Is it faster to replicate your data in hdfs for all your nodes?

July 13, 2015, 12:16 am

≫ Next: Deploying Shared MS Access DB -- Summer Intern first time with Access

≪ Previous: Failed GlusterFS replicated node rebuilt, unable to heal, connection refused in logs

If I have 6 data nodes, is it faster to turn replication to 6 so all the data is replicated across all my nodes so the cluster can split up queries (say in hive) without having to move data around? I believe that if you have a replication of 3 and you put a 300GB file into HDFS, it splits it just across 3 of the data nodes and then when the 6 nodes need to be used for a query it has to move data around to the other 3 nodes that the data doesn't exist on, causing slower responses.. is that accurate?

↧