I have a Hadoop cluster with 4 DataNodes. I am confused between the two issues : data replication and data distribution.
Suppose that I have a 2 GB file and my replication factor is 2 & block size is 128 MB. When I put this file into hdfs, I see that 2 copies of each 128 MB blocks are created and they are placed in datanode3 and datanode4. But datanode2 & datanode1 are not used. The data is replicated because of the replication factor but I expect to see some data blocks in datanode1 and datanode2. Is something wrong?
Let's say that I have 20 DataNodes and replication factor is 2. If I put a file (2 GB) on HDFS, I again expect to see two copies of each 128 MB but also expect to see these 128 MB blocks are distributed between 20 DataNodes.