For example I wrote a file into HDFS using replication factor 2. The node I was writing to has now all the blocks of the file. Others copies of all blocks of the file are scattered around all remaining nodes in the cluster. That's default HDFS policy. What exactly happens if I lower replication factor of the file to 1? How HDFS decides which blocks from which nodes to delete? I hope it tries to delete blocks from nodes that have the most count of blocks of the file?
Why I'm asking - if it does, it would make sense - it will alleviate processing of the file. Because if there is only one copy of all blocks and all the blocks are located on the same node, then it would be harder to process the file using map-reduce because of data transferring to other nodes in the cluster.
When a block becomes
over-replicated
, thename node
chooses a replica to remove. Thename node
will prefer not to reduce the number of racks that host replicas, and secondly prefer to remove a replica from thedata node
with the least amount of available disk space. This may help rebalancing the load over the cluster.Source: The Architecture of Open Source Applications