how hdfs removes over-replicated blocks

3.7k Views Asked by Mikhail Golubtsov At 16 August 2025 at 21:31

For example I wrote a file into HDFS using replication factor 2. The node I was writing to has now all the blocks of the file. Others copies of all blocks of the file are scattered around all remaining nodes in the cluster. That's default HDFS policy. What exactly happens if I lower replication factor of the file to 1? How HDFS decides which blocks from which nodes to delete? I hope it tries to delete blocks from nodes that have the most count of blocks of the file?

Why I'm asking - if it does, it would make sense - it will alleviate processing of the file. Because if there is only one copy of all blocks and all the blocks are located on the same node, then it would be harder to process the file using map-reduce because of data transferring to other nodes in the cluster.

Original Q&A

There are 2 best solutions below

vanekjar On 25 June 2015 at 14:36 BEST ANSWER

When a block becomes over-replicated, the name node chooses a replica to remove. The name node will prefer not to reduce the number of racks that host replicas, and secondly prefer to remove a replica from the data node with the least amount of available disk space. This may help rebalancing the load over the cluster.

Source: The Architecture of Open Source Applications

Anshul Joshi On 25 June 2015 at 10:13

Over-replicated blocks are randomly removed from different nodes by the HDFS, and are rebalanced that means they are not just removed from the current node.

how hdfs removes over-replicated blocks

There are 2 best solutions below

Related Questions in HADOOP

Related Questions in HDFS

Related Questions in REPLICATION

Trending Questions

Popular # Hahtags

Popular Questions