Introduction
According to several documentation 1, 2, 3 HDFS' Location Awareness is about knowing the physical location of nodes and replicating data on different racks to reduce the impact of rack issues due to, e.g. power supply and/or switch issues.
Question
How does HDFS know the physical location of nodes and racks and subsequently decide to replicate data to nodes located on other racks?
Rack-awareness is configured when the cluster is set up. This can be done either manually for each node or through a script.
Each
DataNode
is given a network location which is simple a string, much like a file system path.Example:
The
NameNode
then builds a network topology (basically a tree structure) using the network locations of eachDataNode
. This topology is then used to determine block replica placement.