HDFS' Location Awareness

984 Views Asked by At

Introduction

According to several documentation 1, 2, 3 HDFS' Location Awareness is about knowing the physical location of nodes and replicating data on different racks to reduce the impact of rack issues due to, e.g. power supply and/or switch issues.

Question

How does HDFS know the physical location of nodes and racks and subsequently decide to replicate data to nodes located on other racks?

2

There are 2 best solutions below

0
On BEST ANSWER

Rack-awareness is configured when the cluster is set up. This can be done either manually for each node or through a script.

Each DataNode is given a network location which is simple a string, much like a file system path.

Example:

datacenter-1/rack-1/node1
datacenter-1/rack-1/node2
datacenter-1/rack-2/node3

The NameNode then builds a network topology (basically a tree structure) using the network locations of each DataNode. This topology is then used to determine block replica placement.

0
On

somebody needs to know where Data Nodes are located in the network topology and use that information to make an intelligent decision about where data replicas should exist in the cluster. That “somebody” is the Name Node.

The Name node stores this information and is the the namespace.

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.