Why do we need to format HDFS after every time we restart machine?

Question

Why do we need to format HDFS after every time we restart machine?

3.3k Views Asked by Shekhar At 22 November 2013 at 10:06

I have installed Hadoop in pseudo distributed mode on my laptop, OS is Ubuntu.

I have changed paths where hadoop will store its data (by default hadoop stores data in /tmp folder)

hdfs-site.xml file looks as below :

<property>
    <name>dfs.data.dir</name>
    <value>/HADOOP_CLUSTER_DATA/data</value>
</property>

Now whenever I restart machine and try to start hadoop cluster using start-all.sh script, data node never starts. I confirmed that data node is not start by checking logs and by using jps command.

Then I

Stopped cluster using stop-all.sh script.
Formatted HDFS using hadoop namenode -format command.
Started cluster using start-all.sh script.

Now everything works fine even if I stop and start cluster again. Problem occurs only when I restart machine and try to start the cluster.

Has anyone encountered similar problem?
Why this is happening and
How can we solve this problem?

Original Q&A

There are 2 best solutions below

Samuel On 29 June 2016 at 13:02

For those who use hadoop 2.0 or above versions config file names may be different.

As this answer points out, go to the /etc/hadoop directory of your hadoop installation.

Open the file hdfs-site.xml. This user configuration will override the default hadoop configurations, that are loaded by the java classloader before.

Add dfs.namenode.name.dir property and set a new namenode dir (default is file://${hadoop.tmp.dir}/dfs/name).

Do the same for dfs.datanode.data.dir property (default is file://${hadoop.tmp.dir}/dfs/data).

For example:

<property>
    <name>dfs.namenode.name.dir</name>
    <value>/Users/samuel/Documents/hadoop_data/name</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>/Users/samuel/Documents/hadoop_data/data</value>
</property>

Other property where a tmp dir appears is dfs.namenode.checkpoint.dir. Its default value is: file://${hadoop.tmp.dir}/dfs/namesecondary.

If you want, you can easily also add this property:

<property>
    <name>dfs.namenode.checkpoint.dir</name>
    <value>/Users/samuel/Documents/hadoop_data/namesecondary</value>
</property>

**Remus Rusanu** · Accepted Answer · 2013-11-22T11:57:58.023000

By changing dfs.datanode.data.dir away from /tmp you indeed made the data (the blocks) survive across a reboot. However there is more to HDFS than just blocks. You need to make sure all the relevant dirs point away from /tmp, most notably dfs.namenode.name.dir (I can't tell what other dirs you have to change, it depends on your config, but the namenode dir is mandatory, could be also sufficient).

I would also recommend using a more recent Hadoop distribution. BTW, the 1.1 namenode dir setting is dfs.name.dir.

Why do we need to format HDFS after every time we restart machine?

There are 2 best solutions below

Related Questions in HADOOP

Related Questions in UBUNTU-11.04

Trending Questions

Popular # Hahtags

Popular Questions