As per the Hadoop 3.x release notes, they have introduced Erasure coding to overcome the problems with storage.
Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.
Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.
I am looking for the sample configuration files for the same.
Also, even after setting up the ec policy and enabling it using hdfs ec-enablePolicy, does the policy work for cold files only or it is by default implemented to store the entire hdfs files?
In hadoop3 we can enable Erasure coding policy to any
folderin HDFS.Command to List the supported erasure policies:
./bin/hdfs ec -listPoliciesCommand to Enable XOR-2-1-1024k Erasure policy:
./bin/hdfs ec -enablePolicy -policy XOR-2-1-1024kCommand to Set Erasure policy to HDFS directory:
./bin/hdfs ec -setPolicy -path /tmp -policy XOR-2-1-1024kCommand to Get the policy set to the given directory:
./bin/hdfs ec -getPolicy -path /tmpCommand to Remove the policy from the directory.i.e unset policy:
./bin/hdfs ec -unsetPolicy -path /tmpCommand to Disable policy:
./bin/hdfs ec -disablePolicy -policy XOR-2-1-1024kEdit:
A sample EC policy XML file named
user_ec_policies.xml.templateis in the Hadoop conf directory($HADOOP_HOME/etc/hadoop/) available for reference.By default
REPLICATIONpolicy is always enabled. Erasure coding policy are disabled by default.Erasure coding apply for only selected
HDFSpath, for example if you select /erasure_code_data as your path when setting policy then EC apply only for this directory. And other file already present in HDFS like /tmp /user has REPLICATION policy.