How to select ${mapred.local.dir}?

1.2k Views Asked by At

If I configured several ${mapred.local.dir} directories to store immediate result of Map Task, these directories mounted different disks. My questions are: 1. Whether LocalDirAllocator.java is used to manage ${mapred.local.dir} directories?

2.The method getLocalPathForWrite() of LocalDirAllocator.java is used to select a ${mapred.local.dir} directory?

1

There are 1 best solutions below

2
On BEST ANSWER
1. Whether LocalDirAllocator.java is used to manage ${mapred.local.dir} directories?

Yes, the tasktracker uses LocalDirAllocator to manage the local directories/ disks inorder to store intermmediate data.(The by which it allocate space is given in the explanation)

2.The method getLocalPathForWrite() of LocalDirAllocator.java is used to select a ${mapred.local.dir} directory?

There are 3 overloaded methods in LocalDirAllocator corresponding to getLocalPathForWrite().They round-robin over the set of disks (via the configured dirs) and return the first complete path which has enough space.

Explantion From the java doc: LocalDirAllocator.java

An implementation of a round-robin scheme for disk allocation for creating files. The way it works is that it is kept track what disk was last allocated for a file write. For the current request, the next disk from the set of disks would be allocated if the free space on the disk is sufficient enough to accommodate the file that is being considered for creation. If the space requirements cannot be met, the next disk in order would be tried and so on till a disk is found with sufficient capacity. Once a disk with sufficient space is identified, a check is done to make sure that the disk is writable. Also, there is an API provided that doesn't take the space requirements into consideration but just checks whether the disk under consideration is writable (this should be used for cases where the file size is not known apriori). An API is provided to read a path that was created earlier. That API works by doing a scan of all the disks for the input pathname. This implementation also provides the functionality of having multiple allocators per JVM (one for each unique functionality or context, like mapred, dfs-client, etc.). It ensures that there is only one instance of an allocator per context per JVM.

Note:

  1. The contexts referred above are actually the configuration items defined in the Configuration class like "mapred.local.dir" (for which we want to control the dir allocations). The context-strings are exactly those configuration items.

  2. This implementation does not take into consideration cases where a disk becomes read-only or goes out of space while a file is being written to (disks are shared between multiple processes, and so the latter situation is probable).

    1. In the class implementation, "Disk" is referred to as "Dir", which actually points to the configured directory on the Disk which will be the parent for all file write/read allocations.

I don't think we can directly override its behaviour, unless we override behaviour of its dependents!