I have a small hadoop (2.5.1) cluster where I have the following configuration
(concerning memory limits) mapred-site.xml:
<property>
<name>mapreduce.map.memory.mb</name>
<value>3072</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx2450m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx1630m</value>
</property>
yarn-site.xml:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>13312</value>
</property>
And a map streaming task with python (without a reducer) where I just read lines from a file and select specific fields to print out (I keep one of the fields as a key and the rest one big string).
Each line holds quite a big of an array so the default hadoop configuration was changed to the one above (only to make sure that each record would fit a mapper and so I can test my code without worrying about memory). Each line/record though is smaller than the blocksize (which I have left with the default value).
My problem is that when I test my code at a 7gb sample of the original file everything runs perfectly, BUT when I try it on the original file (~100GB) about 50% of the mapping stage I get the error that "Container is running beyond physical memory for larger files" where it reports it has gone over the 3GB limit.
Why does a mapper need more memory for a larger file? Isn't the computation supposed to be on record by record? If the block size is smaller (by a lot) than the available memory, how does a mapper end up using more than 3GB?
I find this issue a little perplexing.
If I'm interpreting your scenario correctly, it isn't that a single mapper is bankrupting your memory, it's possible that many more mappers are being spawned in parallel since there are so many more blocks of input - this is where much of Hadoop's parallelism comes from. The memory error is probably from too many mappers trying to run at the same time per node. If you have a small cluster, you probably need to keep the mappers/node ratio lower for larger input sets.
This SO question/answer has more details about to affect the mapper count. Setting the number of map tasks and reduce tasks