Container is running beyond physical memory for larger files

580 Views Asked by user1676389 At 04 August 2025 at 20:04

I have a small hadoop (2.5.1) cluster where I have the following configuration

(concerning memory limits) mapred-site.xml:

    <property>
            <name>mapreduce.map.memory.mb</name>
            <value>3072</value>
    </property>
    <property>
            <name>mapreduce.reduce.memory.mb</name>
            <value>2048</value>
    </property>
    <property>
            <name>mapreduce.map.java.opts</name>
            <value>-Xmx2450m</value>
    </property>
    <property>
            <name>mapreduce.reduce.java.opts</name>
            <value>-Xmx1630m</value>
    </property>

yarn-site.xml:

      <property>
            <name>yarn.nodemanager.resource.memory-mb</name>
            <value>13312</value>
    </property>

And a map streaming task with python (without a reducer) where I just read lines from a file and select specific fields to print out (I keep one of the fields as a key and the rest one big string).

Each line holds quite a big of an array so the default hadoop configuration was changed to the one above (only to make sure that each record would fit a mapper and so I can test my code without worrying about memory). Each line/record though is smaller than the blocksize (which I have left with the default value).

My problem is that when I test my code at a 7gb sample of the original file everything runs perfectly, BUT when I try it on the original file (~100GB) about 50% of the mapping stage I get the error that "Container is running beyond physical memory for larger files" where it reports it has gone over the 3GB limit.

Why does a mapper need more memory for a larger file? Isn't the computation supposed to be on record by record? If the block size is smaller (by a lot) than the available memory, how does a mapper end up using more than 3GB?

I find this issue a little perplexing.

Original Q&A

There are 1 best solutions below

rchang On 27 November 2014 at 13:18

If I'm interpreting your scenario correctly, it isn't that a single mapper is bankrupting your memory, it's possible that many more mappers are being spawned in parallel since there are so many more blocks of input - this is where much of Hadoop's parallelism comes from. The memory error is probably from too many mappers trying to run at the same time per node. If you have a small cluster, you probably need to keep the mappers/node ratio lower for larger input sets.

This SO question/answer has more details about to affect the mapper count. Setting the number of map tasks and reduce tasks

Container is running beyond physical memory for larger files

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in HADOOP

Related Questions in MEMORY-MANAGEMENT

Related Questions in MAPREDUCE

Related Questions in HADOOP-STREAMING

Trending Questions

Popular # Hahtags

Popular Questions