mrjob - automatic tar of source directory

145 Views Asked by At

I've created a Amazon EMR job using mrjob. My mapreduce job inherits from a common helper class to make my parsing of the apache log I'm parsing easier, the class I inherit from is shared amongst several mapreduce jobs, so this is my file structure:

__init__.py
count_ip.py (mapreduce job)
common/apache.py   (base class count_ip.py inherits from)

I'd like to automatically tar my full src directory from my local machine and have mrjob upload it to Amazon EMR. Right now I have a tar file with the common directory, common.tar.gz . This tar I've added to my python packages in the mrjob.conf, it works fine, what I'd like to do is to automatically createthe common.tar.gz, is there any support for mrjob to handle this and if not, what options do I have?

1

There are 1 best solutions below

0
On

I'm not a super mrjobber, having only been doing it for the last few months, but I use python's standard tarfile package to do this.

def tar_and_gzip(roots, filename):
    """
    Tars all files starting from roots provided and gzips result
    """
    with tarfile.open(filename, 'w:gz') as tarball:
        for root in roots:
            tarball.add(root, arcname=basename(root))

You can either run this separately before you run your job, or write a script that does both.