I've created a Amazon EMR job using mrjob. My mapreduce job inherits from a common helper class to make my parsing of the apache log I'm parsing easier, the class I inherit from is shared amongst several mapreduce jobs, so this is my file structure:
__init__.py
count_ip.py (mapreduce job)
common/apache.py (base class count_ip.py inherits from)
I'd like to automatically tar my full src directory from my local machine and have mrjob upload it to Amazon EMR. Right now I have a tar file with the common directory, common.tar.gz . This tar I've added to my python packages in the mrjob.conf, it works fine, what I'd like to do is to automatically createthe common.tar.gz, is there any support for mrjob to handle this and if not, what options do I have?
I'm not a super mrjobber, having only been doing it for the last few months, but I use python's standard
tarfile
package to do this.You can either run this separately before you run your job, or write a script that does both.