Access a file in a python script that is stored in HDFS distributed cache

1.2k Views Asked by At

I have a python script that needs to access and query a MaxMind (.mmdb) file type. My current thought is to load the MaxMind file into HDFS's distributed cache and then pass it through Pig to my Python script. MY current Pig Script is:

SET mapred.cache.file /path/filelocation/;
SET mapred.createsymlink YES;
SET mapred.cache.file hdfs://localserver:8020/pathtofile#filename;
REGISTER 'pythonscript' USING jython AS myudf;
logfile= LOAD 'filename' USING PigStorage(',') AS (x:int);
RESULT = FOREACH logfile GENERATE myudf.pyFunc(x,"how to pass in MaxMind file");

Any thoughts as to how to access the file once its loaded in to the distribute cache inside of the python script?

Thanks

1

There are 1 best solutions below

4
On BEST ANSWER

I think you can do it like this:

set mapred.cache.files hdfs:///user/cody.stevens/testdat//list.txt#filename;
SET mapred.createsymlink YES; 
REGISTER 'my.py' USING jython AS myudf;
a = LOAD 'hdfs:///user/cody.stevens/pig.txt' as (x:chararray);
RESULT = FOREACH a GENERATE myudf.list_files(x,'filename');
STORE RESULT into '$OUTPUT';

and here is the corresponding my.py that I used for this example

#/usr/bin/env python 
import os 

@outputSchema("ls:chararray}")
def list_files(x,f):
    #ls =  os.listdir('.')
    fin = open(f,'rb')
    return [x,fin.readlines()]


if __name__ == '__main__':
    print "ok" 

Almost forgot.. I was calling it like this.

pig -param OUTPUT=/user/cody.stevens/pigout -f dist.pig

It should be in your local dir so python should be able to access it. In that example 'filename' is the name of the symbolic link, you will have to update accordingly. In your case you will want your 'filename' to be your maxmind file, and depending on what your values in 'a' are you may need to change that back to 'as (x:int)'.

Good luck!