I have a python script that needs to access and query a MaxMind (.mmdb) file type. My current thought is to load the MaxMind file into HDFS's distributed cache and then pass it through Pig to my Python script. MY current Pig Script is:
SET mapred.cache.file /path/filelocation/;
SET mapred.createsymlink YES;
SET mapred.cache.file hdfs://localserver:8020/pathtofile#filename;
REGISTER 'pythonscript' USING jython AS myudf;
logfile= LOAD 'filename' USING PigStorage(',') AS (x:int);
RESULT = FOREACH logfile GENERATE myudf.pyFunc(x,"how to pass in MaxMind file");
Any thoughts as to how to access the file once its loaded in to the distribute cache inside of the python script?
Thanks
I think you can do it like this:
and here is the corresponding my.py that I used for this example
Almost forgot.. I was calling it like this.
It should be in your local dir so python should be able to access it. In that example 'filename' is the name of the symbolic link, you will have to update accordingly. In your case you will want your 'filename' to be your maxmind file, and depending on what your values in 'a' are you may need to change that back to 'as (x:int)'.
Good luck!