Access a file in a python script that is stored in HDFS distributed cache

Question

Access a file in a python script that is stored in HDFS distributed cache

1.2k Views Asked by bjurstrs At 27 September 2025 at 11:27

I have a python script that needs to access and query a MaxMind (.mmdb) file type. My current thought is to load the MaxMind file into HDFS's distributed cache and then pass it through Pig to my Python script. MY current Pig Script is:

SET mapred.cache.file /path/filelocation/;
SET mapred.createsymlink YES;
SET mapred.cache.file hdfs://localserver:8020/pathtofile#filename;
REGISTER 'pythonscript' USING jython AS myudf;
logfile= LOAD 'filename' USING PigStorage(',') AS (x:int);
RESULT = FOREACH logfile GENERATE myudf.pyFunc(x,"how to pass in MaxMind file");

Any thoughts as to how to access the file once its loaded in to the distribute cache inside of the python script?

Thanks

Original Q&A

There are 1 best solutions below

**Cody Stevens** · Accepted Answer

I think you can do it like this:

set mapred.cache.files hdfs:///user/cody.stevens/testdat//list.txt#filename;
SET mapred.createsymlink YES; 
REGISTER 'my.py' USING jython AS myudf;
a = LOAD 'hdfs:///user/cody.stevens/pig.txt' as (x:chararray);
RESULT = FOREACH a GENERATE myudf.list_files(x,'filename');
STORE RESULT into '$OUTPUT';

and here is the corresponding my.py that I used for this example

#/usr/bin/env python 
import os 

@outputSchema("ls:chararray}")
def list_files(x,f):
    #ls =  os.listdir('.')
    fin = open(f,'rb')
    return [x,fin.readlines()]


if __name__ == '__main__':
    print "ok"

Almost forgot.. I was calling it like this.

pig -param OUTPUT=/user/cody.stevens/pigout -f dist.pig

It should be in your local dir so python should be able to access it. In that example 'filename' is the name of the symbolic link, you will have to update accordingly. In your case you will want your 'filename' to be your maxmind file, and depending on what your values in 'a' are you may need to change that back to 'as (x:int)'.

Good luck!

Access a file in a python script that is stored in HDFS distributed cache

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in APACHE-PIG

Related Questions in USER-DEFINED-FUNCTIONS

Trending Questions

Popular # Hahtags

Popular Questions