How can I make Apache Spark workers to import libraries before running map

1.1k Views Asked by At

I wanted to make a traffic report on country from nginx access.log file. This is my code snippet using Apache Spark on Python:

from pyspark import SparkContext

if __name__ == "__main__":
    sc = SparkContext(appName="PythonAccessLogAnalyzer")
    def get_country_from_line(line):
        try:
            from geoip import geolite2
            ip = line.split(' ')[0]
            match = geolite2.lookup(ip)
            if match is not None:
                return match.country
            else:
                return "Unknown"
        except IndexError:
            return "Error"

    rdd = sc.textFile("/Users/victor/access.log").map(get_country_from_line)
    ips = rdd.countByValue()

    print ips
    sc.stop()

On a 6GB log file, it took an hour to complete the task (I ran on my Macbook Pro, 4 cores) which is too slow. I think the bottle neck is that whenever spark maps a line, it has to import geolite2 which has to load some database I think. Is there anyway for me to import geolite2 on each worker instead of each line? Would it boost the performance? Any suggestion to improve that code?

1

There are 1 best solutions below

3
mgaido On

What about using broadcast variables? Here is the doc which explains how they work. However they are simply read-only variables which are spread to all worker nodes once per worker and then accessed whenever necessary.