How can I make Apache Spark workers to import libraries before running map

1.1k Views Asked by vutran At 10 March 2015 at 08:38

I wanted to make a traffic report on country from nginx access.log file. This is my code snippet using Apache Spark on Python:

from pyspark import SparkContext

if __name__ == "__main__":
    sc = SparkContext(appName="PythonAccessLogAnalyzer")
    def get_country_from_line(line):
        try:
            from geoip import geolite2
            ip = line.split(' ')[0]
            match = geolite2.lookup(ip)
            if match is not None:
                return match.country
            else:
                return "Unknown"
        except IndexError:
            return "Error"

    rdd = sc.textFile("/Users/victor/access.log").map(get_country_from_line)
    ips = rdd.countByValue()

    print ips
    sc.stop()

On a 6GB log file, it took an hour to complete the task (I ran on my Macbook Pro, 4 cores) which is too slow. I think the bottle neck is that whenever spark maps a line, it has to import geolite2 which has to load some database I think. Is there anyway for me to import geolite2 on each worker instead of each line? Would it boost the performance? Any suggestion to improve that code?

Original Q&A

There are 1 best solutions below

mgaido On 10 March 2015 at 10:30

What about using broadcast variables? Here is the doc which explains how they work. However they are simply read-only variables which are spread to all worker nodes once per worker and then accessed whenever necessary.

How can I make Apache Spark workers to import libraries before running map

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in NGINX

Related Questions in PARALLEL-PROCESSING

Related Questions in APACHE-SPARK

Related Questions in TRAFFIC-MEASUREMENT

Trending Questions

Popular # Hahtags

Popular Questions