I wanted to make a traffic report on country from nginx access.log file. This is my code snippet using Apache Spark on Python:
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonAccessLogAnalyzer")
def get_country_from_line(line):
try:
from geoip import geolite2
ip = line.split(' ')[0]
match = geolite2.lookup(ip)
if match is not None:
return match.country
else:
return "Unknown"
except IndexError:
return "Error"
rdd = sc.textFile("/Users/victor/access.log").map(get_country_from_line)
ips = rdd.countByValue()
print ips
sc.stop()
On a 6GB log file, it took an hour to complete the task (I ran on my Macbook Pro, 4 cores) which is too slow. I think the bottle neck is that whenever spark maps a line, it has to import geolite2 which has to load some database I think. Is there anyway for me to import geolite2 on each worker instead of each line? Would it boost the performance? Any suggestion to improve that code?
What about using broadcast variables? Here is the doc which explains how they work. However they are simply read-only variables which are spread to all worker nodes once per worker and then accessed whenever necessary.