I need to get source code of about 4 thousand of web pages, and extract a few numbers from it. I achieve this using urllib and .split()
, store it in a dataframe and export to csv.
After running cProfile:
ncalls tottime percall cumtime percall filename:
290 0.003 0.000 411.894 1.420 request.py:1281(http_open)
290 0.002 0.000 411.956 1.421 request.py:140(urlopen)
These take a long time. Is there a work around to getting source codes faster? If not, are there any disadvantages to splitting up the urls among 6 different Kernels, so that each only has to get some 650 source codes, and run it in parallel, instead of using Threading. I am new to Python3.
Also, is the above excerpt from cProfile => Python3 evidence that the source code fetching is the part which is a bottleneck to the code? What other factors could contribute to slow speeds in this regard? I have a decent 8mbps connection but I believe the TCP handshake is what takes too long.