I am reading a WARC file with python's 'warc' library. Current file that I am using, is around 4.50 GB. The thing is ;
file = warc.open("random.warc")
html_lists = [line for line in file]
Executing these 2 lines takes up to 40 seconds. Since there will be 64000 more files like this one, it is not acceptable that it takes 40 seconds per file. Do you guys have any tips to improve performance or any different approaches?
Edit : I found out that Beautifulsoup operations take some time. So I removed it and wrote the necessary stuff myself. It is 100x faster now. It takes +- 60 seconds to read and process 4.50 GB data. With this line of code I remove the scripts from data;
clean = re.sub(r"<script.*?</script>", "", string=text)
And with this one I split the text and remove the stamp which I don't need
warc_stamp = str(soup).split(r"\r\n\r\n")
As I said it is faster but 60 seconds are not that good in this case. Any suggestions ?
Get the source code of that module, and check for optimization potential.
Use a profiler to identify performance bottlenecks, then focus on these for optimization.
It can make a huge difference to rewrite Python code in Cython and compile it into native code. So that is likely worth a try.
But by any means, rather than speculating on an internet forum on how to accelerate a two line script, you really need to work with the actual code underneath!