I have a program I wrote on Windows that runs perfectly fine however when running it in Xubuntu it is veryyyyy slow and I'm not sure why, I have isolated the problem to be something to do with either beautiful soup or threading so I made a testing program and ran it on both Windows and Xubuntu.
from testThread import myThread
thread1 = myThread("thread1","http://www.youtube.com")
thread2 = myThread("thread2","http://www.youtube.com")
thread1.start()
thread2.start()
thread1.join()
thread2.join()
from threading import Thread
import time
from bs4 import BeautifulSoup
import requests
class myThread(Thread):
def __init__(self, name, url):
Thread.__init__(self)
self.t0 = time.time()
self.name = name
self.url = url
self.r = requests.get(url)
def run(self):
start_time = time.time() - self.t0
soup = BeautifulSoup(self.r._content)
end_time = time.time() - self.t0
print self.name, "Time: %s" % str(end_time - start_time)
Essentially what the program does is it creates 2 threads that print out the time it takes to use BeautifulSoup to get the contents of a url.
Xubuntu output:
thread1 Time: 6.88162994385
thread2 Time: 6.92221403122
Windows output:
thread1 Time: 0.524999856949
thread2 Time: 0.542999982834
As you can see windows was faster by a significant amount of time, and I have absolutely no idea why.
Xubuntu specs: Intel(R) Core(TM) i7, 2.6Ghz, 16Gb memory, Ubuntu 15.04
Windows specs: Intel(R) Core(TM) Duo CPU 2.2Ghz, 2Gb memory, Windows 7
So how come Xubuntu is so slow compared to Windows when doing this task? I've been looking all over trying to find a solution to this problem but with no luck. Any help would be much appreciated, thank you.
EDIT: I should note that I have tried explicitly stating the parser for BeautifulSoup to use. I have tried
soup = BeautifulSoup(self.r._content,"lxml")
There was no effect on the outputs when using lxml as the parser
EDIT: I have isolated the problem even further and have determined that the issue occurs inside beautiful soups dammit.py class.
Code in BeautifulSoup.dammit
import codecs
from htmlentitydefs import codepoint2name
import re
import logging
import string
# Import a library to autodetect character encodings.
chardet_type = None
try:
# First try the fast C implementation.
# PyPI package: cchardet
import cchardet
def chardet_dammit(s):
return cchardet.detect(s)['encoding']
except ImportError:
try:
# Fall back to the pure Python implementation
# Debian package: python-chardet
# PyPI package: chardet
import chardet
def chardet_dammit(s):
return chardet.detect(s)['encoding']
#import chardet.constants
#chardet.constants._debug = 1
except ImportError:
# No chardet available.
def chardet_dammit(s):
return None
On Windows it executes the "return None" in
except ImportError:
# No chardet available.
def chardet_dammit(s):
return None
On Xubuntu however it executes "return chardet.detect(s)['encoding'] in
try:
# Fall back to the pure Python implementation
# Debian package: python-chardet
# PyPI package: chardet
import chardet
def chardet_dammit(s):
return chardet.detect(s)['encoding']
#import chardet.constants
#chardet.constants._debug = 1
The code that Xubuntu executes takes much longer. It seems the issue has something to with the libraries being imported by Beautiful soup. I'm not exactly sure why Windows is getting the nested import error exception and Xubuntu isn't, but it seems that when beautifulsoup successfully imports the libraries it slows it down greatly which seems odd. I would appreciate it if someone who better understands what's going on here could explain, thank you. Also not sure if this has any effect but I am using pycharm on both Xubuntu and Windows.
EDIT: Found the problem. On Xubuntu I had the chardet library installed whereas on Windows I did not. BeautifulSoup used this library and it ended up slowing down the program considerably, uninstalled it and everything is working smoothly.