I am trying to scrape some data from a webpage using beautiful soup.
I am running into problems when I try to get convert the HTML document into a beautifulsoup object.
when I run the code
soup = BeautifulSoup(html_doc)
The error message im getting is :
SyntaxError: Non-ASCII character '\xa9' in file C:/Users/mlee/PycharmProjects/BsTest/htmlparse.py on line 683, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
I believe this is because there are some asp.net viewstate objects in the html which are base64 encoded.
Is there a suggested workaround to this or will I have to use a different tool?
Also, I am primarily just interested in getting the javascript generated portions of text. Is there a better way of doing this?
Thank you!
Put this header
on the first line of your
htmlparse.py
file, make sure that PyCharm saves the file as utf-8 encoded.This has nothing to do with asp/viewstate. You have utf characters in the file.
You might want to use Selenium webdriver + python bindings for doing the task. Another option is PhantomJS