How to read a asp.net page with BeautifulSoup?

1.1k Views Asked by At

I am trying to scrape some data from a webpage using beautiful soup.

I am running into problems when I try to get convert the HTML document into a beautifulsoup object.

when I run the code

soup = BeautifulSoup(html_doc)

The error message im getting is :

SyntaxError: Non-ASCII character '\xa9' in file      C:/Users/mlee/PycharmProjects/BsTest/htmlparse.py on line 683, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

I believe this is because there are some asp.net viewstate objects in the html which are base64 encoded.

Is there a suggested workaround to this or will I have to use a different tool?

Also, I am primarily just interested in getting the javascript generated portions of text. Is there a better way of doing this?

Thank you!

1

There are 1 best solutions below

2
On BEST ANSWER

Put this header

#!/usr/bin/env python
# -*- coding: utf-8 -*-

on the first line of your htmlparse.py file, make sure that PyCharm saves the file as utf-8 encoded.

This has nothing to do with asp/viewstate. You have utf characters in the file.

I am primarily just interested in getting the javascript generated portions of text. Is there a better way of doing this?

You might want to use Selenium webdriver + python bindings for doing the task. Another option is PhantomJS