I'm trying to get information from many different tables from an HTML url without any of the HTML indent/tab formatting. I use get_text to generate the content I want, but it prints with a lot of white space and tabs. I've tried .strip and that doesn't accomplish what I want.
Here's the python script I'm using:
import csv, simplejson, urllib,
url="http://www.thecomedystudio.com/schedule.html"
response=urllib.urlopen(url)
from bs4 import BeautifulSoup
html = response
soup = BeautifulSoup(html.read())
text = soup.get_text()
print text
In the end, I'd like to create a csv of the event calendar, but first I'd like to create a .txt or something that doesn't require too much manual cleaning.
Any help appreciated.
You don't need to "clean up" the HTML in order to parse it with
BeautifulSoup
.Just parse the dates and events into a csv file directly:
This contents of
output.csv
after running the script: