Scraping blog and saving date to database causes DateError: unknown date format

102 Views Asked by At

I am working on a project where I scrape a number of blogs, and save a selection of the data to a SQLite database. Such as the title of the post, the date it was posted, and the content of the post. The goal in the end is to do some fancy textual analyses, but right now I have a problem with writing the data to the database. I work with the library pattern for Python. (the module about databases can be found here)

I am busy with the third blog now. The data from the two other blogs is already saved in the database, and for the third blog, which is similarly structured, I adapted the code.

There are several functions well integrated with each other, they work fine. I also got access to all the data the right way, when I try it out in IPython Notebook it works fine. When I ran the code as a trial in the Console for only one blog page (it has 43 altogether), it also worked and saved everything nicely in the database. But when I ran it again for 43 pages, it threw a data error.

There are some comments and print statements inside the functions now which I used for debugging. The problem seems to happen in the function parse_post_info, which passes a dictionary on to the function that goes over all blog pages and opens every single post, and then saves the dictionary that the function parse_post_info returns IF it is not None, but I think it IS empty because something about the date format goes wrong.

Also - why does the code work once, and the same code throws a dateerror the second time:

DateError: unknown date format for '2015-06-09T07:01:55+00:00'

Here is the function:

from pattern.db import Database, field, pk, date, STRING, INTEGER, BOOLEAN, DATE, NOW, TEXT, TableError, PRIMARY, eq, all
from pattern.web import URL, Element, DOM, plaintext


def parse_post_info(p):
""" This function receives a post Element from the post list and 
    returns a dictionary with post url, post title, labels, date.
"""
try:    
    post_header = p("header.entry-header")[0]
    title_tag = post_header("a < h1")[0]
    post_title = plaintext(title_tag.content)
    print post_title
    post_url = title_tag("a")[0].href
    date_tag = post_header("div.entry-meta")[0]
    post_date = plaintext(date_tag("time")[0].datetime).split("T")[0]
    #post_date = date(post_date_text)
    print post_date
    post_id = int(((p).id).split("-")[-1])
    post_content = get_post_content(post_url)
    labels = " "
    print labels
    return dict(blog_no=blog_no,
                post_title=post_title, 
                post_url=post_url,
                post_date=post_date,
                post_id=post_id,
                labels=labels,
                post_content=post_content
                )
except:
    pass
1

There are 1 best solutions below

0
On

The date() function returns a new Date, a convenient subclass of Python's datetime.datetime. It takes an integer (Unix timestamp), a string or NOW.

You can have diff with local time.

Also the format is "YYYY-MM-DD hh:mm:ss".

The convert time format can be found here