How to scrape data into PythonAnywhere database

92 Views Asked by At

I have a database on PythonAnywhere and credentials in place.

The aim is to scrape a whole load of news websites and chuck the data into a new website using Flask.

Here's my code for a section of my website. (I've ignored the imports because it runs)

@app.route("/nationals")
def scrape_nationals():
    feeds = [{"type": "news","title": "BBC", "url": "http://feeds.bbci.co.uk/news/uk/rss.xml"},
        {"type": "news","title": "The Economist", "url": "https://www.economist.com/international/rss.xml"},
        {"type": "news","title": "The New Statesman", "url": "https://www.newstatesman.com/feed"},
        {"type": "news","title": "The New York Times", "url": "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"},
        {"type": "news","title": "Metro UK","url": "https://metro.co.uk/feed/"},
        {"type": "news", "title": "Evening Standard", "url": "https://www.standard.co.uk/rss.xml"},
        {"type": "news","title": "Daily Mail", "url": "https://www.dailymail.co.uk/articles.rss"},
        {"type": "news","title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
        {"type": "news", "title": "The Mirror", "url": "https://www.mirror.co.uk/news/?service=rss"},
        {"type": "news", "title": "The Sun", "url": "https://www.thesun.co.uk/news/feed/"},
        {"type": "news", "title": "Sky News", "url": "https://news.sky.com/feeds/rss/home.xml"},
        {"type": "news", "title": "The Guardian", "url": "https://www.theguardian.com/uk/rss"},
        {"type": "news", "title": "The Independent", "url": "https://www.independent.co.uk/news/uk/rss"},
        #{"type": "news", "title": "The Telegraph", "url": "https://www.telegraph.co.uk/news/rss.xml"},
        {"type": "news", "title": "The Times", "url": "https://www.thetimes.co.uk/?service=rss"}]
    print(feeds)

    data = []                               # <---- initialize empty list here
    for feed in feeds:
        parsed_feed = feedparser.parse(feed['url'])
        #print("Title:", feed['title'])
        #print("Number of Articles:", len(parsed_feed.entries))
        #print("\n")
        for entry in parsed_feed.entries:

            title = entry.title
            print(title)
            url = entry.link
            #print(entry.summary)
            try:
                summary = entry.summary[:400] or "No summary available" # I simplified the ternary operators here
            except:
                #print("no summary")
                summary = "none"
            try:
                date = pd.to_datetime(entry.published)#
                #or "No data available"     # I simplified the ternary operators here
            except:
                #print("date")
                date = pd.to_datetime("01-01-1970")
            data.append([title, url, summary, date])          # <---- append data from each entry here

    df = pd.DataFrame(data, columns = ['title', 'url', 'summary', 'date'])
    articles = pd.read_sql('nationals', con = engine)
    articles = articles.drop_duplicates()
    df = df.append(articles)
    df = df.drop_duplicates()
    df.to_sql('nationals', con = engine, if_exists = 'replace', index = False)

It works in VSCode locally, but I can't work out why my table on PythonAnywhere won't populate. What have I got wrong?

1

There are 1 best solutions below

0
Glenn On

Free accounts can only access sites on our allowlist. A site is only eligible for the allowlist if it has a publicly documented API. Search for "allowlist" on the PythonAnywhere help pages for how to request additions.