I'm new to coding and this my first project. So far I've pieced together what I have through Googling, Tutorials and Stack.
I'm trying to add data from a pandas df of scraped RSS feeds to a remote sql database, then host the script on heroku or AWS and have the script running every hour.
Someone on here recommend that I use APScheduler as in this post.
I'm struggling though as there aren't any 'dummies' tutorials around APScheduler. This is what I've created so far.
I guess my question is does my script need to be in a function for APScheduler to trigger it or can it work another way.
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
@sched.scheduled_job('interval', minutes=1)
sched.configure()
sched.start()
import pandas as pd
from pandas.io import sql
import feedparser
import time
rawrss = ['http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
'https://www.yahoo.com/news/rss/',
'http://www.huffingtonpost.co.uk/feeds/index.xml',
'http://feeds.feedburner.com/TechCrunch/',
'https://www.uktech.news/feed'
]
time = time.strftime('%a %H:%M:%S')
summary = 'text'
posts = []
for url in rawrss:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((time, post.title, post.link, summary))
df = pd.DataFrame(posts, columns=['article_time','article_title','article_url', 'article_summary']) # pass data to init
df.set_index(['article_time'], inplace=True)
import pymysql
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://<username>:<host>:3306/<database_name>?charset=utf8', encoding = 'utf-8')
engine.execute("INSERT INTO rsstracker VALUES('%s', '%s', '%s','%s')" % (time, post.title, post.link, summary))
df.to_sql(con=engine, name='rsstracker', if_exists='append') #, flavor='mysql'
Yes. What you want to be executed must be a function (or another callable, like a method). The decorator syntax (
@sched.…
) needs a function definition (def …
) to which the decorator is applied. The code in your example doesn't compile.Then it's a blocking scheduler, meaning if you call
sched.start()
this method doesn't return (unless you stop the scheduler in some scheduled code) and nothing after the call is executed.Imports should go to the top, then it's easier to see what the module depends on. And don' import things you don't actually use.
I'm not sure why you import and use
pandas
for data that doesn't really needDataFrame
objects. Also SQLAlchemy without actually using anything this library offers and formatting values as strings into an SQL query which is dangerous!Just using SQLAlchemy for the database it may look like this:
The time column seems a bit odd, I would have expected a
TIMESTAMP
orDATETIME
here and not a string that throws away much of the information, just leaving the abbreviated week day and the time.