I'm a student trying to get all top-level comments from this r/worldnews live thread: https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/ for a school research project. I'm currently coding in Python, using the PRAW API and pandas library. Here's the code I've written so far:
url = "https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/"
submission = reddit.submission(url=url)
comments_list = []
def process_comment(comment):
if isinstance(comment, praw.models.Comment) and comment.is_root:
comments_list.append({
'author': comment.author.name if comment.author else '[deleted]',
'body': comment.body,
'score': comment.score,
'edited': comment.edited,
'created_utc': comment.created_utc,
'permalink': f"https://www.reddit.com{comment.permalink}"
})
submission.comments.replace_more(limit=None, threshold=0)
for top_level_comment in submission.comments.list():
process_comment(top_level_comment)
comments_df = pd.DataFrame(comments_list)
But the code times out when limit=None. Using other limits(100,300,500) only returns ~700 comments. Any help in gathering the top-level comments from this Reddit thread would be greatly appreciated.
I've looked at probably hundreds of pages of documentation/Reddit threads and tried the following techniques:
- Coding a "timeout" for the Reddit API, then after the break, continuing on with gathering comments
- Gathering comments in batches, then calling replace_more again but to no avail. I've also looked at the Reddit API rate limit request documentation, in hopes that there is a method to bypass these limits.
I was able to pull in 190k+ comments using a recursive function instead of the replace_more method to bypass the timeout issue. Maybe this will help: