Scraping timestamp data on comments from Reddit page with BeautifulSoup doesn't return anything

51 Views Asked by At

I am trying to scrape the timestamps of all comments in a Reddit thread for research purposes. As of right now, the post has about 700 total comments. So I thought the best way to scrape this data without using existing apps or extensions would be through SOUP. Below is the beginnings of the python program I'm writing for this.

from bs4 import BeautifulSoup
import requests
import csv

url = "https://www.reddit.com/r/NewTubers/comments/1bfhcwz/feedback_friday_post_your_videos_here_if_you_want/"
r = requests.get(url)
#print(r.status_code) returned 200

soup = BeautifulSoup(r.content, 'html.parser') #lxml didn't work either
#print(soup.title) returned the correct title of the HTML page
file = open("scraped_timestamps.csv", "w")
writer = csv.writer(file)
writer.writerow(["TIMESTAMPS"])

timestamps = soup.findAll('a', class_='_3yx4Dn0W3Yunucf5sVJeFU')
for timestamp in timestamps:
    writer.writerow([timestamp.text])
file.close()

When one inspects the timestamps on comments, they all have this one same class "3yx4Dn0W3Yunucf5sVJeFU" under the attribute "a". So, targeting this exact class, I tried to print out the timestamp text. This didn't return anything so I'm quite confused. I have tried using lxml but I didn't work either.

Also, Reddit shows you the exact time down to the second when you hover over the timestamp text. I'd like to possibly scrape that data as well in the future but for now I'm simply trying to scrape the generic timestamp text on comments.

1

There are 1 best solutions below

0
KNutellaZ On

OP here. I've reached an offline solution for now. It became very clear how to get the timestamps I wanted once I installed JSONvue on my browser and saw the structure of the JSON data. Excuse any inefficiencies as the timestamp data was quite nested.

It's also obvious that retrieving the JSON by putting .json at the end the url did not give me the data of every single main comment in the post as it is probably dynamically loaded by Javascript as someone pointed out. I have experimented with requesting HTML as well as json from the URL itself, but it will fail half of the time and return with a data structure that is incompatible with my current code. In either case however, the data still seems to be a fraction of what exists. I will further explore this exercise with how I can get the data of all comments

Here is the code:

#from bs4 import BeautifulSoup
#import requests
import json
import csv
import datetime

data = open('Reddit.json')
html_json = json.load(data)
file = open("scraped_timestamps.csv", "w")
writer = csv.writer(file)
writer.writerow(["TIMESTAMPS"])
#print(html.status_code)

parent = html_json[1]['data']['children']
timestamps = []

for i in parent:
    if 'created_utc' in i['data']:
        timestamps.append(datetime.datetime.fromtimestamp(i['data]['created_utc']))
    else:
        break

sorted_timestamps = sorted(timestamps)
for x in sorted_timestamps:
    writer.writerow([x])
file.close()

#html_json[0] contains the post itself without the comment section
#html_json[1] contains the entire comment section
#I had to check for the existence of the timestamp data, otherwise the query throws
#an error passing the list as html_json[1] also contains the footer without such timestamp data