I am trying to scrape the timestamps of all comments in a Reddit thread for research purposes. As of right now, the post has about 700 total comments. So I thought the best way to scrape this data without using existing apps or extensions would be through SOUP. Below is the beginnings of the python program I'm writing for this.
from bs4 import BeautifulSoup
import requests
import csv
url = "https://www.reddit.com/r/NewTubers/comments/1bfhcwz/feedback_friday_post_your_videos_here_if_you_want/"
r = requests.get(url)
#print(r.status_code) returned 200
soup = BeautifulSoup(r.content, 'html.parser') #lxml didn't work either
#print(soup.title) returned the correct title of the HTML page
file = open("scraped_timestamps.csv", "w")
writer = csv.writer(file)
writer.writerow(["TIMESTAMPS"])
timestamps = soup.findAll('a', class_='_3yx4Dn0W3Yunucf5sVJeFU')
for timestamp in timestamps:
writer.writerow([timestamp.text])
file.close()
When one inspects the timestamps on comments, they all have this one same class "3yx4Dn0W3Yunucf5sVJeFU" under the attribute "a". So, targeting this exact class, I tried to print out the timestamp text. This didn't return anything so I'm quite confused. I have tried using lxml but I didn't work either.
Also, Reddit shows you the exact time down to the second when you hover over the timestamp text. I'd like to possibly scrape that data as well in the future but for now I'm simply trying to scrape the generic timestamp text on comments.
OP here. I've reached an offline solution for now. It became very clear how to get the timestamps I wanted once I installed JSONvue on my browser and saw the structure of the JSON data. Excuse any inefficiencies as the timestamp data was quite nested.
It's also obvious that retrieving the JSON by putting .json at the end the url did not give me the data of every single main comment in the post as it is probably dynamically loaded by Javascript as someone pointed out. I have experimented with requesting HTML as well as json from the URL itself, but it will fail half of the time and return with a data structure that is incompatible with my current code. In either case however, the data still seems to be a fraction of what exists. I will further explore this exercise with how I can get the data of all comments
Here is the code: