I wish to scrape twitter's articles. Take instance of a URL below. https://twitter.com/UNTechEnvoy/status/1704972265866014829
Upon requesting above URL, we find below API call with particular headers in the network traffic which fetches the article data.
https://api.twitter.com/graphql/5GOHgZe-8U2j5sVHQzEm9A/TweetResultByRestId?variables=%7B%22tweetId%22%3A%221704972265866014829%22%2C%22withCommunity%22%3Afalse%2C%22includePromotedContent%22%3Afalse%2C%22withVoice%22%3Afalse%7D&features=%7B%22creator_subscriptions_tweet_preview_api_enabled%22%3Atrue%2C%22c9s_tweet_anatomy_moderator_badge_enabled%22%3Atrue%2C%22tweetypie_unmention_optimization_enabled%22%3Atrue%2C%22responsive_web_edit_tweet_api_enabled%22%3Atrue%2C%22graphql_is_translatable_rweb_tweet_is_translatable_enabled%22%3Atrue%2C%22view_counts_everywhere_api_enabled%22%3Atrue%2C%22longform_notetweets_consumption_enabled%22%3Atrue%2C%22responsive_web_twitter_article_tweet_consumption_enabled%22%3Afalse%2C%22tweet_awards_web_tipping_enabled%22%3Afalse%2C%22responsive_web_home_pinned_timelines_enabled%22%3Atrue%2C%22freedom_of_speech_not_reach_fetch_enabled%22%3Atrue%2C%22standardized_nudges_misinfo%22%3Atrue%2C%22tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled%22%3Atrue%2C%22longform_notetweets_rich_text_read_enabled%22%3Atrue%2C%22longform_notetweets_inline_media_enabled%22%3Atrue%2C%22responsive_web_graphql_exclude_directive_enabled%22%3Atrue%2C%22verified_phone_label_enabled%22%3Atrue%2C%22responsive_web_media_download_video_enabled%22%3Afalse%2C%22responsive_web_graphql_skip_user_profile_image_extensions_enabled%22%3Afalse%2C%22responsive_web_graphql_timeline_navigation_enabled%22%3Atrue%2C%22responsive_web_enhance_cards_enabled%22%3Afalse%7D
headers = {
'authority': 'api.twitter.com',
'authorization': 'Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA',
'content-type': 'application/json',
'cookie': 'guest_id_marketing=v1%3A169883004211703651; guest_id_ads=v1%3A169883004211703651; personalization_id="v1_z3S9HEXBgiQBLPn9TMbSLA=="; guest_id=v1%3A169883006823417906; gt=1719644188290040005; guest_id=v1%3A169865337165479828; guest_id_ads=v1%3A169865337165479828; guest_id_marketing=v1%3A169865337165479828; personalization_id="v1_PoXKYFsBsEAzLKCo41vjqw=="',
'origin': 'https://twitter.com',
'referer': 'https://twitter.com/',
'sec-ch-ua': '"Chromium";v="118", "Brave";v="118", "Not=A?Brand";v="99"',
'sec-ch-ua-platform': '"Windows"',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36',
'x-client-transaction-id': 'H4Tcw9J6LDFN6U2WzYR4exzeOdxZ4+gpEzzZwMqFERoUjGB+92eN6XgJdb9vwzLr9r2s7R+mX1T/a9ExhV4HL7rb/TGGHg',
'x-guest-token': '1719644188277379350',
'x-twitter-active-user': 'yes',
'x-twitter-client-language': 'en-US'
}
Note, the guest token expires every 1-2 hours hence a user would need to refresh headers to use in the script to scrape twitter articles.
In reference to that, I found a way to retrieve 'api.twitter..' url's headers using scapy library, however I am unable to get it.
I searched web and tried below partial code.
import requests, threading
from scapy.all import sniff
from scapy.layers.http import HTTPRequest
def sniff_traffic():
sniff(filter="tcp and (port 80 or port 443)", prn=process_packet)
def process_packet(packet):
if HTTPRequest in packet:
host = packet[HTTPRequest].Host
path = packet[HTTPRequest].Path
headers = packet[HTTPRequest].fields
def run(url):
t = threading.Thread(target=sniff_traffic)
t.start()
response = requests.get(url)
t.join()
run('https://twitter.com/UNTechEnvoy/status/1704972265866014829')
Can you assist in getting me headers of an API URL that's called in the network traffic? Do share even if there exists another method apart from 'mitmproxy'. Thank you all in advance.