Scraping Data From Website Using Selenium/Requests/Pandas

98 Views Asked by At

I am trying to scrape the table on this website (https://www.cmegroup.com/markets/fx/g10/canadian-dollar.settlements.html). I tried using the requests library, pandas, and selenium, but to no avail. Does anyone have a workaround? Here is some of the things I have tried so far:

import requests
url = "https://www.cmegroup.com/markets/fx/g10/canadian-dollar.settlements.html"
requests.get(url)
import pandas as pd
url = "https://www.cmegroup.com/markets/fx/g10/canadian-dollar.settlements.html"
df = pd.read_html(url)
from selenium import webdriver
import pandas as pd

url = 'https://www.cmegroup.com/markets/fx/g10/canadian-dollar.settlements.html'
driver = webdriver.Edge()
driver.get(url)
table = driver.find_element("xpath",'//*[@id="productTable1"]')
df = pd.read_html(table.get_attribute('outerHTML'))[0]
driver.quit()

Thanks!!

2

There are 2 best solutions below

0
On

For requests

Adding the 'User-Agent' on the headers works!

import requests
url = "https://www.cmegroup.com/markets/fx/g10/canadian-dollar.settlements.html"
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
response = requests.get(url,headers=headers, timeout=10)
# response.content is what you might want 

For pandas 2.1+

Use storage_options=headers

import pandas as pd
url = "https://www.cmegroup.com/markets/fx/g10/canadian-dollar.settlements.html"
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
df = pd.read_html(url,storage_options=headers)
0
On

Actual table data is coming from a network call. I have tried the 'request libraries’ and ‘pandas’ by adding a few more headers, but I am receiving the blocking error as stated below.

{'message': "This IP address is blocked due to suspected web scraping activity associated with it on this CMEgroup.com page. Use of scripts, software, spiders, robots, avatars, agents, tools or other scraping mechanisms is strictly prohibited by CME Group’s website Data Terms of Use. If you are attempting to access data or content from the website via automated means or for commercial purposes, CME has numerous other methods to deliver the content you require. Please contact CME Group's Global Command Center (GCC) at [email protected] and your inquiry will be directed to the appropriate team."}  

So, instead of those libraries I have imported the subprocess.

import subprocess
import json
curl_command = [
    'curl',
    'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/48/FUT?strategy=DEFAULT&tradeDate=02/06/2024&pageSize=500&isProtected&_t=1707294881357',
    '-H', 'authority: www.cmegroup.com',
    '-H', 'accept: application/json, text/plain, */*',
    '-H', 'accept-language: en-GB,en-US;q=0.9,en;q=0.8,ur;q=0.7',
    '-H', 'cache-control: no-cache',
    '-H', 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
    '--compressed'
]

curl_output = subprocess.check_output(curl_command)
data = json.loads(curl_output.decode('utf-8'))
table_content = data['settlements']
print(table_content)