Python requests.get(url) returns empty content in Colab

86 Views Asked by At

I'm crawling a website via requests, but despite response.status_code returns 200, there's no content in response.text or response.content.

Another site with the code works well, in local Jupyter environment it works well too, but some reason I couldn't get past the firewall url below in 'Colab'.

Could you give some advice for me?

problem url: https://gall.dcinside.com/board/view/?id=piano&no=1&exception_mode=notice&page=1

import requests
from bs4 import BeautifulSoup as bs

url = 'https://gall.dcinside.com/board/view/?id=piano&no=1&exception_mode=notice&page=1'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Whale/3.25.232.19 Safari/537.36'}
response = requests.get(url, headers=headers, data={'buscar':100000})
soup = bs(response.content, "html.parser")
soup
<br/>
<br/>
<center>
<h2>
The request / response that are contrary to the Web firewall security policies have been blocked.
</h2>
<table>
<tr>
<td>Detect time</td>
<td>2024-03-12 21:52:05</td>
</tr>
<tr>
<td>Detect client IP</td>
<td>35.236.245.49</td>
</tr>
<tr>
<td>Detect URL</td>
<td>https://gall.dcinside.com/board/view/</td>
</tr>
</table>
</center>
<br/>

I tried to change user-agent, https to http, and the other advice of similar questions, everything doesn't work.

1

There are 1 best solutions below

7
Lakshmanarao Simhadri On BEST ANSWER

If you're facing issues with making HTTP requests using the requests module in Google Colab, there could be a few reasons for this behavior

1. Firewall or Network Restrictions: Sometimes, network or firewall restrictions might prevent the notebook from accessing external resources. If you are behind a proxy or firewall, you may need to configure the proxy settings in your notebook.

Use the following snippet to set proxy settings in your notebook:

import os

os.environ['HTTP_PROXY'] = 'http://your_proxy_address:your_proxy_port'
os.environ['HTTPS_PROXY'] = 'http://your_proxy_address:your_proxy_port'

2. Blocked Sites: If the website you are trying to access is blocked in the Colab environment, you won't be able to make requests to it.

Also, please add all the possible headers to avoid the blocking. Here is the revised version of code

import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urlparse, parse_qs
import os

# Please add your proxy address and port to use given proxy while making a request.
# Note: I'm using scrapeops proxy here, you can also get a trail plan and replace the api_key with a valid key

api_key = "0565b10e-c1b5-418c-b15d-02d4ebd5d6a2"
proxy_value = f"http://scrapeops:{api_key}@proxy.scrapeops.io:5353"
os.environ['HTTP_PROXY'] = proxy_value
os.environ['HTTPS_PROXY'] = proxy_value

def get_response_by_passing_headers(url):

    # We are parsing query parameters from the URL to pass it to the request
    parsed_url = urlparse(url)
    query_params = parse_qs(parsed_url.query)
    params = {key: value[0] for key, value in query_params.items()}

    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'Accept-Language': 'en-GB,en;q=0.9',
        'Cache-Control': 'no-cache',
        'Connection': 'keep-alive',
        'Pragma': 'no-cache',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
        'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Linux"',
        }

    # Making a request with all the headers and parameters
    response = requests.get('https://gall.dcinside.com/board/view/', params=params, headers=headers, verify=False)
    return response

url = 'https://gall.dcinside.com/board/view/?id=piano&no=1&exception_mode=notice&page=1'
response = get_response_by_passing_headers(url)
soup = bs(response.content, "html.parser")
print(soup)