Colly (GoLang) Web Scrapper - 403 Forbidden

2k Views Asked by At

I am trying to scrape products from mediamarkt site with Colly. Here is my code:

func WebScraper(allowedDomain string, page string, htmlElement string, htmlTag string) {
    /*
        Order in which Collector's callbacks are executed in:
        1. OnRequest  -> Called before a request
        2. OnError    -> Called if error occured durig the request
        3. OnResponse -> Called after response received
        4. OnHTML     -> Called right after OnResponse if the received content is HTML
        5. OnXML      -> Called right after OnHTML if the recieved content is HTML or XML
        6. Scraped    -> Called after OnXML callback
    */
    c := colly.NewCollector(
        // MaxDepth is 2, so only the links on the scraped page
        // and links on those pages are visited
        colly.AllowedDomains(allowedDomain),
        colly.MaxDepth(2),
        colly.Async(true),
    )

    // Limit the maximum parallelism to 2
    // This is necessary if the goroutines are dynamically
    // created to control the limit of simultaneous requests.
    //
    // Parallelism can be controlled also by spawning fixed
    // number of go routines.
    c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2})

    // Step 2. Perform some logic before REQUEST Is made
    c.OnRequest(func(r *colly.Request) {
        app.InfoLog.Println("Visiting ", r.URL.String())
    })

    // Step 2.1. If errror occurred during the request, handle it!
    c.OnError(func(r *colly.Response, err error) {
        app.ErrorLog.Println("Request URL: ", r.Request.URL, " failed with response: ", r, "\nError: ", err)
    })

    // On every a element which has href attribute call callback
    c.OnHTML(htmlElement, func(e *colly.HTMLElement) {
        app.InfoLog.Println(e.ChildText(htmlTag))
    })

    c.Visit(page)
    // Wait until threads are finished
    c.Wait()
}

I've already tried scraping Wikipedia and some other sites, and it works. But here, I am getting 403 Forbidden error. Here is HEADER from RESPONSE:

Permissions-Policy : [accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()]
Expires : [Thu, 01 Jan 1970 00:00:01 GMT]
Set-Cookie : [__cf_bm=eEhiHiAsyTUuG7Ra4_rGhBWBHGxP_FWphwxEIl66hW8-1654161057-0-Aef4Vr6ypA0zr8CVP66c2x9X1s+vUcusYPkMqJR3MhpLt/FxMHi+GXMD0+YEcb2L/cLC6RVhgROG9gOvXVTjQMIYUjwyvfi1/hFvAPthwzC/; path=/; expires=Thu, 02-Jun-22 09:40:57 GMT; domain=.mediamarkt.de; HttpOnly; Secure; SameSite=None]
Vary : [Accept-Encoding]
Date : [Thu, 02 Jun 2022 09:10:57 GMT]
Expect-Ct : [max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"]
Content-Type : [text/html; charset=UTF-8]
Cache-Control : [private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0]
Server : [cloudflare]
Cf-Ray : [714f0f0e3b881c23-SOF]
X-Frame-Options : [SAMEORIGIN]
Strict-Transport-Security : [max-age=15897600]
X-We-Are-Hiring : [We appreciate developers that love to explore what goes on under the hood of software. Apply now at https://careers.mediamarktsaturn.com/MediaMarktSaturn!]

And here is the Body of the RESPONSE:

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>

<title>Please Wait... | Cloudflare</title>
  
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" /><![endif]-->
<style>body{margin:0;padding:0}</style>


<!--[if gte IE 10]><!-->
<script>
  if (!navigator.cookieEnabled) {
    window.addEventListener('DOMContentLoaded', function () {
      var cookieEl = document.getElementById('cookie-alert');
      cookieEl.style.display = 'block';
    })
  }
</script>
<!--<![endif]-->


    <script>
    //<![CDATA[
    (function(){
      window._cf_chl_opt={
        cvId: "2",
        cType: "managed",
        cNounce: "41590",
        cRay: "714f0f0e3b881c23",
        cHash: "7549f8b7d78a2a4",
        cUPMDTk: "\/de\/category\/smartphones-579.html?__cf_chl_tk=PrWxKIbQcP5Dh7keed1nL5yIqzx2FEiIyMvDz_3jTp0-1654161057-0-gaNycGzNBqU",
        cFPWv: "g",
        cTTimeMs: "1000",
        cLt: "n",
        cRq: {
          ru: "aHR0cHM6Ly93d3cubWVkaWFtYXJrdC5kZS9kZS9jYXRlZ29yeS9zbWFydHBob25lcy01NzkuaHRtbA==",
          ra: "Y29sbHkgLSBodHRwczovL2dpdGh1Yi5jb20vZ29jb2xseS9jb2xseQ==",
          rm: "R0VU",
          d: "vr3pEux85BB4TszTDjAPScZq2oMqIA1GoFOPEjftlymNdbnhggazvYIWsXBQOTzYsqm6B1QxUgRJqK2CNemXc9VqLj70rk1vMXKFsNRn8eSkCfbX1bVvJbp+S3YSI+zdrPmzOiiq4gO2vWm5pOKlKc+7qmux89XYc1J0YnOprUgYdHNeayUheiiXkRqwPQqW/cY1+5C2IsPzqzcU7M7YCnWjenwMn1pjLFjMclUxEi6s/gu5lLTr8HSnalidGwSVexGj4SBqmKekU99FZqEtE5kJutfFoUEiwuEJmmo7QrYuWrXRfB80Fms3xVWa8J6Ga4M9cnJgv3PP9qRucyj01EtAlfkpx7coaUfTJue65CZcHA4SJcB7WqMHdaUVojdSFsc4UoCYGbnstK2lyuX+v6GAC2GGOtK23s8DcfcB/YJsCChlpkURsIfnGbzmfI5cQf5JqWkhnW6p1UG3oKs7bec/dUNKL+XJjRH0rvyvKFkMX6Ca/0FX00zR0a1WcxnXOhU1iZzQOR2U/ZrXvfE0jeFCRQ+OHvCd0Ncfosas5axWsibMU+MeasO+bYbG8hTjHgvG8+tFc0tYII+nbVWFp44k+mWOBIhKh951P8TAoLl1h4HO9+hxKdpjQGAtjeZJ39oc3daC5julK9RJOng8Hw==",
          t: "MTY1NDE2MTA1Ni45OTkwMDA=",
          m: "cZC1J0+WAKjb0r4I8GxqyYnUTcVqCk2O4D12RYxeP7Q=",
          i1: "90OzQhzN+BROMhNBF2EFBw==",
          i2: "grkPyoRifg7B+X0FEjpHHQ==",
          zh: "q1ZR4e29hYz+cTx2o5UYJG1hFifFh0loDJNTfBOG7gU=",
          uh: "DaHp0r0NTdLobcNE2+1UVaN6g6tbXcsPQKHJoB7xdZI=",
          hh: "+dgxVyY+fQBum8yrY3Q9pqqEvjydD2WPU3jRaUrPF1o=",
        }
      };
    }());
    //]]>
    </script>

<style>
  #cf-wrapper #spinner {width:69px; margin:  auto;}
  #cf-wrapper #cf-please-wait{text-align:center}
  .attribution {margin-top: 32px;}
  .bubbles { background-color: #f58220; width:20px; height: 20px; margin:2px; border-radius:100%; display:inline-block; }
  #cf-wrapper #challenge-form { padding-top:25px; padding-bottom:25px; }
  #cf-hcaptcha-container { text-align:center;}
  #cf-hcaptcha-container iframe { display: inline-block;}
  @keyframes fader     { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
  #cf-wrapper #cf-bubbles { width:69px; }
  @-webkit-keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
  #cf-bubbles > .bubbles { animation: fader 1.6s infinite;}
  #cf-bubbles > .bubbles:nth-child(2) { animation-delay: .2s;}
  #cf-bubbles > .bubbles:nth-child(3) { animation-delay: .4s;}
</style>
</head>
<body>
  <div id="cf-wrapper">
    <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
    <div id="cf-error-details" class="cf-error-details-wrapper">
      <div class="cf-wrapper cf-header cf-error-overview">
      
        <h1 data-translate="managed_challenge_headline">Please wait...</h1>
        <h2 class="cf-subheadline"><span data-translate="managed_checking_msg">We are checking your browser...</span> www.mediamarkt.de</h2>
      
      </div>
      
      <div class="cf-section cf-highlight cf-captcha-container">
        <div class="cf-wrapper">
          <div class="cf-columns two">
            <div class="cf-column">
            
              <div class="cf-highlight-inverse cf-form-stacked">
                <form class="challenge-form managed-form" id="challenge-form" action="/de/category/smartphones-579.html?__cf_chl_f_tk=PrWxKIbQcP5Dh7keed1nL5yIqzx2FEiIyMvDz_3jTp0-1654161057-0-gaNycGzNBqU" method="POST" enctype="application/x-www-form-urlencoded">
    <div id='cf-please-wait'>
      <div id='spinner'>
        <div id="cf-bubbles">
            <div class="bubbles"></div>
            <div class="bubbles"></div>
            <div class="bubbles"></div>
        </div>
      </div>
      <p data-translate="please_wait" id="cf-spinner-please-wait">Please stand by, while we are checking your browser...</p>
      <p data-translate="redirecting" id="cf-spinner-redirecting" style="display:none">Redirecting...</p>
      </div>
  <input type="hidden" name="md" value="u0AdAefiQaOd5cct_8y26o7DHt3en_YcDPYT5F0ABUY-1654161057-0-ATANjzlyezjgr7F1BeHeI_j_uUY38_a79__nKHeOV0Dk2cJOfgMdCTl3WoYsPTD7L25TEyF0Zu27FsSj21OI2aeiNSKmAbPirtvQwqJkPR_knETzvfp75Sv1rnhXV_52btLnozXuVO3Y_z7ElYk1CZDJDEdTw8Eu-MLyEaxyZGJHxx9Tk58hP1NPpWzN98aAcbhY0L1Au8IvJiH8bVmaRlLhK2KDOcXgM7KFONTOuo5-vGZjUjtE4YbUadBFGqk8jIZTRrIXZmwIZNm7TiPlPBwAz8POM7Rw_uoL7THpV4QUctlXigEqRHrY4g-jLcJEW-uZZm2qVMpzbAFOQjJ6UvkY_RC25ZQ5L0MQr1Nnh32-OQZctZIhj8edoK1TZasOXT6u0bT5lOecpx2j82H8mF59qM_zfUbIs4H6wvEx0prqNpEu-4Z7_x1y_agGnVMtW-2OCpKPjcmn9j1-NZnZYdJbrqTzdn2j6qe-wnn3RuRSna8DnN-W7AQTCS4vn7uYc76FWBFERMIwczuHUk-KrOof_TpwA324htdvh4I7URUK8CxdSCZqdG7UfsKbjgdLStciaw_PGDud2rPsE2hQEClxPXFsbcWju8aM6BDmlxQFJm7KJHZcbJTtA8yPMfgha4EvOTTGrEwaBy16B4U18Tmo9JXUlBUJwzbtBXMxfZ0XVQWu709nvxwpWAMZb8kEPND5aXQi2jEiGZZnM3wx_JlXtxPlBiTsxP5mEJ5pf8a71v1aZAzWUcPAaHtRymR8a92yWS4Z57h4a2HSchUf8LlFiuoogFCLBNEi2IoYTuIWFhww1k1UEhjuUZ2h21G4149DN5k-xfRY53H4EyHRs30oYiABowol3n3te3kZcPwB" />
  <input type="hidden" name="r" value="919uegZGZgLhriycM0_XCKz1utWQOqsLyAsDF2mLcEY-1654161057-0-AXs4pyKaoppndTN8hJ/khCNIxpye10VI2waeNLb4xYndXBU8rLwkuUXxzAWPTOMPsGwR0KAe5aERtjPvehE5pESDCLcHgGq/H6RUBimtjqQMbxRS8fCyoLrV89WrqAv7Okw3Y+i048El6jKYonunXSU7zzKNR/EL8DIe8/qP47CVRqyOxIDJ2pVHq4GwnfXBtiiWpr4z49jikhah7wbqwOALXPYP4WYlFPrk1kZ1+VgBhEf3RtsybLxsR3E8UagLgTf4K+yNUAt+Uzmi+1qvE2oTq8cVRWZ+gBiXsmRKkWnn3hg6qg9h0DPF8X0U+h8ufqBiTIT3/Lb2M8f1/bB1Sjr6ZBo08ZO5lkGvqdx08L6TRwv5MT4yDWrubtXpZL4Dkpw0yuvLJjonxLMdoF7laSt+xW0VP7ZmAPCNBfY89CXhTqnj/78w0GiLvIFjb9kiNk7cnofy1erkGrI2e/rO6HomogGJT0kGb7V5t6HBOU8mW+4JraBqv1rYLpqv7XmPh4cqjr9DJ8iDDGcqxMciL9VWT4g0nTNlipr0JoVv7L1F36+0Yc+5FuIJwvhvIXN64LlK2vyroKNE/wu3r5O9RWVgAToNI2KlZAbJaHFCBBAhDRdDi7EaVZVoNhmA3Ju+YiNXmGJ5L21MWLwX+N9jQP1KRibF3ixAzObVKTlGmAWUQLdfrc98pHn8oDI1cpCWzrhrsdAQImLLMEO49lJQnmvWpF+lP8iULAiJG4pdsZ5dIelChc7f4W51l0bAUvL/2l/lJg7/qLxFd5PqJp8Jo7nzqbgibEvM8/55/A3wtT9WX0kJp2Da8Kez0UzrgKeAb3VdGVrHwr+k1eJ4o3fI/RBesr/aWkbgjk4EM8itKypPg/c1Ejd9h/Kn89EpeJPtgz7t+vxDyH47kzmR0L9+gWOd5UBvVel/KzwxAxpuO0fw/tNYbEO0vJ2A3NWThWuS2g34K60w+y+Tp/TrNw/yrQH6wVUUsYESQCc2ZLkt8aVRPR30GuKuC9Zjaj8C8g3ywF5EDvFPYm9ZSPjayGyW3magUchBTngl9HJTiAADmSJB8sJfFWWNVJzKP8e7QRYdGbZzy+EiKzEUN61jWlCKlhFKFIwZlCZBIQ+TYL4+ukePHWoUgttIef21cFjy/ydCoznkJDPtceQDPNyCJZHBv2ljXGJ/IpPZ3CcLW9mAVOdjorEitBUY5ObbZTnpgFelrEKo9SVuE4tSawF7ba0TBcUR7yQXKcB6xmrsdlpn0Bp2Ki7rm8XnIGcK34U2+SQ2FrVaBEHTWW3vFWcdyfQmPPoD8BQo/to3Vt3Lz3K2RC8Ugh6bDzzD61z+6d1iWJ2qIyostZIvVQoPwNqdhYrWw9eBF4DF4COCxIoA16S9TLaEqSV+5e+fBfoRVw+jmsi0qRWkYbtBI0imU7f99EEIdP4y6sz+3LeHLUufXvHHWZoT2URjpCZSXJfhnYYg77qSZbIDX5z0RcnBpGBjiISfAwpfUpwp1SPe5fqB0rka6hvGektNSI+YgSPsI8mfH4CNh2dnaxN0OJzj64zaEWKJYrG3Jzhmip7RBJ7v7utJqqLQu6EWIfJ2b8vV314ucEgB9ORIjARY0Zb/Lx7/Jzrt4wvlsuEhySPHb7TylWO1Gyra">
  <input type="hidden" name="vc" value="22dd9a5e4ec44559e78aa0e010d110ca">
  <noscript id="cf-captcha-bookmark" class="cf-captcha-info">
  <h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>
  </noscript>
    <div id="no-cookie-warning" class="cookie-warning" data-translate="turn_on_cookies" style="display:none">
      <p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.</p>
    </div>
  <script>
  //<![CDATA[
    var a = function() {try{return !!window.addEventListener} catch(e) {return !1} },
      b = function(b, c) {a() ? document.addEventListener("DOMContentLoaded", b, c) : document.attachEvent("onreadystatechange", b)};
      b(function(){
        var cookiesEnabled=(navigator.cookieEnabled)? true : false;
        if(!cookiesEnabled){
          var q = document.getElementById('no-cookie-warning');q.style.display = 'block';
        }
      });
  //]]>
  </script>
  <div id="trk_captcha_js" style="background-image:url('/cdn-cgi/images/trace/captcha/nojs/h/transparent.gif?ray=714f0f0e3b881c23')"></div>
</form>
  <script>
    //<![CDATA[
    (function(){
        var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);
        var trkjs = isIE ? new Image() : document.createElement('img');
        trkjs.setAttribute("src", "/cdn-cgi/images/trace/managed/js/transparent.gif?ray=714f0f0e3b881c23");
        trkjs.id = "trk_managed_js";
        trkjs.setAttribute("alt", "");
        document.body.appendChild(trkjs);
        var cpo=document.createElement('script');
        cpo.type='text/javascript';
        cpo.src="/cdn-cgi/challenge-platform/h/g/orchestrate/managed/v1?ray=714f0f0e3b881c23";
        
        window._cf_chl_opt.cOgUHash = location.hash === '' && location.href.indexOf('#') !== -1 ? '#' : location.hash;
        window._cf_chl_opt.cOgUQuery = location.search === '' && location.href.slice(0, -window._cf_chl_opt.cOgUHash.length).indexOf('?') !== -1 ? '?' : location.search;
        if (window._cf_chl_opt.cUPMDTk && window.history && window.history.replaceState) {
          var ogU = location.pathname + window._cf_chl_opt.cOgUQuery + window._cf_chl_opt.cOgUHash;
          history.replaceState(null, null, "\/de\/category\/smartphones-579.html?__cf_chl_rt_tk=PrWxKIbQcP5Dh7keed1nL5yIqzx2FEiIyMvDz_3jTp0-1654161057-0-gaNycGzNBqU" + window._cf_chl_opt.cOgUHash);
          cpo.onload = function() {
            history.replaceState(null, null, ogU);
          };
        }
        
        document.getElementsByTagName('head')[0].appendChild(cpo);
    }());
    //]]>
    </script>


              </div>
            </div>

            <div class="cf-column">
              <div class="cf-screenshot-container">
              
                <span class="cf-no-screenshot"></span>
              
              </div>
            </div>
          </div>
        </div>
      </div>

      <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>
            
            <p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
          </div>

          <div class="cf-column">
            <h2 data-translate="resolve_captcha_headline">What can I do to prevent this in the future?</h2>
            

            <p data-translate="resolve_captcha_antivirus">If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.</p>

            <p data-translate="resolve_captcha_network">If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.</p>
            
              
            
          </div>
        </div>
      </div>
      

      <div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">
  <p class="text-13">
    <span class="cf-footer-item sm:block sm:mb-1">Cloudflare Ray ID: <strong class="font-semibold">714f0f0e3b881c23</strong></span>
    <span class="cf-footer-separator sm:hidden">&bull;</span>
    <span class="cf-footer-item sm:block sm:mb-1"><span>Your IP</span>: 178.221.155.142</span>
    <span class="cf-footer-separator sm:hidden">&bull;</span>
    <span class="cf-footer-item sm:block sm:mb-1"><span>Performance &amp; security by</span> <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing" id="brand_link" target="_blank">Cloudflare</a></span>
    
  </p>
</div><!-- /.error-footer -->


    </div>
  </div>

  <script>
  window._cf_translation = {};
  
  
</script>


</body>
</html>

It looks like some sort of CAPTCHA or JS issue, but I cannot figure out how to avoid it. Any advice?

0

There are 0 best solutions below