How to get apify to make every request round-robin style?

151 Views Asked by At

Using Apify to crawl a job board, but with multiple concurrently. I have an array of proxies but my queued urls aren't using my proxies in a round robin fashion even though I use this setup. How can I set things up so that every new url that gets requested uses a different proxy round robin style? Essentially what I'm saying is, if I have 10 proxy urls and I have a max concurrent request of 5, how do I setup my crawler so that only have of my proxies are being used in any batch of requests while the others finish their requested, and take a break. I'm running into a 429 error cause I'm using the same proxy to request new pages too quickly.

Right now proxy 1 calls request 1,2,3 (then gets blocked) then proxy 2 requests 4,5,6 (then gets blocked)

How do I make it so that proxy 1 requests 1, proxy2 requests 2, proxy 3 requests 3 etc?

const collector = new Apify.PlaywrightCrawler({
       requestQueue,
       proxyConfiguration,
       useSessionPool: true,
       persistCookiesPerSession: true,
       launchContext: {

           launchOptions: {
               headless: true,
           }
       },

       maxConcurrency: 4,


       handlePageFunction: handleFunctionCollection,

       // This function is called if the page processing failed more than maxRequestRetries+1 times.
       handleFailedRequestFunction: async ({ request }) => {
           console.log(`Request ${request.url} failed too many times.`);
       },
       
   });
1

There are 1 best solutions below

0
On

Few things:

  1. You are using still version 2 or lower. Version 3 with Crawlee is up for more than a year so I recommend upgrading. But what I next say mostly works the same with older ones.
  2. Browser based crawlers rotate proxies with a browser instance. That is by default every 100 requests. You need to set browserPoolOptions and https://crawlee.dev/api/browser-pool/interface/BrowserPoolOptions#retireBrowserAfterPageCount (or its older variant)
  3. We don't see what is in your proxy config. If you use Apify proxy, then the IPs are chosen randomly. If you provide a list of proxy URLs, it should round robin.