How to crawl 5000 different URLs to find certain links

51 Views Asked by At

I have a list of URLs which has about 5k urls, I want to crawl each of these websites and find links to certain other page, which most these websites have. To solve this issue i wrote a python script which works but its too slow. I was hoping to find a better solution than mine.

The below code takes about 2 mins to complete for just 10 links. Is there any way to make it more faster or some other method altogether.

import asyncio
import aiohttp
from bs4 import BeautifulSoup
from urllib.parse import urljoin

timeout = 5
async def fetch_html(session, url):
    async with session.get(url, headers=headers, timeout=timeout) as response:
        return await response.text()

async def find_info_page(session, start_url, max_depth=2):
    visited_urls = set()
    queue = [(start_url, 0)]
    ans_list = []
    while queue:
        current_url, depth = queue.pop(0)
        if current_url in visited_urls or depth > max_depth:
            continue

        visited_urls.add(current_url)

        try:
            html = await fetch_html(session, current_url)
            soup = BeautifulSoup(html, 'html.parser')
            for link in soup.find_all('a', href=True):
                absolute_link = urljoin(current_url, link['href'])
                if pattern.search(absolute_link.lower()):
                    ans_list.append(absolute_link)
                    return absolute_link
                elif absolute_link != current_url:
                    queue.append((absolute_link, depth + 1))
        except Exception as e:
            print(f"Error processing {current_url}: {e}")
            
    return None

async def main():
    start_urls = links[:10]

    async with aiohttp.ClientSession() as session:
        tasks = [find_fee_page(session, url) for url in start_urls]
        results = await asyncio.gather(*tasks)


    for url, result in zip(start_urls, results):
        if result:
            print(f"Fee page found for {url}: {result}")
        else:
            print(f"No fee page found for {url}")

await main()

short explanation of the code

fetch_html() : this function is an async co-routine and all it does it makes a request to the given url asynchronously.

find_info_page(): this function is also an async co-routine,

  • it crawls the given url page, for that it calls the fetch-html function to get the html source of the given url.

  • then it gets all the links ( tags) on the page, check for a certain condition on the link, if not found then it crawls the newly found links until it find the info its looking.

main() : it's just a wrapper function to create a shared session object and for using asyncio.gather() function, which ensures async functionality.

0

There are 0 best solutions below