Crawling Depth with BeautifulSoup

3.3k Views Asked by Anthony At 20 December 2017 at 14:35

Is there a function within the beautifulsoup package that allows users to set crawling depth within a site? I am relatively new to Python but I have used Rcrawler in R before and Rcrawler provides 'MaxDepth' so the crawler will go within a certain number of links from the homepage within that domain.

Rcrawler(Website = "https://stackoverflow.com/", no_cores = 4, no_conn = 4, ExtractCSSPat = c("div"), ****MaxDepth=5****)

The basics of my current script in Python parses all visible text on a page but I would like to set a crawling depth.

from bs4 import BeautifulSoup
import bs4 as bs
import urllib.request

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    elif isinstance(element,bs.element.Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(html, 'lxml')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('https://stackoverflow.com/').read()
print(text_from_html(html))

Any insight or direction is appreciated.

Original Q&A

There are 1 best solutions below

furas On 21 December 2017 at 04:12 BEST ANSWER

There is no function in BeautifulSoup because BeautifulSoup is not crawler.
It only parses string with HTML so you could search in HTML.

There is no function in requests because requests is no crawler too.
It only reads data from server so you could use it with BeautifulSoup or similar.

If you use BeautifulSoup and request then you have to do all on your own - you have to build crawling system from scratch.

Scrapy is real crawler (or rather framework to build spiders and crawl network).
And it has option DEPTH_LIMIT

Crawling Depth with BeautifulSoup

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in WEB-SCRAPING

Related Questions in BEAUTIFULSOUP

Related Questions in RCRAWLER

Trending Questions

Popular # Hahtags

Popular Questions