Is there a function within the beautifulsoup package that allows users to set crawling depth within a site? I am relatively new to Python but I have used Rcrawler in R before and Rcrawler provides 'MaxDepth' so the crawler will go within a certain number of links from the homepage within that domain.
Rcrawler(Website = "https://stackoverflow.com/", no_cores = 4, no_conn = 4, ExtractCSSPat = c("div"), ****MaxDepth=5****)
The basics of my current script in Python parses all visible text on a page but I would like to set a crawling depth.
from bs4 import BeautifulSoup
import bs4 as bs
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
elif isinstance(element,bs.element.Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(html, 'lxml')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
html = urllib.request.urlopen('https://stackoverflow.com/').read()
print(text_from_html(html))
Any insight or direction is appreciated.
There is no function in
BeautifulSoupbecauseBeautifulSoupis notcrawler.It only parses string with
HTMLso you could search inHTML.There is no function in
requestsbecauserequestsis nocrawlertoo.It only reads data from server so you could use it with
BeautifulSoupor similar.If you use
BeautifulSoupandrequestthen you have to do all on your own - you have to build crawling system from scratch.Scrapy is real crawler (or rather framework to build spiders and crawl network).
And it has option DEPTH_LIMIT