How can we extract information from subdomain using Rcrawler in R?

842 Views Asked by Premal At 22 December 2017 at 06:20

I want to extract content of webpage from the subdomain using main URL.

I tried using Rcrawler

library(Rcrawler)

Rcrawler(Website = "http://www.xbyte-technolabs.com/", no_cores = 4, no_conn = 4, ExtractCSSPat = c(".address"))

After running this code I got INDEX default variable and we can see all URL of website. There is one URL ""http://xbyte-technolabs.com/contact_us.php" I want to extract contact details from it.

Now can someone please guide me how can I go to this particular URL from main URL ""http://xbyte-technolabs.com/" using Rcrawler in R.

Original Q&A

There are 2 best solutions below

Otto Kässi On 22 December 2017 at 07:40 BEST ANSWER

library(Rcrawler)
Rcrawler(Website = "http://www.xbyte-technolabs.com/", no_cores = 1, no_conn = 1, ExtractCSSPat = c(".address"))

pageid <- as.numeric(INDEX$Id[INDEX$Url == 'http://xbyte-technolabs.com/contact_us.php'])
DATA[pageid]

According to ?Rcrawler, Rcrawler creates two global variables

INDEX: A data frame in global environement representing the generic URL index,including the list of fetched URLs and page details (contenttype,HTTP state, number of out-links and in-links, encoding type, and level), and

DATA: A list of lists in global environement holding scraped contents.

The Id variable in INDEX, corresponds to the list element in DATA. The code snippet above looks for the Id corresponding to the url you are interested in.

Sidenote: since you know the URL you are looking for, crawling through the whole website seems like an overkill.

Premal On 22 December 2017 at 07:35

library(Rcrawler)

Rcrawler("http://www.xbyte-technolabs.com/",no_cores = 4,no_conn = 4)

for (i in length(INDEX)) {
  for (j in nrow(INDEX)) {

    Rcrawler(Website = INDEX[[i]][j], no_cores = 4, no_conn = 4, ExtractCSSPat = c(".address"))

  }

}
#Rcrawler(Website = INDEX[[i]][23], no_cores = 4, no_conn = 4, ExtractCSSPat = c(".address"))
class(DATA)
head(DATA)

ad <- DATA[[1]]
ad <- as.character(ad)
cat(ad)

Sorry I think something wrong with this code Anyone get following Error:

Error in strsplit(gsub("http://|https://|www\.", "", Website), "/")[[c(1, : subscript out of bounds

How can we extract information from subdomain using Rcrawler in R?

There are 2 best solutions below

Related Questions in R

Related Questions in WEB-SCRAPING

Related Questions in RCRAWLER

Trending Questions

Popular # Hahtags

Popular Questions