How can I extract multiple items from 1 html using RCrawler's ExtractXpathPat?

Question

How can I extract multiple items from 1 html using RCrawler's ExtractXpathPat?

146 Views Asked by Friso At 02 March 2020 at 21:13

I'm trying to get both the label and data of items of a museum collection using Rcrawler. I think I made a mistake using the ExtractXpathPat variable, but I can't figure out how to fix it.

I expect an output like this:

1;"Titel(s)";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
1;"Objecttype";"Schilderij"
1;"Objectnummer";"SK-A-2931"

However the output file repeats the title in the 3rd position:

1;"Titel(s)";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
1;"Objecttype";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
1;"Objectnummer";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"

The HTML looks like this:

<div class="item">
      <h3 class="item-label h4-like">Objectnummer</h3>
      <p class="item-data">SK-A-2931</p>
</div>

My method looks like this:

Rcrawler(Website = "https://www.rijksmuseum.nl/nl/", 
         no_cores = 4, no_conn = 4,
         dataUrlfilter = '.*/collectie/.*',
         ExtractXpathPat = c('//*[@class="item-label h4-like"]', '//*[@class="item-data"]'), 
         PatternsNames = c('label','data'),
         ManyPerPattern = TRUE)

Clarification of goal The HTML page doesn't always have the same labels and sometimes it has labels without the corresponding data. Sometimes the data is in a paragraph and sometimes in an unordered list.

My end goal is to create a csv that has all the labels of the site with the corresponding data in each row.

This question is to get to the first step of collecting the labels and data, which I will then use to create the above mentioned csv.

Original Q&A

There are 1 best solutions below

**E.Wiest** · Accepted Answer · 2020-03-03T02:10:41.960000

I don't use RCrawler to scrape but I think your XPaths need to be fixed. I did it for you :

Rcrawler(Website = "https://www.rijksmuseum.nl/nl/", 
         no_cores = 4, no_conn = 4,
         dataUrlfilter = '.*/collectie/.*',
         ExtractXpathPat = c("//h3[@class='item-label h4-like'][.='Titel(s)']/following-sibling::p/text()","//h3[@class='item-label h4-like'][.='Objecttype']/following::a[1]/text()","//h3[@class='item-label h4-like'][.='Objectnummer']/following-sibling::p/text()"), 
         PatternsNames = c("Titel(s)", "Objecttype","Objectnummer"),
         ManyPerPattern = TRUE)

I run it for a few minutes and it seems to work :

DATA[[1]]
$`PageID`
[1] 1

$`Titel(s)`
[1] "De Staalmeesters"                                                                   
[2] "De waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"

$Objecttype
[1] "schilderij"

$Objectnummer
[1] "SK-C-6"

More options :

Bruteforce. Since you don't know yet all the label names, and if you don't want to write specific XPaths you can try something like this in RCrawlers ExtractXpathPat:

c("string((//h3[@class='item-label h4-like'])[1]/parent::*)","string((//h3[@class='item-label h4-like'])[2]/parent::*)",...,"string((//h3[@class='item-label h4-like'])[30]/parent::*)")

Here, we just increment from position 1 to position 30. You could try 40,50, it's up to you.

PatternsNames = c("Item1", "Item2",...,"Item30")

Example of result :

Item1:Title(s) The Seven Works of MercyPolyptych with the Seven Works of Charity 
Item2:Object type painting 
Item3:Object number SK-A-2815
...
Item17:Parts The Seven Works of Mercy (SK-A-2815-1) The Seven Works of Mercy (SK-A-2815-2) The Seven Works of Mercy (SK-A-2815-3) The Seven Works of Mercy (SK-A-2815-4) The Seven Works of Mercy (SK-A-2815-5) The Seven Works of Mercy (SK-A-2815-6) The Seven Works of Mercy (SK-A-2815-7)
...
Item29:
Item30:

You need then to tidy the data (split, trim, reorganize...) with appropriate tools (dplyr, stringr) to generate a proper csv.

If this option doesn't work, you could determine all the label names you could possibly have (get all the //h3[@class='item-label h4-like']/text() of the webpages and remove duplicates to keep unique values only. Then write the Xpaths accordingly. This way the .csv would be easier to generate.

You could also work outside RCrawler (with other tools) and write some functions to scrape the data properly (with apply functions or for loops).

How can I extract multiple items from 1 html using RCrawler's ExtractXpathPat?

There are 1 best solutions below

Related Questions in R

Related Questions in XPATH

Related Questions in WEB-CRAWLER

Related Questions in RCRAWLER

Trending Questions

Popular # Hahtags

Popular Questions