I'm trying to get both the label and data of items of a museum collection using Rcrawler. I think I made a mistake using the ExtractXpathPat variable, but I can't figure out how to fix it.
I expect an output like this:
1;"Titel(s)";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
1;"Objecttype";"Schilderij"
1;"Objectnummer";"SK-A-2931"
However the output file repeats the title in the 3rd position:
1;"Titel(s)";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
1;"Objecttype";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
1;"Objectnummer";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
The HTML looks like this:
<div class="item">
<h3 class="item-label h4-like">Objectnummer</h3>
<p class="item-data">SK-A-2931</p>
</div>
My method looks like this:
Rcrawler(Website = "https://www.rijksmuseum.nl/nl/",
no_cores = 4, no_conn = 4,
dataUrlfilter = '.*/collectie/.*',
ExtractXpathPat = c('//*[@class="item-label h4-like"]', '//*[@class="item-data"]'),
PatternsNames = c('label','data'),
ManyPerPattern = TRUE)
Clarification of goal The HTML page doesn't always have the same labels and sometimes it has labels without the corresponding data. Sometimes the data is in a paragraph and sometimes in an unordered list.
My end goal is to create a csv that has all the labels of the site with the corresponding data in each row.
This question is to get to the first step of collecting the labels and data, which I will then use to create the above mentioned csv.
I don't use RCrawler to scrape but I think your XPaths need to be fixed. I did it for you :
I run it for a few minutes and it seems to work :
More options :
Bruteforce. Since you don't know yet all the label names, and if you don't want to write specific XPaths you can try something like this in RCrawlers ExtractXpathPat:
Here, we just increment from position 1 to position 30. You could try 40,50, it's up to you.
PatternsNames = c("Item1", "Item2",...,"Item30")
Example of result :
You need then to tidy the data (split, trim, reorganize...) with appropriate tools (dplyr, stringr) to generate a proper csv.
If this option doesn't work, you could determine all the label names you could possibly have (get all the
//h3[@class='item-label h4-like']/text()of the webpages and remove duplicates to keep unique values only. Then write the Xpaths accordingly. This way the .csv would be easier to generate.You could also work outside RCrawler (with other tools) and write some functions to scrape the data properly (with apply functions or for loops).