how to land up on the bitstream url from the href link of an html

143 Views Asked by At

I am using rvest R package to scrape a PDF file from this webpage but the final link is exposed (as a bitstream url - whatever it is) after I click on the exposed url by name AC1-96-21-01-2011.pdf. The final pdf file is tucked in here hidden from access. This blocks all attempts of rvest function read_html() as the final pdf file opens only on clicking on the previous link (on href). Copy pasting the xml node that is not allowing me to enter into the pdf file.

<a href="/judgments/handle/123456789/701">Arbitration Case - AC</a>

The final file is on this url which is not exposed in the href node. http://judgmenthck.kar.nic.in/judgments/bitstream/123456789/563560/2/AC1-96-21-01-2011.pdf

So as a summary how do I access the pdf file link using rvest that is not found in the href attribute as explained above.

I tried to search bitstream but it takes my to something else.

1

There are 1 best solutions below

3
Allan Cameron On BEST ANSWER

You're looking at the wrong node I think:

library(rvest)

"http://judgmenthck.kar.nic.in/judgments/handle/123456789/563560" %>%
read_html()                                                       %>%
html_nodes(xpath = "//td/a[@target='_blank']")                    %>%
html_attr("href")                                                 %>% 
unique()                                                          %>% 
{grep("[.]pdf", ., value = T)}                                    %>%
paste0("http://judgmenthck.kar.nic.in", .)                         ->
pdf_url

print(pdf_url)
# [1] "http://judgmenthck.kar.nic.in/judgments/bitstream/123456789/563560/2/AC1-96-21-01-2011.pdf"