Launch web browser and copy information contained R

168 Views Asked by At

I'm trying to find a way to copy-paste the title and the abstract from a PubMed page.

I started using

browseURL("https://pubmed.ncbi.nlm.nih.gov/19592249") ## final numbers are the PMID

now I can't find a way to obtain the title and the abstract in a txt way. I have to do it for multiple PMID so I need to automatize it. It can be useful also just copying everything is on that page and after I can take only what I need. Is it possible to do that? thanks!

2

There are 2 best solutions below

0
On BEST ANSWER

I suppose what you're trying to do is scrape PubMed for articles of interest?

Here's one way to do this using the rvest package:

#Required libraries.
library(magrittr)
library(rvest)

#Function.
getpubmed <- function(url){
  
  dat <- rvest::read_html(url)
  
  pid <- dat %>% html_elements(xpath = '//*[@title="PubMed ID"]') %>% html_text2() %>% unique()
  ptitle <- dat %>% html_elements(xpath = '//*[@class="heading-title"]') %>% html_text2() %>% unique()
  pabs <- dat %>% html_elements(xpath = '//*[@id="enc-abstract"]') %>% html_text2()
  
  return(data.frame(pubmed_id = pid, title = ptitle, abs = pabs, stringsAsFactors = FALSE))
  
}

#Test run.
urls <- c("https://pubmed.ncbi.nlm.nih.gov/19592249", "https://pubmed.ncbi.nlm.nih.gov/22281223/")

df <- do.call("rbind", lapply(urls, getpubmed))

The code should be fairly self-explanatory. (I've not added the contents of df here for brevity.) The function getpubmed does no error-handling or anything of that sort, but it is a start. By supplying a vector of URLs to the do.call("rbind", lapply(urls, getpubmed)) construct, you can get back a data.frame consisting of the PubMed ID, title, and abstract as columns.

Another option would be to explore the easyPubMed package.

0
On

I would also use a function and rvest. However, I would go with a passing the pid in as the argument function, using html_node as only a single node is needed to be matched, and use faster css selectors. String cleaning is done via stringr package:

library(rvest)
library(stringr)
library(dplyr)

get_abstract <- function(pid){
  
  page <- read_html(paste0('https://pubmed.ncbi.nlm.nih.gov/', pid))
  
  df <-tibble(
    title = page %>% html_node('.heading-title') %>% html_text() %>% str_squish(),
    abstract = page %>% html_node('#enc-abstract') %>% html_text() %>% str_squish()
  )
  return(df)
}

get_abstract('19592249')