How to account for papers with zero citations in a year with scholar R package?

73 Views Asked by At

I am using the scholar package in R to extract citation stats. I am planning on creating a data frame that has

  • pubID or article title
  • year
  • number of citations

I am able to do this article by article using 'get_article_cite_history', however, I receive an error for any article in which there is a year without citations, reading:

Error in data.frame(year = years, cites = vals) : arguments imply differing number of rows: 13, 12

Looking at how the code runs, instead of inserting a "zero" in which there are no citations for a given year, it will keep the year, but will just not keep the row for citation, thus causing the differing number of rows.

I would like to simply run a loop where it will take a pubid, get all the citation information (year and # of cites) and account for any years with 0 cites, and bind everything together to build the data.

Any help would be much appreciated!

1

There are 1 best solutions below

2
On

This should do it. The code below uses the internals from the get_artice_cite_history() function, but instead of getting years in the way the function did originally, it gets them from the last four digits of the href that surrounds the span that provides the number. This makes the years and number of citations compatible.

  library(rvest)
  library(scholar)
  library(tidyverse)
  id <- "_zbP0I0AAAAJ"
pubs <- get_publications(id)   
out <- lapply(pubs$pubid, function(article){
  site <- getOption("scholar_site")
  id <- tidy_id(id)
  url_base <- paste0(site, "/citations?", "view_op=view_citation&hl=en&citation_for_view=")
  url_tail <- paste(id, article, sep = ":")
  url <- paste0(url_base, url_tail)
  res <- get_scholar_resp(url)
  if (is.null(res)) 
    return(NA)
  httr::stop_for_status(res, "get user id / article information")
  doc <- read_html(res)
  vals <- doc %>% html_nodes(".gsc_oci_g_al") %>% html_text() %>% 
    as.numeric()
  years <- doc %>% 
    html_nodes("a.gsc_oci_g_a")  %>% 
    html_attr("href") %>% 
    gsub(".*(\\d{4})$", "\\1", .)
  df <- data.frame(year = years, cites = vals)
  if (nrow(df) > 0) {
    df <- merge(data.frame(year = min(years):max(years)), 
                df, all.x = TRUE)
    df[is.na(df)] <- 0
    df$pubid <- article
  }
  else {
    df$pubid <- vector(mode = mode(article))
  }
  df
})

nrs <- sapply(out, nrow)

out <- bind_rows(out[which(nrs > 0)])
head(out)
#>   year cites        pubid
#> 1 2004     4 qjMakFHDy7sC
#> 2 2005    16 qjMakFHDy7sC
#> 3 2006    19 qjMakFHDy7sC
#> 4 2007    27 qjMakFHDy7sC
#> 5 2008    38 qjMakFHDy7sC
#> 6 2009    47 qjMakFHDy7sC

Created on 2022-04-11 by the reprex package (v2.0.1)

Adding the following code after creating the output above should add back in all the publication-years with no citations.

eg <- expand.grid(pubid=pubs$pubid, year=sort(unique(out$year)))

out <- full_join(eg, out)
out <- out %>% mutate(cites = ifelse(is.na(cites), 0, cites))
pubyr <- pubs %>% dplyr::select(pubid, year) %>% rename(pubyr = year) %>% na.omit()
out <- left_join(out, pubyr)
out <- out %>% filter(is.na(pubyr) | year >= pubyr)