Match bibliographic data and remove duplicates even if strings do not match exactly

32 Views Asked by At

I have two sets of bibliographic data. One contains records from PubMed, one from Scopus:

data_PubMed <- data.frame(
  Authors = c("AB, CD.", "XX, XY.", "FG, HJ.", "KL, MN.", "OP, QR."),
  Year = c(2020, 2019, 2018, 2017, 2016),
  Title = c("Title 1", "Title 2", "Title 3", "Title 4", "Title 5"),
  `Source title` = c("Journal A", "Journal B", "Journal C", "Journal D", "Journal E"),
  DOI = c("DOI1", "DOI2", NA, "DOI4", NA),
  Database = c("PubMed", "PubMed", "PubMed", "PubMed", "PubMed")
)

data_Scopus <- data.frame(
  Authors = c("F.G.; H.J.", "K.L.; M.N.", "O.P.; Q.R.", "S.T.; U.V.", "C.D.; E.F."),
  Year = c(2018, 2017, 2016, 2015, 2014),
  Title = c("Title: 3.", "Title 4", "Title - 5.", "Title 6", "Title 7"),
  `Source title` = c("Journal C", "Journal D", "Journal E", "Journal F", "Journal G"),
  DOI = c(NA, "doi4", "dOi5", NA, NA),
  Database = c("Scopus", "Scopus", "Scopus", "Scopus", "Scopus")
)

I want to create a combined data frame which only contains distinct records. In case of a duplicate, I want to retain the record from PubMed, discarding the one from Scopus.

The tricky part is that actual duplicates don't necessarily appear as such due to differently applied conventions when it comes to spelling authors and titles. I tried to reflect these pitfalls in my example code.

So the obvious first choice is to resort to the DOI. Sometimes the same DOI is written lowercase, sometimes uppercase, but this is easy to deal with, using tolower(DOI). So if all records had DOIs, that task would be simple. Just arrange by $Database to have "PubMed" first, then group_by(DOI) and apply a filter that keeps the PubMed record of each group (DOI):

data_Combined <- rbind(data_PubMed, data_Scopus) |>
  mutate(DOI = tolower(DOI)) |>
  arrange(Database) |>
  group_by(DOI) |>
  filter(Database == "PubMed" | (Database == "Scopus" & !any(Database == "PubMed"))) |>
  ungroup()

> data_Combined
# A tibble: 6 × 6
  Authors     Year Title      Source.title DOI   Database
  <chr>      <dbl> <chr>      <chr>        <chr> <chr>   
1 AB, CD.     2020 Title 1    Journal A    doi1  PubMed  
2 XX, XY.     2019 Title 2    Journal B    doi2  PubMed  
3 FG, HJ.     2018 Title 3    Journal C    NA    PubMed  
4 KL, MN.     2017 Title 4    Journal D    doi4  PubMed  
5 OP, QR.     2016 Title 5    Journal E    NA    PubMed  
6 O.P.; Q.R.  2016 Title - 5. Journal E    doi5  Scopus

So with my inconsistent data, this approach is obviously wrong for a couple of reasons. "Title 6" and "Title 7" have been eliminated entirely and "Title 5" is actually a duplicate with slight differences that I would have liked to detect.

I know that there is the stringdist package that can identify "close enough" strings as matches, but I tried to implement this in my approach to no avail.

0

There are 0 best solutions below