Scan bibtexkeys in Rmarkdown documents

146 Views Asked by At

I love the simplicity of Rmarkdown to produce documents and I am maintaining my own library in a Bibtex (*.bib) document. I'm using these instructions to cite in document (bibtexkey leaded by "@" symbol).

My question is: Is there a way to scan the Rmarkdown document (*.Rmd) and extract a list of bibtexkeys cited in the document? This could be great to produce a subset of my library to be attached to the project instead of all the ca. 6000 references accumulated in my library.

3

There are 3 best solutions below

0
On BEST ANSWER

After exploring several alternatives, I came to the function str_extract() from the package stringr. Here I am assuming, you have a bibtex library including all cited references (usually more). I also combined the example of Oto Kaláb with an own because of the different bibtexkey styles.

First the Rmd document.

rmd_text <- c("# Introduction",
        "",
        "Lorem ipsum dolor sit amet [@bibkey_a], consectetur adipisici elit [@bibkey_b],",
        "sed eiusmod tempor incidunt ut labore et dolore magna aliqua [@bibkey_c;@bibkey_d].",
        "",
        "According to @Noname2000, the world is round [@Ladybug1999;Ladybug2009].",
        "This knowledge got lost [@Ladybug2009a].")
writeLines(rmd_text, "document.Rmd")

The next code block is commented. At the end we obtain a vector with all cited references, which could be compressed by unique().

# Bibtexkeys from bib file
keys <- c("bibkey_a", "bibkey_b", "bibkey_c", "bibkey_d",
        "Noname2000", "Ladybug1999", "Ladybug2009", "Ladybug2009a")
keys <- paste0("@", keys)

# Read document
document <- readLines("document.Rmd")

# Scan document line by line
cited_refs <- list()
for(i in 1:length(document)) {
    cited_refs[[i]] <- str_extract(document[i], keys)
}

# Final output
cited_refs <- unlist(cited_refs)
cited_refs <- cited_refs[!is.na(cited_refs)]

summary(as.factor(cited_refs))

The resulting vector can be then aggregated to know the frequency of appearance in the text (I think also useful to detect rare citations). I'm also thinking to extract the "line number" in the output.

2
On

You can jsut parse your .Rmd document with finding given string pattern (ie @).

Example:

Create example file

Rmd_txt  <- "Lorem ipsum dolor sit amet [@bibkey_a], consectetur adipisici elit [@bibkey_b], sed eiusmod tempor incidunt ut labore et dolore magna aliqua [@bibkey_c;@bibkey_d]."
writeLines(Rmd_txt, "rmdfile.Rmd")

Read file:

Rmd <- readChar("rmdfile.Rmd",nchars=1e6)

Use RegExp to find all cases where the strings start with [@ and ends with ]

pattern <- "\\[@(.*?)\\]"
m <- regmatches(Rmd,gregexpr(pattern,Rmd))[[1]]
m
[1] "[@bibkey_a]"           "[@bibkey_b]"           "[@bibkey_c;@bibkey_d]"

Finally just split and clean the strings to your needs

res <- unlist(strsplit(m,";"))

res<- gsub("\\[","",res)
res<- gsub("\\]","",res)

res
[1] "@bibkey_a" "@bibkey_b" "@bibkey_c" "@bibkey_d"

0
On

A more simple solution is using the function bbt_detect_citations() package rbbt.

See also this discussion