Split PDF files in multiples files every 2 pages in R

1.2k Views Asked by At

I have a PDF document with 300 pages. I need to split this file in 150 files containing each one 2 pages. For example, the 1st document would contain pages 1 & 2 of the original file, the 2nd document, the pages 3 & 4 and so on.

Maybe I can use the "pdftools" package, but I don't know how.


There are 2 best solutions below


1) pdftools Assuming that the input PDF is in the current directory and the outputs are to go into the same directory, change the inputs below and then get the number of pages num, compute the st and en vectors of start and end page numbers and repeatedly call pdf_subset. Note that the pdf_length and pdf_subset functions come from the qpdf R package but are also made available by the pdftools R package by importing them and exporting them back out.


# inputs
infile <- "a.pdf"  # input pdf
prefix <- "out_"  # output pdf's will begin with this prefix

num <- pdf_length(infile)
st <- seq(1, num, 2)
en <- pmin(st + 1, num)

for (i in seq_along(st)) {
  outfile <- sprintf("%s%0*d.pdf", prefix, nchar(num), i)
  pdf_subset(infile, pages = st[i]:en[i], output = outfile)

2) pdfbox The Apache pdfbox utility can split into files of 2 pages each. Download the .jar command line utilities file from pdfbox and be sure you have java installed. Then run this assuming that your input file is a.pdf and is in the current directory (or run the quoted part directly from the command line without the quotes and without R). The jar file name below may need to be changed if a later version is to be used. The one named below is the latest one currently (not counting alpha version).

system("java -jar pdfbox-app-2.0.26.jar PDFSplit -split 2 a.pdf")

3) animation/pdftk Another option is to install the pdftk program, change the inputs at the top of the script below and run. This gets the number of pages in the input, num, using pdftk and then computes the start and end page numbers, st and en, and then invokes pdftk repeatedly, once for each st/en pair to extract those pages into another file.


# inputs
PDFTK <- "~/../bin/pdftk.exe"  # path to pdftk
infile <- "a.pdf"  # input pdf
prefix <- "out_"  # output pdf's will begin with this prefix

ani.options(pdftk = Sys.glob(PDFTK))

tmp <- tempfile()
dump_data <- pdftk(infile, "dump_data", tmp)
g <- grep("NumberOfPages", readLines(tmp), value = TRUE)
num <- as.numeric(sub(".* ", "", g))

st <- seq(1, num, 2)
en <- pmin(st + 1, num)

for (i in seq_along(st)) {
  outfile <- sprintf("%s%0*d.pdf", prefix, nchar(num), i)
  pdftk(infile, sprintf("cat %d-%d", st[i], en[i]), outfile)

Neither pdftools nor qpdf (on which the first depends) support splitting PDF files by other than "every page". You likely will need to rely on an external program, I'm confident you can get pdftk to do that by calling it once for each 2-page output.

I have a 36-page PDF here named quux.pdf in the current working directory.

# List of 11
#  $ version    : chr "1.5"
#  $ pages      : int 36
#  $ encrypted  : logi FALSE
#  $ linearized : logi FALSE
#  $ keys       :List of 8
#   ..$ Producer       : chr "pdfTeX-1.40.24"
#   ..$ Author         : chr ""
#   ..$ Title          : chr ""
#   ..$ Subject        : chr ""
#   ..$ Creator        : chr "LaTeX via pandoc"
#   ..$ Keywords       : chr ""
#   ..$ Trapped        : chr ""
#   ..$ PTEX.Fullbanner: chr "This is pdfTeX, Version 3.141592653-2.6-1.40.24 (TeX Live 2022) kpathsea version 6.3.4"
#  $ created    : POSIXct[1:1], format: "2022-05-17 22:54:40"
#  $ modified   : POSIXct[1:1], format: "2022-05-17 22:54:40"
#  $ metadata   : chr ""
#  $ locked     : logi FALSE
#  $ attachments: logi FALSE
#  $ layout     : chr "no_layout"

I also have pdftk installed and available in the page,

#                                        pdftk 
# "C:\\PROGRA~2\\PDFtk Server\\bin\\pdftk.exe" 

With this, I can run an external script to create 2-page PDFs:

list.files(pattern = "pdf$")
# [1] "quux.pdf"

pages <- seq(pdftools::pdf_info("quux.pdf")$pages)
pages <- split(pages, (pages - 1) %/% 2)
# $`0`
# [1] 1 2
# $`1`
# [1] 3 4
# $`2`
# [1] 5 6

for (pg in pages) {
  system(sprintf("pdftk quux.pdf cat %s-%s output out_%02i-%02i.pdf",
         min(pg), max(pg), min(pg), max(pg)))

list.files(pattern = "pdf$")
#  [1] "out_01-02.pdf" "out_03-04.pdf" "out_05-06.pdf" "out_07-08.pdf"
#  [5] "out_09-10.pdf" "out_11-12.pdf" "out_13-14.pdf" "out_15-16.pdf"
#  [9] "out_17-18.pdf" "out_19-20.pdf" "out_21-22.pdf" "out_23-24.pdf"
# [13] "out_25-26.pdf" "out_27-28.pdf" "out_29-30.pdf" "out_31-32.pdf"
# [17] "out_33-34.pdf" "out_35-36.pdf" "quux.pdf"     

# List of 11
#  $ version    : chr "1.5"
#  $ pages      : int 2
#  $ encrypted  : logi FALSE
#  $ linearized : logi FALSE
#  $ keys       :List of 2
#   ..$ Creator : chr "pdftk 2.02 - www.pdftk.com"
#   ..$ Producer: chr "itext-paulo-155 (itextpdf.sf.net-lowagie.com)"
#  $ created    : POSIXct[1:1], format: "2022-05-18 09:37:56"
#  $ modified   : POSIXct[1:1], format: "2022-05-18 09:37:56"
#  $ metadata   : chr ""
#  $ locked     : logi FALSE
#  $ attachments: logi FALSE
#  $ layout     : chr "no_layout"