How can I optimize my pdf repository after splitting it by page?

67 Views Asked by At

I have about 20 large pdfs which I have split by pages for easier access. When I split it by pages using qpdf I am observing an inflation of 10x in total size, meaning that I have some redundant data in all per-page pdfs. It is very likely stored fonts that are cause of the bloat. Is there a way to externalize these fonts (like the user can install those fonts beforehand on their devices)? My goal is that once I split the pdfs by page the total size should be within 1x-2x of original so that I can host it on my website.

Here is the sample pdf from repository

https://www.mea.gov.in/Images/CPV/Volume17_Part_III.pdf

Any help regarding pdf splitting is welcomed

Thanks!

1

There are 1 best solutions below

0
On

I split the file into files of one page each and then tried to squeeze them. There is no un-needed data:

$ cpdf -squeeze 641.pdf -o out.pdf
Initial file size is 947307 bytes
Beginning squeeze: 2178 objects
Squeezing... Down to 1519 objects
Squeezing page data and xobjects
Recompressing document
Final file size is 945176 bytes, 99.78% of original.

So no luck there. About 4/5 of the size of each file is the (uncompressed) XML metadata from the main file. You may well not need this. If so, you can run:

cpdf -remove-metadata in.pdf -o small.pdf

on each output file. This reduces the size of each file by about 5 times. Obviously if you're splitting into groups of more than one page, the effect will not be as large.