I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files. I would like to :
- Parse the contents of the PDF files to get keywords.
- Select the most relevant keywords to make a summary.
- Create static HTML pages for some keywords with entries linked to the appropriate files.
My questions are :
- Is there an existing tool to perform the whole job ?
- What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
- I consider using
Perl
,swish-e
,pdfgrep
to make a script. Do you know other tools which could be useful?
Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).
Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.
As for reading
pdf
, here are some options if your needs aren't too elaborateUse
CAM::PDF
(andCAM::PDF::PageText
) orPDF-API2
modulesUse
pdftotext
from thepoppler
library (probably inpoppler-utils
package)Use
pdftohtml
with-xml
option, read the generated simple XML file withXML::libXML
orXML::Twig
The last two are external tools which you use via Perl's builtins like
system
.The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.
Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into
HTML::Template
. Also seethis post
, for example.Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.
If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse
pdf
so you may still be better off with your own script.