Am trying to parse the contents of PDFs. Basically they are scientific research papers.
Here's the portion am trying to grab:
I only need the paper title and the author name(s).
What I used is the PDF Parser Library. And I was able to get the header portion text using this code:
function get_pdf_prop( $file )
{
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile( $file );
$details = $pdf->getDetails();
$page = $pdf->getPages()[0];
//-- Extract the text of the first page
$text = $page->getText();
$text = explode( 'ABSTRACT', $text, 2 ); //-- get the text before the "ABSTRACT"
$text = $text[0];
//-- split the lines
$lines = explode( "\n", $text );
return array(
'total_pages' => $details['Pages'],
'paper_title' => $lines[0] . $lines[1],
'author' => $lines[2]
);
}
What I did is, parse the full text of first page, then it will return the whole text in plain format. Since the required content is before the word ABSTRACT
, I tried splitting the text and then splitting the lines.
And I assume the first two lines are the title and the third line is the author name. So far papers like I shown in the screenshot above gives correct results.
But problems happens during the following scenarios:
If paper title is a single line, I don't know it before hand. So my code will always return the first two lines as paper tile. And this might give both the title and author name as
paper_title
If paper title is three lines, again this will give issues.
If there are more than 1 author, then my code will not return the proper data.
So any suggestions on how effectively I can grab the data like Paper Title and Author Name(s) from a PDF scientific paper? Am sure that they all follow a same pattern while creating PDFs using the LateX tools. Any better solutions or clues?
Kindly note that, am trying to do this on the paper uploaded in my site. And am using PHP as the server side language.
Thank you
You could try using PDF meta data to retrieve the 'fields' you need (author, title, other...). I have tried a few scientific papers, at random, and they all have (as least) meta-data for pages, author and title.
PDF Parser docs show how this can be done:
Sample output for a randomly picked paper (
var_dump($details)
):