I'm using pdfminer.six
According to this on page 8 I should be able to modify char_margin and line_overlap in a LAParams object in order to cause a bunch of LTChar objects next to each other to group into LTTextLine objects. Unfortunately, it doesn't matter what values I assign these, nothing changes.
Here is my code and some example output. What am I missing? I've tried reasonable values such as 1, 5, 10, and super high ones such as 10000000000000. But I never get any LTTextLine objects.
from pdfminer.layout import *
from pdfminer.high_level import extract_pages
def rec(element, deep=0):
print(f"{deep}: {element}")
if hasattr(element, '__iter__'):
for item in element:
rec(item, deep+1)
def main():
file = 'completed-intake-form.pdf'
laparams = LAParams(char_margin=4, line_overlap=4.0, word_margin=5.0)
for page_layout in extract_pages(file, laparams=laparams):
for element in page_layout:
rec(element, 0)
$ python3 src/main.py
0: <LTFigure(TLkNsAkXBt) 0.000,0.000,595.500,850.080 matrix=[1.00,0.00,0.00,1.00, (0.00,-7.83)]>
1: <LTRect 0.000,-7.463,596.000,842.250>
1: <LTRect 0.000,-0.707,596.000,842.250>
1: <LTRect 0.000,-0.707,596.000,842.250>
1: <LTRect 36.491,749.262,550.347,787.661>
1: <LTChar 39.671,663.263,47.715,673.269 matrix=[0.75,0.00,0.00,0.75, (39.67,666.25)] font='AAAAAA+EBGaramond-SemiBold' adv=10.717319919600001 text='N'>
1: <LTChar 47.714,663.263,51.967,673.269 matrix=[0.75,0.00,0.00,0.75, (47.71,666.25)] font='AAAAAA+EBGaramond-SemiBold' adv=5.6652499575 text='a'>
1: <LTChar 51.966,663.263,59.941,673.269 matrix=[0.75,0.00,0.00,0.75, (51.97,666.25)] font='AAAAAA+EBGaramond-SemiBold' adv=10.6240099203 text='m'>
1: <LTChar 59.940,663.263,64.022,673.269 matrix=[0.75,0.00,0.00,0.75, (59.94,666.25)] font='AAAAAA+EBGaramond-SemiBold' adv=5.4386399592000005 text='e'>
1: <LTChar 39.671,645.269,46.785,655.275 matrix=[0.75,0.00,0.00,0.75, (39.67,648.25)] font='AAAAAA+EBGaramond-SemiBold' adv=9.4776299289 text='A'>
1: <LTChar 46.784,645.269,52.157,655.275 matrix=[0.75,0.00,0.00,0.75, (46.78,648.25)] font='AAAAAA+EBGaramond-SemiBold' adv=7.1582099463 text='d'>
1: <LTChar 52.156,645.269,57.529,655.275 matrix=[0.75,0.00,0.00,0.75, (52.16,648.25)] font='AAAAAA+EBGaramond-SemiBold' adv=7.1582099463 text='d'>
1: <LTChar 57.529,645.269,61.411,655.275 matrix=[0.75,0.00,0.00,0.75, (57.53,648.25)] font='AAAAAA+EBGaramond-SemiBold' adv=5.1720399612 text='r'>
1: <LTChar 61.410,645.269,65.493,655.275 matrix=[0.75,0.00,0.00,0.75, (61.41,648.25)] font='AAAAAA+EBGaramond-SemiBold' adv=5.4386399592000005 text='e'>
1: <LTChar 65.492,645.269,68.974,655.275 matrix=[0.75,0.00,0.00,0.75, (65.49,648.25)] font='AAAAAA+EBGaramond-SemiBold' adv=4.638839965200001 text='s'>
1: <LTChar 68.974,645.269,72.456,655.275 matrix=[0.75,0.00,0.00,0.75, (68.97,648.25)] font='AAAAAA+EBGaramond-SemiBold' adv=4.638839965200001 text='s'>
1: <LTChar 39.671,627.003,45.564,637.009 matrix=[0.75,0.00,0.00,0.75, (39.67,629.99)] font='AAAAAA+EBGaramond-SemiBold' adv=7.8513699411 text='P'>
1: <LTChar 45.563,627.003,51.016,637.009 matrix=[0.75,0.00,0.00,0.75, (45.56,629.99)] font='AAAAAA+EBGaramond-SemiBold' adv=7.264849945500001 text='h'>
1: <LTChar 51.016,627.003,56.019,637.009 matrix=[0.75,0.00,0.00,0.75, (51.02,629.99)] font='AAAAAA+EBGaramond-SemiBold' adv=6.66499995 text='o'>
1: <LTChar 56.018,627.003,61.551,637.009 matrix=[0.75,0.00,0.00,0.75, (56.02,629.99)] font='AAAAAA+EBGaramond-SemiBold' adv=7.371489944700001 text='n'>
1: <LTChar 61.550,627.003,65.633,637.009 matrix=[0.75,0.00,0.00,0.75, (61.55,629.99)] font='AAAAAA+EBGaramond-SemiBold' adv=5.4386399592000005 text='e'>
1: <LTLine 66.501,665.670,230.891,665.670>
1: <LTLine 168.264,545.088,332.654,545.088>
1: <LTLine 387.881,472.888,412.650,473.221>
1: <LTLine 521.049,472.516,550.331,472.516>
1: <LTLine 365.141,648.330,445.459,648.330>
1: <LTLine 487.873,647.580,550.179,647.580>
1: <LTLine 99.174,57.068,263.564,57.068>
1: <LTLine 416.664,215.165,540.527,215.540>
1: <LTLine 75.393,647.955,239.783,647.955>
1: <LTLine 330.209,629.758,494.599,629.758>
1: <LTLine 385.636,498.845,550.026,498.845>
1: <LTLine 69.056,630.134,157.635,630.206>
1: <LTLine 305.623,665.220,394.946,666.118>
1: <LTLine 45.726,405.074,559.167,405.074>
1: <LTLine 45.726,572.622,559.167,572.622>
1: <LTLine 45.726,166.724,559.167,166.724>
1: <LTRect 50.007,374.845,62.221,387.059>
1: <LTRect 284.236,521.109,296.450,533.323>
1: <LTRect 166.196,495.736,178.410,507.950>
1: <LTRect 119.307,469.260,131.522,481.474>
1: <LTRect 362.813,521.109,375.027,533.323>
1: <LTRect 244.773,495.736,256.987,507.950>
1: <LTRect 197.884,469.260,210.098,481.474>
1: <LTRect 233.699,374.845,245.913,387.059>
1: <LTRect 416.664,372.944,428.878,385.158>
1: <LTRect 50.007,334.925,62.221,347.139>
1: <LTRect 233.699,334.925,245.913,347.139>
1: <LTRect 416.664,333.024,428.878,345.238>
1: <LTRect 50.007,295.005,62.221,307.219>
1: <LTRect 233.699,295.005,245.913,307.219>
1: <LTRect 416.664,293.104,428.878,305.318>
1: <LTRect 50.007,255.084,62.221,267.299>
1: <LTRect 233.699,255.084,245.913,267.299>
1: <LTRect 416.664,253.183,428.878,265.398>
1: <LTRect 50.007,354.885,62.221,367.099>
1: <LTRect 233.699,354.885,245.913,367.099>
1: <LTRect 416.664,352.984,428.878,365.198>
1: <LTRect 50.007,314.965,62.221,327.179>
1: <LTRect 233.699,314.965,245.913,327.179>
1: <LTRect 416.664,313.064,428.878,325.278>
1: <LTRect 50.007,275.045,62.221,287.259>
1: <LTRect 233.699,275.045,245.913,287.259>
1: <LTRect 416.664,273.144,428.878,285.358>
1: <LTRect 50.007,235.124,62.221,247.339>
1: <LTRect 233.699,235.124,245.913,247.339>
1: <LTRect 416.664,233.223,428.878,245.437>
1: <LTRect 50.007,215.164,62.221,227.378>
1: <LTRect 233.699,215.164,245.913,227.378>
1: <LTChar 134.147,719.958,183.612,762.526 matrix=[0.75,0.00,0.00,0.75, (134.15,761.84)] font='BAAAAA+EyesomeScript' adv=65.897018838 text='M'>
1: <LTChar 183.605,719.958,204.336,762.526 matrix=[0.75,0.00,0.00,0.75, (183.61,761.84)] font='BAAAAA+EyesomeScript' adv=27.617769513000002 text='a'>
1: <LTChar 204.334,719.958,217.658,762.526 matrix=[0.75,0.00,0.00,0.75, (204.33,761.84)] font='BAAAAA+EyesomeScript' adv=17.750229687 text='s'>
1: <LTChar 217.656,719.958,230.980,762.526 matrix=[0.75,0.00,0.00,0.75, (217.66,761.84)] font='BAAAAA+EyesomeScript' adv=17.750229687 text='s'>
1: <LTChar 230.978,719.958,251.709,762.526 matrix=[0.75,0.00,0.00,0.75, (230.98,761.84)] font='BAAAAA+EyesomeScript' adv=27.617769513000002 text='a'>
1: <LTChar 251.706,719.958,272.990,762.526 matrix=[0.75,0.00,0.00,0.75, (251.71,761.84)] font='BAAAAA+EyesomeScript' adv=28.3549995 text='g'>
1: <LTChar 272.988,719.958,287.504,762.526 matrix=[0.75,0.00,0.00,0.75, (272.99,761.84)] font='BAAAAA+EyesomeScript' adv=19.338109659000004 text='e'>
1: <LTChar 287.502,719.958,300.272,762.526 matrix=[0.75,0.00,0.00,0.75, (287.50,761.84)] font='BAAAAA+EyesomeScript' adv=17.0129997 text=' '>
1: <LTChar 300.271,719.958,320.959,762.526 matrix=[0.75,0.00,0.00,0.75, (300.27,761.84)] font='BAAAAA+EyesomeScript' adv=27.561059514 text='T'>
1: <LTChar 320.956,719.958,346.157,762.526 matrix=[0.75,0.00,0.00,0.75, (320.96,761.84)] font='BAAAAA+EyesomeScript' adv=33.572319408 text='h'>
1: <LTChar 346.154,719.958,360.669,762.526 matrix=[0.75,0.00,0.00,0.75, (346.15,761.84)] font='BAAAAA+EyesomeScript' adv=19.338109659000004 text='e'>
1: <LTChar 360.668,719.958,378.674,762.526 matrix=[0.75,0.00,0.00,0.75, (360.67,761.84)] font='BAAAAA+EyesomeScript' adv=23.988329577000002 text='r'>
1: <LTChar 378.672,719.958,399.403,762.526 matrix=[0.75,0.00,0.00,0.75, (378.67,761.84)] font='BAAAAA+EyesomeScript' adv=27.617769513000002 text='a'>
1: <LTChar 399.400,719.958,418.343,762.526 matrix=[0.75,0.00,0.00,0.75, (399.40,761.84)] font='BAAAAA+EyesomeScript' adv=25.235949555 text='p'>
1: <LTChar 418.341,719.958,439.625,762.526 matrix=[0.75,0.00,0.00,0.75, (418.34,761.84)] font='BAAAAA+EyesomeScript' adv=28.3549995 text='y'>
1: <LTChar 439.618,719.958,452.389,762.526 matrix=[0.75,0.00,0.00,0.75, (439.62,761.84)] font='BAAAAA+EyesomeScript' adv=17.0129997 text=' '>
1: <LTChar 150.514,707.298,167.303,727.723 matrix=[0.75,0.00,0.00,0.75, (150.51,715.06)] font='CAAAAA+Garet-Book' adv=22.366619178 text='C'>
1: <LTChar 171.038,707.298,182.537,727.723 matrix=[0.75,0.00,0.00,0.75, (171.04,715.06)] font='CAAAAA+Garet-Book' adv=15.319229437 text='L'>
1: <LTChar 186.273,707.298,192.135,727.723 matrix=[0.75,0.00,0.00,0.75, (186.27,715.06)] font='CAAAAA+Garet-Book' adv=7.809269713000001 text='I'>
1: <LTChar 195.872,707.298,208.311,727.723 matrix=[0.75,0.00,0.00,0.75, (195.87,715.06)] font='CAAAAA+Garet-Book' adv=16.570889390999998 text='E'>
1: <LTChar 212.046,707.298,227.753,727.723 matrix=[0.75,0.00,0.00,0.75, (212.05,715.06)] font='CAAAAA+Garet-Book' adv=20.924489231 text='N'>
1: <LTChar 231.488,707.298,244.131,727.723 matrix=[0.75,0.00,0.00,0.75, (231.49,715.06)] font='CAAAAA+Garet-Book' adv=16.842989381 text='T'>
1: <LTChar 247.866,707.298,253.340,727.723 matrix=[0.75,0.00,0.00,0.75, (247.87,715.06)] font='CAAAAA+Garet-Book' adv=7.292279732000001 text=' '>
1: <LTChar 257.078,707.298,262.940,727.723 matrix=[0.75,0.00,0.00,0.75, (257.08,715.06)] font='CAAAAA+Garet-Book' adv=7.809269713000001 text='I'>
1: <LTChar 266.677,707.298,282.384,727.723 matrix=[0.75,0.00,0.00,0.75, (266.68,715.06)] font='CAAAAA+Garet-Book' adv=20.924489231 text='N'>
1: <LTChar 286.118,707.298,298.761,727.723 matrix=[0.75,0.00,0.00,0.75, (286.12,715.06)] font='CAAAAA+Garet-Book' adv=16.842989381 text='T'>
1: <LTChar 302.497,707.298,316.447,727.723 matrix=[0.75,0.00,0.00,0.75, (302.50,715.06)] font='CAAAAA+Garet-Book' adv=18.584429317 text='A'>
1: <LTChar 320.182,707.298,334.112,727.723 matrix=[0.75,0.00,0.00,0.75, (320.18,715.06)] font='CAAAAA+Garet-Book' adv=18.557219318 text='K'>
1: <LTChar 337.847,707.298,350.286,727.723 matrix=[0.75,0.00,0.00,0.75, (337.85,715.06)] font='CAAAAA+Garet-Book' adv=16.570889390999998 text='E'>
1: <LTChar 354.022,707.298,359.495,727.723 matrix=[0.75,0.00,0.00,0.75, (354.02,715.06)] font='CAAAAA+Garet-Book' adv=7.292279732000001 text=' '>
1: <LTChar 363.233,707.298,374.671,727.723 matrix=[0.75,0.00,0.00,0.75, (363.23,715.06)] font='CAAAAA+Garet-Book' adv=15.237599440000002 text='F'>
1: <LTChar 378.407,707.298,395.768,727.723 matrix=[0.75,0.00,0.00,0.75, (378.41,715.06)] font='CAAAAA+Garet-Book' adv=23.12849915 text='O'>
1: <LTChar 399.502,707.298,413.473,727.723 matrix=[0.75,0.00,0.00,0.75, (399.50,715.06)] font='CAAAAA+Garet-Book' adv=18.611639316 text='R'>
1: <LTChar 417.208,707.298,436.019,727.723 matrix=[0.75,0.00,0.00,0.75, (417.21,715.06)] font='CAAAAA+Garet-Book' adv=25.060409079 text='M'>
Been having a similar problem and diving deep into the documentation I think the most likely solution to your issue is the LAParams(). Changing the char_margin/line_margin isn't enough. You need to also change the all_texts to True (default is False).
LTPage performs an analysis of the elements that are present in your PDF. It it finds multiple elements that meets certain criteria (the criteria is set by the LAParams()) it will group elements in your document together to form LTTextBox and LTTextLine and such. Certain objects in PDF are interpreted to be images (usually check marks but if it's a fillable form sometimes text inputs, and even the individual characters, can be interpreted as images).