tesseract ocr works terrible on ios (7)

1k Views Asked by At

I don't know if something is wrong with me or tesseract library but it works terrible.

Tesseract* tesseract = [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"eng"];

    [tesseract setVariableValue:@"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZéèô" forKey:@"tessedit_char_whitelist"]; //limit search
    [tesseract setImage:[UIImage imageNamed:@"sampledoc.jpg"]]; //image to check
    [tesseract recognize];

    NSLog(@"%@", [tesseract recognizedText]);

    [tesseract clear];

This is the sample image I want to extract text from:

enter image description here

And this is what I get after running:

THE SILVER CHAIR
by r 5 Lawn
CHAPTER ow
BEHIND THE cm

lr W1C a dull aulumn day and llll Pole vmscrylng ulmo mo gym
She ms clymg because Illey had been bullymg her Hus Is not gmng In baa school oolyl se I
shall say 15 lane is poslble Ibvlll lllrs schwll which lsnol 1 plusinl subjzrl II was Tcr
eduummlr o sdsooV rm bolh boysuld glrlsl Mm used no he cnllcd o wmxodl schonll some
said on wax ml nculy so mixed as an mlndsohhe people whn an n These penple had um mu
m boyund glrlsshauld loeullma mdn who my mo And unlonunalcb mm ml or
mom aflhc hlggzsl bays mo girls liked best was bullying Ihe mm All suns orlllmgsl hound
mmgso went on Much u an nvdmlry saloon wnuld mm bum flwnd om ma snowed m lulfn
R1my hm al Ilus school xhcy vlucrfl Or mu Iflhcy mo mo people who am am wxc not
expellad m pomsloa The mm no they Mile lntntesilng psycholoycnl msxs mdsaul for
them and mm mlhem for hnun Mo Ifyml knew lhe nghl sorlofdnngxmsay In mo um
mo maul result wos um vou became mlhev 1 fmounlelhan olllnrwlsc
no mswmy ml Pole W crymg on ml dull autumn my on me dlmp Vmlc pith Much runs
bellman um um arm gym ma Ihe lhvubbezy mm ole mam nearly nmulea her ay whan
boy came round Ihz oomuonhogym Mxmlmg mm ms lnmlds m ms pocktu I12 mm In
lmo nu
 CuIV yuu look when yolfre gomw ma JIH Fob
Mu nglur sud me km won mam man a and am he mom hen rm ll WV Polef he
not was upv
ml only mndc lung the am you mm mo yodic llymg oo my somclhmg um um Ihn lfyou
spnk you1l smrl ctymg owl
 lfs mum I suww l as mualr sand me hwy Mlmlbx ouggmg ms hlnds nmm mm ms vovkals
ml waded Them wlsw moo forhurm sly llH1hVlIgoCVOllWiIE ooolo have Said u They both
knew
wow laok has said the beyl Wherek no gond us all r
He mezm WEIL am he am mlk mum mo mlnmne begmnmg n lecmne ml suddenly liew mm a
lmxpcr hvmdl Isqnllc Illkcly llllng Io hlppen Ifyou law been mmrupled in n cryl

I

What I am supposed to do?

2

There are 2 best solutions below

0
On
Tesseract *tesseract = [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"eng"];
[tesseract setImage:chosenImage];
[tesseract recognize];

NSLog(@"%@",[tesseract recognizedText]);
2
On

He meant the pixel resolution (PPI), not the image dimension.

I rescaled the image (from 96 DPI) to 300 DPI and got almost all text recognized correctly. The image definitely needed pre-processing before OCR step.