process the data of an image like pdf or something else using pdfcreator

321 Views Asked by At

hay all. maybe you guys can help me in my project. im using pdfcreator as a virtual printer to print to a file some images. can be pdf can be any type of image. but i need to extract data from it. can it be done? im using C#.

1

There are 1 best solutions below

0
On

You cannot extract text from images.

In principle, you can extract text from PDFs.

Here are two methods using Free software commandline utilities; maybe one of them fits your needs:

  1. pdftotext.exe (part of Foolabs' XPDF utilities)
  2. gswin32c.exe (Artifex' Ghostscript)

Example commandlines to extract all text from pages 3-7:

pdftotext:

pdftotext.exe ^
   -f 3 ^
   -l 7 ^
   -epl dos ^
   -layout ^
   "d:\path with spaces\to\input.pdf" ^
   "d:\path\to\output.txt"

You want to get the text output to stdout instead of a file? OK, try this:

pdftotext.exe ^
   -f 3 ^
   -l 7 ^
   -epl dos ^
   -layout ^
   "d:\path with spaces\to\input.pdf" ^
   -

Ghostscript: (Check that your installation has ps2ascii.ps in its lib subdirectory)

gswin32c.exe ^
   -q ^
   -sFONTPATH=c:/windows/fonts ^
   -dNODISPLAY ^
   -dSAFER ^
   -dDELAYBIND ^
   -dWRITESYSTEMDICT ^
   -dSIMPLE ^
   -f ps2ascii.ps ^
   -dFirstPage=3 ^
   -dLastPage=7 ^
   "c:/path/to/input.pdf" ^
   -dQUIET 

Text output will appear on stdout. If you test this in a cmd.exe window, you can redirect this to a file by appending > /path/to/output.txt to the command.