Is there a way to extract text information from a postscript file? (.ps .eps)

1.7k Views Asked by At

I want to extract the text information contained in a postscript image file (the captions to my axis labels). These images were generated with pgplot. I have tried ps2ascii and ps2txt on Ubuntu but they didn't produce any useful results. Does anyone know of another method?

Thanks

1

There are 1 best solutions below

7
On BEST ANSWER

It's likely that pgplot drew the fonts in the text directly with lines rather than using text. Especially since pgplot is designed to output to a huge range of devices including plotters where you would have to do this.

Edit:

If you have enough plots to be worth the effort than it's a very simple image processing task. Convert each page to something like tiff, in mono chrome Threshold the image to binary, the text will be max pixel value.

Use a template matching technique. If you have a limited set of possible labels then just match the entire label, you can even start with a template of the correct size and rotation. Then just flag each plot as containing label[1-n], no need to read the actual text.

If you don't know the label then you can still do OCR fairly easily, just extract the region around the axis, rotate it for the vertical - and use Google's free OCR lib

If you have pgplot you can even build the training set for OCR or the template images directly rather than having to harvest them from the image list