original image
img = cv2.imread('eng2.png')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
n_boxes = len(d['level'])
for i in range(n_boxes):
(x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
plt.figure(figsize=(10,10))
plt.imshow(img)
The above code produces this image. Now in the image there are two coordinates one for each word and other for the whole text. I would like to get the coordinates for the whole text (sentences in each line or the whole paragraph
This is what I have tried
box = pd.DataFrame(d) #dict to dataframe
box['text'].replace('', np.nan, inplace=True) #replace empty values by NaN
box= box.dropna(subset = ['text']) #delete rows with NaN
print(box)
def lineup(boxes):
linebox = None
for _, box in boxes.iterrows():
if linebox is None: linebox = box # first line begins
elif box.top <= linebox.top+linebox.height: # box in same line
linebox.top = min(linebox.top, box.top)
linebox.width = box.left+box.width-linebox.left
linebox.heigth = max(linebox.top+linebox.height, box.top+box.height)-linebox.top
linebox.text += ' '+box.text
else: # box in new line
yield linebox
linebox = box # new line begins
yield linebox # return last line
lineboxes = pd.DataFrame.from_records(lineup(box))
Output dataframe
n_boxes = len(lineboxes['level'])
for i in range(n_boxes):
(x, y, w, h) = (lineboxes['left'][i], lineboxes['top'][i], lineboxes['width'][i], lineboxes['height'][i])
cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
plt.figure(figsize=(10,10))
plt.imshow(img)
There seems to be no difference between the original coordinates and after joining all the coordinates
How can i get the coordinates of the whole text (sentences in each line or the whole paragraph) using pytesseract library?
You faced a similar issue in one of your previous questions linked here. I failed to elaborate what I meant in the comments. Here is a more visual explanation.
By horizontal kernel I meant an array with single row
[1, 1, 1, 1, 1]
. The number of columns can be determined based on the font size and space between characters/words. Using the kernel with a morphological dilation operation you can connect individual entities that are present horizontally as a single entity.In your case, we would like to extract each line as an individual entity. Let's go through the code:
Code:
Now there is a black border surrounding the original image. In
th
it becomes are white border. Since it is unwanted we will remove it usingcv2.floodFill()
You can get the coordinates for each line from
cv2.boundingRect()
. This can be seen in the image above. Using those coordinates you can crop each line in the document and feed it topytesseract
library.