Base problem: The PDFTron lib font.MapToUnicode return wrong case character.
Details: This is happens in particular book the snap are attached below, there are few character are getting in lower case but char.char_code is for upper case. as per my knowledge the font character and gly mapping having a problem. please go through the code and file and let me help in this case
environment: PDFNetPython3 lib vr : 9.4.2
Original PDF has some capital letters, but we get small letter as character PDF text is capital 'O' but PDFTron text-extractor gives small 'o'
snap of pdf :
code :
from PDFNetPython3.PDFNetPython import PDFNet, PDFDoc, ElementReader, Element, Point
from PDFNetPython3.PDFNetPython import Font, GState, ColorSpace, PatternColor, PathData
class CharFromPDF:
def __init__(self):
pass
def print_char_from_pdf(self, pdf_file_path):
PDFNet.Initialize("demo:1691991990538:7c56930b030000000055aed6bf8e4eb6a00bb237070a3797ee21cafe95")
doc = PDFDoc(pdf_file_path)
doc.InitSecurityHandler()
page_begin = doc.GetPageIterator()
page_reader = ElementReader()
itr = page_begin
while itr.HasNext():
page_reader.Begin(itr.Current())
self.process_elements(page_reader)
page_reader.End()
itr.Next()
doc.Close()
PDFNet.Terminate()
print("Done.")
def process_path(self, reader, path):
gs = path.GetGState()
gs_itr = reader.GetChangesIterator()
while gs_itr.HasNext():
if gs_itr.Current() == GState.e_fill_color:
if (gs.GetFillColorSpace().GetType() == ColorSpace.e_pattern and
gs.GetFillPattern().GetType() != PatternColor.e_shading):
reader.PatternBegin(True)
self.process_elements(reader)
reader.End()
gs_itr.Next()
reader.ClearChangeList()
def process_text(self, page_reader):
# Begin text element
element = page_reader.Next()
while element is not None:
element_type = element.GetType()
if element_type == Element.e_text_end:
return
elif element_type == Element.e_text:
gs = element.GetGState()
font = gs.GetFont()
if font.GetType() == Font.e_Type3:
itr = element.GetCharIterator()
while itr.HasNext():
page_reader.Type3FontBegin(itr.Current())
self.process_elements(page_reader)
page_reader.End()
else:
itr = element.GetCharIterator()
while itr.HasNext():
char_code = itr.Current().char_code
a = font.MapToUnicode(char_code)
print("Char: ", a[0], " ascii code: ", ascii(a[0]), "char_code", char_code,
" Font Name: ", font.GetName())
itr.Next()
print("")
element = page_reader.Next()
def process_elements(self, reader):
element = reader.Next()
while element is not None:
element_type = element.GetType()
if element_type == Element.e_path:
self.process_path(reader, element)
elif element_type == Element.e_text_begin:
self.process_text(reader)
elif element_type == Element.e_form:
reader.FormBegin()
self.process_elements(reader)
reader.End()
element = reader.Next()
if __name__ == "__main__":
cfp = CharFromPDF()
input_file_path = "text_issue.pdf"
cfp.print_char_from_pdf(pdf_file_path=input_file_path)
In above example you find the font.MaptoUnicode the character code for "o" is capital case but function return small case letter
we try the textextracter as well from same lib but the return vise versa out but as text.

The font used to display the text has a ToUnicode cmap that maps 'O' (upper case O) to 'o' (lower case o).
PDF specification says that when extracting text from PDF, the ToUnicode cmap should be considered first and then the font's encoding.
It seems that Acrobat ignores the ToUnicode cmap in favor of the font's WinAnsi encoding. Even after fixing the cmap's code space range Acrobat still ignores it, so this might be Acrobat's particular behavior with WinAnsi encoding (not compliant with PDF specification).
Other PDF readers such as SumatraPDF use the ToUnicode cmap for text extraction so their output is the same as PDFTron.