How to use Cyrillic alphabet in borb?

115 Views Asked by At

I'm trying to use Cyrillic in borb. It turned out that borb does not use fonts that support the Cyrillic alphabet. I found a simple but long-working solution, as well as a fast-working solution, but I had problems with it.

I need to change a PDF document, but since I can't do it directly (this is a completely non-trivial task), I read all the data from the PDF document, then change the coordinates of the characters, create a new page, insert all the characters back there, and delete the old page. I need to be able to manipulate the location of each individual character. To do this, I first used Rectangle and Paragraph for each individual character, but this is not the most effective solution, and the code runs for a long time. However, I managed to use a font that supports Cyrillic.

font_path: Path = Path(__file__).parent / "TimesNewRomanRegular.ttf"
custom_font: Font = TrueTypeFont.true_type_font_from_file(font_path)
for i, s in enumerate(symb_arr):
    r: Rectangle = Rectangle(
        Decimal(s.x_coord),  # x: 0 + page_margin
        Decimal(s.y_coord),  # y: page_height - page_margin - height_of_textbox
        Decimal(s.width + 2),  # width: page_width - 2 * page_margin
        Decimal(s.height + 2),  # height
    )
    Paragraph(s.sym, font_size=Decimal(font_size), font=custom_font).paint(page, r)

Where symb_arr is an array of objects of the Symbol class (my own):

class Symbol:
    def __init__(self, s, x, y, w, h, f):
        self.sym = s  # character
        self.x_coord = x
        self.y_coord = y
        self.width = w
        self.height = h
        self.font_size = f

This code for my two-paragraph PDF document runs for about 5 seconds, which is not so little. However, I found a more elegant and faster solution using low-level syntax:

# create content stream
content_stream = Stream()
content = b""""""
for s in symb_arr:
    content += b"""
        q
        BT
        /F1 %b Tf
        %b %b Td
        (%b) Tj
        ET
        Q
    """ % (bytes(format(s.font_size, '.4f'), 'utf-8'), bytes(format(s.x_coord, '.4f'), 'utf-8'),
           bytes(format(s.y_coord, '.4f'), 'utf-8'), bytes(str(s.sym), 'utf-8'))

content_stream[Name("DecodedBytes")] = content
content_stream[Name("Bytes")] = zlib.compress(content_stream["DecodedBytes"], 9)
content_stream[Name("Filter")] = Name("FlateDecode")
content_stream[Name("Length")] = bDecimal(len(content_stream["Bytes"]))

# set content of page
page[Name("Contents")] = content_stream

# set Font
page[Name("Resources")] = Dictionary()
page[Name("Resources")] = doc.get_page(0)["Resources"]  # here I am installing the font from the original page

This method works for 0.4 seconds for the same example, which is 12 times faster than the previous method! However, I can't use the Cyrillic alphabet in this method.

Potential problems are as follows: the font used does not support Cyrillic, or there is a problem with byte encoding, since the (%b) Tj operator receives a byte as input, but the Cyrillic alphabet is encoded in utf-8 with two bytes.

So, I can also change the font when using low-level syntax, however I don't know how. I have an example of the author from the borb documentation, however, very little attention is paid to this there, so I do not understand how to do this on my own with my font (having a TTF file).

Here the author defines his font using low-level syntax:

# set Font
page[Name("Resources")] = Dictionary()
page["Resources"][Name("Font")] = Dictionary()
page["Resources"]["Font"][Name("F1")] = Dictionary()
page["Resources"]["Font"]["F1"][Name("Type")] = Name("Font")
page["Resources"]["Font"]["F1"][Name("Subtype")] = Name("Type1")
page["Resources"]["Font"]["F1"][Name("Name")] = Name("F1")
page["Resources"]["Font"]["F1"][Name("BaseFont")] = Name("Helvetica")
page["Resources"]["Font"]["F1"][Name("Encoding")] = Name("MacRomanEncoding")
1

There are 1 best solutions below

0
Joris Schellekens On

Disclaimer: I am the author of borb

I suggest you look at the true_type_font_from_file method in true_type_font.py. This method is the starting point for constructing a font from a .ttf file.

If you walk through the code in debug mode, you should see all the dictionaries being created.

If you need further information on the specifics of fonts, you can also check out the PDF specification (which is included in the repository).

Section 9.5 Introduction to Font Data Structures is a good place to start.