DDIA Thrift CompactProtocol encoding

28 Views Asked by At

When I read DDIA page 119, I see that in Trift CompactProtocol when encode int64 with variable length.

Example taken from DDIA

Number in Base 10 to be encoded: 1337 It is encoded to 1|111001|0 0|0010100

1st byte:  1|111011|0 ---> 
           the first 1 represents there are more data coming in after this byte
           then 111011 is the actual coding of 37 (i assuem they use little endian)
           the last 0 represent the sign +
2nd byte: 0|0010100

My question is, how are 37 and 13 being encoded to 111011 and 0010100? Shouldn't they be 100101 and 1101? Thank you

I searched online but it seems this is so straightforward that nobody has ever doubted about this :(

2

There are 2 best solutions below

0
JosefZ On

Decimal 1337 is hexadecimal 0539 or binary 0000010100111001.

There is a clear and comprehensive explanation and comparison of BinaryProtocol vs Thrift CompactProtocol at Designing Data-Intensive Applications by Martin Kleppmann (Chapter 4. Encoding and Evolution) found e.g. at first three links via googling DDIA "Thrift CompactProtocol":

The Thrift CompactProtocol encoding is semantically equivalent to BinaryProtocol, but as you can see in Figure 4-3, it packs the same information into only 34 bytes. It does this by packing the field type and tag number into a single byte, and by using variable-length integers. Rather than using a full eight bytes for the number 1337, it is encoded in two bytes, with the top bit of each byte used to indicate whether there are still more bytes to come. This means numbers between –64 and 63 are encoded in one byte, numbers between –8192 and 8191 are encoded in two bytes, etc. Bigger numbers use more bytes.

enter image description here

0
JensG On

Integers are encoded using what is called "ZigZag encoding". Once you wrapped your head around it, its actually pretty simple.

def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31)
def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1)
def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63)
def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1)

The theory behind is that numbers near zero have a higher probability to appear in usual data than values near the limits of the data type. So the sign bit is removed from front and placed at the end instead, giving less significant bits for smaller numbers than for larger numbers.

The leading zeroes can be dropped because that is mostly redundant information. Only caveat is that you need a control bit to tell where the variable-length byte sequence ends, that's why the "payload" bits are grouped into seven-bit groups, leaving room for the "control bit".

Hence, when encoding dec 1337 = bin 0000 0101 0011 1001, the data are split (remember LE) into 7 bits 0111001 and the remaining 5 bits 01010, which leads to

ctrl | data | sign
    1|111001|0
    0|001010|0

or hex F214. No idea why where that pound sign comes from in that picture, but that should be an F.

I searched online but it seems this is so straightforward that nobody has ever doubted about this :(

Since this is FOSS, there is also sometimes documentation available at completely unexpected and illogical places ;-) such as: https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md or https://protobuf.dev/programming-guides/encoding/