Why is dictionary page offset 0 for `plain_dictionary` encoding?

1k Views Asked by At

The parquet was generated by Spark v2.4 Parquet-mr v1.10

n = 10000
x = [1.0, 2.0, 3.0, 4.0, 5.0, 5.0, None] * n
y = [u'é', u'é', u'é', u'é', u'a', None, u'a'] * n

z = np.random.rand(len(x)).tolist()
dfs = spark.createDataFrame(zip(x, y, z), schema=StructType([StructField('x', DoubleType(),True),StructField('y', StringType(), True),StructField('z', DoubleType(), False)]))
dfs.repartition(1).write.mode('overwrite').parquet('test_spark.parquet')

Using parquet-tools v1.12 to inspect

row group 0 
--------------------------------------------------------------------------------
x:  DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
y:  BINARY SNAPPY DO:0 FPO:1636 SZ:864/16573/19.18 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]
z:  DOUBLE SNAPPY DO:0 FPO:2500 SZ:560097/560067/1.00 VC:70000 ENC:PLAIN,BIT_PACKED ST:[min: 2.0828331581679294E-7, max: 0.9999892375625329, num_nulls: 0]

    x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000

    y TV=70000 RL=0 DL=1 DS: 2 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000] SZ:16514 VC:70000

    z TV=70000 RL=0 DL=0
    ----------------------------------------------------------------------------
    page 0:                   DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[min: 2.0828331581679294E-7, max: 0.9999892375625329, num_nulls: 0] SZ:560000 VC:70000

Question:

Should FPO (First data page offset) always bigger or smaller than DO (Dictionary page offset)? I read from somewhere that dictionary page is stored after the data page.

For column x & y, plain_dictionary is used for encoding. However, why is dictionary offset 0 for both of the columns?

If I inspect using pyarrow v0.11.1 which uses parquet-cpp v1.5.1, it tells me has_dictionary_page: False & dictionary_page_offset: None

Does it have a dictionary page or not?

1

There are 1 best solutions below

0
On BEST ANSWER

The offset of the first data page is always larger than the offset of the dictionary. In other words, the dictionary comes first and only then the data pages. There are two metadata fields meant to store these offsets: dictionary_page_offset (aka DO) and data_page_offset (aka FPO). Unfortunately, these metadata fields are not filled in correctly by parquet-mr.

For example, if the dictionary starts at offset 1000 and the the first data page starts at offset 2000, then the correct values would be:

  • dictionary_page_offset = 1000
  • data_page_offset = 2000

Instead, parquet-mr stores

  • dictionary_page_offset = 0
  • data_page_offset = 1000

Applied to your example, this means that in spite of parquet-tools showing DO: 0, columns x and y are dictionary encoded nonetheless (column z is not).

It is worth mentioning that Impala follows the specification correctly, so you can not rely on every file having this deficiency.

This is how parquet-mr handles this situation during reading:

// TODO: this should use getDictionaryPageOffset() but it isn't reliable.
if (f.getPos() != meta.getStartingPos()) {
  f.seek(meta.getStartingPos());
}

where getStartingPos is defined as:

/**
 * @return the offset of the first byte in the chunk
 */
public long getStartingPos() {
  long dictionaryPageOffset = getDictionaryPageOffset();
  long firstDataPageOffset = getFirstDataPageOffset();
  if (dictionaryPageOffset > 0 && dictionaryPageOffset < firstDataPageOffset) {
    // if there's a dictionary and it's before the first data page, start from there
    return dictionaryPageOffset;
  }
  return firstDataPageOffset;
}

You can see these lines of code in context here: ParquetFileReader.readDictionary, ColumnChunkMetaData.getStartingPos.