How can I write Parquet files with int64 timestamps (instead of int96) from AWS Kinesis Firehose?

1k Views Asked by At

Why do int96 timestamps not work for me?

I want to read the Parquet files with S3 Select. S3 Select does not support timestamps saved as int96 according to the documentation. Also, storing timestamps in parquet as int96 is deprecated.

What did I try?

Firehose uses org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe for serialization to parquet. (The exact hive version that is used by AWS is unknown.) While reading the hive code, I came across the following config switch: hive.parquet.write.int64.timestamp. I tried to apply this config switch by changing the Serde parameters in the AWS Glue table config: AWS Glue Serde parameters config Unfortunately, this did not make a difference and my timestamp column is still stored as int96 (checked by downloading a file from S3 and inspecting it with parq my-file.parquet --schema)

1

There are 1 best solutions below

0
On

While I was not able to make Firehose write int64 timestamps, I found a workaround to convert the int96 timestamps returned by the S3 Select query result into something useful.

I used the approach described in

to write the following conversion function in JavaScript:

const hideTimePart = BigInt(64);
const maskToHideJulianDayPart = BigInt('0xffffffffffffffff');
const unixEpochInJulianDay = 2_440_588;
const nanoSecsInOneSec = BigInt(1_000_000_000);
const secsInOneDay = 86_400;
const milliSecsInOneSec = 1_000;

export const parseS3SelectParquetTimeStamp = (ts: string) => {
  const tsBigInt = BigInt(ts);

  const julianDay = Number(tsBigInt >> hideTimePart);
  const secsSinceUnixEpochToStartOfJulianDay = (julianDay - unixEpochInJulianDay) * secsInOneDay;

  const nanoSecsSinceStartOfJulianDay = tsBigInt & maskToHideJulianDayPart;
  const secsSinceStartOJulianDay = Number(nanoSecsSinceStartOfJulianDay / nanoSecsInOneSec);

  return new Date(
    (secsSinceUnixEpochToStartOfJulianDay + secsSinceStartOJulianDay) * milliSecsInOneSec,
  );
};

parseS3SelectParquetTimeStamp('45377606915595481758988800'); // Result: '2022-12-11T20:58:33.000Z'

Note, unlike expected, the timestamps returned by S3 Select stores the julian day part at the beginning and not in the last 4 bytes. The nano sec time part is stored in the last 8 bytes. Furthermore, the byte order is not reversed.

(Regarding the julian day constant 2440588: Using 2440587.5 would be wrong in this context according to https://docs.oracle.com/javase/8/docs/api/java/time/temporal/JulianFields.html)