Why do int96 timestamps not work for me?
I want to read the Parquet files with S3 Select. S3 Select does not support timestamps saved as int96 according to the documentation. Also, storing timestamps in parquet as int96 is deprecated.
What did I try?
Firehose uses org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
for serialization to parquet. (The exact hive version that is used by AWS is unknown.) While reading the hive code, I came across the following config switch: hive.parquet.write.int64.timestamp
. I tried to apply this config switch by changing the Serde parameters in the AWS Glue table config:
Unfortunately, this did not make a difference and my timestamp column is still stored as int96 (checked by downloading a file from S3 and inspecting it with
parq my-file.parquet --schema
)
While I was not able to make Firehose write int64 timestamps, I found a workaround to convert the int96 timestamps returned by the S3 Select query result into something useful.
I used the approach described in
to write the following conversion function in JavaScript:
Note, unlike expected, the timestamps returned by S3 Select stores the julian day part at the beginning and not in the last 4 bytes. The nano sec time part is stored in the last 8 bytes. Furthermore, the byte order is not reversed.
(Regarding the julian day constant
2440588
: Using2440587.5
would be wrong in this context according to https://docs.oracle.com/javase/8/docs/api/java/time/temporal/JulianFields.html)