I would like to configure Presto to query ORC files in AWS S3. According to the documentation, a Hive Metastore must be configured. This metastore will gather all the schemas of the tables. Since I have existing ORC files, is there any way to populate this metastore automatically with the schemes of the already existing tables that should be embedded in the ORC files ?
Thanks in advance for your help !
This is totally possible but it may fail some times if the ORC writer is not compatible with Trino (formerly known as PrestoSQL). This is rather unlikely but should be noted.
The first step is being able to get the schema correct. You can do this by printing out the orc schema using the uber orc-tools.jar and the
meta
command. See more: https://orc.apache.org/docs/java-tools.htmlYou will want to take the Type struct that is in the output. You will also want to ensure that the schemas are aligned between all of your ORC files.
The next step is to translate the Type struct to a
CREATE TABLE
statement like so:Once the table is created you should be able to start querying unless you have a partition in your table. If that is the case you will then need to sync your the Hive metastore (or possibly the file metastore) to add the existing partition locations to the metastore.
Assuming you had a partition on the
nationkey
column above, yourCREATE TABLE
statement would look like this:You would then want to sync which can be done with the system.sync_partition_metadata procedure.
This will look something like this:
There is also a demo of doing this in the Trino Community Broadcast (back when Trino was still called PrestoSQL).