There are at least two different ways of creating a hive table backed with Avro data:
Creating a table based on an Avro schema (in this example, stored in hdfs):
CREATE TABLE users_from_avro_schema ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/avro/schema/user.avsc');
Creating a table by specifying hive columns explicitly with
STORED AS AVRO
clause:CREATE TABLE users_stored_as_avro( id INT, name STRING ) STORED AS AVRO;
Am I correct that in the first case the metadata of users_from_avro_schema
table are not stored in Hive Metastore, but inferred from the SERDE class reading the avro schema file? Or maybe the table metadata are stored in the Metastore, added on table's creation, but then what is the policy for synchronising hive metadata with the Avro schema? I mean both cases:
- updating table metadata (adding/removing columns) and
- updating Avro schema by changing
avro.schema.url
property.
In the second case when I call DESCRIBE FORMATTED users_stored_as_avro
there is no avro.schema.*
property defined, so I don't know which Avro schema is used to read/write data. Is it generated dynamically based on the table's metadata stored in the Metastore?
This fragment of Programming Hive book discusses inferring info about columns from the SerDe class, but on the other hand HIVE-4703 removes this from deserializer
info form columns comments. How can I check then what is the source of column types for a given table (Metastore or Avro schema)?
I decided to publish a complementary answer to those given by @DuduMarkovitz.
To make code examples more concise let's clarify that
STORED AS AVRO
clause is an equivalent of these three lines:Let's take a look then at what happens when we create a table giving a reference to avro schema stored in hdfs. Here is the schema:
We create our table with the following command:
Hive has inferred the schema properly, which we can see by calling:
Hive Metastore shows us the same (I use @DuduMarkovitz's query):
So far, so good, everything works as we expect. But let's see what happens when we update
avro.schema.url
property to point to the next version of our schema (users_v2.avsc), which is as follows:We simply added another field called email.
Now we update a table property pointing to the avro schema in hdfs:
Has table metadata been changed?
Yeah, cool! But do you expect that Hive Metastore contains this additional column?
Unfortunately in Metastore nothing changed:
I suspect that Hive has the following strategy of inferring schema: It tries to get it from a SerDe class specified for a given table. When SerDe cannot provide the schema Hive looks into the metastore.
Let's check that by removing
avro.schema.url
property:Describe shows us data stored in the Metastore. Let's modify them by adding a column:
It of course changes Hive Metastore:
But when we set
avro.schema.url
again back touser_v2.avsc
what is in Hive Metastore doesn't matter any more:Avro schema takes precedence over the Metastore.
The above example shows that we should rather avoid mixing hive schema changes with avro schema evolution, because otherwise we can easily get into big mess and inconsistency between Hive Metastore and actual schema which is used while reading and writing data. The first inconsistency occurs when we change our avro schema definition by updating
avro.schema.url
property, but we can live with that if we are aware of Hive strategy of inferring schema. I haven't checked in Hive's source code whether my suspicions about schema logic are correct, but the above example convince me what happens underneath.I extended my answer to show that even when there is a conflict between Avro schema and Hive Metastore data which comply Avro schema can be read. Please have a look again at my example above. Our table definition points to avro schema having three fields:
whereas in Hive Metastore there are the following columns:
email vs phone
Let's create an avro file containing a single user record complying
user_v2.avsc
schema. This is its json representation:To create the avro file we call:
We are able to query our table despite the fact that Hive Metastore doesn't contain
email
column and it containsphone
column instead: