My logs look like this:
client_id;event_1;event_2;event3
And i would like to get an SQL Table like this:
client_id | event
---------------------
... | event_1
... | event_2
... | event_3
I am new to Hive, it seems to me that one log line always provides one entry in the resulting SQL table. I tried the following (unsuccessful):
CREATE EXTERNAL TABLE IF NOT EXISTS tablename (
client_id String,
`event` String
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^([^\;]+);.*([^\;]+).*$" )
LOCATION 's3://myBucket/prefix/';
It takes only the first event and ignore the others...
Unfortunately, it is not possible to generate rows using SerDe in table DDL. It's possible to do the same in Hive.
(1) Read all user events as a single column:
Check, it should read two columns, user_id and all events concatenated:
'client_id'
and'event_1;event_2;event3'
(2) Split events and explode to generate rows:
Read also about Lateral View.
In Athena use UNNEST with CROSS JOIN: