hive create table input.regex - filter out all rows starting with a char

513 Views Asked by At

I want to create table in Hive

CREATE TABLE table (
    a     string
   ,b     string
) 
PARTITIONED BY ( pr_filename string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='reg_exp') ;

but source data have multiline header starting with "#"

# <some comment>
#
# <some other comments>
# <some other comments>
# <some other comments>
#
a,b
1,2
8,2
8,9

Is it possible to write reg_exp to filter out all rows starting with chosen character or do I have to use temporary table to deal with this header?

1

There are 1 best solutions below

0
On BEST ANSWER

If you try to filter like this:

'input.regex'='^([^#]+),([a-zA-Z])' --first group is everything except #

The row will be returned anyway with NULLs, you can filter such records.

RegexSerDe JavaDocs says: In deserialization stage, if a row does not match the regex, then all columns in the row will be NULL. If a row matches the regex but has less than expected groups, the missing groups will be NULL. If a row matches the regex but has more than expected groups, the additional groups are just ignored

The solution is to use intermediate table + filter rows when selecting from it.