Spark DataFrame ORC Hive table reading issue

Question

Spark DataFrame ORC Hive table reading issue

4.1k Views Asked by Subhasis At 03 July 2018 at 04:10

I am trying to read a Hive table in Spark. Below is the Hive Table format:

# Storage Information       
SerDe Library:  org.apache.hadoop.hive.ql.io.orc.OrcSerde   
InputFormat:    org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
OutputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat    
Compressed: No  
Num Buckets:    -1  
Bucket Columns: []  
Sort Columns:   []  
Storage Desc Params:        
    field.delim \u0001
    serialization.format    \u0001

When I am trying to read it using the Spark SQL with the below command:

val c = hiveContext.sql("""select  
        a
    from c_db.c cs 
    where dt >=  '2016-05-12' """)
c. show

I am getting the below warning:-

18/07/02 18:02:02 WARN ReaderImpl: Cannot find field for: a in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49, _col50, _col51, _col52, _col53, _col54, _col55, _col56, _col57, _col58, _col59, _col60, _col61, _col62, _col63, _col64, _col65, _col66, _col67,

The read starts but it is very slow and getting network time out.

When i am trying to read the Hive table directory directly i am getting the below error.

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.setConf("spark.sql.orc.filterPushdown", "true") 
val c = hiveContext.read.format("orc").load("/a/warehouse/c_db.db/c")
c.select("a").show()

org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns: [_col18, _col3, _col8, _col66, _col45, _col42, _col31, _col17, _col52, _col58, _col50, _col26, _col63, _col12, _col27, _col23, _col6, _col28, _col54, _col48, _col33, _col56, _col22, _col35, _col44, _col67, _col15, _col32, _col9, _col11, _col41, _col20, _col2, _col25, _col24, _col64, _col40, _col34, _col61, _col49, _col14, _col13, _col19, _col43, _col65, _col29, _col10, _col7, _col21, _col39, _col46, _col4, _col5, _col62, _col0, _col30, _col47, trans_dt, _col57, _col16, _col36, _col38, _col59, _col1, _col37, _col55, _col51, _col60, _col53]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

I can convert the Hive table to TextInputFormat but that should be my last option as i would like to get the benefit of OrcInputFormat to compress the table size.

Really appreciate your suggestion.

Original Q&A

There are 4 best solutions below

**Vihit Shah** · Answer 1 · 2018-07-03T06:17:32.637000

I think the table doesnt have named columns or if it has, Spark isnt able to read the names probably. You can use the default column names that Spark has given as mentioned in the Error. Or also set column names in the Spark code. Use printSchema and toDF method to rename the columns. But yes, you will need the mappings. This might require selecting and showing columns individually.

**K. Kostikov** · Answer 2 · 2019-03-12T09:45:59.597000

K. Kostikov On 12 March 2019 at 09:45

I found workaround with reading table such way:

val schema = spark.table("db.name").schema

spark.read.schema(schema).orc("/path/to/table")

**V.B** · Answer 3 · 2019-12-24T13:02:02.223000

V.B On 24 December 2019 at 13:02

The issue occurs generally with large tables, as it fails to read to max field length. I added meta-store read as true (set spark.sql.hive.convertMetastoreOrc=true;) and it worked for me.

**Sreenath Vemireddy** · Answer 4 · 2020-11-17T04:13:40.743000

Sreenath Vemireddy On 17 November 2020 at 04:13

Setting (set spark.sql.hive.convertMetastoreOrc=true;) conf is working. But its trying to modify metadata of hive table. Can you please explain me, What is going to modify and does it effect the table. Thanks

Spark DataFrame ORC Hive table reading issue

There are 4 best solutions below

Related Questions in APACHE-SPARK

Related Questions in HIVE

Related Questions in APACHE-SPARK-SQL

Related Questions in ORC

Related Questions in HIVE-TABLE

Trending Questions

Popular # Hahtags

Popular Questions