I have the following emp table in hive testing database
1 ram 2000.0 101 market
2 shyam 3000.0 102 IT
3 sam 4000.0 103 finance
4 remo 1000.0 103 finance
I want to transpose this table in pyspark with first two columns being same and last 3 columns being stacked.
I have done the following in pyspark shell
test = sqlContext.sql("select * from testing.emp")
data = test.flatMap (lambda row: [Row (id=row ['id'],name=row['name'],column_name=col,column_val=row [col]) for col in ('sal','dno','dname')])
emp = sqlContext.createDataFrame(data)
emp.registerTempTable('mytempTable')
sqlContext.sql('create table testing.test(id int,name string,column_name string,column_val int) row format delimited fields terminated by ","')
sqlContext.sql('INSERT INTO TABlE testing.test select * from mytempTable')
the expected output is
1 ram sal 2000
1 ram dno 101
1 ram dname market
2 shyam sal 3000
2 shyam dno 102
2 shyam dname IT
3 sam sal 4000
3 sam dno 103
3 sam dname finance
4 remo sal 1000
4 remo dno 103
4 remo dname finance
But the output I get is
NULL 2000.0 1 NULL
NULL NULL 1 NULL
NULL NULL 1 NULL
NULL 3000.0 2 NULL
NULL NULL 2 NULL
NULL NULL 2 NULL
NULL 4000.0 3 NULL
NULL NULL 3 NULL
NULL NULL 3 NULL
NULL 1000.0 4 NULL
NULL NULL 4 NULL
NULL NULL 4 NULL
Also please let me know how I can loop columns if I have many columns in the table
Sorry I just notic "hive table "
And in HIVE :
P.S. You can only use sql without Spark as :
For your question about small parquet files:
small parquet files amount = DataFrame partitions amount
You can use df.coalesce or df.repartition to decrease DataFrame partitions amount
But I am not sure whether there is a hidden trouble that reduce DataFrame partitions to only one (e.g.: OOM?)
And there is another way to combine small files with out spark,just use HIVE sql: