We just went ahead with a deployment for one of our Hive based table. We renamed our column risk_old to risk_new (renamed). The table is period partitioned. However post deployment, we saw a strange behaviour that all data for that renamed column for all prior periods have been deleted/removed by null. Only newer period data is having values. How is this possible? How can we get back the data.
Sample data prior deployment:
| id | risk_old | period |
|---|---|---|
| 1234 | high | 20230927 |
| 2345 | low | 20230927 |
| 1234 | high | 20230928 |
| 2345 | low | 20230928 |
Steps for deployment:
- Killing old schedule
- Deployment of new code (with changes in code sourcing from new source).
- ALTER statement for renaming column from risk_old to risk_new
- Creating schedule for the new job
Post deployment data:
| id | risk_new | period |
|---|---|---|
| 1234 | 20230927 | |
| 2345 | 20230927 | |
| 1234 | 20230928 | |
| 2345 | 20230928 | |
| 1234 | high | 20230929 |
| 2345 | low | 20230929 |
We are looking on how the new column risk_new can also populate all periods data? We cannot revert back changes since all downstream apps have also changed their code renaming columns to fetch data from our table. One catch in UAT, if we try altering back table name from risk_new back to risk_old, it is showing older data but not the newer one.
Any leads in how to retain the name column name with all periods of data? Thanks in advance
Renaming columns in hive with tables stored as parquet doesnt work properly. Because the schema info is stored in a file and the schema info changes when you fire alter table statement but underlying data doesnt change. This shows odd/old values for that column.
Clean solution would be -
Create a table with new definition from old table and rename columns while selecting
create table newtable as select id, risk_old as risk_new, period from oldtable ;Drop old table
drop table oldtable.Alter new table and rename to oldtable.
alter table newtable rename to oldtable;