I am trying to convert datafarame to desired output format with requirements mentioned below.
Provided requirements:
- Each row can only keep one not Nan value (except Trh1 and Trh2)
- I want to avoid methods that iterate over each row for performance reasons.
- I have only included four columns, for example, in a real scenario there are many more columns to share
Example:
Input:
| Index | Schema | Column | Trh1 | Trh2 | Trh3 | Trh4 |
|---|---|---|---|---|---|---|
| 0 | schema_1 | col_1 | NaN | 0.01 | NaN | NaN |
| 1 | schema_2 | col_2 | 0.02 | 0.03 | NaN | NaN |
| 2 | schema_3 | col_3 | 0.03 | 0.04 | 0.05 | NaN |
| 3 | schema_4 | col_4 | NaN | NaN | 0.06 | 0.07 |
Expected output:
| Index | Schema | Column | Trh1 | Trh2 | Trh3 | Trh4 |
|---|---|---|---|---|---|---|
| 0 | schema_1 | col_1 | NaN | 0.01 | NaN | NaN |
| 1 | schema_2 | col_2 | 0.02 | 0.03 | NaN | NaN |
| 2 | schema_3 | col_3 | 0.03 | 0.04 | NaN | NaN |
| 3 | schema_3 | col_3 | NaN | NaN | 0.05 | NaN |
| 4 | schema_4 | col_4 | NaN | NaN | 0.06 | NaN |
| 5 | schema_4 | col_4 | NaN | NaN | NaN | 0.07 |
I explored following approach: Split row into 2 based on condition pandas. However, this approach is only suitable for splitting a row if there are no Nan values in the two columns.
handling a jump
Output:
original answer
You can use reshaping with de-duplication, with
stack/unstack:NB. this requires no duplicates in the initial
cols.Or with
melt, which might be more memory intensive but also more robust if you have duplicates:Output: