Wide date to long data conversion in Python Pandas using column names

217 Views Asked by At

I have a wide dataframe in Python Pandas like this:

import pandas as pd
import numpy as np

df = pd.DataFrame({'student_id'  :[100, 101, 102],
                  'q1.cta1_count':[0, 1, 1],
                  'q1.cta1_date' :[np.nan, '2019-03-26', '2019-03-25' ],
                  'q1.cta1.reminder_count' :[0, 1, 0],
                  'q1.cta1.reminder_date'  :[np.nan, '2019-03-31', np.nan]})

df

enter image description here

and I need to convert to the long format so that it looks like this dataframe:

# student_id type               count  date
# 100        q1.cta1            0      NaN
# 100        q1.cta1.reminder   0      NaN
# 101        q1.cta1            1      2019-03-26
# 101        q1.cta1.reminder   1      2019-03-31
# 102        q1.cta1            1      2019-03-25
# 102        q1.cta1.reminder   0      NaN

How can I do that?

3

There are 3 best solutions below

3
On BEST ANSWER

You can use pandas wide to long, but first you need to reshape the columns to fit pandas wide to long format :

df.columns = ["_".join(entry.split("_")[::-1]) 
              if entry.endswith(("count", "date"))
              else entry
              for entry in df.columns]

Now, you can reshape your data :

pd.wide_to_long(df, 
                ["count", "date"], 
                i="student_id", 
                j="type", 
                sep="_", 
                suffix=".+").reset_index()


                                  
  student_id    type                  count       date      
0  100      q1.cta1                 0          NaN
1  101      q1.cta1                 1          2019-03-26
2  102      q1.cta1                 1          2019-03-25
3  100      q1.cta1.reminder        0          NaN
4  101      q1.cta1.reminder        1          2019-03-31
5  102      q1.cta1.reminder        0          NaN

Alternatively, you could use pyjanitor's pivot_longer function, which is built on pandas' melt function, and offers some more flexibility (full disclosure, I am a contributor to the library):

import pyjanitor
df.pivot_longer(index="student_id", 
                names_to=("type", ".value"), 
                names_sep="_")

    student_id  type             count  date
0     100      q1.cta1              0   NaN
1     100      q1.cta1.reminder     0   NaN
2     101      q1.cta1              1   2019-03-26
3     101      q1.cta1.reminder     1   2019-03-31
4     102      q1.cta1              1   2019-03-25
5     102      q1.cta1.reminder     0   NaN
0
On
df = df.set_index('student_id')
df = pd.concat([
    df[['q1.cta1_count', 'q1.cta1_date']].assign(type='q1.cta1').rename(columns={'q1.cta1_date':'date', 'q1.cta1_count':'count'}),
    df[['q1.cta1.reminder_count', 'q1.cta1.reminder_date']].assign(type='q1.cta1_reminder').rename(columns={'q1.cta1.reminder_date':'date', 'q1.cta1.reminder_count':'count'})
]).sort_index().reset_index()
print(df)

Prints:

   student_id  count        date              type
0         100      0         NaN           q1.cta1
1         100      0         NaN  q1.cta1_reminder
2         101      1  2019-03-26           q1.cta1
3         101      1  2019-03-31  q1.cta1_reminder
4         102      1  2019-03-25           q1.cta1
5         102      0         NaN  q1.cta1_reminder
0
On

Solution: From v0.20, melt is a first order function, you can use

import pandas as pd
import numpy as np

df = pd.DataFrame({'student_id'  :[100, 101, 102],
                  'q1.cta1_count':[0, 1, 1],
                  'q1.cta1_date' :[np.nan, '2019-03-26', '2019-03-25' ],
                  'q1.cta1.reminder_count' :[0, 1, 0],
                  'q1.cta1.reminder_date'  :[np.nan, '2019-03-31', np.nan]})

df

df.rename(columns={'q1.cta1_count': 'q1.cta1', 'q1.cta1.reminder_count': 'q1.cta1.reminder'}, inplace=True)

df[['student_id', 'q1.cta1', 'q1.cta1_date']]\
    .melt(id_vars=['student_id', 'q1.cta1_date'], var_name='q1.cta1', value_name='count')\
    .rename(columns={'q1.cta1_date': 'date', 'q1.cta1': 'type'})\
    .append(    
        df[['student_id', 'q1.cta1.reminder', 'q1.cta1.reminder_date']]\
            .melt(id_vars=['student_id', 'q1.cta1.reminder_date'], var_name='q1.cta1.reminder', value_name='count')\
            .rename(columns={'q1.cta1.reminder_date': 'date', 'q1.cta1.reminder': 'type'})\
    , ignore_index=True)\
    .sort_values(['student_id'])[['student_id', 'type', 'count', 'date']]

Prints:

   student_id              type  count        date
0         100           q1.cta1      0         NaN
3         100  q1.cta1.reminder      0         NaN
1         101           q1.cta1      1  2019-03-26
4         101  q1.cta1.reminder      1  2019-03-31
2         102           q1.cta1      1  2019-03-25
5         102  q1.cta1.reminder      0         NaN