pd.wide_to_long in python is slow

Question

pd.wide_to_long in python is slow

59 Views Asked by Roshan Nuvvula At 13 March 2024 at 02:48

I have a dataframe with 55049 rows and 667 columns in it.

Sample dataframe structure as follows:


data = {
    'g1': [1],
    'g2': [2],
    'g3': [3],
    
    'st1_1': [1],
    'st1_2': [1],
    'st1_3': [1],
    'st1_4': [1],
    'st1_5': [5],
    'st1_6': [5],
    'st1_7': [5],
    'st1_8': [5],
    
    'st1_Next_1': [8],
    'st1_Next_2': [8],
    'st1_Next_3': [8],
    'st1_Next_4': [8],
    'st1_Next_5': [9],
    'st1_Next_6': [9],
    'st1_Next_7': [9],
    'st1_Next_8': [9],
    
    'st2_1': [2],
    'st2_2': [2],
    'st2_3': [2],
    'st2_4': [2],
    'st2_5': [2],
    'st2_6': [2],
    'st2_7': [2],
    'st2_8': [2],
    
    'ft_1': [1],
    'ft_2': [0],
    'ft_3': [1],
    'ft_4': [1],
    'ft_5': [1],
    'ft_6': [0],
    'ft_7': [0],
    'ft_8': [1]
}

df = pd.DataFrame(data)
print(df)

To get my desired output I have the following code where I am using pd.wide_to_long

ilist = ['g1','g2','g3']
stublist = ['st1','st1_Next','st2','ft']

df_long = pd.wide_to_long(
      df.reset_index(),
      i=['index']+ilist ,
      stubnames= stublist,
      j='j', sep='_').reset_index()

df_long = df_long[df_long['ft']==1]

Above code is working in fine with expected results.

I perfromed this wide_to_long to apply the filter df_long[df_long['ft']==1]. which means ft_1 need to apply for all _1, ft_2 for all _2.....and so for all _8.

Problem is to perform wide_to_long operation it took around 2 mins, Since I have 800+ source files to process the whole process is taking 1600 mins which is quite high.

I Am looking for any alternative suggestions to transpose the data.

I have Tried this but didn't work for me with much differenece.

As @sammywemmy suggested, I have tried below code. But output is missing st1_Next.


ilist = ['g1','g2','g3']
stublist = ['st1','st1_Next','st2','ft']


df_pvot = df.pivot_longer(index=ilist,names_to=stublist,names_pattern=stublist)
print(df_pvot)

Output is missing st1_Next and data clubbing with st1 Instead of new column.

Output:

    g1  g2  g3  st1  st2   ft
0    1   2   3    1  2.0  1.0
1    1   2   3    1  2.0  0.0
2    1   2   3    1  2.0  1.0
3    1   2   3    1  2.0  1.0
4    1   2   3    5  2.0  1.0
5    1   2   3    5  2.0  0.0
6    1   2   3    5  2.0  0.0
7    1   2   3    5  2.0  1.0
8    1   2   3    8  NaN  NaN
9    1   2   3    8  NaN  NaN
10   1   2   3    8  NaN  NaN
11   1   2   3    8  NaN  NaN
12   1   2   3    9  NaN  NaN
13   1   2   3    9  NaN  NaN
14   1   2   3    9  NaN  NaN
15   1   2   3    9  NaN  NaN

Original Q&A

There are 1 best solutions below

**sammywemmy** · Accepted Answer · 2024-03-13T03:02:25.820000

One option is with pivot_longer, where you pass the new header names to names_to and a list of regexes to names_pattern:

# pip install pyjanitor
import pandas as pd

df.pivot_longer(index=ilist,names_to=stublist,names_pattern=stublist)
   g1  g2  g3  st1  st2  ft
0   1   2   3    1    2   1
1   1   2   3    1    2   0
2   1   2   3    1    2   1
3   1   2   3    1    2   1
4   1   2   3    1    2   1
5   1   2   3    1    2   0
6   1   2   3    1    2   0
7   1   2   3    1    2   1

Another option is a reshaping of the columns, followed by pd.stack:

reshaped = df.set_index(ilist)
reshaped.columns = reshaped.columns.str.split('_',expand=True).set_names([None,'drop'])
reshaped.stack(level='drop').droplevel('drop').reset_index()
   g1  g2  g3  st1  st2  ft
0   1   2   3    1    2   1
1   1   2   3    1    2   0
2   1   2   3    1    2   1
3   1   2   3    1    2   1
4   1   2   3    1    2   1
5   1   2   3    1    2   0
6   1   2   3    1    2   0
7   1   2   3    1    2   1

this is in response to the updated question - the names_pattern relies on regexes, and under the hood, pd.Series.str.contains and np.select is used to extract and pair the columns with the regexes. As such, the regexes have to be crafted properly to match the columns:

# pip install pyjanitor
import pandas as pd
import janitor 

# note the inclusion of digits within the regexes
names_pattern = [r'st1_\d+',r'st1_Next',r'st2_\d+', 'ft']

df.pivot_longer(index=ilist,names_to=stublist,names_pattern=names_pattern)
   g1  g2  g3  st1  st1_Next  st2  ft
0   1   2   3    1         8    2   1
1   1   2   3    1         8    2   0
2   1   2   3    1         8    2   1
3   1   2   3    1         8    2   1
4   1   2   3    5         9    2   1
5   1   2   3    5         9    2   0
6   1   2   3    5         9    2   0
7   1   2   3    5         9    2   1

You can replicate this using stack, this time using pd.Series.str.rsplit(), with n=1.

reshaped = df.set_index(ilist)
reshaped.columns = (reshaped
                    .columns
                    .str
                    .rsplit('_',n=1,expand=True)
                    .set_names([None, 'drop'])
                   )

reshaped.stack(level='drop').droplevel('drop').reset_index()
   g1  g2  g3  st1  st1_Next  st2  ft
0   1   2   3    1         8    2   1
1   1   2   3    1         8    2   0
2   1   2   3    1         8    2   1
3   1   2   3    1         8    2   1
4   1   2   3    5         9    2   1
5   1   2   3    5         9    2   0
6   1   2   3    5         9    2   0
7   1   2   3    5         9    2   1

As long as you understand the pattern your columns have, and adapt your code to them, you should be fine.

pd.wide_to_long in python is slow

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in TRANSFORM

Trending Questions

Popular # Hahtags

Popular Questions