I have a dataframe with 55049 rows and 667 columns in it.
Sample dataframe structure as follows:
data = {
'g1': [1],
'g2': [2],
'g3': [3],
'st1_1': [1],
'st1_2': [1],
'st1_3': [1],
'st1_4': [1],
'st1_5': [5],
'st1_6': [5],
'st1_7': [5],
'st1_8': [5],
'st1_Next_1': [8],
'st1_Next_2': [8],
'st1_Next_3': [8],
'st1_Next_4': [8],
'st1_Next_5': [9],
'st1_Next_6': [9],
'st1_Next_7': [9],
'st1_Next_8': [9],
'st2_1': [2],
'st2_2': [2],
'st2_3': [2],
'st2_4': [2],
'st2_5': [2],
'st2_6': [2],
'st2_7': [2],
'st2_8': [2],
'ft_1': [1],
'ft_2': [0],
'ft_3': [1],
'ft_4': [1],
'ft_5': [1],
'ft_6': [0],
'ft_7': [0],
'ft_8': [1]
}
df = pd.DataFrame(data)
print(df)
To get my desired output I have the following code where I am using pd.wide_to_long
ilist = ['g1','g2','g3']
stublist = ['st1','st1_Next','st2','ft']
df_long = pd.wide_to_long(
df.reset_index(),
i=['index']+ilist ,
stubnames= stublist,
j='j', sep='_').reset_index()
df_long = df_long[df_long['ft']==1]
Above code is working in fine with expected results.
I perfromed this wide_to_long to apply the filter
df_long[df_long['ft']==1]. which means ft_1 need to apply for all _1, ft_2 for all _2.....and so for all _8.
Problem is to perform wide_to_long operation it took around 2 mins, Since I have 800+ source files to process the whole process is taking 1600 mins which is quite high.
I Am looking for any alternative suggestions to transpose the data.
I have Tried this but didn't work for me with much differenece.
As @sammywemmy suggested, I have tried below code. But output is missing
st1_Next.
ilist = ['g1','g2','g3']
stublist = ['st1','st1_Next','st2','ft']
df_pvot = df.pivot_longer(index=ilist,names_to=stublist,names_pattern=stublist)
print(df_pvot)
Output is missing st1_Next and data clubbing with st1 Instead of new column.
Output:
g1 g2 g3 st1 st2 ft
0 1 2 3 1 2.0 1.0
1 1 2 3 1 2.0 0.0
2 1 2 3 1 2.0 1.0
3 1 2 3 1 2.0 1.0
4 1 2 3 5 2.0 1.0
5 1 2 3 5 2.0 0.0
6 1 2 3 5 2.0 0.0
7 1 2 3 5 2.0 1.0
8 1 2 3 8 NaN NaN
9 1 2 3 8 NaN NaN
10 1 2 3 8 NaN NaN
11 1 2 3 8 NaN NaN
12 1 2 3 9 NaN NaN
13 1 2 3 9 NaN NaN
14 1 2 3 9 NaN NaN
15 1 2 3 9 NaN NaN
One option is with pivot_longer, where you pass the new header names to
names_toand a list of regexes tonames_pattern:Another option is a reshaping of the columns, followed by pd.stack:
this is in response to the updated question - the
names_patternrelies on regexes, and under the hood,pd.Series.str.containsandnp.selectis used to extract and pair the columns with the regexes. As such, the regexes have to be crafted properly to match the columns:You can replicate this using
stack, this time usingpd.Series.str.rsplit(), withn=1.As long as you understand the pattern your columns have, and adapt your code to them, you should be fine.