using regex in pivot_longer to unpivot multiple sets of columns with common grouping variable

65 Views Asked by At

Follow-up from my last question: pyjanitor pivot_longer multiple sets of columns with common grouping variable and id column

In my last question, the dataset I gave was oversimplified for the problem I was having. I have changed the column names to represent the ones in my dataset, as I couldn't figure out how to fix them myself using regex in pivot_longer. In the model dataset I gave, columns were written with the following pattern: number_word, but in my dataset the columns are in any order and never separated by underscores (e.g., wordnumber).

Note that the number needs to be the same grouping variable for each column set. So there should be a rating, estimate, and type for each number.

The dataset

df = pd.DataFrame({

    'id': [1, 1, 1],
    'ratingfirst': [1, 2, 3],
    'ratingsecond': [2.8, 2.9, 2.2],
    'ratingthird': [3.4, 3.8, 2.9],
    'firstestimate': [1.2, 2.4, 2.8],
    'secondestimate': [2.4, 3, 2.4],
    'thirdestimate':[3.4, 3.8, 2.9],
    'firsttype': ['red', 'green', 'blue'],
    'secondtype': ['red', 'green', 'yellow'],
    'thirdtype': ['red', 'red', 'blue'],
})

Desired output

The header of my desired output is the following:

id category rating estimate type
1 first 1.0 1.2 'red'
2

There are 2 best solutions below

2
Shaido On BEST ANSWER

I think the easiest way would be to align the columns you have with what was used in the previous question, something like:

def fix_col_header(s, d):
    for word, word_replace in d.items():
        s = s.replace(word, word_replace)
        if s.startswith("_"):
            s = s[len(word_replace):] + s[:len(word_replace)]
    return s

d = {"first":"_first", "second":"_second", "third": "_third"}
df.columns = [fix_col_header(col, d) for col in df.columns]

This will give the columns:

id, rating_first, rating_second, rating_third, estimate_first, estimate_second, estimate_third, type_first, type_second, type_third

Now you can apply the solution from the previous question (note that category and value are switched). For completeness I have added it here:

import janitor

(df
.pivot_longer(
    column_names="*_*", 
    names_to = (".value", "category"), 
    names_sep="_")
)
0
sammywemmy On

pivot_longer supports multiple .value - you can take advantage of that to reshape your dataframe, using the names_sep parameter:

# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
    index='id',
    names_to = (".value", "category", ".value"),
    names_sep = "(first|second|third)")
)
   id category  rating  estimate    type
0   1    first     1.0       1.2     red
1   1    first     2.0       2.4   green
2   1    first     3.0       2.8    blue
3   1   second     2.8       2.4     red
4   1   second     2.9       3.0   green
5   1   second     2.2       2.4  yellow
6   1    third     3.4       3.4     red
7   1    third     3.8       3.8     red
8   1    third     2.9       2.9    blue

If you look at the breakdown you can see what's going on here :

df.columns[1:].str.split("(first|second|third)")
Index([   ['rating', 'first', ''],   ['rating', 'second', ''],
          ['rating', 'third', ''],  ['', 'first', 'estimate'],
       ['', 'second', 'estimate'],  ['', 'third', 'estimate'],
            ['', 'first', 'type'],     ['', 'second', 'type'],
            ['', 'third', 'type']],
      dtype='object')

Note how we have three entries, and one of them is an empty string. This matches our names_to argument -> (".value", "category", ".value"); once pivot_longer matches, it then in the final output combine the .values into one, so for ['rating', 'first', ''], it pulls out ('rating', ''), and finally lumps them into one -> rating, and on and on for the rest of the columns.

Another option is with pd.stack:

temp = df.set_index('id')
temp.columns = temp.columns.str.split("(first|second|third)", expand = True)
temp.columns.names = [None, 'category', None]
temp = temp.stack('category')
temp.columns = temp.columns.map("".join)
temp.reset_index()

   id category  estimate    type  rating
0   1    first       1.2     red     1.0
1   1   second       2.4     red     2.8
2   1    third       3.4     red     3.4
3   1    first       2.4   green     2.0
4   1   second       3.0   green     2.9
5   1    third       3.8     red     3.8
6   1    first       2.8    blue     3.0
7   1   second       2.4  yellow     2.2
8   1    third       2.9    blue     2.9