How to convert R's Tukey's HSD table into correlation matrix in Python using Pandas

31 Views Asked by At

I have recently exported a table from R's TukeyHSD test to obtain the p-values for various time groups (0, 5, 10, 20, 30, 40, 50, 60). I'm curious if there's a method to transform this into a correlation matrix, where each axis represents the time groups and corresponds to the respective p-value.

The table includes an index indicating the correspondence between the different time groups (e.g., 5-10 or 10-50). I've imported it as a dataframe into Python. Is there a way to rearrange the dataframe as depicted below?

        p adj
Groups  
50-0    2.815526e-13
60-0    2.855494e-13
20-0    4.764197e-08
50-5    1.712389e-05
50-10   1.483440e-04
50-40   1.643480e-04
60-5    5.873007e-04
60-10   5.218047e-03
60-40   5.613566e-03
10-0    6.878476e-03
40-0    1.270855e-02
20-5    7.380859e-02
50-20   1.574372e-01
40-20   3.264569e-01
20-10   3.369147e-01
5-0 3.816166e-01
60-50   7.301423e-01
60-20   8.503578e-01
10-5    9.731384e-01
40-5    9.820983e-01
40-10   1.000000e+00

I want it to be something like:

        0    5    10    20    ...
0       ...  ...  ...  ...
5       ...  ...  ...  ...
10      ...  ...  ...  ...
20      ...  ...  ...  ...
...

I haven't found anything similar online, so I don't know where to start.

2

There are 2 best solutions below

0
Andrej Kesely On BEST ANSWER

Try:

df[["x", "y"]] = df.index.str.split("-", expand=True).to_frame().astype(int).values
print(pd.crosstab(df["x"], df["y"], df["p adj"], aggfunc="first"))

Prints:

y             0         5         10        20        40        50
x                                                                 
5   3.816166e-01       NaN       NaN       NaN       NaN       NaN
10  6.878476e-03  0.973138       NaN       NaN       NaN       NaN
20  4.764197e-08  0.073809  0.336915       NaN       NaN       NaN
40  1.270855e-02  0.982098  1.000000  0.326457       NaN       NaN
50  2.815526e-13  0.000017  0.000148  0.157437  0.000164       NaN
60  2.855494e-13  0.000587  0.005218  0.850358  0.005614  0.730142
0
Davinder Singh On

Just create a dataframe with

  • index unique values from 0th split of index of original dataframe(df)
  • columns be unique values from 1th split index of of original dataframe(df)
final = pd.DataFrame(
    {},
    index = df.index.map(lambda x: x.split('-')[0]).unique().map(lambda x: int(x)).sort_values(),
    columns = df.index.map(lambda x: x.split('-')[1]).unique().map(lambda x: int(x)).sort_values()
)

for i in final.index:
    for j in final.columns:
        try:
            final.loc[i, j] = df.loc[f'{i}-{j}', 'p adj']
        except:
            final.loc[i, j] = np.nan

Output:

Groups        0         5         10        20        40        50
Groups                                                            
5       0.381617       NaN       NaN       NaN       NaN       NaN
10      0.006878  0.973138       NaN       NaN       NaN       NaN
20           0.0  0.073809  0.336915       NaN       NaN       NaN
40      0.012709  0.982098       1.0  0.326457       NaN       NaN
50      0.281553  0.000017  0.000148  0.157437  0.000164       NaN
60           0.0  0.000587  0.005218  0.850358  0.005614  0.730142

Then iterate over (ith, jth) location of final dataframe and fill those value using on-the-fly created index form original dataframe.

There must have some optimal solution that fully use vectorization of pandas where no need to relay on loop.