Better way to duplicate rows based on two columns, merging those columns into a single column

29 Views Asked by At

I have the following Padas DF... Starting DF

I.e.

,resultset_id,resultsetrevision_id,injection_id,injection_acqmethod_id,injection_damethod_id
0,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,5cff24fc-f1b8-43b1-98a5-39fc41c27a33,f85b0a52-52a8-4e8d-93c3-54be11c7f8c3,
1,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,6c005f00-8654-4ebc-8e42-c92bd4a5fa64,53b34ff9-fec2-472d-a4d0-61e6029d586a,cd4cbbd9-5f23-4146-a499-9c90e3c73383

In order for a later merge, it would be better if I had four rows (instead of two) and injection_acqmethod_id and injection_damethod_id were both simply in column method_id as follows... desired result

I.e.

,resultset_id,resultsetrevision_id,injection_id,injection_acqmethod_id,injection_damethod_id,method_id
0,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,5cff24fc-f1b8-43b1-98a5-39fc41c27a33,f85b0a52-52a8-4e8d-93c3-54be11c7f8c3,,f85b0a52-52a8-4e8d-93c3-54be11c7f8c3
0,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,5cff24fc-f1b8-43b1-98a5-39fc41c27a33,f85b0a52-52a8-4e8d-93c3-54be11c7f8c3,,None
1,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,6c005f00-8654-4ebc-8e42-c92bd4a5fa64,53b34ff9-fec2-472d-a4d0-61e6029d586a,cd4cbbd9-5f23-4146-a499-9c90e3c73383,53b34ff9-fec2-472d-a4d0-61e6029d586a
1,8c502f71-9965-43c9-b3be-e7988a2fc89e,023c8953-565e-4953-991a-a842e0444e67,6c005f00-8654-4ebc-8e42-c92bd4a5fa64,53b34ff9-fec2-472d-a4d0-61e6029d586a,cd4cbbd9-5f23-4146-a499-9c90e3c73383,cd4cbbd9-5f23-4146-a499-9c90e3c73383

I'm using the following code...

_injections["method_id"] = (_injections.injection_acqmethod_id.astype(str) + "," + _injections.injection_damethod_id.astype(str)).str.split(",")
_injections = _injections.explode("method_id")

Merging the columns into a list and then exploding seems like unnecessary work. Is tere a more pythonic/faster/more-concise way to do this?

1

There are 1 best solutions below

0
jezrael On

One idea with reshape:

cols = ['injection_acqmethod_id','injection_damethod_id']

out = (df.assign(**{f'_{x}_': df[x] for x in cols})
        .set_index(list(df.columns))
        .stack()
        .droplevel(-1)
        .reset_index(name='method_id'))
print (out)
    

Another idea with numpy ravel:

cols = ['injection_acqmethod_id','injection_damethod_id']
out = df.loc[df.index.repeat(len(cols))].assign(method_id = np.ravel(df[cols].to_numpy()))

Or with concat:

cols = ['injection_acqmethod_id','injection_damethod_id']

out = (pd.concat([df.assign(method_id=df[x]) for x in cols])
         .sort_index(kind='stable', ignore_index=True))