pandas combine two columns with null values

61.5k Views Asked by At

I have a df with two columns and I want to combine both columns ignoring the NaN values. The catch is that sometimes both columns have NaN values in which case I want the new column to also have NaN. Here's the example:

df = pd.DataFrame({'foodstuff':['apple-martini', 'apple-pie', None, None, None], 'type':[None, None, 'strawberry-tart', 'dessert', None]})

df
Out[10]:
foodstuff   type
0   apple-martini   None
1   apple-pie   None
2   None    strawberry-tart
3   None    dessert
4   None    None

I tried to use fillna and solve this :

df['foodstuff'].fillna('') + df['type'].fillna('')

and I got :

0      apple-martini
1          apple-pie
2    strawberry-tart
3            dessert
4                   
dtype: object

The row 4 has become a blank value. What I want in this situation is a NaN value since both the combining columns are NaNs.

0      apple-martini
1          apple-pie
2    strawberry-tart
3            dessert
4            None       
dtype: object
9

There are 9 best solutions below

4
On BEST ANSWER

Use fillna on one column with the fill values being the other column:

df['foodstuff'].fillna(df['type'])

The resulting output:

0      apple-martini
1          apple-pie
2    strawberry-tart
3            dessert
4               None
1
On

You can always fill the empty string in the new column with None

import numpy as np

df['new_col'].replace(r'^\s*$', np.nan, regex=True, inplace=True)

Complete code:

import pandas as pd
import numpy as np

df = pd.DataFrame({'foodstuff':['apple-martini', 'apple-pie', None, None, None], 'type':[None, None, 'strawberry-tart', 'dessert', None]})

df['new_col'] = df['foodstuff'].fillna('') + df['type'].fillna('')

df['new_col'].replace(r'^\s*$', np.nan, regex=True, inplace=True)

df

output:

    foodstuff   type    new_col
0   apple-martini   None    apple-martini
1   apple-pie   None    apple-pie
2   None    strawberry-tart strawberry-tart
3   None    dessert dessert
4   None    None    NaN
0
On
  • fillna both columns together
  • sum(1) to add them
  • replace('', np.nan)

df.fillna('').sum(1).replace('', np.nan)

0      apple-martini
1          apple-pie
2    strawberry-tart
3            dessert
4                NaN
dtype: object
0
On

you can use the combine method with a lambda:

df['foodstuff'].combine(df['type'], lambda a, b: ((a or "") + (b or "")) or None, None)

(a or "") returns "" if a is None then the same logic is applied on the concatenation (where the result would be None if the concatenation is an empty string).

1
On

If you deal with columns that contain something where the others don't and vice-versa, a one-liner that does well the job is

>>> df.rename(columns={'type': 'foodstuff'}).stack().unstack()
         foodstuff
0    apple-martini
1        apple-pie
2  strawberry-tart
3          dessert

... which solution also generalises well if you have multiple columns to "intricate", as long as you can define your ~.rename mapping. The intention behind such renaming is to create duplicates that ~.stack().unstack() will then process for you.

As explained, this solution only suits configuration with orthogonal columns, i.e. columns that never are simultaneously valued.

0
On

You can replace the non zero values with column names like:

df1= df.replace(1, pd.Series(df.columns, df.columns))

Afterwards, replace 0's with empty string and then merge the columns like below:

f = f.replace(0, '')
f['new'] = f.First+f.Second+f.Three+f.Four

Refer the full code below:

import pandas as pd
df = pd.DataFrame({'Second':[0,1,0,0],'First':[1,0,0,0],'Three':[0,0,1,0],'Four':[0,0,0,1], 'cl': ['3D', 'Wireless','Accounting','cisco']})
df2=pd.DataFrame({'pi':['Accounting','cisco','3D','Wireless']})
df1= df.replace(1, pd.Series(df.columns, df.columns))
f = pd.merge(df1,df2,how='right',left_on=['cl'],right_on=['pi'])
f = f.replace(0, '')
f['new'] = f.First+f.Second+f.Three+f.Four

df1:

In [3]: df1                                                                                                                                                                              
Out[3]: 
   Second  First  Three  Four          cl
0       0  First      0     0          3D
1  Second      0      0     0    Wireless
2       0      0  Three     0  Accounting
3       0      0      0  Four       cisco

df2:

In [4]: df2                                                                                                                                                                              
Out[4]: 
           pi
0  Accounting
1       cisco
2          3D
3    Wireless

Final DataFrame f will be:

In [2]: f                                                                                                                                                                                
Out[2]: 
   Second  First  Three  Four          cl          pi     new
0          First                       3D          3D   First
1  Second                        Wireless    Wireless  Second
2                 Three        Accounting  Accounting   Three
3                        Four       cisco       cisco    Four
0
On

We can make this problem even more complete and have a universal solution for this type of problem.

The key things in there are that we wish to join a group of columns together but just ignore NaNs.

Here is my answer:

df = pd.DataFrame({'foodstuff':['apple-martini', 'apple-pie', None, None, None], 
               'type':[None, None, 'strawberry-tart', 'dessert', None],
              'type1':[98324, None, None, 'banan', None],
              'type2':[3, None, 'strawberry-tart', np.nan, None]})

enter image description here

df=df.fillna("NAN")
df=df.astype('str')
df["output"] = df[['foodstuff', 'type', 'type1', 'type2']].agg(', '.join, axis=1)
df['output'] = df['output'].str.replace('NAN, ', '')
df['output'] = df['output'].str.replace(', NAN', '')

enter image description here

0
On

With combine_first you can fill null values in one column with non-null values from another column:

In [3]: df['foodstuff'].combine_first(df['type'])
Out[3]: 
0      apple-martini
1          apple-pie
2    strawberry-tart
3            dessert
4               None
0
On

If you initialize your DataFrame with NaNs for missing values rather than None, you can use Series.add() to fill NaN values on the fly when adding the columns together.

Example:

df = pd.DataFrame({'foodstuff':['apple-martini', 'apple-pie', np.NaN, np.NaN, np.NaN], 
'type':[np.NaN, np.NaN, 'strawberry-tart', 'dessert', np.NaN]})

df['foodstuff'].add(df['type'], fill_value = '')

Result:

0      apple-martini
1          apple-pie
2    strawberry-tart
3            dessert
4                NaN

This also works nicely for adding numerical columns that have some NaN values, as it allows you to add a number to a NaN value and get the number. Example:

df_test_nums = pd.DataFrame({'left_numbers':[1, 1, np.NaN, 3.7, 2.4], 
'right_numbers':[4, np.NaN, np.NaN, 2.7, 9.4]})
print(df_test_nums)

Result:

   left_numbers  right_numbers
0           1.0            4.0
1           1.0            NaN
2           NaN            NaN
3           3.7            2.7
4           2.4            9.4

Adding these columns together so that the sum of a number and a NaN value will be the number:

df_test_nums['left_numbers'].add(
    df_test_nums['right_numbers'], fill_value = 0)

Result:

0     5.0
1     1.0
2     NaN
3     6.4
4    11.8

Compare this to the use of the + operator, which converts the sum of NaN and a number into NaN:

df_test_nums['left_numbers'] + df_test_nums['right_numbers']

Result:

0     5.0
1     NaN
2     NaN
3     6.4
4    11.8

For operations that involve multiple columns, a more elegant approach is available via df.sum().

print(df_test_nums[
          ['left_numbers', 'right_numbers']].sum(
              axis=1, min_count = 1))

output:

0     5.0
1     1.0
2     NaN
3     6.4
4    11.8

Note that, if min_count is set to 0 (the default), the 3rd row will equal 0, since that's the default output when values consisting only of NaNs are added together. (See the df.sum() documentation for more information.)