Filter individuals that don't have data for the whole period

69 Views Asked by At

I am using Python 3.9 on Pycharm. I have the following dataframe:

  id  year  gdp
0  A  2019    3
1  A  2020    0
2  A  2021    5
3  B  2019    4
4  B  2020    2
5  B  2021    1
6  C  2020    5
7  C  2021    4

I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:

  id  year  gdp
0  A  2019    3
1  A  2020    0
2  A  2021    5
3  B  2019    4
4  B  2020    2
5  B  2021    1
 

Is it feasible in Python?

3

There are 3 best solutions below

0
ThePyGuy On

As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:

>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
    # df is your dataframe

  id  year  gdp
0  A  2019    3
1  A  2020    0
2  A  2021    5
3  B  2019    4
4  B  2020    2
5  B  2021    1
0
Timeless On

First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.

from io import StringIO
import pandas as pd

s = """id   year    gdp
A   2019    3
A   2020    0
A   2021    5
B   2019    4
B   2020    2
B   2021    1
C   2020    5
C   2021    4
"""
df = pd.read_csv(StringIO(s), sep='\t')
    
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))

out = df[mask]

>>> print(out)

  id  year  gdp
0  A  2019    3
1  A  2020    0
2  A  2021    5
3  B  2019    4
4  B  2020    2
5  B  2021    1
0
mozway On

Here is a way using pivot and dropna to automatically find ids with missing values:

keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']

out = df[df['id'].isin(keep)]

output:

  id  year  gdp
0  A  2019    3
1  A  2020    0
2  A  2021    5
3  B  2019    4
4  B  2020    2
5  B  2021    1