Filter individuals that don't have data for the whole period

69 Views Asked by maxst At 16 September 2022 at 09:23

I am using Python 3.9 on Pycharm. I have the following dataframe:

  id  year  gdp
0  A  2019    3
1  A  2020    0
2  A  2021    5
3  B  2019    4
4  B  2020    2
5  B  2021    1
6  C  2020    5
7  C  2021    4

I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:

  id  year  gdp
0  A  2019    3
1  A  2020    0
2  A  2021    5
3  B  2019    4
4  B  2020    2
5  B  2021    1

Is it feasible in Python?

Original Q&A

There are 3 best solutions below

ThePyGuy On 16 September 2022 at 09:48

As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:

>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
    # df is your dataframe

  id  year  gdp
0  A  2019    3
1  A  2020    0
2  A  2021    5
3  B  2019    4
4  B  2020    2
5  B  2021    1

Timeless On 16 September 2022 at 10:51

First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.

from io import StringIO
import pandas as pd

s = """id   year    gdp
A   2019    3
A   2020    0
A   2021    5
B   2019    4
B   2020    2
B   2021    1
C   2020    5
C   2021    4
"""
df = pd.read_csv(StringIO(s), sep='\t')
    
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))

out = df[mask]

`>>> print(out)`

  id  year  gdp
0  A  2019    3
1  A  2020    0
2  A  2021    5
3  B  2019    4
4  B  2020    2
5  B  2021    1

mozway On 17 September 2022 at 10:14

Here is a way using pivot and dropna to automatically find ids with missing values:

keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']

out = df[df['id'].isin(keep)]

output:

  id  year  gdp
0  A  2019    3
1  A  2020    0
2  A  2021    5
3  B  2019    4
4  B  2020    2
5  B  2021    1

Filter individuals that don't have data for the whole period

There are 3 best solutions below

`>>> print(out)`

Related Questions in PYTHON

Related Questions in FILTER

Related Questions in ROWS

Trending Questions

Popular # Hahtags

Popular Questions