Pandas - find all Rows where special Columns contain a part of text

74 Views Asked by At

I have a problem finding rows of a DataFrame where 2 Columns contain a Part of a given String s.

The Column Values (Type String(Object))
I mean about the opposite of str.contains or isin(), because the substring-mask is the Column-Value.
The String is not suitable for clear splitting, because the 3 Values "Cityname", "Districtname" and "Streetname" can contain Whitespaces.

Can you help me?

s = "Bad Testcity Teststr."
df_res = df.loc[(s.find(df['CITY'] != -1) & (s.find(df['DISTRICT'] != -1) & (s.find(df['STREET'] != -1)]`

This sample should return TRUE.

<bound method DataFrame.info of             ZIP              CITY               STREET NUMBER   NUMBER_SFX         DISTRICT   ONKZ ASB      ADSL      VDSL   VDSL_SV         VPSZ OUTDOOR
ID
4025217   12345  Bad Testcity          Teststr.          6              NaN  Bad Testcity  12345   2  +017.696  +102.784       NaN  49/12345/30       O
4025219   12345  Bad Testcity          Teststr.          7              NaN  Bad Testcity  12345   2  +017.696  +102.784       NaN  49/12345/30       O
4025242   12345  Bad Testcity          Teststr.          8              NaN  Bad Testcity  12345   2  +017.696  +102.784  +185.824  49/12345/30       O
4025244   12345  Bad Testcity          Teststr.         10              NaN  Bad Testcity  12345   2  +017.696  +102.784       NaN  49/12345/30       O
4025245   12345  Bad Testcity          Teststr.         11              NaN  Bad Testcity  12345   2  +017.696  +051.392       NaN  49/12345/30       O
...         ...              ...                   ...        ...              ...              ...    ...  ..       ...       ...       ...          ...     ...

[1569530 rows x 13 columns]>
1

There are 1 best solutions below

0
mozway On BEST ANSWER

Assuming this input:

           ZIP          CITY    STREET  NUMBER  NUMBER_SFX      DISTRICT   ONKZ  ASB    ADSL     VDSL  VDSL_SV         VPSZ OUTDOOR
ID                                                                                                                                 
4025217  12345  Bad Testcity  Teststr.       6         NaN  Bad Testcity  12345    2  17.696  102.784      NaN  49/12345/30       O
4025219  12345  Bad Testcity  Teststr.       7         NaN  Bad Testcity  12345    2  17.696  102.784      NaN  49/12345/30       O
4025242  12345  Bad Testcity  Teststr.       8         NaN  Bad Testcity  12345    2  17.696  102.784  185.824  49/12345/30       O
4025244  12345  Bad Testcity  Teststr.      10         NaN  Bad Testcity  12345    2  17.696  102.784      NaN  49/12345/30       O
4025245  12345  Bad Testcity  Teststr.      11         NaN  Bad Testcity  12345    2  17.696   51.392      NaN  49/12345/30       O

You can concatenate the columns with a space, then use str.contains on the output:

s = "Bad Testcity Teststr."

df_res = df.loc[(df['CITY']+' '+df['DISTRICT']+' '+df['STREET']).str.contains(s)]

Less efficient alternative (can nevertheless be useful):

df_res = df.loc[df[['CITY', 'DISTRICT', 'STREET']]
                .apply(' '.join, axis=1)
                .str.contains(s)]

Output (here unchanged):

           ZIP          CITY    STREET  NUMBER  NUMBER_SFX      DISTRICT   ONKZ  ASB    ADSL     VDSL  VDSL_SV         VPSZ OUTDOOR
ID                                                                                                                                 
4025217  12345  Bad Testcity  Teststr.       6         NaN  Bad Testcity  12345    2  17.696  102.784      NaN  49/12345/30       O
4025219  12345  Bad Testcity  Teststr.       7         NaN  Bad Testcity  12345    2  17.696  102.784      NaN  49/12345/30       O
4025242  12345  Bad Testcity  Teststr.       8         NaN  Bad Testcity  12345    2  17.696  102.784  185.824  49/12345/30       O
4025244  12345  Bad Testcity  Teststr.      10         NaN  Bad Testcity  12345    2  17.696  102.784      NaN  49/12345/30       O
4025245  12345  Bad Testcity  Teststr.      11         NaN  Bad Testcity  12345    2  17.696   51.392      NaN  49/12345/30       O