Parsing data with Pandas - how to output match as a new column

133 Views Asked by At

I've got a routine to read in a CSV file and spit out selected columns that match specific criteria:

CSV Input File looks like this

Name Role Login
Phil Role A | Role B 2024/01/01
Bob Role A | Role B 2024/02/01
Arthur Role A | Role C 2024/01/04
Jane Role B | Role C 2024/01/31
Mary Role A | Role D 2024/02/12
Liz Role B | Role F 2024/02/21
Phoebe Role C | Role D 2023/11/21
Mike Role E 2024/02/15
Rick Role D | Role E 2024/01/13
Hilary Role F 2024/01/11

I have a block of code that matches based on a passed value:

# Define function to check if a value matches any of the filter values
def matches_filter(value):
    value_lower = value.lower()
    for filter_value in value_lower.split("|"):
        filter_value_lower = filter_value.lower()
        for fvals in fltr_values:
            if fvals.lower() in filter_value_lower:
                return fvals.lower()
    return None

# Apply filter
# filtered_df = df[df[fltr_field].apply(matches_filter)]
df[fltr_field + "_matched"] = df[fltr_field].apply(matches_filter)

Based on passing the values "Role B" and "Role D" I'd like to replace whatever is in the Role column with the result of the filter. The net resulting table should therefore look something like this:

Name Role Login
Phil Role B 2024/01/01
Bob Role B 2024/01/01
Jane Role B 2024/02/03
Mary Role D 2024/02/02
Liz Role B 2024/02/12
Phoebe Role D 2024/02/21
Rick Role D 2024/01/31

So far, the code will filter so I only get strings containing "Role B" or "Role D" but I'd like to replace the string found with the match criterion, rather than the list of roles. Can someone explain what I need to change here?

To further explain based on comments received so far:

  • What is the content of fltr_field?

fltr_field contains the name of the column to filter on (in this case, I'm filtering on the column called "Role".

  • Can you clearly explain what you're trying to do here?

I wish to replace the contents of the Role column with the matched value.

  • What is the nature of the "Login" column?

The "Login" column contains the last login date.

1

There are 1 best solutions below

2
OCa On

With this input dataframe generator:

df = pd.DataFrame({'Name': {0: 'Phil', 1: 'Bob', 2: 'Arthur', 3: 'Jane', 4: 'Mary', 5: 'Liz', 6: 'Phoebe', 7: 'Mike', 8: 'Rick', 9: 'Hilary'}, 
                   'Role': {0: 'Role A | Role B', 1: 'Role A | Role B', 2: 'Role A | Role C', 3: 'Role B | Role C', 4: 'Role A | Role D', 5: 'Role B | Role F', 6: 'Role C | Role D', 7: 'Role E', 8: 'Role D | Role E', 9: 'Role F'}, 
                   'Login': {0: '2024/01/01', 1: '2024/02/01', 2: '2024/01/04', 3: '2024/01/31', 4: '2024/02/12', 5: '2024/02/21', 6: '2023/11/21', 7: '2024/02/15', 8: '2024/01/13', 9: '2024/01/11'}})

     Name             Role       Login
0    Phil  Role A | Role B  2024/01/01
1     Bob  Role A | Role B  2024/02/01
2  Arthur  Role A | Role C  2024/01/04
3    Jane  Role B | Role C  2024/01/31
4    Mary  Role A | Role D  2024/02/12
5     Liz  Role B | Role F  2024/02/21
6  Phoebe  Role C | Role D  2023/11/21
7    Mike           Role E  2024/02/15
8    Rick  Role D | Role E  2024/01/13
9  Hilary           Role F  2024/01/11

A regex pattern extraction looks like your best option:

# Define search - insert any number of Role letters inside the brackets
role_pattern = "(Role\s[BD])"
# Generate filtered table
df.assign(Role = df['Role'].str.extract(pat=role_pattern)
 ).dropna(subset = ['Role'])

     Name    Role       Login
0    Phil  Role B  2024/01/01
1     Bob  Role B  2024/02/01
3    Jane  Role B  2024/01/31
4    Mary  Role D  2024/02/12
5     Liz  Role B  2024/02/21
6  Phoebe  Role D  2023/11/21
8    Rick  Role D  2024/01/13

I saw no need for the complexity of a dynamic column name (I mean your df[fltr_field + "_matched"]) since the final column in your desired output is called "Role" anyways.


Update to match various roles as whole words
With alternative input,

  • filtering roles "Alpha" and "Role B",
  • in presence of other roles such as "Role A", "Role AB", "Aphal", "Alpha A"*
df = pd.DataFrame({'Name': {0: 'Phil', 1: 'Bob', 2: 'Arthur', 3: 'Jane', 4: 'Mary', 5: 'Liz', 6: 'Phoebe', 7: 'Mike', 8: 'Rick', 9: 'Hilary'}, 
                   'Role': {0: 'Role A | Role B', 1: 'Role A | Role B', 2: 'Role A | Role C', 3: 'Role B | Role C', 4: 'Role A | Role D', 5: 'Role BA | Role F', 6: 'Role C | Role D', 7: 'Aphal', 8: 'Alpha B | Role E', 9: 'Alpha'},
                   'Login': {0: '2024/01/01', 1: '2024/02/01', 2: '2024/01/04', 3: '2024/01/31', 4: '2024/02/12', 5: '2024/02/21', 6: '2023/11/21', 7: '2024/02/15', 8: '2024/01/13', 9: '2024/01/11'}}

     Name              Role       Login
0    Phil   Role A | Role B  2024/01/01
1     Bob   Role A | Role B  2024/02/01
2  Arthur   Role A | Role C  2024/01/04
3    Jane   Role B | Role C  2024/01/31
4    Mary   Role A | Role D  2024/02/12
5     Liz  Role BA | Role F  2024/02/21
6  Phoebe   Role C | Role D  2023/11/21
7    Mike             Aphal  2024/02/15
8    Rick  Alpha B | Role E  2024/01/13
9  Hilary             Alpha  2024/01/11

# simple variant for whole words - caution: will mistake a "Role BA" for a "Role B":
role_pattern = "(Alpha|Role B)"
df.assign(Role = df['Role'].str.extract(pat=role_pattern)
 ).dropna(subset = ['Role'])

# Safe variant for whole words that prevents this: (with lookahead assertion)
role_pattern = '(Alpha(?=($|\s|))|Role B(?=($|\s|)))'
df.assign(Role = df['Role'].str.extract(pat=role_pattern)[0]
 ).dropna(subset = ['Role'])

     Name    Role       Login
0    Phil  Role B  2024/01/01
1     Bob  Role B  2024/02/01
3    Jane  Role B  2024/01/31
9  Hilary   Alpha  2024/01/11

The "safe variant" with lookahead assertion will function as long as roles are entered with " | " in between. It makes sure the role name is followed by

  • either the end of the string ("$"),
  • or your vertical separator after a space ("\s|").

Similarly, we could not only "look ahead" but also "look back"; you see: regex complexity will have to catch-up with the diversity of expected "roles", but there is no need to make it more complex than it needs to be.

References