I've a dataset
| id | ref | name | conditionCol |
|---|---|---|---|
| 1 | 123 | a | no_error |
| 1 | 456 | b | error |
| 1 | 789 | c | no_error |
| 2 | 231 | d | no_error |
| 2 | 312 | e | no_error |
| 2 | 546 | f | no_error |
| 3 | 645 | g | error |
| 3 | 879 | h | error |
| 4 | 789 | i | no_error |
| 4 | 978 | j | no_error |
I'm trying to create a custom error_flag, condition being:
- for each unique
idcolumn elements - if any row in the
conditionColhas the keyworderror, then - for each row should be flagged as
yesin theerror_flag - if for any element in
idcolumn - not even a single row has the keyword
errorinconditionColcolumn, then - for each row should be flagged as
noin theerror_flag
E.g. For id:1, all the values of error_flag is yes, as for id value 1, row #2 of conditionCol has error
| id | ref | name | conditionCol | error_flag |
|---|---|---|---|---|
| 1 | 123 | a | no_error | yes |
| 1 | 456 | b | error | yes |
| 1 | 789 | c | no_error | yes |
But, for id:2, all the values of error_flag is no, as for id value 2, no row of conditionCol has error
| id | ref | name | conditionCol | error_flag |
|---|---|---|---|---|
| 2 | 231 | d | no_error | no |
| 2 | 312 | e | no_error | no |
| 2 | 546 | f | no_error | no |
Similarly for id value 3 & 4:
| id | ref | name | conditionCol | error_flag |
|---|---|---|---|---|
| 3 | 645 | g | no_error | no |
| 3 | 879 | h | no_error | no |
| 4 | 789 | i | error | yes |
| 4 | 978 | j | error | yes |
And final output being:
| id | ref | name | conditionCol | error_flag |
|---|---|---|---|---|
| 1 | 123 | a | no_error | yes |
| 1 | 456 | b | error | yes |
| 1 | 789 | c | no_error | yes |
| 2 | 231 | d | no_error | no |
| 2 | 312 | e | no_error | no |
| 2 | 546 | f | no_error | no |
| 3 | 645 | g | no_error | no |
| 3 | 879 | h | no_error | no |
| 4 | 789 | i | error | yes |
| 4 | 978 | j | error | yes |
Update:
If you wish to play around with the dataset:
import pandas as pd
import numpy as np
id_col = [1,1,1,2,2,2,3,3,4,4]
ref_col = [123,456, 789, 231, 312, 546, 645, 879, 789, 978]
name_col = ['a','b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
conditionCol = ['no_error', 'error', 'no_error', 'no_error', 'no_error', 'no_error', 'no_error', 'no_error', 'error', 'error']
df = pd.DataFrame(zip(id_col, ref_col, name_col, conditionCol), columns=['id','ref','name','conditionCol'])
df
update2: Is there a way to work with thresholds, i.e.:
- current question: atleast one occurrence of keyword
errorin conditionCol column for each individual uniqueids, then the value inerror_flagwould beyesfor all the rows in that id value - atleast 4 or atleast 5 occurrence of keyword
errorin conditionCol column for uniqueids, then only the value inerror_flagwould beyesfor all the rows in that id value
Use
numpy.wherewith test if at least one valueerrorper groups byid: