Objective: create a new column in a csv if an exact match to a search entry (from a list of search entries) is found I am new to python so I apologise if this is confusing.
Below I create some mock-csv files for example. I have an "database" csv file where I use the column headers to create patterns of strings:
#!/usr/bin/env python
import pandas as pd
import regex as re
import numpy as np
#creating database example for stack overflow
data = [['Chicken','Chicken Breast'],
['Cattle', 'Beef'],
['Bird']]
database = pd.DataFrame(data, columns = ['Animal', 'Meat'])
database.to_csv('db.csv')
db = pd.read_csv('db.csv')
I also have csv files of data that include a source column which I want to search.
data_to_search = [['ID1', 'Chicken'],
['ID2', 'Chicken Breast'],
['ID3', 'Cat'],
['ID4', 'Unknown']]
search_df = pd.DataFrame(data_to_search, columns=['Identifier','Source'])
search_df.to_csv('info.csv')
below is an example of my ugly code
#use the column headers in the source database csv file to create lists and patterns
Animal = db.Animal.tolist()
Animalpattern = "|".join(str(v) for v in Animal)
Meat = db.Meat.tolist()
Meatpattern = "|".join(str(v) for v in Meat)
#read the input file that will be searched to source parses from
search_data = pd.read_csv('info.csv')
#search through the source column in the input file, and identify matches to the patterns from the database csv, then create new columns for matches
search_data['Animal'] = search_data['Source'].str.match(Animalpattern)
search_data['Animal'] = search_data['Animal'].map({True: 'Animal', False: ''})
search_data['Meat'] = search_data['Source'].str.match(Meatpattern)
search_data['Meat'] = search_data['Meat'].map({True: 'Meat', False: ''})
#replacing empty cells with NaN so can concatenate without worrying about extra commas
search_data['Animal'].replace('', np.nan, inplace=True)
search_data['Meat'].replace('', np.nan, inplace=True)
#create a new column that concatenates all of the parsed source information into one
search_data['Source'] = search_data[['Animal', 'Meat']].apply(lambda x: ','.join(x[x.notnull()]), axis=1)
#output a new csv file with source data
search_data.to_csv('output.csv')
The output looks like this:
Unnamed: 0,Identifier,Source,Animal,Meat
0,ID1,Animal,Animal,
1,ID2,"Animal,Meat",Animal,Meat
2,ID3,,,
3,ID4,,,
But I would like to prevent it from outputting "Animal,Meat" where "Chicken Breast" was an entry, as it should only be a match to "Meat" but is also detecting "Chicken":
Unnamed: 0,Identifier,Source,Animal,Meat
0,ID1,Animal,Animal,
1,ID2,Meat,,Meat
2,ID3,,,
3,ID4,,,
I have it working, but I can not figure out how to get an exact match to work, so where it should be just 'Meat' for 'Chicken Breast' I end up with 'Animal,Meat' because 'Chicken' is in 'Chicken Breast'.
My source/database file has hundreds of entries for some columns, so I need a way to read the columns in as lists, then search for the values in those columns.
I have tried to understand if I can use info from: How to match any string from a list of strings in regular expressions in python?
But I am still very new to coding (hence why my code is long and ugly where I'm sure for-loops or something would simplify it).
Any help is appreciated!