Objective: create a new column in a csv if an exact match to a search entry (from a list of search entries) is found I am new to python so I apologise if this is confusing.

Below I create some mock-csv files for example. I have an "database" csv file where I use the column headers to create patterns of strings:

#!/usr/bin/env python

import pandas as pd
import regex as re
import numpy as np

#creating database example for stack overflow
data = [['Chicken','Chicken Breast'],
        ['Cattle', 'Beef'],
        ['Bird']]
database = pd.DataFrame(data, columns = ['Animal', 'Meat'])
database.to_csv('db.csv')
db = pd.read_csv('db.csv')

I also have csv files of data that include a source column which I want to search.

data_to_search = [['ID1', 'Chicken'],
                  ['ID2', 'Chicken Breast'],
                  ['ID3', 'Cat'],
                  ['ID4', 'Unknown']]
search_df = pd.DataFrame(data_to_search, columns=['Identifier','Source'])
search_df.to_csv('info.csv')

below is an example of my ugly code

#use the column headers in the source database csv file to create lists and patterns
Animal = db.Animal.tolist()
Animalpattern = "|".join(str(v) for v in Animal)

Meat = db.Meat.tolist()
Meatpattern = "|".join(str(v) for v in Meat)


#read the input file that will be searched to source parses from
search_data = pd.read_csv('info.csv')

#search through the source column in the input file, and identify matches to the patterns from the database csv, then create new columns for matches
search_data['Animal'] = search_data['Source'].str.match(Animalpattern)
search_data['Animal'] = search_data['Animal'].map({True: 'Animal', False: ''})

search_data['Meat'] = search_data['Source'].str.match(Meatpattern)
search_data['Meat'] = search_data['Meat'].map({True: 'Meat', False: ''})

#replacing empty cells with NaN so can concatenate without worrying about extra commas
search_data['Animal'].replace('', np.nan, inplace=True)
search_data['Meat'].replace('', np.nan, inplace=True)

#create a new column that concatenates all of the parsed source information into one
search_data['Source'] = search_data[['Animal', 'Meat']].apply(lambda x: ','.join(x[x.notnull()]), axis=1)

#output a new csv file with source data
search_data.to_csv('output.csv')

The output looks like this:

Unnamed: 0,Identifier,Source,Animal,Meat
0,ID1,Animal,Animal,
1,ID2,"Animal,Meat",Animal,Meat
2,ID3,,,
3,ID4,,,

But I would like to prevent it from outputting "Animal,Meat" where "Chicken Breast" was an entry, as it should only be a match to "Meat" but is also detecting "Chicken":

Unnamed: 0,Identifier,Source,Animal,Meat
0,ID1,Animal,Animal,
1,ID2,Meat,,Meat
2,ID3,,,
3,ID4,,,

I have it working, but I can not figure out how to get an exact match to work, so where it should be just 'Meat' for 'Chicken Breast' I end up with 'Animal,Meat' because 'Chicken' is in 'Chicken Breast'.

My source/database file has hundreds of entries for some columns, so I need a way to read the columns in as lists, then search for the values in those columns.

I have tried to understand if I can use info from: How to match any string from a list of strings in regular expressions in python?

But I am still very new to coding (hence why my code is long and ugly where I'm sure for-loops or something would simplify it).

Any help is appreciated!

0

There are 0 best solutions below