Regex/Difflib/Datastructure algorithm problem

56 Views Asked by At

thank you for the help in advance. I'm in a bit of a pickle with this current problem, I have data sets all representing the same data in CSV format except the column names vary to a certain degree, for example

  • ME_loard_MW
  • ME_loard
  • ME_load

Would be the heading names for 3 separate sets of data, I'm trying to develop a function that parses the column names(pandas) and changes all the names for any uploaded data set to a specific set. The approaches I've tried are using a Regex function such as

def renamefunc(col_name):
    if re.match(myregex, col_name, flags=re.I):
        return "FLOW202"
    else:
        return col_name

I've also considered using the difflib module(get_close_matches) since all the column names are distinct enough that the 1st list element will be the one I am targeting. Finally, I have been considering using a dictionary/algorithm, but this is a bit out of my scope since I started programming in April. Any input/feedback/criticism is more than welcome, my goal is to improve! Attached is an image of the type of data sets I expect to encounter

1

There are 1 best solutions below

0
On

Seems like you want to change column names in all your datasets to a specific set. Given that all your datasets are aligned, i.e. their columns appear in the same order, you can simply set column names like this:

import pandas as pd

df = pd.DataFrame({'name':['A','A','B','B','C','C'], 'year': ['2013','2013','2014','2014', '2015','2015'],
    'type': ['up', 'down', 'up', 'down', 'up', 'down'],
    'cost': [30, 15, 20, 15, 30,25]})

column_names_set = ('Name', 'Year', 'Type', 'Cost')

df.columns = column_names_set

I cannot be more specific than this because I cannot see your dataset. Perhaps the image you intended to attach did not work.