Detecting similar columns across multiple files based on statistical profile

Question

Detecting similar columns across multiple files based on statistical profile

215 Views Asked by Ryan Gross At 10 October 2022 at 14:05

I'm attempting to clean up a set of old files that contain sensor data measurements. Many of the files don't have headers, and the format (column ordering, etc.) is inconsistent. I'm thinking the best that I can do in these cases is to match statistical profiles of the columns to data from files that do have good headers. This seems like it should be simple using something like Pandas Profiling, but I haven't found any examples. I'm looking for something that would calculate a score for the similarity between each column in the header-less file and each "known" column for which I already have headers.

Example Data with Headers:

Large Value Column	Small Value Column
100	5
102	12
110	8
98	10

Example Data with only column numbers:

0	1
6	99
9	105
11	101
14	100

For the above example, I would like to automatically determine that column 1 should be added to "Large Value Column" and Column 0 to "Small Value Column".

Original Q&A

There are 2 best solutions below

**Arthur Bricq** · Answer 1 · 2022-10-10T14:26:30.593000

That's an interesting question.

The solution that will work heavily depends on the data you're working with. But here's my thoughts, assuming that your data is similar to the little exampled above, just longer.

The first step would be to create a descriptor that takes as input a row and returns a low-dimension vector that describes this row. For instance: the mean, the standard deviation, the median, ... This should be designed according to what actually describes your data. Then, for each column of your dataset, you're going to have a descriptor vector and (eventually) a label (ie. the column's header), that may or not be known.

The next step can either be classification or clustering, depending on how much data you have and the quality of your descriptors. Look up Scikit documentation about how to do so. Their API is very easy to use and comes with many examples. If you know the number of real headers, you could try to use K-means and I think it would work really well. Maybe KNN can also work. You probably want to look for a sorting algorithm quite 'simple' at first (it's easier to tune), and if nothing simple works then you could try more fancy approaches.

I hope this helps you.

**Ingwersen_erik** · Answer 2 · 2022-10-10T16:08:07.063000

As stated by @Arthur Bricq, the best solution depends a lot on the shape of your data. Unfortunately, here's no "one solution fits all approach". Having said that, the following code was able to correctly rename the columns from your example, using the fuzzywuzzy package:

Note: You need to pip install fuzzywuzzy before running the code below. To do so, run the following command:

pip install fuzzywuzzy


from typing import Tuple, List
from fuzzywuzzy import fuzz
import pandas as pd


def find_similar(
    *dfs: pd.DataFrame, known_df: pd.DataFrame
) -> Tuple[List[pd.DataFrame], pd.DataFrame]:
    """
    Name columns in a list of dataframes based on the columns from known dataframe.

    Each of the dataframes in the list will be renamed based on the columns from the
    known dataframe. To avoid cases where two different unnamed columns match the
    same column from the known dataframe, the function matches each of the known columns
    to the unnamed column that contains the highest similarity score and then removes
    the matched columns from the list of possible columns. For example:

    >>> x = pd.DataFrame(
    ...     {'Large Value Column' :[100, 102, 110, 98,]
    ...      'Small Value Column': [5, 12, 8, 10,]}
    ... )
    >>> y = pd.DataFrame(
    ...     {0: [6, 9, 11, 14],
    ...      1: [20, 25, 55, 65]}
    ... )
    >>> pd.DataFrame(
    ...     [
    ...         [
    ...             fuzz.ratio(
    ...                 list(map(str, y[col].values)), list(map(str, x[_col].values))
    ...             ) for _col in x.columns
    ...         ] for col in y.columns
    ...     ],
    ...     index=y.columns,
    ...     columns=x.columns,
    ... )
       Large Value Column  Small Value Column
    0                  73                  77
    1                  71                  74

    In the above example, "Small Value Column" has the highest similarity score for
    both columns "0" and "1". However, since "0" has the highest similarity score
    then, "1" will be renamed to "Large Value Column".

    Parameters
    ----------
    dfs : pd.DataFrame
        List of dataframes to rename.
    known_df : pd.DataFrame
        Pandas dataframe with the correct column names.

    Returns
    -------
    Tuple[List[pd.DataFrame], pd.DataFrame]
        List of dataframes with renamed columns, and the known dataframe.
    """
    # Store the results in a list
    _dfs = []
    # Loop over the dataframes
    for df in dfs:
        # Create a dataframe that contains the similarities between the columns
        # in the known dataframe and the columns in the current dataframe.
        # The columns represent the known dataframe columns, and the index
        # represents the current dataframe columns.
        similarity_df = pd.DataFrame(
            [
                [
                    fuzz.ratio(
                        list(map(str, df[col].values)),
                        list(map(str, known_df[_col].values)),
                    )
                    for _col in known_df.columns
                ]
                for col in df.columns
            ],
            index=df.columns,
            columns=known_df.columns,
        )
        # Dictionary to map old and new column names
        rename_dict = {}

        # Keep track of the columns that have already been matched
        # to avoid naming two columns using the same name.
        used_names = []
        for col in similarity_df.max().sort_values(ascending=False).keys():
            old_name = similarity_df[~similarity_df.index.isin(used_names)][
                col
            ].idxmax()
            rename_dict[old_name] = col
            used_names.append(old_name)
        # Rename the columns in the current dataframe and append it to the list
        _dfs.append(df.rename(columns=rename_dict, errors="ignore"))
    return _dfs, known_df

Example

import pandas as pd


x = pd.DataFrame(
    {
        "Large Value Column": [100, 102, 110, 98],
        "Small Value Column": [5, 12, 8, 10],
    }
)
y = pd.DataFrame({0: [6, 9, 11, 14], 1: [20, 40, 55, 65]})
z = pd.DataFrame({0: [95, 80, 72, 100], 1: [0, 20, 14, 10]})

pd.concat(find_similar(y, z, known_df=x)[0])

# Returns:
#
#    Small Value Column  Large Value Column
# 0                   6                  20
# 1                   9                  40
# 2                  11                  55
# 3                  14                  65
# 0                   0                  95
# 1                  20                  80
# 2                  14                  72
# 3                  10                 100

Detecting similar columns across multiple files based on statistical profile

There are 2 best solutions below

Example

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PANDAS-PROFILING

Related Questions in DATA-QUALITY

Related Questions in DATA-PROFILING

Trending Questions

Popular # Hahtags

Popular Questions