Finding string or empty column in pandas

44 Views Asked by At

This is to create a general use case of cleaning up tabular Data from the SEC EDGAR database. Considering this below table, I need to remove columns that may contain '[any_int]'

Column A Column B Column C Column D
val val nan nan
val val nan [1]
val val nan nan

The assumption here is that column names are unknown, as each company in the Database will have unique table structures.

cols_to_drop = df.columns[df.columns.str.contains('\[')]

Using the string.contains() method yields no results, though I was expecting it to assign Column D

1

There are 1 best solutions below

0
Navkar Jain On

A simple approach to get desired output.

import pandas as pd
import numpy as np

# Sample data (replace this with your actual data)
data = {
    'Column A': ['val', 'val', 'val'],
    'Column B': ['val', 'val', 'val'],
    'Column C': [np.nan, np.nan, np.nan],
    'Column D': [np.nan, '[1]', np.nan]
}

# Convert the data dictionary into a DataFrame
df = pd.DataFrame(data)

# Regular expression pattern to match any integer value inside square brackets
pattern = r'\[\d+\]'

# Identify columns containing the specified pattern
columns_to_drop = [col for col in df.columns if any(df[col].apply(lambda x: re.search(pattern, str(x))))]
df_cleaned = df.drop(columns=columns_to_drop)

print("Original DataFrame:")
print(df)
print("\nCleaned DataFrame:")
print(df_cleaned)