How can I remove a different-format header from my tsv file using read_csv() to convert to pandas?

81 Views Asked by At

I'm trying to read in a bunch of files into a pandas data frame using read_csv(), however the files are formatted like this:

# {parms: values,
# more params: more values, 
# something else,
# params: values}

co1 col2 col3
value value value

Essentially, if I removed the part that starts with #, my file would be a perfectly easily-readable tsv.

Do I have to process all these files first to remove this header (reading it in and writing back every line that doesn't begin with #) or is there some way to do this directly using read_csv or another similar function?

When I try to read it in normally, it of course returns a rather broken-looking pandas data frame.

2

There are 2 best solutions below

2
On

You can remove the headings that start with # before reading them like this:

import pandas as pd

def preprocess_file(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()

    processed_lines = [line for line in lines if not line.startswith('#')]
    processed_content = ''.join(processed_lines)

    with open(file_path, 'w') as file:
        file.write(processed_content)

preprocess_file('your_file.csv') #removes hashtag lines

df = pd.read_csv('your_file.csv', delimiter='\t')

The only thing to note, is that it will modify your file, so if you would like to keep it the same, then make a copy of it first like this:

import shutil
import pandas as pd

def preprocess_file(file_path, output_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()

    processed_lines = [line for line in lines if not line.startswith('#')]
    processed_content = ''.join(processed_lines)

    with open(output_path, 'w') as file:
        file.write(processed_content)

shutil.copy2('your_file.csv', 'your_file_copy.csv')
preprocess_file('your_file_copy.csv', 'your_file_processed.csv')
df = pd.read_csv('your_file_processed.csv', delimiter='\t')
0
On

You could try the following:

pd.read_csv("./pandas_test.tsv", sep=" ", comment="#")

or the following if all your files have the same formatting

pd.read_csv("./pandas_test.tsv", sep=" ", skiprows=5)