I'm trying to read numerical data from a .txt to a pandas dataframe, but it needs some wrangling. Some rows are misaligned (I think by tabs)
Snippet of data (pasting the table actually makes it appear aligned): .txt dataset with mixed alignment
What worked for now was simply dropping the misaligned rows, but it's a small dataset that I'd like to retain every row for. Code:
df = pd.read_table('path/file.txt', on_bad_lines='skip', header=None)
df
Output:
0
0 15.26\t14.84\t0.871\t5.763\t3.312\t2.221\t5.22\t1
1 14.88\t14.57\t0.8811\t5.554\t3.333\t1.018\t4.9...
2 14.29\t14.09\t0.905\t5.291\t3.337\t2.699\t4.82...
Using read_table without skipping bad lines returns: 'ParserError: Error tokenizing data. C error: Expected 8 fields in line 8, saw 10'
I've tried rewriting the .txt to replace tabs with a single space (or a comma) and trying to read the new file in with the specific delimiter, but that brings me back to the ParserError (strategy inspired by Replace Tab with space in entire text file python).
inputFile = open('path/file.txt', 'r') # read mode
exportFile = open('path/file_v1.txt', 'w') # write mode
for line in inputFile:
new_line = line.replace('\t', ',')
exportFile.write(new_line)
inputFile.close()
exportFile.close()
(PS. Python beginner, and first StackOverflow problem. Thanks and sorry in advance if I missed some posting convention)
You can use the
sep='\s+'parameter to specify how to split your data. This means that each column is separated by one or more spaces.Try: