Pandas read txt file delimiter appears in text

55 Views Asked by At

I have a dataset in the shape of a txt file that looks like this:

beer_name: Legbiter
beer_id: 19827
brewery_name: Strangford Lough Brewing Company Ltd
brewery_id: 10093
style: English Pale Ale
abv: 4.8
date: 1357729200
user_name: AgentMunky
user_id: agentmunky.409755
appearance: 4.0
aroma: 3.75
palate: 3.5
taste: 3.5
overall: 3.75
rating: 3.64
text: Poured from a 12 ounce bottle into a pilsner glass.A: A finger of creamy head with clear-dark amber body.S: Rich brown sugar. Malty...T: Slight sugars, dry malt, vague hops. Big malty-brown with sugar.M: Dry and slightly astringent before a boring endtaste.O: Solid beer. Drinkable and interesting. Still vaguely bland.
review: True

I am using the following function to try and make it into a proper df (and a little more processing afterwards, but this is where is throws an error):

rb_file_data = pd.read_csv(os.path.join(MATCHED_BEER_DIR, 'ratings_with_text_rb.txt'), sep=":", header=None, names=["Key", "Value"])

The issue I have is that some reviews use ":" in the text part (I specifically chose to show you one containing some), which throws the following error:

ParserError: Error tokenizing data. C error: Expected 2 fields in line 34, saw 7

I have enough data to get rid of the whole review if needed, but would be happy to keep it if possible.

Is there a way to use the separator only on the first time it appears in a line, or anything else?

3

There are 3 best solutions below

0
Alok On

You can try with below code

import pandas as pd
import os

MATCHED_BEER_DIR = "give your directory path"

with open(os.path.join(MATCHED_BEER_DIR, 'ratings_with_text_rb.txt'), 'r') as file:
    lines = file.readlines()

data = [line.strip().split(':', 1) for line in lines]

rb_file_data = pd.DataFrame(data, columns=["Key", "Value"])

rb_file_data['Value'] = rb_file_data['Value'].str.strip()

print(rb_file_data)
0
Timeless On

You can try this :

df = (pd.read_csv("ratings_with_text_rb.txt",  header=None, engine="python",
                  sep=r"(.+?):\s*(.+)") # Click here to see the regex-demo
            [[1, 2]].set_index(1).T) # the transpose is optional, maybe !

If you have blocks (which is most likely the case) of 17 entries each, you can use :

N = 17

tmp = pd.read_csv("ratings_with_text_rb.txt", header=None,
                  engine="python", sep=r"(.+?):\s*(.+)")[[1, 2]]

out = tmp.set_axis(tmp.index // N).set_index(1, append=True)[2].unstack(1)

Output :

print(df)

1 beer_name beer_id         brewery_name  ... rating                 text review
0  Legbiter   19827  Strangford Lough...  ...   3.64  Poured from a 12...   True

[1 rows x 17 columns]
0
amance On

Similar to https://stackoverflow.com/a/54504598/17142551. You can use:

rb_file_data = pd.read_csv('ratings_with_text_rb.txt'), sep='^([^:]+): ', engine='python', usecols=['beer_name', 'Legbiter'])