I have to parse a log txt file with regex in python. This is an example of a txt (named file
):
20/01/18, 08:11 - Peter: Good morning
How are you?
Peter 20/01/18, 09:00 - Caroline: I am fine thanks. You?
20/01/18, 09:01 - Peter: Good
I had some problems few days ago.
Now I am happy
Are you working?
20/01/18, 09:02 - Caroline: No I have to go to the supermarket to buy vegetables
20/01/18, 09:12 - Peter: Nice!
Where are you now?
I tried to parse the whole text with this regular expression:
f = open(file, 'r', encoding='utf-8')
texts=re.findall('(\d+/\d+/\d+, \d+:\d+\d+) - (.+?): (.*)',f.read())
f.close()
df= pd.DataFrame(texts,columns=['data','name','text'])
However, I have problems when matching one or multiple newlines in python (for example the text of Peter at 09:01). I also try to work on https://regex101.com/ to find a possible solution but I didn't succeed.
Can you help me please?
By default,
.
will not match a newline. You need to use DOTALL mode to make it match newlines:It works:
This does not solve the problem of matching the entire rest of the text, though!
See @the-fourth-bird's answer for a real solution.
Another. more explicit way to handle it is to read the file line by line, and check if a line is a continuation or not.
This may be easier to reason about.