Extracting data from text (in CSV file) with Python

207 Views Asked by At

I'm helping a dog rescue group analyzing their incoming application for adoption. All the applications come in through an online system, and each application is given an automatically generated Form ID. The applications will then be assigned to different volunteers to process.

Most of the information is straightforward and I can use pandas to process with ease. Part of the analysis is about the processing time for each application. That means from the date the application is created (form status as "Submitted") to a dog is adopted (form status as "Adopted"). When I exported the form data, the status changes and the general comments are mixed under one header called "Comments (inline)."

Here is an abridged example of what the status changes/comments look like for one application.

Abridged view for one application

The text follows some basic patterns.

General comment: CURRENT_PERSON(first name) wrote on DATE: text

Change in volunteers: CURRENT_PERSON(first name) wrote on DATE: Assigned form to NEW_PERSON(first last)

Form status change: CURRENT_PERSON(first name) wrote on DATE: Changed Status from CURRENT_STATUS to NEW_STATUS

I'm new to python (~3mos). The first thing that came to mind is using python and regular expression to parse the text, extract the data, and separate them into two groups (one for general comments and one for status changes). This will take some time since I'm still learning but it seems doable. And the end result will be something like this.

Possible result

The same status can be assigned multiple times so I'd need to give them a number. Then I can calculate the processing time as the days between Adopted-01 and Submitted-01.

However, from what I've learned so far, it seems that using this type of iteration (for loop) is slow and not recommended. I have about 1500 forms so far and the number will only go up.

Should I go ahead with python and re? or is there a better way to get what I described? I strongly believe there is a much better way I just don't know enough yet.

Suggestions are greatly appreciated.

UPDATE

Sample data here in CSV file, here in XML file

You'll find four names in there: Jenny White, Rose Burk, Kerry James, Henry Woods.

I realize that the text CSV file is not as clean as I thought it would be. There is no space in between the comments or the status change. Even though there are patterns, it's not always consistent(first name only vs full name). I updated the patterns. With exporting into CSV, you'll see the following

Henry wrote on 9/22/2020: Assigned form to Rose Burk Rose wrote on 9/22/2020: Sent intro email.

became

Henry wrote on 9/22/2020: Assigned form to Rose BurkRose wrote on 9/22/2020: Sent intro email.

Now I need to look at cleaning up the data much more as well. Thanks for taking a look.

1

There are 1 best solutions below

0
On BEST ANSWER

So I figured out how to use the regular expressions to process the comments based on the patterns.

This following one can grab all different dates and all statuses based on the fact that they all have the section "Changed Status from" and I can match the individual group to whether that's an old status or a new status.

regex = r"(\d{1,2}\/\d{1,2}\/\d{4})\:\sChanged\sStatus\sfrom\s([A-Z][a-z]+(\s[a-z]+)?(\s[A-Z][a-z]+)?)\sto\s([A-Z][a-z]+(\s[a-z]+)?(\s[A-Z][a-z]+)?)"

And this one can grab just the date I need if I only focus on Adopted.

regex = r"(\d{1,2}\/\d{1,2}\/\d{4})\:\sChanged\sStatus\sfrom\s([A-Z][a-z]+(\s[a-z]+)?(\s[A-Z][a-z]+)?)\sto\sAdopted(?!\sElsewhere)

Also, I do not need to use the for loop. In the particular case focusing on the date for Adopted, I can just use the following to add the adopted date to my dataframe in Jupyter notebook.

df['Adopted']=df['Comments (inline)'].str.extract(r'(\d{1,2}\/\d{1,2}\/\d{4})\:\sChanged\sStatus\sfrom\s[A-Z][a-z]+(\s[a-z]+)?(\s[A-Z][a-z]+)?\sto\sAdopted(?!\sElsewhere)')

This also means that I can just use the original CSV file.