Using regex to extract information from a large SFrame or dataframe without using a loop

335 Views Asked by At

I have the following code in which I use a loop to extract some information and use these information to create a new matrix. However, because I am using a loop, this code takes forever to finish.

I wonder if there is a better way of doing this by using GraphLab's SFrame or pandas dataframe. I appreciate any help!

# This is the regex pattern
pattern_topic_entry_read = r"\d{15}/discussion_topics/(?P<topic>\d{9})/entries/(?P<entry>\d{9})/read"

# Using the pattern, I filter my records
requests_topic_entry_read = requests[requests['url'].apply(lambda x: False if regex.match(pattern_topic_entry_read, x) == None else True)]

# Then for each record in the final set, 
# I need to extract topic and entry info using match.group
for request in requests_topic_entry_read:
    for match in regex.finditer(pattern_topic_entry_read, request['url']):
        topic, entry  = match.group('topic'), match.group('entry')

        # Then, I need to create a new SFrame (or dataframe, or anything suitable) 
        newRow = gl.SFrame({'user_id':[request['user_id']], 
                            'url':[request['url']], 
                            'topic':[topic], 'entry':[entry]})

        # And, append it to my existing SFrame (or dataframe)
        entry_read_matrix = entry_read_matrix.append(newRow)

Some sample data:

user_id | url
1000    | /123456832960900/discussion_topics/770000832912345/read
1001    | /123456832960900/discussion_topics/770000832923456/view?per_page=832945307
1002    | /123456832960900/discussion_topics/770000834562343/entries/832350330/read
1003    | /123456832960900/discussion_topics/770000534344444/entries/832350367/read

I want to obtain this:

user_id | topic           | entry
1002    | 770000834562343 | 832350330
1003    | 770000534344444 | 832350367
2

There are 2 best solutions below

0
On BEST ANSWER

Here, let me reproduce it:

>>> import pandas as pd
>>> df = pd.DataFrame(columns=["user_id","url"])
>>> df.user_id = [1000,1001,1002,1003]
>>> df.url = ['/123456832960900/discussion_topics/770000832912345/read', '/123456832960900/discussion_topics/770000832923456/view?per_page=832945307', '/123456832960900/discussion_topics/770000834562343/entries/832350330/read','/123456832960900/discussion_topics/770000534344444/entries/832350367/read']
>>> df["entry"] = df.url.apply(lambda x: x.split("/")[-2] if "entries" in x.split("/") else "---")
>>> df["topic"] = df.url.apply(lambda x: x.split("/")[-4] if "entries" in x.split("/") else "---")
>>> df[df.entry!="---"]

gives you desired DataFrame

0
On

Pandas' series has string functions for that. E.g., with your data in df:

pattern = re.compile(r'.*/discussion_topics/(?P<topic>\d+)(?:/entries/(?P<entry>\d+))?')
df = pd.read_table(io.StringIO(data), sep=r'\s*\|\s*', index_col='user_id')
df.url.str.extract(pattern, expand=True)

yields

                   topic      entry
user_id                            
1000     770000832912345        NaN
1001     770000832923456        NaN
1002     770000834562343  832350330
1003     770000534344444  832350367