Let's assume that I have the following pandas dataframe:
id |opinion
1 |Hi how are you?
...
n-1|Hello!
I would like to create a new pandas POS-tagged column like this:
id| opinion |POS-tagged_opinions
1 |Hi how are you?|hi\tUH\thi
how\tWRB\thow
are\tVBP\tbe
you\tPP\tyou
?\tSENT\t?
.....
n-1| Hello |Hello\tUH\tHello
!\tSENT\t!
From the documentation a tutorial, I tried several approaches. Particularly:
df.apply(postag_cell, axis=1)
and
df['content'].map(postag_cell)
Therefore, I created this POS-tag cell function:
import pandas as pd
df = pd.read_csv('/Users/user/Desktop/data2.csv', sep='|')
print df.head()
def postag_cell(pandas_cell):
import pprint # For proper print of sequences.
import treetaggerwrapper
tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
#2) tag your text.
y = [i.decode('UTF-8') if isinstance(i, basestring) else i for i in [pandas_cell]]
tags = tagger.tag_text(y)
#3) use the tags list... (list of string output from TreeTagger).
return tags
#df.apply(postag_cell(), axis=1)
#df['content'].map(postag_cell())
df['POS-tagged_opinions'] = (df['content'].apply(postag_cell))
print df.head()
The above function return the following:
user:~/PycharmProjects/misc_tests$ time python tagging\ with\ pandas.py
id| opinion |POS-tagged_opinions
1 |Hi how are you?|[hi\tUH\thi
how\tWRB\thow
are\tVBP\tbe
you\tPP\tyou
?\tSENT\t?]
.....
n-1| Hello |Hello\tUH\tHello
!\tSENT\t!
--- 9.53674316406e-07 seconds ---
real 18m22.038s
user 16m33.236s
sys 1m39.066s
The problem is that with large number of opinions it get takes a lot of time:
How to perform pos-tagging more efficiently and in a more pythonic way with pandas and treetagger?. I believe that this issue is due my pandas limited knowledge, since I tagged very quickly the opinions just with treetagger, out of a pandas dataframe.
There are some obvious modifications that can be done to gain a reasonable amount of time (as removing the imports and the instantiation of TreeTagger class from
postag_cell
function). Then the code can be parallelized. However, the majority of work is done by treetagger itself. As I don't know anything about this software, I can't tell if it can be further optimized.The minimal working code:
I'm not using
pd.read_csv(filename, sep = '|')
because your input file is "misformatted" - it contains unescaped characters|
in some text opinions.(Update:) After format fix, the output file looks like this:
If the formatting is not exactly what you want, we can work it out.
Parallelized code
It may give some speedup but don't expect miracles. The overhead coming from multiprocess setting may even exceed the gains. You can experiment with the number of processes
nproc
(here, set by default to number of CPUs; setting more than this is inefficient).Treetaggerwrapper has its own multiprocess class. I suspect that it does more less the same thing as the code below, so I didn't try it.
Update
In Python 3, all strings are by default in unicode, so you can save some trouble and time with decoding/encoding. (In the code below, I also use pure numpy arrays instead of data frames in child processes - but the impact of this change is insignificant.)
After single runs (so, not really statistically significant), I'm getting these timings on your file:
If the only use of pandas dataframe
pd
is to save everything back to a file, then the next step would be removing pandas from the code at all. But again, the gain would be insignificant in comparison with treetagger's work time.