Count the number of labels on IOB corpus with Pandas

115 Views Asked by At

From my IOB corpus such as:

    mention Tag
170     
171 467 O
172     
173 Vincennes   B-LOCATION
174 .   O
175     
176 Confirmation    O
177 des O
178 privilèges  O
179 de  O
180 la  O
181 ville   B-ORGANISATION
182 de  I-ORGANISATION
183 Tournai I-ORGANISATION
184 1   O
185 (   O
186 cf  O
187 .   O
188 infra   O
189 ,   O

I try to make simple statistics like total number of annotated mentions, total by labels etc.

After loading my dataset with pandas I got this:

df = pd.Series(data['Tag'].value_counts(), name="Total").to_frame().reset_index()
df.columns = ['Label', 'Total']
df

Output :

   Label        Total
0   O          438528
1               36235
2   B-LOCATION  378
3   I-LOCATION  259
4   I-PERSON    234
5   I-INSTALLATION  156
6   I-ORGANISATION  150
7   B-PERSON    144
8   B-TITLE 94
9   I-TITLE 89
10  B-ORGANISATION  68
11  B-INSTALLATION  62
12  I-EVENT 8
13  B-EVENT 2

First of all, How I could get a similar representation above but by regrouping the IOB prefixes such as (example):

Label, Total
PERSON, 300
LOCATION, 154
ORGANISATION, 67
etc.

and secondly how to exclude the "O" and empty strings labels from my output, I tested with .mask() and .where() on my Series but it fails.

Thank you for your leads.

1

There are 1 best solutions below

1
On

remove B-, I- parts, groupby, sum

df['label'] = df.label.str[2:]
df.groupby(['label']).sum()

For the second part, just return data in which the length of the label column string is greater than 2

df.loc[df.label.str.len()>2]