Exporting annotated GATE file for further processing in Python causes character offset issues

179 Views Asked by At

I've used General Architecture for Text Engineering (GATE) to manually annotate data for a Named Entity Recognition (NER) task. The annotations are on corpus level and stored with a character start-and end offset, like this:

 startOffset endOffset annotationType
 12          17        personName
 21          28        organisationName

When I try to load the annotations and the original text file in Python, the character offsets mismatch, i.e. in GATE a given entity starting at offset 12 and ending at offset 20 may correctly resemble a person's name while the same offset in Python resembles something just slightly off.

Ideally, I would like to tokenize the corpus and replace annotations in the text with a tuple containing the word and its annotation type, like below:

"This", "is", "Jane", "Doe", "speaking"
"This", "is", ("Jane", personName), ("Doe", personName), "speaking"

This would make it easier to achieve my ultimate goal of transforming the data in an I, O, B format where each token is written on a new line with an annotation tag or "O" if there is no annotation for that word next to it. An entity can be multiple tokens long, in that case, the first occurence of said entity receives the "(B)eginning" tag and the following occurences receive the "(I)nside" tag. Like shown below:

This O
is O
Jane B-personName
Doe I-personName
speaking O

I am however unsure how to approach this problem: by using Python or by using GATE with one of the many available plugins, i.e. with the Groovy editor. There are multiple ways to export GATE files, for example by XML, json, or plain text using Groovy but importing either file type results in mismatching character offsets while using UTF-8 encoding. For this reason, I've created a Python script to repair the wrong annotations first, however this code is long, not efficient and it feels like I'm trying to solve a problem that should not be there in the first place. I've written my current high level approach at the bottom of this post.

For reference, a sample XML file snippet is shown below, with as input the following text where country and city names are labelled as "Toponyms":

Sweden is a country in Europe.
Paris is the capital of France.
In Spain it's often sunny.

This results in the following GATE output XML:

<?xml version='1.0' encoding='UTF-8'?>
<GateDocument>
<!-- The document content area with serialized nodes -->

<TextWithNodes><Node id="0"/>Sweden<Node id="6"/> is a country in <Node id="23"/>Europe<Node id="29"/>.&#xd;
<Node id="32"/>Paris<Node id="37"/> is the capital of <Node id="56"/>France<Node id="62"/>.&#xd;
In <Node id="68"/>Spain<Node id="73"/> it's often sunny.<Node id="91"/></TextWithNodes>
<!-- The default annotation set -->

<AnnotationSet>
<Annotation Id="1" Type="gateFinal" StartNode="0" EndNode="6">
<Feature>
  <Name className="java.lang.String">kind</Name>
  <Value className="java.lang.String">Toponyms</Value>
</Feature>
</Annotation>
<Annotation Id="2" Type="gateFinal" StartNode="23" EndNode="29">
<Feature>
  <Name className="java.lang.String">kind</Name>
  <Value className="java.lang.String">Toponyms</Value>
</Feature>
</Annotation>
<Annotation Id="3" Type="gateFinal" StartNode="56" EndNode="62">
<Feature>
  <Name className="java.lang.String">kind</Name>
  <Value className="java.lang.String">Toponyms</Value>
</Feature>
</Annotation>
<Annotation Id="4" Type="gateFinal" StartNode="32" EndNode="37">
<Feature>
  <Name className="java.lang.String">kind</Name>
  <Value className="java.lang.String">Toponyms</Value>
</Feature>
</Annotation>
<Annotation Id="5" Type="gateFinal" StartNode="68" EndNode="73">
<Feature>
  <Name className="java.lang.String">kind</Name>
  <Value className="java.lang.String">Toponyms</Value>
</Feature>
</Annotation>
</AnnotationSet>
</GateDocument>

Currently I "repair" the incorrect offsets by tokenizing the input corpus while remembering the character offset for each token. Then I retrieve the annotations from the "Annotationsets" part of the XML file. Then I loop through a list of all the annotations and find the first occurence of the annotated token and continue the loop from the last found token while storing the correct annotation offset. Ultimately I merge the corpus offsets with the annotation offsets. However, a small mistake can cause this script to crash which I why I'm curious to find a better solution.

0

There are 0 best solutions below