I'm trying to create dataset for entity recognition task in Google AutoML with their script to convert my .txt files in .jsonl and save it in Google Cloud Storage as explained in this tutorial. Data looks like (from their example - NCBI Disease Corpus):
"10021369 Identification of APC2, a homologue of the <category="Modifier">adenomatous polyposis coli tumour<\/category> suppressor . "
After uploading in GCS labels are not recognized at all. What format of data is relevant?
I'm not quite sure if
<category="Modifier">
should work, but as far as I know, the right way in the Quickstart is annotating in the following way:After importing the dataset, in the AutoML NL UI you will see the five annotations that are specified in the jsonl:
For more reference on the jsonl structure of the example above, you can take a look at the sample files in the Quickstart:
If you are using the python script for your own texts strings, you will see that it generates a csv file (dataset.csv) and jsonl files with content like:
So, you will need to specify the
annotations
(usingstart_offset
and theend_offset
) whose manual process can be a bit overwhelm, or you can upload the CSV file in the AutoML UI and label entities interactively.