What is the required data format for Google AutoML ".txt to .jsonl" script?

489 Views Asked by At

I'm trying to create dataset for entity recognition task in Google AutoML with their script to convert my .txt files in .jsonl and save it in Google Cloud Storage as explained in this tutorial. Data looks like (from their example - NCBI Disease Corpus):

"10021369   Identification of APC2, a homologue of the <category="Modifier">adenomatous polyposis coli tumour<\/category> suppressor .  "

After uploading in GCS labels are not recognized at all. What format of data is relevant?

1

There are 1 best solutions below

0
On

I'm not quite sure if <category="Modifier"> should work, but as far as I know, the right way in the Quickstart is annotating in the following way:

{"annotations": [
{"text_extraction": {"text_segment": {"end_offset": 85, "start_offset": 52}}, "display_name": "Modifier"}, 
{"text_extraction": {"text_segment": {"end_offset": 144, "start_offset": 103}}, "display_name": "Modifier"}, 
{"text_extraction": {"text_segment": {"end_offset": 391, "start_offset": 376}}, "display_name": "Modifier"}, 
{"text_extraction": {"text_segment": {"end_offset": 1008, "start_offset": 993}}, "display_name": "Modifier"}, 
{"text_extraction": {"text_segment": {"end_offset": 1137, "start_offset": 1131}}, "display_name": "SpecificDisease"}], 
"text_snippet": {"content": "10021369\tIdentification of APC2, a homologue of the adenomatous polyposis coli tumour suppressor .\tThe ... APC - / - colon 
carcinoma cells . Human APC2 maps to chromosome 19p13 . 3. APC and APC2 may therefore have comparable functions in development and cancer .\n "}
}

After importing the dataset, in the AutoML NL UI you will see the five annotations that are specified in the jsonl:

enter image description here

For more reference on the jsonl structure of the example above, you can take a look at the sample files in the Quickstart:

$ gsutil cat gs://cloud-ml-data/NL-entity/dataset.csv
TRAIN,gs://cloud-ml-data/NL-entity/train.jsonl
TEST,gs://cloud-ml-data/NL-entity/test.jsonl
$ gsutil cat gs://cloud-ml-data/NL-entity/train.jsonl

If you are using the python script for your own texts strings, you will see that it generates a csv file (dataset.csv) and jsonl files with content like:

{"text_snippet": {"content": "This is a disease\n Second line blah blabh"}, "annotations": []} 

So, you will need to specify the annotations (using start_offset and the end_offset) whose manual process can be a bit overwhelm, or you can upload the CSV file in the AutoML UI and label entities interactively.