Training data preparation for Training ELMO Embedding from scratch

288 Views Asked by At

I am trying to build my own custom chemical domain ELMO embedding. I am following the instruction from https://github.com/allenai/bilm-tf

How do I prepare the training data if I have many multi word token in domain like chemistry. For example:

1. Original Sentences: "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide. \n This is another sentence."

Here "3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide" is a single token. There are multiple words inside the token which are white space separated. This would lead to the above token to be split as 3 tokens: ['3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl','tetrazolium', 'bromide'].

How can I avoid this? Can I give the input training data in the following format to avoid this?

Training data(1) : List of tokens for each sentences. So the training text file will have list of list tokens.

[['This', 'is', 'a', 'multi', 'word', 'chemical', 'component', '3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide'], ['This', 'is', 'another', 'sentence.']]

Training data(2): Here I have concatenated the multi keyword token by '|' symbol. "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl|tetrazolium|bromide. \n This is another sentence."

Please guide on the best way to prepare the training data.

1

There are 1 best solutions below

0
On

You can create your own custom spaCy Tokenizer by adding your own special case.

First, install the packages needed.

pip install spacy
python -m spacy download en_core_web_sm

Then, run the following code.

import spacy
from spacy.symbols import ORTH

input = "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide.\nThis is another sentence."
output = []

nlp_tokenisation = spacy.load("en_core_web_sm") # Initialise

# Add additional rules
special_case = [{ORTH: "3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide"}]
nlp_tokenisation.tokenizer.add_special_case("3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide", special_case)

input = input.split("\n") # Split lines

for line in input:
    doc = nlp_tokenisation(line)
    output.append([token.text for token in doc])

print(output)

It should return you the following output:

[['This', 'is', 'a', 'multi', 'word', 'chemical', 'component', '3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide', '.'], ['This', 'is', 'another', 'sentence', '.']]

You can then tweak the tokenizer to your desire (e.g. finetuning punctuations). You can create a script to input all chemistry terms into this tokenizer automatically. For more information on spaCy refer to their documentation on Tokenizer and Linguistic Features. Although this response is late, hope it helps future developers.