I am trying to build my own custom chemical domain ELMO embedding. I am following the instruction from https://github.com/allenai/bilm-tf
How do I prepare the training data if I have many multi word token in domain like chemistry. For example:
1. Original Sentences: "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide. \n This is another sentence."
Here "3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide" is a single token. There are multiple words inside the token which are white space separated. This would lead to the above token to be split as 3 tokens: ['3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl','tetrazolium', 'bromide'].
How can I avoid this? Can I give the input training data in the following format to avoid this?
Training data(1) : List of tokens for each sentences. So the training text file will have list of list tokens.
[['This', 'is', 'a', 'multi', 'word', 'chemical', 'component', '3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide'], ['This', 'is', 'another', 'sentence.']]
Training data(2): Here I have concatenated the multi keyword token by '|' symbol. "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl|tetrazolium|bromide. \n This is another sentence."
Please guide on the best way to prepare the training data.
You can create your own custom spaCy Tokenizer by adding your own special case.
First, install the packages needed.
Then, run the following code.
It should return you the following output:
You can then tweak the tokenizer to your desire (e.g. finetuning punctuations). You can create a script to input all chemistry terms into this tokenizer automatically. For more information on spaCy refer to their documentation on Tokenizer and Linguistic Features. Although this response is late, hope it helps future developers.