What is *.subwords file in natural language processing to use as vocabulary file?

398 Views Asked by At

I have been trying to create a vocab file in a nlp task to use in tokenize method of trax to tokenize the word but i can't find which module/library to use to create the *.subwords file. Please help me out?

2

There are 2 best solutions below

0
On

The easiest way to use the trax.data.Tokenize with your own data and a subword vocabulary it's using Google's Sentencepiece python module

import sentencepiece as spm

spm.SentencePieceTrainer.train('--input=data/my_data.csv --model_type=bpe --model_prefix=my_model --vocab_size=32000')

This creates two files:

  • my_model.model
  • my_model.vocab

We'll use this model in trax.data.Tokenize and we'll add the parameter vocab_type with the value "sentencepiece"

trax.data.Tokenize(vocab_dir='vocab/', vocab_file='my_model.model', vocab_type='sentencepiece')

I think it's the best way since you can load the model and use it to get control ids while avoiding hardcode

sp = spm.SentencePieceProcessor()
sp.load('my_model.model')

print('bos=sp.bos_id()=', sp.bos_id())
print('eos=sp.eos_id()=', sp.eos_id())
print('unk=sp.unk_id()=', sp.unk_id())
print('pad=sp.pad_id()=', sp.pad_id()) 

sentence = "hello world"
# encode: text => id
print("Pieces: ", sp.encode_as_pieces(sentence))
print("Ids: ", sp.encode_as_ids(sentence))
# decode: id => text
print("Decode Pieces: ", sp.decode_pieces(sp.encode_as_pieces(sentence)))
print("Decode ids: ", sp.decode_ids(sp.encode_as_ids(sentence)))

print([sp.bos_id()] + sp.encode_as_ids(sentence) + [sp.eos_id()])

If still you want to have a subword file, try this:

python trax/data/text_encoder_build_subword.py \
--corpus_filepattern=data/data.txt --corpus_max_lines=40000 \
--output_filename=data/my_file.subword

I hope this can help since there is no clear literature to see how to create compatible subword files out there

0
On

You can use tensorflow API SubwordTextEncoder

Use following code snippet -

encoder = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (text_row for text_row in text_dataset), target_vocab_size=2**15)
encoder.save_to_file(vocab_fname)

Tensorflow will append .subwords extension to above vocab file.