I am trying to run a GloVe word embedding on a Bengali
news dataset. Now the original GloVe source doesn't have any supported language other than English but I found this which has word vectors pretrained for 30 non-English languages.
I am running this notebook on text classification using GloVe embeddings. My question is
Can I use the pre-trained Bengali word vectors with my custom
Bengali
dataset, and run on this model?this pretrained Bengali word vector is in
tsv
format. Using the following code I cannot seem to parse it intoword-vector
lists.embeddings_index = {} f = open(root_path + 'bn.tsv') for line in f: values = line.split('\t') word = values[1] ## The first entry is the word coefs = np.asarray(values[1:], dtype='float32') ## These are the vecotrs representing the embedding for the word embeddings_index[word] = coefs f.close() print('GloVe data loaded')
and I get the error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-39-3a4cb8d8dfb0> in <module>()
4 values = line.split('\t')
5 word = values[1] ## The first entry is the word
----> 6 coefs = np.asarray(values[1:], dtype='float32') ## These are the vecotrs representing the embedding for the word
7 embeddings_index[word] = coefs
8 f.close()
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not convert string to float: 'এবং'