Why am I getting an error while loading IMDB Dataset

366 Views Asked by At
from torchtext.datasets import WikiText2, IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tkzer = get_tokenizer('basic_english')

tr_iter = WikiText2(split='train')
vocabulary = build_vocab_from_iterator(map(tkzer, tr_iter), specials=['<unk>'])

tr_iter_imdb = IMDB(split='train')
vocabulary = build_vocab_from_iterator(map(tkzer, tr_iter_imdb), specials=['<unk>'])

The code for WikiText2 runs fine. But when it comes to IMDB, I get the following error while running build_vocab_from_iterator.

'tuple' object has no attribute 'lower'

Can someone please help me understand why is that the case? I assume this relates to IMDB data structure different from WikiText2. In that case, how can I build vocab for IMDB dataset.

1

There are 1 best solutions below

1
tomdartmoor On

IMDB() returns a tuple containing an int and a str:

IMDB Dataset

For additional details refer to http://ai.stanford.edu/~amaas/data/sentiment/

Number of lines per split:

train: 25000
test: 25000
Args:
    root: Directory where the datasets are saved. Default: os.path.expanduser('~/.torchtext/cache')
    split: split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

:returns: DataPipe that yields tuple of label (1 to 2) and text containing the movie review
:rtype: (int, str)

I suggest that you check that the text in the tuple is what you want, and then update your map function to something like: map(lambda x : tkzer(x[1]),tr_iter_imdb)