Is my training data really being randomized? Error rates are wildly oscillating

77 Views Asked by At

So I set the randomization window to 100,000. In my log I can see that it's oscillating between 0 errors and a lot of errors, which makes me wonder if the data is truly random. The training data is made up of sequences where the input is typically about 50 tokens and the output is 6 tokens for about 99% of the sequences, and maybe about 400 tokens in the other 1% (and these sequences are the most important to learn how to output, of course). It seems like more than one of the longer sequences may be getting clumped together, and that's why the error rate might go up all of a sudden. Is that possible?

1

There are 1 best solutions below

0
On

Please try to specify larger randomization window if your samples are small, i.e. randomizationWindow=100000000. It can be that your window is only a single chunk - then the data will be only randomized inside, not between chunks.

(You can see how the data is splitted if you specify verbosity=4 in the reader section, the randomized windows [) information).

The more data you can put in memory - the better. Also from the perf perspective, because (after initial load) while the data being processed the readers can start prefetching new chunks and your GPU won't be IO bound.