I am trying to train Doc2Vec model using the following code:
String modelPath = "input_data.csv";
File file = new File(modelPath);
SentenceIterator iter = new BasicLineIterator(file);
AbstractCache<VocabWord> cache = new AbstractCache<>();
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());
LabelsSource source = new LabelsSource("DOC_");
ParagraphVectors vec = new ParagraphVectors.Builder()
    .minWordFrequency(1)
    .iterations(5)
    .epochs(1)
    .layerSize(100)
    .learningRate(0.025)
    .labelsSource(source)
    .windowSize(5)
    .iterate(iter)
    .trainWordVectors(false)
    .vocabCache(cache)
    .tokenizerFactory(t)
    .sampling(0)
    .workers(4)
    .build();
vec.fit();
File tempFile = new File("trained_model.zip");
WordVectorSerializer.writeParagraphVectors(vec, tempFile);
- This code works for small input file
- When I try to execute this code on large file (18GB), I am getting the following error - ......... o.d.m.s.SequenceVectors - Time spent on training: 5667912 ms Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: Stream Closed at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeParagraphVectors(WordVectorSerializer.java:477) at org.deeplearning4j.examples.nlp.paragraphvectors.ParagraphVectorsTextExample.main(ParagraphVectorsTextExample.java:73) Caused by: java.lang.RuntimeException: java.io.IOException: Stream Closed at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeWordVectors(WordVectorSerializer.java:393) at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeParagraphVectors(WordVectorSerializer.java:687) at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeParagraphVectors(WordVectorSerializer.java:475) ... 1 more Caused by: java.io.IOException: Stream Closed at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.FilterOutputStream.close(FilterOutputStream.java:158) at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.writeWordVectors(WordVectorSerializer.java:392) ... 3 more
I am not sure what I am doing wrong. Is there any way around this?