I have a Node.js application that needs to fetch this 6GB zip file from Census.gov and then process its content. However when fetching the file using Node.js https API, the downloading stops at different file size. Sometime it fails at 2GB or 1.8GB and so on. I am never able to fully download the file using the application but its fully downloaded when using the browser. Is there any way to download the full file? I cannot start processing the zip until its fully download, so my processing code waits for the download to complete before executing.
const file = fs.createWriteStream(fileName);
http.get(url).on("response", function (res) {
let downloaded = 0;
res
.on("data", function (chunk) {
file.write(chunk);
downloaded += chunk.length;
process.stdout.write(`Downloaded ${(downloaded / 1000000).toFixed(2)} MB of ${fileName}\r`);
})
.on("end", async function () {
file.end();
console.log(`${fileName} downloaded successfully.`);
});
});
You have no flow control on the
file.write(chunk). You need to pay attention to the return value fromfile.write(chunk)and when it returnsfalse, you have to wait for thedrainevent before writing more. Otherwise, you can overflow the buffer on the writestream, particularly when writing large things to a slow medium like disk.When you lack flow control when attempting to write large things faster than the disk can keep up, you will probably blow up your memory usage because the stream has to accumulate more data in its buffer than is desirable.
Since your data is coming from a readable, when you get
falseback from thefile.write(chunk), you will also have to pause the incoming read stream so it doesn't keep spewing data events at you while you're waiting for thedrainevent on the writestream. When you get thedrainevent, you can thenresumethe readstream.FYI, if you don't need the progress info, you can let
pipeline()do all the work (including the flow control) for you. You don't have to write that code yourself. You may even be able to still gather the progress info, by just watching the writestream activity when usingpipeline().Here's one way to implement the flow control yourself, though I'd recommend you use the
pipeline()function in the stream module and let it do all this for you if you can:There also appeared to be a timeout issue in the http request. When I added this:
I was then able to download the whole 7GB ZIP file.
Here's turnkey code that worked for me: