Why is "slurping" a file not a good practice for normal text-file I/O, and when is it useful?
For example, why shouldn't I use these?
File.read('/path/to/text.txt').lines.each do |line|
# do something with a line
end
or
File.readlines('/path/to/text.txt').each do |line|
# do something with a line
end
Again and again we see questions asking about reading a text file to process it line-by-line, that use variations of
read, orreadlines, which pull the entire file into memory in one action.The documentation for
readsays:The documentation for
readlinessays:Pulling in a small file is no big deal, but there comes a point where memory has to be shuffled around as the incoming data's buffer grows, and that eats CPU time. In addition, if the data consumes too much space, the OS has to get involved just to keep the script running and starts spooling to disk, which will take a program to its knees. On a HTTPd (web-host) or something needing fast response it'll cripple the entire application.
Slurping is usually based on a misunderstanding of the speed of file I/O or thinking that it's better to read then split the buffer than it is to read it a single line at a time.
Here's some test code to demonstrate the problem caused by "slurping".
Save this as "test.sh":
It creates five files of increasing sizes. 1K files are easily processed, and are very common. It used to be that 1MB files were considered big, but they're common now. 1GB is common in my environment, and files beyond 10GB are encountered periodically, so knowing what happens at 1GB and beyond is very important.
Save this as "readlines.rb". It doesn't do anything but read the entire file line-by-line internally, and append it to an array that is then returned, and seems like it'd be fast since it's all written in C:
Save this as "foreach.rb":
Running
sh ./test.shon my laptop I get:Reading the 1K file:
Reading the 1MB file:
Reading the 1GB file:
Reading the 2GB file:
Reading the 3GB file:
Notice how
readlinesruns twice as slow each time the file size increases, and usingforeachslows linearly. At 1MB, we can see there's something affecting the "slurping" I/O that doesn't affect reading line-by-line. And, because 1MB files are very common these days, it's easy to see they'll slow the processing of files over the lifetime of a program if we don't think ahead. A couple seconds here or there aren't much when they happen once, but if they happen multiple times a minute it adds up to a serious performance impact by the end of a year.I ran into this problem years ago when processing large data files. The Perl code I was using would periodically stop as it reallocated memory while loading the file. Rewriting the code to not slurp the data file, and instead read and process it line-by-line, gave a huge speed improvement from over five minutes to run to less than one and taught me a big lesson.
"slurping" a file is sometimes useful, especially if you have to do something across line boundaries, however, it's worth spending some time thinking about alternate ways of reading a file if you have to do that. For instance, consider maintaining a small buffer built from the last "n" lines and scan it. That will avoid memory management issues caused by trying to read and hold the entire file. This is discussed in a Perl-related blog "Perl Slurp-Eaze" which covers the "whens" and "whys" to justify using full file-reads, and applies well to Ruby.
For other excellent reasons not to "slurp" your files, read "How to search file text for a pattern and replace it with a given value".