Summing total file sizes of directory is different by a large margin: Ruby -e, du -ach, ls -al "total"

1k Views Asked by At
ls | ruby -ne 'BEGIN{a= []}; a <<  File.size($_.chomp).to_i; END{puts a.sum}'

The code above gets the file size of each file, puts it into an array, and prints the sum.

The value returned is very different from:

du -ach

And both values are very different from the Total displayed by:

ls -al

There are no hidden files.

MacOs

2

There are 2 best solutions below

2
On BEST ANSWER

If du is showing you a lot of 4K and 8K files, this is because it is showing you the block size. For performance, storage on disk is made up of blocks. A typical block these days is 4K. Even a single byte will take a full block.

$ echo '1' > this

$ hexdump this
0000000 31 0a                                          
0000002

$ ls -l this
-rw-r--r-- 1 schwern staff 2 Dec  5 15:16 this

$ du -h this
4.0K    this

$ du --apparent-size -h this
2   this

$ ruby -e 'puts File.size(ARGV[0])' this
2

The file in question has 2 bytes of content. ls -l and File.size report the content of two bytes.

du, by default, reports the block size of the file. This is because it is a Disk Usage tool and you want to know the true amount of disk taken up. Those 2 bytes take up 4K of disk. 1000 2 byte files will take 4000K, not 2000 bytes.

For this reason, many programs will avoid having many tiny files and instead save disk space by packing them together into a single image file. A simple example is Git packfiles.

0
On

The question is how do you define "size", how do you define "sum", and are you 100% sure that all three of the examples you showed are actually measuring the same thing (i.e. all three are defining those two terms in exactly the same way)?

Here are just a few examples of things to consider.

Sparse files

Sparse files are a feature of many filesystems, that optimize storage of files that contain long runs of binary zeroes. Instead of actually storing the zeroes, the file instead simply contains information that there is a "hole" in the file, and when reading the file, the OS will return zeroes, even though they are not physically stored in the file.

The most extreme example would be a file that consists only of zeroes. I can store the information "this file contains 2 terabytes of zeroes" in just a few bytes, yet, when I ask the operating system to open and read the file, I will "see" 2 terabytes of zeroes. Now, what is the "size" of this file? Is it 2TB or is it only the couple of bytes that are actually needed to encode the information of the "hole" of the sparse file (which in this case covers the whole file)?

I used to confuse my friends by creating terabyte-size sparse files on 1.44MB floppy disks (or more recently, 32 GB USB sticks).

Metadata overhead

A filesystem not only has to store the content of the file, but also some sort of metadata about the file: when was the file created, when was the file last modified, when was the file last accessed, who owns the file, and so on.

This metadata also takes up space. Do you count that or not? Note that it is different for every filesystem!

Block size

Many filesystems have a smallest possible allocation size called a "block". It is not possible to allocate space smaller than a block, so unless the size of the file is an exact integer multiple of the block size, the size of the content of the file, and the size of the file on disk will always be different.

This is especially noticeable for very small files and very large block sizes. E.g. a file which contains only the string "Hello" encoded in ASCII contains at most 7 bytes (worst-case assuming that it ends with a newline, and the newline is a Windows-style CRLF), but it will take up an entire block (typically 4KB) on disk.

Metadata inlining

On the other hand, on some filesystems, very small files get inlined into their metadata entry. So, they don't require any data blocks at all. Does that mean their size is 0?

Tail sharing

On some filesystems, the "tails" of multiple files can share one block. So, if you have multiple files whose sizes are not an integer multiple of the block size, instead of allocating one mostly empty block for each "tail end" of each file, the "tail ends" of multiple files are stuffed into a single block.

However, now this block belongs to multiple files, so if you ask for the size of each file in isolation, this block will be reported multiple times.

Multiple entries for the same file

Many file systems separate the notion of a "file" from the notion of a "file name". For example, in Unix, and any systems derived or inspired from it (Linux, macOS, Android, …), a "file" is simply an unnamed blob of data. A directory is a special kind of file that associates names with files.

However, this means that a file can have more than one name! So, if you have the same file under two different names in your directory, then do you count it once or twice?

Directory entry inlining

Similar to metadata inlining, if the file is very small, and there is only one name for the file, then instead of putting a pointer to the file in the directory entry, we can put the data of the file into the directory entry directly.

Again, this has the effect that if we ignore directory entries when looking at file size, the file appears to have a size on disk of 0.

Deduplication

Some filesystems perform deduplication, where they try to find blocks with the same content and then transparently replace those two blocks with a link to a single block.

Now, when two totally unrelated files happen to have a run of identical content somewhere inside of them, and thus are sharing some deduplicated blocks, are you counting those blocks once or twice?

Compression

Some filesystems transparently compress the contents of the files. This means that the actual size of the file on disk depends on how compressible the content of the file is.

So, do you count the compressed or the uncompressed size?

Alternate Data Streams / Forks

Some filesystems have a feature that allows you to store more than one data stream inside of a single file. For example, NTFS allows you to store so-called "Alternate Data Streams" in a file. Applications use this to store additional application-specific metadata, e.g. music players use it to store album covers inside music files, or count how often a song has been played, or song-specific equalizer settings, etc., office applications use it to store backups of older versions of the file, and so on. MacOS has a similar feature called "Forks".

Almost all standard filesystem APIs will only present the Default Stream / Data Fork. Unless you explicitly ask for an Alternate Data Stream / the Resource Fork using typically OS-specific or filesystem-specific APIs, you will never even know that it is there, but it may be of significant size.

"Bundles"

Specifically on macOS, you have the concept of "Bundles" which are technically directories as far as the filesystem and the lower levels of the OS are concerned, but are mostly treated as single files when presented to higher levels of the OS and to the user.

So, here you have a thing that looks like a file, where you think "the size of this should be easy to determine", but it is actually a directory, with all of the problems that you noticed in your question.

Any combination of the above

And of course, all of the above can be combined with each other.

So, as you can see, when you compute the sum of the sizes of multiple files, that is not a straightforward thing. Files can share pieces of data.

But even if you forget about the sum and only ask about the size of a single file, the answer is still not clear, because there are many different ways to define what "size" means.

So, in order to have a meaningful answer to the question, you need to actually take several steps back, and ask yourself:

  1. Why are you measuring the sum of the sizes of the files of a directory? What do you need this information for? What is your end goal? Which decisions are you actually going to base on this information? How are you going to use this information?

  2. What is it that you actually need to measure to have the necessary information to base the decisions on?

  3. How are you measuring this? Depending on your answer to question #2, the information that you need may be very OS-specific or filesystem-specific, and part of internal filesystem APIs that you don't even have access to as a user.