I'm tinkering with a Mac utility that does HFS compression. Originally it'd read the entire file into memory using read(2), then create a compressed representation which would then be saved to the file's extended attributes or resource fork. The utility can handle multiple files in parallel.
It can be quite memory hungry so I started my tinkering to mmap the target file instead, using MAP_PRIVATE (my hope was also that the mmap might fail if the file was already in use by another process, at least sometimes). So I have
inBuf = mmap(NULL, filesize, PROT_READ, MAP_PRIVATE|MAP_NOCACHE, 0);
instead of
inBuf = malloc(filesize);
Memory use is indeed lower, but somewhat to my surprise I find a decrease in performance: noticeably longer processing times, a lower CPU load and a considerably higher number of major faults (according to tcsh's time utility). That impact appears to be more important when the files are smaller, so I have set an arbitrary limit (64Mb) under which the original malloc+read path is used.
Is there a rule of thumb for this sort of trade-off? I'd run this through perf on Linux, but the comparable utilities I know of on Mac are all geared towards GUI apps (while mine is a shell app, of course).
EDIT: Come to think of it, using mmap() will probably not really decrease memory requirements, will it? I'm guessing data is still going to be copied into "fast RAM" as opposed to accessed directly on disk - and the fact that you can apparently have multiple independent mmaps of the same file seems to support that hypothesis:
- I get a file's content by mmapping it into memory (inBuf above)
- I "hfs-compress" that buffer, and rewrite the file with the compressed content
- inBuf doesn't appear to change because of the rewrite, but then HFS compressed files are uncompressed transparently when read
- I get a new mmap of the now compressed file, and as a verification I compare that with inBuf: the contents are identical. But you'd also expect that if those mmaps always reflect the content of the file so I'm not certain if my verification makes any sense.