I don't understand this from the Google File Systems Paper
A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file.
What difference does a small file make? Aren't large files being accessed by many clients equally likely to cause problems?
I've thought / read the following:-
- I assume (correct me if I'm wrong) that the chunks of large files are stored on different chunkservers thereby distributing load. In such a scenario say 1000 clients access 1/100th of the file from each chunkserver. So each chunkserver inevitably ends up getting 1000 requests. (Isn't is the same as 1000 clients accessing a single small file. The server gets 1000 requests for small files or 1000 requests for parts of a larger file)
- I read a little about Sparse files. Small files according to the paper fill up a chunk or several chunks. So to my understanding small files aren't reconstructed and hence I've eliminated this as the probable cause for the hot spots.
Some of the subsequent text can help clarify:
If 1000 clients want to read a small file at the same time, the N chunkservers holding its only chunk will receive 1000 / N simultaneous requests. This sudden load is what's meant by a hot spot.
Large files aren't going to be read all at once by a given client (after all, they are large). Instead, they're going to load some portion of the file, work on it, then move on to the next portion.
In a sharding (MapReduce, Hadoop) scenario, workers may not even read the same chunks at all; one client out of N will read 1/N chunks of the file, distinct from the others.
Even in a non-sharding scenario, in practice clients will not be completely synchronized. They may all end up reading the entire file, but with a random access pattern so that statistically there is no hotspotting. Or if they do read it sequentially, they will get out of sync because of difference in workload (unless you're purposefully synchronizing the clients....but don't do that).
So even with lots of clients, larger files get less hot spotting due to the nature of work that large files entail. It's not guaranteed, which is what I think you are saying in your question, but in practice distributed clients won't work in tandem on every chunk of a multi-chunk file.