I have a handful of ASCII files containing around 17 million lines in total, and within each/most lines is a fixed 36-byte identifier. So my data is rectangular: I have a lot of rows of fixed width. Using Haskell, I want to read all the lines in, use a regex to extract the identifier (I'm fine up to there), then sort them and count the number of unique identifiers (very close to grep | sort | uniq). (I'm already parallelising by reading from each file in parallel.) Sounds like a simple problem , but...
I'm finding it hard to get decent performance out of this problem, even before the sorting stage. Here's as far as I've got. String is overkill for 36-bytes of ASCII, so I figured on using ByteString. But a (linked) list of size 17 million seems like a bad idea, so I tried IOVector ByteString. But this seems to be quite slow. I believe the garbage collection is suffering as I retain millions of small ByteStrings in the vector: the GC is taking at least 3 times as long as the code (according to +RTS -s) and I think it only gets worse as the program keeps running.
I was thinking that I should maybe use Repa or some sort of single giant ByteString/IOVector Char8/whatever (since I know the exact width of each row is 36) to store the data in one massive rectangular array for each thread, which should alleviate the GC problem. However, I do still need to sort the rows afterwards, and Repa seems to have no support for sorting, and I don't want to be writing sort algorithms myself. So I don't know how to have a giant rectangular array and yet still sort it.
Suggestions for libraries to use, GC flags to set, or anything else? The machine has 24 cores and 24GB of RAM, so I'm not constrained on hardware. I want to remain in Haskell because I have lots of related code (that is also parsing the same files and producing summary statistics) that is working fine, and I don't want to rewrite it.
The
Arrayfamily of types has built-in support for multi-dimensional arrays. The indices can be anything with anIxinstance, in particular for your case(Int, Int). It also does not support sorting, unfortunately.But for your use case, do you really need sorting? If you have a map from identifiers to Int you can just increase the count as you go, and later select all keys with value 1. You can check out the
bytestring-triepackage, although for your use case it suggests to usehashmap.Another algorithm would be to carry two sets (e.g. HashSet), one with identifiers seen exactly once, and one with identifiers seen more than once, and you update these sets while you go through the list.
Also, how do you read your file: If you read it as one large ByteString and carefully construct the small ByteString objects from it, they will actually be just pointers into the big chunk of memory with the large file, possibly eliminating your GC problems. But to assess that we’d need to see your code.