What is the most efficient way to read a CSV file into an Accelerate (or Repa) Array?

1k Views Asked by At

I am interested in playing around with the Accelerate library, and I would like to perform some operations on data stored inside of a CSV file. I've read this excellent introduction to Accelerate, but I'm not sure how I can go about reading CSVs into Accelerate efficiently. I've thought about this, and the only thing I can think of is to parse the entire CSV file into one long list, and then feed the entire list into Accelerate.

My data sets will be quite large, and it doesn't seem efficient to read a 1 gb+ file into memory only to copy somewhere else. I noticed there was a CSV Enumerator package on Hackage, but I'm not sure how to use it with Accelerate's generate function. Another constraint is that it seems the dimensions of the Array, or at least number of elements, must be known before generating an array using Accelerate.

Has anyone dealt with this kind of problem before?

Thanks!

2

There are 2 best solutions below

1
On

I am not sure if this is 100% applicable to accelerate or repa, but here is one way I've handled this for Vector in the past:

-- | A hopefully-efficient sink that incrementally grows a vector from the input stream
sinkVector :: (PrimMonad m, GV.Vector v a) => Int -> ConduitM a o m (Int, v a)
sinkVector by = do
    v <- lift $ GMV.new by
    go 0 v
  where
    -- i is the index of the next element to be written by go
    -- also exactly the number of elements in v so far
    go i v = do
        res <- await
        case res of
          Nothing -> do
            v' <- lift $ GV.freeze $ GMV.slice 0 i v
            return $! (i, v')
          Just x -> do
            v' <- case GMV.length v == i of
                    True -> lift $ GMV.grow v by
                    False -> return v
            lift $ GMV.write v' i x
            go (i+1) v'

It basically allocates by empty slots and proceeds to fill them. Once it hits the ceiling, it grows the underlying vector once again. I haven't benchmarked anything, but it appears to perform OK in practice. I am curious to see if there will be other more efficient answers here.

Hope this helps in some way. I do see there's a fromVector function in repa and perhaps that's your golden ticket in combination with this method.

1
On

I haven't tried reading CSV files into repa but I recommend using cassava (http://hackage.haskell.org/package/cassava). Iirc I had a 1.5G file which I used to create my stats. With cassava, my program ran in a surprisingly small amount of memory. Here's an extended example of usage:

http://idontgetoutmuch.wordpress.com/2013/10/23/parking-in-westminster-an-analysis-in-haskell/

In the case of repa, if you add rows incrementally to an array (which it sounds like you want to do) then one would hope the space usage would also grow incrementally. It certainly is worth an experiment. And possibly also contacting the repa folks. Please report back on your results :-)