Parallelize computation of mutable vector in ST

99 Views Asked by At

How can computations done in ST be made to run in parallel?

I have a vector which needs to be filled in by random access, hence the use of ST, and the computation runs correctly single-threaded, but have been unable to figure out how to use more than one core.

Random access is needed because of the meaning of the indices into the vector. There are n things and every possible way of choosing among n things has an entry in the vector, as in the choice function. Each of these choices corresponds to a binary number (conceptually, a packed [Bool]) and these Int values are the indices. If there are n things, then the size of the vector is 2^n. The natural way the algorithm runs is for every entry corresponding to "n choose 1" to be filled in, then every entry for "n choose 2," etc. The entries corresponding to "n choose k" depends on the entries corresponding to "n choose (k-1)." The integers for the different choices do not occur in numerical order, and that's why random access is needed.

Here's a pointless (but slow) computation that follows the same pattern. The example function shows how I tried to break the computation up so that the bulk of the work is done in a pure world (no ST monad). In the code below, bogus is where most of the work is done, with the intent of calling that in parallel, but only one core is ever used.


import qualified Data.Vector as Vb
import qualified Data.Vector.Mutable as Vm
import qualified Data.Vector.Generic.Mutable as Vg
import qualified Data.Vector.Generic as Gg
import Control.Monad.ST as ST ( ST, runST )
import Data.Foldable(forM_)
import Data.Char(digitToInt)


main :: IO ()
main = do
  putStrLn $ show (example 9)
  

example :: Int -> Vb.Vector Int
example n = runST $ do
  m <- Vg.new (2^n) :: ST s (Vm.STVector s Int)
  
  Vg.unsafeWrite m 0 (1)
  
  forM_ [1..n] $ \i -> do
    p <- prev m n (i-1)
    let newEntries = (choiceList n i) :: [Int]
    forM_ newEntries $ \e -> do
      let v = bogus p e
      Vg.unsafeWrite m e v
  
  Gg.unsafeFreeze m


choiceList :: Int -> Int -> [Int]
choiceList _ 0 = [0]
choiceList n 1 = [ 2^k | k <- [0..(n-1) ] ]
choiceList n k 
  | n == k = [2^n - 1]
  | otherwise = (choiceList (n-1) k) ++ (map ((2^(n-1)) +) $ choiceList (n-1) (k-1))

prev :: Vm.STVector s Int -> Int -> Int -> ST s Integer
prev m n 0 = return 1
prev m n i = do
  let chs = choiceList n i
  v <- mapM (\k -> Vg.unsafeRead m k ) chs
  let e = map (\k -> toInteger k ) v
  return (sum e)

bogus :: Integer -> Int -> Int
bogus prior index = do
  let f = fac prior
  let g = (f^index) :: Integer
  let d = (map digitToInt (show g)) :: [Int]
  let a = fromIntegral (head d)^2
  a

fac :: Integer -> Integer
fac 0 = 1
fac n = n * fac (n - 1)

If anyone tests this, using more than 9 or 10 in show (example 9) will take much longer than you want to wait for such a pointless sequence of numbers.

3

There are 3 best solutions below

1
K. A. Buhr On BEST ANSWER

Just do it in IO. If you need to use the result in pure code, then unsafePerformIO is available.

The following version runs about 3-4 times faster with +RTS -N16 than +RTS -N1. My changes involved converting the ST vectors to IO, changing the forM_ to forConcurrently_, and adding a bang annotation to let !v = bogus ....

Full code:

import qualified Data.Vector as Vb
import qualified Data.Vector.Mutable as Vm
import qualified Data.Vector.Generic.Mutable as Vg
import qualified Data.Vector.Generic as Gg
import Control.Monad.ST as ST ( ST, runST )
import Data.Foldable(forM_)
import Data.Char(digitToInt)
import Control.Concurrent.Async
import System.IO.Unsafe

main :: IO ()
main = do
  let m = unsafePerformIO (example 9)
  putStrLn $ show m

example :: Int -> IO (Vb.Vector Int)
example n = do
  m <- Vg.new (2^n)

  Vg.unsafeWrite m 0 (1)

  forM_ [1..n] $ \i -> do
    p <- prev m n (i-1)
    let newEntries = (choiceList n i) :: [Int]
    forConcurrently_ newEntries $ \e -> do
      let !v = bogus p e
      Vg.unsafeWrite m e v

  Gg.unsafeFreeze m

choiceList :: Int -> Int -> [Int]
choiceList _ 0 = [0]
choiceList n 1 = [ 2^k | k <- [0..(n-1) ] ]
choiceList n k
  | n == k = [2^n - 1]
  | otherwise = (choiceList (n-1) k) ++ (map ((2^(n-1)) +) $ choiceList (n-1) (k-1))

prev :: Vm.IOVector Int -> Int -> Int -> IO Integer
prev m n 0 = return 1
prev m n i = do
  let chs = choiceList n i
  v <- mapM (\k -> Vg.unsafeRead m k ) chs
  let e = map (\k -> toInteger k ) v
  return (sum e)

bogus :: Integer -> Int -> Int
bogus prior index = do
  let f = fac prior
  let g = (f^index) :: Integer
  let d = (map digitToInt (show g)) :: [Int]
  let a = fromIntegral (head d)^2
  a

fac :: Integer -> Integer
fac 0 = 1
fac n = n * fac (n - 1)
4
chi On

I think this can not be done in a safe way. In the general case, it seems it would break Haskell's referential transparency.

If we could perform multi-threaded computations within ST s, then we could spawn two threads that race over the same STRef s Bool. Let's say one thread is writing False and the other one True.

After we use runST on the computation, we get an expression of type Bool which is sometimes False and sometimes True. That should not be possible.

If you are absolutely certain that your parallelization does not break referential transparency, you could try using unsafe primitives like unsafeIOToST to spawn new threads. Use with extreme care.

There might be safer ways to achieve something similar. Outside ST, we do have some parallelism available in Control.Parallel.Strategies.

1
lehins On

There are a number of ways to do parallelization in Haskell. Usually they will give comparable performance improvements, however some are better then the others and it mostly depends on problem that needs parallelization. This particular use case looked very interesting to me, so I decided to investigate a few approaches.

Approaches

vector-strategies

We are using a boxed vector, therefore we can utilize laziness and built-in spark pool for parallelization. One very simple approach is provided by vector-strategies package, which can iterate over any immutable boxed vector and evaluate all of the thunks in parallel. It is also possible to split the vector in chunks, but as it turns out the chunk size of 1 is the optimal one:

exampleParVector :: Int -> Vb.Vector Int
exampleParVector n = example n `using` parVector 1

parallel

parVector uses par underneath and requires one extra iteration over the vector. In this case we are already iterating over thee vector, thus it would actually make more sense to use par from parallel directly. This would allow us to perform computation in parallel while continue using ST monad:

import Control.Parallel (par)
...

  forM_ [1..n] $ \i -> do
    p <- prev m n (i-1)
    let newEntries = choiceList n i :: [Int]
    forM_ newEntries $ \e -> do
      let v = bogus p e
      v `par` Vg.unsafeWrite m e v

It is important to note that the computation of each element of the vector is expensive when compared to the total number of elements in the vector. That is why using par is a very good solution here. If it was the opposite, namely the vector was very large, but elements weren't too expensive to compute, it would be better to use an unboxed vector and switch it to a different parallelization method.

async

Another way was described by @K.A.Buhr. Switch to IO from ST and use async:

import Control.Concurrent.Async (forConcurrently_)
...

  forM_ [1..n] $ \i -> do
    p <- prev m n (i-1)
    let newEntries = choiceList n i :: [Int]
    forConcurrently_ newEntries $ \e -> do
      let !v = bogus p e
      Vg.unsafeWrite m e v

The concern that @chi has raised is a valid one, however in this particular implementation it is safe to use unsafePerformIO instead of runST, because parallelization does not violate the invariant of deterministic computation. Namely, we can promise that regardless of the input supplied to example function, the output will always be exactly the same.

scheduler

Green threads are pretty cheap in Haskell, but they aren't free. The solution above with async package has one slight drawback: it will spin up at least as many threads as there are elements in the newEntries list each time forConcurrently_ is called. It would be better to spin up as many threads as there are capabilities (the -N RTS option) and let them do all the work. For this we can use scheduler package, which is a work stealing scheduler:

import Control.Scheduler (Comp(Par), runBatch_, withScheduler_)
...

  withScheduler_ Par $ \scheduler ->
    forM_ [1..n] $ \i -> runBatch_ scheduler $ \_ -> do
      p <- prev m n (i-1)
      let newEntries = choiceList n i :: [Int]
      forM_ newEntries $ \e -> scheduleWork_ scheduler $ do
        let !v = bogus p e
        Vg.unsafeWrite m e v

Spark pool in GHC also uses a work stealing scheduler, which is built into RTS and is unrelated to the package above in any shape or form, but the idea is very similar: few threads with many units of computation.

Benchmarks

Here are some benchmarks on a 16-core machine for all of the approaches with example 7 (value 9 takes on the order of seconds, which introduces too much noise for criterion). We only get about x5 speedup, because a significant part of the algorithm is sequential in nature and can't be parallelized.

enter image description here