Consider the following benchmark:
module Main where
import qualified Data.List as L
import qualified Data.Vector.Unboxed as U
import Criterion.Main
goodSum :: Int -> Double
{-# NOINLINE goodSum #-}
goodSum n =
let ints = U.enumFromN 0 (n * n * 10) :: U.Vector Int
in U.foldl' (+) 0 $ U.map fromIntegral ints
badSum :: Int -> Double
{-# NOINLINE badSum #-}
badSum n = L.foldl' (+) 0.5 [fromIntegral i | i <- [0 .. 10*n*n]]
badSum2 :: Int -> Double
{-# NOINLINE badSum2 #-}
badSum2 n = L.foldr (+) 0.5 [fromIntegral i | i <- [0 .. 10*n*n]]
worstSum :: Int -> Double
{-# NOINLINE worstSum #-}
worstSum n = L.foldl1' (+) $ do
i <- [0 .. n*n]
return $ L.foldl1' (+) $ do
k <- [0 .. 10]
return $ fromIntegral $ k + i
main = do
defaultMain
[ bench "good" $ nf goodSum 500
, bench "bad" $ nf badSum 500
, bench "bad2" $ nf badSum2 500
, bench "worst" $ nf worstSum 500
]
The results:
benchmarking good
time 1.826 ms (1.819 ms .. 1.835 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.810 ms (1.803 ms .. 1.817 ms)
std dev 23.18 μs (19.91 μs .. 27.96 μs)
benchmarking bad
time 38.38 ms (38.07 ms .. 38.74 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 38.23 ms (38.07 ms .. 38.38 ms)
std dev 298.5 μs (220.6 μs .. 417.8 μs)
benchmarking bad2
time 77.87 ms (73.74 ms .. 82.67 ms)
0.992 R² (0.976 R² .. 0.998 R²)
mean 78.14 ms (75.33 ms .. 82.13 ms)
std dev 5.184 ms (3.056 ms .. 7.966 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking worst
time 80.80 ms (79.53 ms .. 82.10 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 80.73 ms (80.29 ms .. 81.19 ms)
std dev 756.9 μs (591.9 μs .. 938.2 μs)
List comprehensions are good producers and foldr
is a good consumer, so why didn't the List fuse?
Contrary to your question,
foldr
and the list comprehension did, in fact, fuse. However, you have to recall the definition offoldr
and not that it is not a tail recursive function. The prelude definesfoldr
as, which will not compile down to a tight loop like theVector
based example.The important bit of core generated for
badSum2
looks like thiswhich is roughly equivalent to this function (modulo unboxed arithmetic)
Running this through Criterion, this function gives the same runtime as
badSum2
. While the generated function is not producing and consuming intermediate cons cells, it is still performing function calls and doing all the associated stack operations.The poor performance for for the
foldl'
based version is due to the fact thatfoldl'
is not a good consumer, so it cannot fuse with the list comprehension. The left-fold will produce a tail recursive loop which walks the list produced by the list comprehension, incurring the overhead of all the allocation and associated memory operations.I am not sure if you can get the same performance as the
Vector
operations using the standard list operations, but the stream-fusion package provides list combinators using a different fusion method, which can achieve similar performance for this problem.