Problem
Hello, I'm using accelerate library to create an application allowing the user to interactively call functions that process images, that's why I'm basing on and extending ghci using ghc api.
The problem is that when running the compiled executable from the shell the computations are done under 100ms (slightly less than 80), while running the same compiled code within ghci it takes over 100ms (on average a bit more than 140) to finish.
Resources
sample code + execution logs: https://gist.github.com/zgredzik/15a437c87d3d8d03b8fc
Description
First of all: the tests were ran after the CUDA kernel was compiled (the compilation itself added additional 2 seconds but that's not the case).
When running the compiled executable from the shell the computations are done in under 10ms. (shell first run
and second shell run
have different arguments passed to make sure the data wasn't cached anywhere).
When trying to run the same code from ghci and fiddling with the input data, the computations take over 100ms. I understand that interpreted code is slower than compiled one, but I'm loading the same compiled code within the ghci session and calling the same top level binding (packedFunction
). I've explicitly typed it to make sure it is specialized (same results as using the SPECIALIZED pragma).
However the computations do take less than 10ms if I run the main
function in ghci (even when changing the input data with :set args
between consecutive calls).
Compiled the Main.hs
with ghc -o main Main.hs -O2 -dynamic -threaded
I'm wondering where the overhead comes from. Does anyone have any suggestions as to why this is happening?
A simplified version of the example posted by remdezx :
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA as C
import Data.Time.Clock (diffUTCTime, getCurrentTime)
main :: IO ()
main = do
start <- getCurrentTime
print $ C.run $ A.maximum $ A.map (+1) $ A.use (fromList (Z:.1000000) [1..1000000] :: Vector Double)
end <- getCurrentTime
print $ diffUTCTime end start
When I compile it and execute it takes 0,09s to finish.
$ ghc -O2 Main.hs -o main -threaded
[1 of 1] Compiling Main ( Main.hs, Main.o )
Linking main ...
$ ./main
Array (Z) [1000001.0]
0.092906s
But when I precompile it and run in interpreter it takes 0,25s
$ ghc -O2 Main.hs -c -dynamic
$ ghci Main
ghci> main
Array (Z) [1000001.0]
0.258224s
I investigated
accelerate
andaccelerate-cuda
and put some debug code to measure a time both under ghci and in a compiled, optimised version.Results are below, you can see stack trace and execution times.
ghci run
compiled code run
As we can see there are two major problems: evaluation of
fromList (Z:.1000000) [1..1000000] :: Vector Double
which takes 69 ms extra under ghci (106ms - 37ms), andperformGC
call which takes 57 ms extra (58 ms - 1 ms). These two sum up to the difference between execution under ghci and in a compiled version.I suppose, that in compiled program, RTS manage memory in a different way than in ghci, so allocation and gc can be faster. We can also test only this part evaluating below code (it does not require CUDA at all):
Results:
This could be another question, but maybe someone know: Can we tune somehow garbage collector to work faster under ghci?