Optimization suggestions when writing storable vector definition for union struct

710 Views Asked by At

I wrote a storable vector instance for the data type below (original question here):

data Atoms = I GHC.Int.Int32 | S GHC.Int.Int16

The code for defining those instances for Storable vector is below. While I am getting very good performance with the code below, I am very much interested in generic suggestions to improve the performance of that storable instance. By generic suggestion, I mean the following:

  • It is not specific to a GHC compiler version. You can assume GHC 6.12.3+ to exclude performance bugs if any present in earlier versions, and relevant to the code here.
  • Platform-specific suggestions are ok. You may assume x86_64 Linux platform.
  • A generic suggestion more in the form of algorithm improvement (big O) is very much valued, than a suggestion that exploits hardware-specific optimizations. But, given a basic operation like peek/poke here, there is not much scope for algorithmic improvement, as far as I can tell (and hence more valuable because it is a scarce commodity :)
  • Compiler flags for x86_64 are acceptable (e.g., telling compiler about removing floating point safe check etc.). I am using "-O2 --make" option to compile the code.

If there is any known good library source code that does similar thing (i.e., define storable instances for union/recursive data types), I will be very much interested in checking them.

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types
import GHC.Int

data Atoms = I GHC.Int.Int32 | S GHC.Int.Int16
                deriving (Show)

instance Storable Atoms where
  sizeOf _ = 1 + sizeOf (undefined :: Int32)
  alignment _ = 1 + alignment (undefined :: Int32)

  {-# INLINE peek #-}
  peek p = do
            let p1 = (castPtr p::Ptr Word8) `plusPtr` 1 -- get pointer to start of the    element. First byte is type of element
            t <- peek (castPtr p::Ptr Word8)
            case t of
              0 -> do
                    x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int32) 0
                    return (I x)
              1 -> do
                    x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int16) 0
                    return (S x)

  {-# INLINE poke #-}
  poke p x = case x of
      I a -> do
              poke (castPtr p :: Ptr Word8) 0
              pokeElemOff (castPtr p1) 0 a
      S a -> do
              poke (castPtr p :: Ptr Word8) 1
              pokeElemOff (castPtr p1) 0 a
      where  p1 = (castPtr p :: Ptr Word8) `plusPtr` 1 -- get pointer to start of the     element. First byte is type of element

Update:

Based on feedback from Daniel and dflemstr, I rewrote the alignment, and also, updated the constructor to be of type Word32 instead of Word8. But, it seems that for this to be effective, the data constructor too should be updated to have unpacked values - that was an oversight on my part. I should have written data constructor to have unpacked values in the first place (see performance slides by John Tibbell - slide #49). So, rewriting the data constructor, coupled with alignment and constructor changes, made a big impact on the performance, improving it by about 33% for functions over vector (a simple sum function in my benchmark test). Relevant changes below (warning - not portable but it is not an issue for my use case):

Data constructor change:

data Atoms = I {-# UNPACK #-} !GHC.Int.Int32 | S {-# UNPACK #-} !GHC.Int.Int16

Storable sizeof and alignment changes:

instance Storable Atoms where
  sizeOf _ = 2*sizeOf (undefined :: Int32)
  alignment _ = 4

  {-# INLINE peek #-}
  peek p = do
            let p1 = (castPtr p::Ptr Word32) `plusPtr` 1
            t <- peek (castPtr p::Ptr Word32)
            case t of
              0 -> do
                    x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int32) 0
                    return (I x)
              _ -> do
                    x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int16) 0
                    return (S x)

  {-# INLINE poke #-}
  poke p x = case x of
      I a -> do
              poke (castPtr p :: Ptr Word32) 0
              pokeElemOff (castPtr p1) 0 a
      S a -> do
              poke (castPtr p :: Ptr Word32) 1
              pokeElemOff (castPtr p1) 0 a
      where  p1 = (castPtr p :: Ptr Word32) `plusPtr` 1
2

There are 2 best solutions below

2
On BEST ANSWER

If speed is what you're after, then this kind of bit packing isn't the right direction to go in.

A processor always deals with word-sized operations, meaning that if you have e.g. a 32-bit processor, the smallest amount of memory that the processor can (physically) deal with is 32 bits or 4 bytes (and for 64-bit processors its 64 bits or 8 bytes). Further; a processor can only load memory at word-boundaries, meaning at byte addresses that are multiples of the word size.

So if you use an alignment of 5 (in this case), it means that your data is stored like this:

|  32 bits  |  32 bits  |  32 bits  |  32 bits  |
 [    data    ] [    data    ] [    data    ]
 00 00 00 00 01 01 00 01 00 00 00 12 34 56 78 00
 IX Value       IX Value XX XX IX Value

IX = Constructor index
Value = The stored value
XX = Unused byte

As you can see, the data gets more and more out of sync with the word boundaries, making the processor/program have to do more work to access each element.

If you increase your alignment to 8 (64 bits), your data will be stored like this:

|  32 bits  |  32 bits  |  32 bits  |  32 bits  |  32 bits  |  32 bits  |
 [    data    ]          [    data    ]          [    data    ]
 00 00 00 00 01 00 00 00 01 00 01 00 00 00 00 00 00 12 34 56 78 00 00 00
 IX Value       XX XX XX IX Value XX XX XX XX XX IX Value       XX XX XX

This makes you "waste" 3 bytes per element, but your data structure will be much faster, since each datum can be loaded and interpreted with far fewer instructions and aligned memory loads.

If you are going to use 8 bytes anyways, you might as well make your constructor index to a Int32, since you aren't using those bytes for anything else anyways, and making all of your datum elements word-aligned further increases speed:

|  32 bits  |  32 bits  |  32 bits  |  32 bits  |  32 bits  |  32 bits  |
 [        data         ] [        data         ] [        data         ]
 00 00 00 00 00 00 00 01 00 00 00 01 00 01 00 00 00 00 00 00 12 34 56 78
 Index       Value       Index       Value XX XX Index       Value

This is the price you have to pay for a faster data structures on current processor architectures.

0
On

Four or eight byte aligned memory access is typically much faster than oddly aligned access. It may be that the alignment for your instance is automatically rounded up to eight bytes, but I'd advise to at least measure with explicit eight byte alignment, using 32 bits (Int32 or Word32) for the constructor tag and reading and writing both types of payloads as Int32. That'll waste bits, but there's a good chance it'll be faster. Since you're on a 64-bit platform, it may be even faster to use 16-byte alignment and reading/writing Int64. Benchmark, benchmark, benchmark to find out what serves you best.