Lazy ByteString built from Socket handle cannot be consumed and GCed lazily

158 Views Asked by At

I'm writing a network file transfer application. Using Lazy ByteString as a intermediate

import qualified Data.ByteString.Lazy as BSL

When constructing a BSL from local file, then put the BSL to a Handle of Socket:

BSL.readFile filename >>= BSL.hPut remoteH  -- OK

This works fine. Memory usage is constant. But for receiving data from Socket, then write to local file:

BSL.hGet remoteH size >>= BSL.hPut fileH bs  -- starts swapping in 1 second

I can see memory usage keep going up, BSL takes size bytes of memory. Worse, for large size that exceeded my physical memory size, OS starts swapping immediately.

I have to receive segments of ByteStrings recursively. That is OK.

Why BSL behave like that?

2

There are 2 best solutions below

0
On BEST ANSWER

hGet is strict -- it immediately demands the number of bytes you requested. It does this in order to facilitate packet level reading of data.

However, hGetContentsN is lazy, and readFile is implemented in terms of hGetContentsN.

Consider the two implementations:

hGetContentsN :: Int -> Handle -> IO ByteString
hGetContentsN k h = lazyRead -- TODO close on exceptions
  where
    lazyRead = unsafeInterleaveIO loop

    loop = do
        c <- S.hGetSome h k -- only blocks if there is no data available
        if S.null c
          then do hClose h >> return Empty
          else do cs <- lazyRead
                  return (Chunk c cs)

and

hGet :: Handle -> Int -> IO ByteString
hGet = hGetN defaultChunkSize

hGetN :: Int -> Handle -> Int -> IO ByteString
hGetN k h n | n > 0 = readChunks n
  where
    STRICT1(readChunks)
    readChunks i = do
        c <- S.hGet h (min k i)
        case S.length c of
            0 -> return Empty
            m -> do cs <- readChunks (i - m)
                    return (Chunk c cs)

The key magic is the laziness in hGetContentsN.

0
On

I can't answer authoritatively on the behavior of lazy bytestrings, but I would recommend that you look into some kind of streaming approach, like conduit or enumerator. With conduit, you could write something like:

import Data.Conduit
import Data.Conduit.Binary

main = do
    let filename = "something"
    remoteH <- getRemoteHandle
    runResourceT $ sourceHandle remoteH $$ sinkFile filename

You can also bypass the Handle abstraction entirely if you wish by using network-conduit and something like:

runResourceT $ sourceSocket socket $$ sinkFile filename