Why does Frame.ofRecords garbles its results when fed a sequence generated by a parallel calculation?

105 Views Asked by At

I am running some code that calculates a sequence of records and calls Frame.ofRecords with that sequence as its argument. The records are calculated using PSeq.map from the library FSharp.Collections.ParallelSeq.

If I convert the sequence into a list then the output is OK. Here is the code and the output:

let summaryReport path (writeOpenPolicy: WriteOpenPolicy) (outputs: Output seq) =
    let foo (output: Output) =
        let temp =
            { Name          = output.Name
              Strategy      = string output.Strategy
              SharpeRatio   = (fst output.PandLStats).SharpeRatio
              CalmarRatio   = (fst output.PandLStats).CalmarRatio }
        printfn "************************************* %A" temp
        temp
    outputs
    |> Seq.map foo
    |> List.ofSeq // this is the line that makes a difference
    |> Frame.ofRecords
    |> frameToCsv path writeOpenPolicy ["Name"] "Summary_Statistics"


Name    Name        Strategy    SharpeRatio CalmarRatio
0   Singleton_AAPL  MyStrategy  0.317372564 0.103940018
1   Singleton_MSFT  MyStrategy  0.372516931 0.130150478
2   Singleton_IBM   MyStrategy              Infinity

The printfn command let me verify by inspection that in each case the variable temp was calculated correctly. The last code line is just a wrapper around FrameExtensions.SaveCsv.

If I remove the |> List.ofSeq line then what comes out is garbled:

Name    Name        Strategy    SharpeRatio CalmarRatio
0   Singleton_IBM   MyStrategy  0.317372564 0.130150478
1   Singleton_MSFT  MyStrategy              0.103940018
2   Singleton_AAPL  MyStrategy  0.372516931 Infinity

Notice that the empty (corresponding to NaN) and Infinity items are now in different lines and other things are also mixed up.

Why is this happening?

2

There are 2 best solutions below

5
On BEST ANSWER

The Frame.ofRecords function iterates over the sequence multiple times, so if your sequence returns different data when called repeatedly, you will get inconsistent data into the frame.

Here is a minimal example:

let mutable n = 0.
let nums = seq { for i in 0 .. 10 do n <- n + 1.; yield n, n }

Frame.ofRecords nums

This returns:

      Item1 Item2 
0  -> 1     12    
1  -> 2     13    
2  -> 3     14    
3  -> 4     15    
4  -> 5     16    
5  -> 6     17    
6  -> 7     18    
7  -> 8     19    
8  -> 9     20    
9  -> 10    21    
10 -> 11    22    

As you can see, the first item is obtained during the first iteration of the sequence, while the second items is obtained during the second iteration.

This should probably be better documented, but it makes the performance better in typical scenarios - if you can send a PR to the docs, that would be useful.

1
On

Parallel Sequences are run in arbitrary order, because they get split across many processors therefore the result-set will be in random order. You can always sort them afterwards, or not run your data in parallel.