Large data-structure overhead when using Julia from Python

79 Views Asked by At

I am working with a compute-intensive code in Python that manipulates very large data-structures (such as dictionaries of about half a million elements) for the purpose of numerical simulations. In order to accelerate this code, my teammate decided to give Julia a shot and call from the Python context a Julia function doing the relevant portion of the calculations, before retrieving the results (in Python).

However, it seems that the current way the input and output data-structures are passed between Julia and Python in my simplified example below incurs a very large overhead, that degrades performance to an unacceptable point. On my laptop, the @elapsed Julia section (pretty much the entire body of the function except the return statement) takes 0.6s, which only accounts for 10% of the 6 seconds the Python timers report for the Julia function to return.

How can I pass large data-structures between Julia and Python with minimum overhead in the simplified example below ?

  • Julia v1.9.3 and Python vs 3.8.10, Ubuntu
  • Input / output can be assumed to always be a dictionary mapping integers to strings (e.g, type is known at compile-time).

Any insights about the steps my timers are actually capturing (or not) is appreciated as I have very little knowledge of Julia or PyCall. Likewise for code examples !


main.py (main python script)

from time import time

import julia
jl = julia.Julia(compiled_modules=False)

from julia import Main
Main.include("main.jl")

# Arbitrarily big data-structure
n = 1_000_000
d = {i: str(i) for i in range(n)}

# Call Julia from Python to perform an action on the large data-structure
t1 = time()
res = Main.func(d)
t2 = time()
print(f"Elapsed overall :: {t2-t1} s")

main.jl (code called from python file / package)

function func(d)

    t = @elapsed begin
        # Perform action on inputs
        d2 = Dict()
        for (k, v) in d
            if mod(k, 2) == 0
                d2[k] = '0'
            end
        end
    end
    println("In Julia body elapsed:  ", t)

    return d2
end
1

There are 1 best solutions below

0
On

Firstly, if you can satisfy the requirements of PythonCall (Julia 1.6.1 upwards and Python 3.7 upwards), you should try that instead. It provides non-copy wrappers by default, without needing additional manual steps (and IMO is better documented too).

If you need to PyCall (and for completion's sake):

The Calling Julia from Python section of the docs says:

A Julia function f(args...) is ordinarily converted to a callable Python object p(args...) that first converts its Python arguments into Julia arguments by the default PyAny conversion ... However, you can exert lower-level control over these argument/return conversions by calling pyfunction(f, ...)

And PyAny is described as:

The PyAny type is used in conversions to tell PyCall to detect the Python type at runtime and convert to the corresponding native Julia type. ... This is convenient, but will lead to slightly worse performance

Using pyfunction (to override the default PyAny conversion with a more appropriate one) like this:

f = pyfunction(func, PyDict{Int, String})

at the end of main.jl, and changing the call in the Python file to:

res = Main.f(d)

improves the performance of the code by about 3x here.

Note that the PyDict section of the docs says:

Currently, passing Julia dictionaries to Python makes a copy of the Julia dictionary.

so my understanding is that we've only eliminated the copy in one direction. So switching to PythonCall may get rid of this second half of the overhead too.