I have freshly installed pyarrow library in my virtualenv.
The first time I import pyarrow as pa
it takes a very long time.
Then, after importing, the first time I instantiate a Table
it takes a very long time. Subsequent Table instantiations are lightning fast however.
If I quit my python shell and re-open it... now the pyarrow import is also fast, and first instantiation is also fast.
I assume that is because it was having to compile stuff the first time the modules are imported, but having done that once they are cached in my venv.
I can see in the src there are loads of .pyx and .pyd files, which I believe are Cython.
I want to deploy an app using pyarrow, either as AWS Lambda function or in a Docker container.
So I want to ensure that the library is fully pre-compiled, already in the artefacts that are deployed, before running the app. How can I achieve this?
These pyx/pyd files are compiled at install time (or before if you are installing from a binary wheel). Calling
import pyarrow
doesn't do any cython compilation.The reason it is faster the 2nd time is either because some pycache was created the first time or because the second time the OS is faster are reading all the different files.
You can try adding
RUN python -c "import pyarrow;pyarrow.table({})
to your Dockerfile, but I doubt it will help at run time.