Imagine the following scenario:
- I have very large datasets hosted in an analytics datawarehouse
- The warehouse is very efficient at handling large analytic workloads and can scale arbitrarily
- I need to process the data in a CPU-intensive way that requires loading much of the data into memory at once
- I would like to use a DataFrame API (pandas-like or spark-like)
What should I consider when choosing between Ibis and Spark for such a task?
It seems like the core difference is that with Ibis the compute is happening in the datawarehouse, whereas with Spark it is happening on an external cluster.
Spark seems to be the more popular choice. However, Ibis sounds like it would be cheaper/more convenient: I can use compute I am already paying for (the datewarehouse itself) and avoid having to manage a Spark cluster. If this is true, I don't see why Ibis wouldn't be a more popular choice over Spark.