I'm trying to evaluate Citus and Greenplum in terms of using them as a Data Warehouse. The general idea is that data from multiple OLTP systems will be integrated in real time via Kafka Connect in a central warehouse for analytical queries.
How does Citus compare to Greenplum in this respect? I have read that Citus has some SQL limitations, e.g. correlated subqueries are not supported if the correlation is not on the distribution column, does Greenplum have similar SQL limitations? Will Greenplum work well if data is being streamed into it (as opposed to batch updates)? I'm just having this feeling that Greenplum is more analytics-focused and can sacrifice some OLTP-specific things, which Citus cannot afford since they position themselves as HTAP (not OLAP). Citus also positions itself as a solution for sub second query times, which is not necessary for my use case - several seconds (up to 5) per query will be satisfactory.
I am not aware of any SQL limitations for Greenplum, like the one you mention above. In some cases, i.e.
CUBE
orpercentile_*
window functions (ordered-set aggregate functions) GPORCA, the Greenplum database query optimiser, will fall back to the PostgreSQL query optimiser and these queries won't be as performant as GPORCA-enabled queries - but you would still get a response to your query.I'd say getting streaming data in vs. batch updates is one thing - using Kafka Connection with JDBC, would work out-of-the-box but won't be taking any advantage of the parallel distributed nature of Greenplum as all your data would have to pass through the coordinator.
What would be optimal is to use something like the Greenplum Streaming Server (GPSS) which would write the data delivered from the client directly into the segments of the Greenplum Database cluster and would allow maximum parallelism and best stream loading performance.