I am looking to load ~10GB of data into memory and perform SQL on it in the form of:
- Sort on a single column (any column)
- Aggregate on a single column (any column)
- Filter on a single column (any column)
What might be a good choice for performance? Some solutions I've come across that could possibly work are TimesTen, ExtremeDB, and SQL In-memory, or even dataframes such as Vaex or Cudf.
I am looking to optimize query times -- that is really all I care about. For a conceptual example, think about something like an Excel table where a user can sort or filter any column (and the application does not know ahead of time, which columns to 'index' because all columns may be used).
Update: I'm posting my benchmarks from pandas below. I know pandas isn't ideal for this, but it's great to prototype and get benchmarking figures:
File (20M rows, 1.2GB): https://storage.googleapis.com/gcp-files/Sales20M.csv.
- Load time (
pd.read_csv): 10.7s - Aggregation: (
df.groupby('currency_code_id').count): 3.3s - Sort: (
df.sort_values('price')): 6.8s - Pivot: (
df.pivot_table(index='code',columns='territory_id', values='id', aggfunc=len, fill_value=0)): 3.4s.
If using a database, please do not create indexes, since the use case is that we do not know the columns that are used beforehand. (Alternately, I suppose you could create an index on every field -- but if so, please include the creation of all those indexes in the load time).
Which tool would be the best for this?
I guess you want to
materializeda random data file and perform sub-second queries over it and you are ready to pay the price (as in-memory features are usually enterprise).For
SQL Serverfor example, there many options:achieve gains up to 10 times the query performance5 times to 20 times fasterperformanceor just using partitioning, or PostgreSQL or MongoDB shards. There so many examples and demonstration of such technologies showing sub-second performance ... but it depends on case because there are limitations.
For example:
In your case, having 10 GB of data and wanting a good performance, you are not require to do something special. Just analyze and normalize the data prior the insert and create the corresponding indexes.
Well begun is half doneand paying some time to have the data written in the right way will give you the performance you need.For example:
Insert the data file in a table
For each column in the table
count distinctcount distinct / countis smaller then X, create a separate table with columnsidandvalueReducing the size of the table will improve the IO operations count. Searching and grouping by numbers is faster then doing such by text.
Of course, you need to change the application - instead of searching by
some city name, you will filter by itsID. And aftercountof cities percountries ids, you will perform second query to transform theseidstonames.I feel applying some fundamental principles in your case will be better then using some high-level tech on high price and limitations that can be critical in the future when new requirements to the application come.
On virtual machine with 8 GB RAM and 4 virtual processors. Unfortunately, it is on HDD and pretty bad I/O from here. Running SQL Server 2019 Standard edition. So, because of the hardware the data load up is slow.
bcpit will be better but ... HDD after all)So, the table looks like:
You can see how lazy I am. No normalization and index on each field leading to
3.2GB allocated for the data and indexes:But some results:
select count(*) , currency_code_id from dbo.Sales20M group by currency_code_id
of course, 0 seconds as we use the index:
0 seconds, but note I am using
TOP- basically, when you are sorting you need to display part of the rows, right? And if I sort the whole rows for some reason:it runs for 2 seconds (I am inserting the data in table as the rendering takes time, too).
As to the
PIVOT, it is not very fast in SQL Server but you can useRif you need something massive. I do not understand yours, but made aPIVOTbyCodeandterritory_idcalculating the average price in USD:I am lazy again and not using dynamic PIVOT. It takes 0-1 seconds.
Conclusion:
My point is that even my set up is bad and I am being super lazy as not paying time to normalize the data and create proper indexes I am still getting close to 0 seconds results. You can simply start with something free like PostgreSQL and I believe you will get good results. Of course, the "fancy" stuff are always there if you need them in order to optimize particular use case.