I have been testing the various compression algorithms with parquet files, and have settled on Zstd.
Now as far as I understand Zstd uses adaptive dictionary unless one is explicitly specified, thus it begins with an empty one. However when having a dictionary enabled the compressed size and and the execution time are quite unsatisfactory.
The file size without using a dictionary is quite less compared to using the adaptive one. (The number at the end of the name is the compression level):
- Name: C:\ParquetFiles\Zstd1 Execution time: 279 ms Size: 13738134
- Name: C:\ParquetFiles\Zstd2 Execution time: 140 ms Size: 13207017
- Name: C:\ParquetFiles\Zstd9 Execution time: 511 ms Size: 12701030
And for comparison the log from using the adaptive dictionary:
- Name: C:\ParquetFiles\ZstdDictZstd1 Execution time: 487 ms Size: 19462825
- Name: C:\ParquetFiles\ZstdDictZstd2 Execution time: 402 ms Size: 19292513
- Name: C:\ParquetFiles\ZstdDictZstd9 Execution time: 614 ms Size: 19072779
Can you help me understand the significance of this, shouldn't the output with an empty dictionary perform at least as good as Zstd compression with dictionary disabled?