What's the difference between the two S3 source options that are available in Foundry Data Connection?

  • S3 (through Hadoop)
  • S3 (Direct)

Is one preferred for ingesting parquet files?

1

There are 1 best solutions below

0
On

S3 through Hadoop is currently the best tested and most flexible S3 option but the performance for large numbers of files is very poor.

S3 Direct is read from S3 using the Amazon S3 SDK directly and performs significantly better than Hadoop as it requires O(1) rather than O(number of files) network calls.

We recommend using S3-direct source instead where possible.