If there is a primary key on the source table, SQOOP import would generate no skewed data... What if there is no primary key defined on the table and we have to use --split-by parameter to split records among multiple mappers.
There are high chances of skewed data depending on the column we select to --split-by.
Could you please help me understand how to avoid skewing in such scenarios and also how to determine the optimal number of mappers to be used for any SQOOP import.
This is a duplicate question that was originally asked here (community.cloudera.com)
I posted the following possible solution for managing skew in the mappers by leveraging xargs. This approach allows you to avoid the skew, parallelize the ingest, and throttle the concurrent work.
I wrote a great blog post on how it works (use xargs to handle split-by skew in sqoop)