Avoiding skew and determining optimal number of mappers in SQOOP import

982 Views Asked by At

If there is a primary key on the source table, SQOOP import would generate no skewed data... What if there is no primary key defined on the table and we have to use --split-by parameter to split records among multiple mappers.

There are high chances of skewed data depending on the column we select to --split-by.

Could you please help me understand how to avoid skewing in such scenarios and also how to determine the optimal number of mappers to be used for any SQOOP import.

1

There are 1 best solutions below

0
On

This is a duplicate question that was originally asked here (community.cloudera.com)

I posted the following possible solution for managing skew in the mappers by leveraging xargs. This approach allows you to avoid the skew, parallelize the ingest, and throttle the concurrent work.

I wrote a great blog post on how it works (use xargs to handle split-by skew in sqoop)

#pseudo code...
do_work(){
  sqoop import \
    ... \
    --query "SELECT * FROM myDb.myTable WHERE order_date = $1 AND \$CONDITIONS" 
}

export -f do_work

declare -a order_dates=(20190101, 20190102, ... 20190131, 20190201, ...)

printf "%s\n" "${order_dates[@]}" | xargs --max-procs=3 -I {} bash -c 'do_work "{}"'