Setting hive properties in Amazon EMR?

4.1k Views Asked by At

I'm trying to run a Hive query using Amazon EMR, and am trying to get Apache Tez to work with it too, which from what I understand requires setting the hive.execution.engine property to tez according to the hive site?

I get that hive properties can be set with set hive.{...} usually, or in the hive-site.xml, but I don't know how either of those interact with / are possible to do in Amazon EMR.

So: is there a way to set Hive Configuration Properties in Amazon EMR, and if so, how?

Thanks!

2

There are 2 best solutions below

0
On

Amazon Elastic MapReduce (EMR) is an automated means of deploying a normal Hadoop distribution. Commands you can normally run against Hadoop and Hive will also work under EMR.

You can execute hive commands either interactively (by logging into the Master node) or via scripts (submitted as job 'steps').

You would be responsible for installing TEZ on Amazon EMR. I found this forum post: TEZ on EMR

0
On

You can do this in two ways:

1) DIRECTLY WITHIN SINGLE HIVE SCRIPT (.hql file)

Just put your properties at the beginning of your Hive hql script, like:

set hive.execution.engine=tez;
CREATE TABLE...

2) VIA APPLICATION CONFIGURATIONS

When you create a EMR cluster, you can specify Hive configurations that work for the entire cluster's lifetime. This can be made either via AWS Management Console, or via AWS CLI.

a) AWS Management Console

  1. Open AWS EMR service and click on Create cluster button

enter image description here

  1. Click on Go to advanced options at the top

enter image description here

  1. Be sure to select Hive among the applications, then enter a JSON configuration like below, where you can find all properties you usually have in hive-site xml configuration, I highlighted the TEZ property as example. You can optionally load the JSON from a S3 path.

enter image description here

b) AWS CLI

As stated in detail here, you can specify the Hive configuration on cluster creation, using the flag --configurations, like below:

aws emr create-cluster --configurations file://configurations.json --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate

The JSON file has the same content shown above in the Management Console example.

Again, you can optionally specify a S3 path instead:

--configurations https://s3.amazonaws.com/myBucket/configurations.json