Apache Spark-ec2 script: "ERROR Unknown Spark version". Broken init.sh?

109 Views Asked by At

I want to launch an AWS EC2 instance with the spark-ec2 script. I get this error:

Initializing spark
--2016-11-18 22:33:06--  http://s3.amazonaws.com/spark-related-packages/spark-1.6.3-bin-hadoop1.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.1.3
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.1.3|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2016-11-18 22:33:06 ERROR 404: Not Found.
ERROR: Unknown Spark version

The installed spark locally came from spark-1.6.3-bin-hadoop2.6.tgz, so the installation should not be trying to access spark-1.6.3-bin-hadoop1.tgz. In init.sh, this spark version will be installed when HADOOP_MAJOR_VERSION==1 :

      if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
    wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
  elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
    wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
  else
    wget http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
  fi
  if [ $? != 0 ]; then
    echo "ERROR: Unknown Spark version"
    return -1

The problems are:

--There are no spark versions with hadoop1 at http://s3.amazonaws.com/spark-related-packages, so that is the basic reason the installation of spark fails.

--HADOOP_MAJOR_VERSION seems to be getting set to 1 during installation, even though my installations have Hadoop version 2.x, leading to the problem above.

--spark_ec2.py pulls the latest spark-ec2 from github during installation, so I don't see a possible local fix. I don't feel confident branching and hacking this script directly from github.

Any ideas for how to fix this?

1

There are 1 best solutions below

0
On

The problem is solved by including this option when locally calling the spark-ec2 script:

--hadoop_major_version=2

see: https://github.com/amplab/spark-ec2/issues/43