Inexplicable PySpark SQL array indexing error: Index 1 out of bounds for length 1

35 Views Asked by At

I'm seeing an inexplicable array index reference error, Index 1 out of bounds for length 1

... which I can't explain because I don't see any relevant arrays being referenced in my context of an AWS MWAA+EMR Serverless pyspark SQL query.

Here's the query I'm running:

INSERT INTO TABLE dev.aggregates
  PARTITION (p_date='2024-03-03')
  BY NAME
  SELECT '13x1' AS dimensions
  FROM dev.other_aggregates
  LIMIT 3

Interestingly, if I modify the query to get rid of the FROM portion, it executes successfully:

INSERT INTO TABLE dev.aggregates
  PARTITION (p_date='2024-03-03')
  BY NAME
  SELECT '13x1' AS dimensions

Here's my target insert table definition from show create table dev.aggregates:

CREATE EXTERNAL TABLE `dev.aggregates`(
  `dimensions` string COMMENT '')
PARTITIONED BY ( 
  `p_date` varchar(11) COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://airflow-dev/tables/aggregates/parquet'
TBLPROPERTIES (
"  'parquet.compression'='SNAPPY', "
"  'projection.enabled'='true', "
"  'projection.p_date.format'='yyyy-MM-dd', "
"  'projection.p_date.interval'='1', "
"  'projection.p_date.interval.unit'='DAYS', "
"  'projection.p_date.range'='2023-12-01,NOW', "
"  'projection.p_date.type'='date', "
  'storage.location.template'='s3://airflow-dev/tables/aggregates/parquet/p_date=${p_date}')

Here's a bit more of the stacktrace:

  File "/tmp/spark-9ef48e87-4e57-4225-a373-6579254c3f89/spark_job_runner.py", line 66, in <module>
    run_spark_commands(spark, sql_commands)
  File "/tmp/spark-9ef48e87-4e57-4225-a373-6579254c3f89/spark_job_runner.py", line 41, in run_spark_commands
    spark_runner.sql(command_string)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 1631, in sql
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o101.sql.
: java.lang.IndexOutOfBoundsException: Index 1 out of bounds for length 1
    at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
    at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
    at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
    at java.base/java.util.Objects.checkIndex(Objects.java:361)
    at java.base/java.util.ArrayList.get(ArrayList.java:427)
    at org.apache.hadoop.hive.ql.metadata.Table.createSpec(Table.java:881)
    at org.apache.hadoop.hive.ql.metadata.Table.createSpec(Table.java:873)
    at org.apache.hadoop.hive.ql.metadata.Partition.getSpec(Partition.java:416)
    at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHivePartition(HiveClientImpl.scala:1245)
    at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitions$4(HiveClientImpl.scala:832)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
...

Thanks in advance for any clues you might offer.

1

There are 1 best solutions below

0
SomeDude On

We may have had multiple issues, but the final piece of the puzzle had to do with the WHERE clause (which I neglected to include in the example above):

INSERT INTO table1
  SELECT ...
  FROM ...
  WHERE ...
    AND id1 IN (SELECT id2 FROM table2 WHERE dt = '2024-03-08')
  ...

For whatever reason, spark wasn't seeing table2's partition for dt = '2024-03-08' even though we could query it through Datagrip Athena (perhaps something to do with table1 a projection). So we explicitly re-added the partition before the INSERT query in the spark sql, and it worked.

ALTER TABLE table2 ADD IF NOT EXISTS
  PARTITION (dt='2024-03-08')
  LOCATION 's3://airflow-dev/tables/table2/dt=2024-03-08';