How do I address this generic error message? SparkRuntimeException: [UDF_USER_CODE_ERROR.GENERIC] Execution of function

159 Views Asked by At

I receive the following error once trying to display my training dataframe created by a training_set.

SparkRuntimeException: [UDF_USER_CODE_ERROR.GENERIC] Execution of function mycatalog.mydatabase.product_difference_ratio_on_demand_feature(left_MaxProductAmount#6091, left_Amount#6087) failed. 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 217.0 failed 4 times, most recent failure: Lost task 0.3 in stage 217.0 (TID 823) (ip-10-0-32-203.us-west-2.compute.internal executor driver): org.apache.spark.SparkRuntimeException: [UDF_USER_CODE_ERROR.GENERIC] Execution of function mycatalog.mydatabase.product_difference_ratio_on_demand_feature(left_MaxProductAmount#6091, left_Amount#6087) failed. 
== Error ==
TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'
== Stacktrace ==
  File "<udfbody>", line 5, in main
    return calc_ratio_difference(max_price, transaction_amount)
  File "<udfbody>", line 3, in calc_ratio_difference
    return round(((n1 - n2)/n1),2) SQLSTATE: 39000
== SQL (line 1, position 1) ==
mycatalog.mydatabase.product_difference_ratio_on_demand_feature(`MaxProductAmount`, `Amount`)

Here is my training_set

from databricks.feature_engineering import FeatureEngineeringClient, FeatureFunction, FeatureLookup
fe = FeatureEngineeringClient()


training_feature_lookups = [
    FeatureLookup(
      table_name="transaction_count_history",
      rename_outputs={
          "eventTimestamp": "TransactionTimestamp"
        },
      lookup_key=["CustomerID"],
      feature_names=["transactionCount", "isTimeout"],
      timestamp_lookup_key = "TransactionTimestamp"
    ),
    FeatureLookup(
      table_name="product_3minute_max_price_ft",
      rename_outputs={
          "LookupTimestamp": "TransactionTimestamp"
        },
      lookup_key=['Product'],
      
      timestamp_lookup_key='TransactionTimestamp'
    ),
    FeatureFunction(
      udf_name="product_difference_ratio_on_demand_feature",
      input_bindings={"max_price":"MaxProductAmount", "transaction_amount":"Amount"},
      output_name="MaxDifferenceRatio"
    )
]

raw_transactions_df = spark.table("raw_transactions")


training_set = fe.create_training_set(
    df=raw_transactions_df,
    feature_lookups=training_feature_lookups,
    label="Label",
    exclude_columns="_rescued_data"
)
training_df = training_set.load_df()

What stands out to me is the TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'

However, everything is a float. Floats go in, and a float comes out. The function itself works fine in testing.

1

There are 1 best solutions below

0
On

Nulls were created when the lookups occurred. I put a minimum timestamp on the base dataframe. This made sure no nulls were being imputed. This makes sense, given the NoneType error.

raw_transactions_df = sql("SELECT * FROM raw_transactions WHERE timestamp(TransactionTimestamp) > timestamp('2023-12-12T23:38:00.000+00:00')")