I use a expectations and Check to determine if a column of decimal type could be transformed into int or long type. A column could be safely transformed if it contains integers or decimals where decimal part only contains zeros. I check it using regex function rlike, as I couldn't find any other method using expectations.
The question is, can I do such check for all columns of type decimal without explicitly listing column names? df.columns is not yet available, as we are not yet inside the my_compute_function.
from transforms.api import transform_df, Input, Output, Check
from transforms import expectations as E
@transform_df(
Output("ri.foundry.main.dataset.1e35801c-3d35-4e28-9945-006ec74c0fde"),
inp=Input(
"ri.foundry.main.dataset.79d9fa9c-4b61-488e-9a95-0db75fc39950",
checks=Check(
E.col('DSK').rlike('^(\d*(\.0+)?)|(0E-10)$'),
'Decimal col DSK can be converted to int/long.',
on_error='WARN'
)
),
)
def my_compute_function(inp):
return inp
You are right in that
df.columnsis not available beforemy_compute_function's scope is entered. There's also no way to add expectations from runtime, so hard-coding column names and generating expectations is necessary with this method.To touch on the first part of your question - in an alternative approach you could attempt
decimal -> int/longconversion in an upstream transform, store the result in a separate column and then useE.col('col_a').equals_col('converted_col_a').This way you could simplify your
Expectationcondition while also implicitly handling the cases in which conversion would under/over-flow asDecimalTypecan hold arbitrarily large/small values (https://spark.apache.org/docs/latest/sql-ref-datatypes.html).