I was trying to implement TensorFlow Data Validation to check drift/skew in a dataset. They are using the L-infinity norm as a metric. I didn't understand the concept. Can anyone explain how it is calculated and why they used threshold as a 0.01 here ?
train_day1_stats = tfdv.generate_statistics_from_tfrecord(data_location=train_day1_data_path)
# Add a drift comparator to schema for 'payment_type' and set the threshold of L-infinity norm for triggering drift anomaly to be 0.01.
**tfdv.get_feature(schema, 'payment_type').drift_comparator.infinity_norm.threshold = 0.01**
drift_anomalies = tfdv.validate_statistics(
statistics=train_day2_stats, schema=schema, previous_statistics=train_day1_stats)
The COMPARATOR_L_INFTY_HIGH is triggered as follows:
Used Schema Fields: * feature.skew_comparator.infinity_norm.threshold.
* feature.drift_comparator.infinity_norm.threshold
Statistics Fields: * feature.string_stats.rank_histogram
Detection Condition: L-infinity norm of the vector that represents the difference between the normalized counts from the feature.string_stats.rank_histogram in the control statistics (i.e., serving statistics for skew or previous statistics for drift) and the treatment statistics (i.e., training statistics for skew or current statistics for drift) > feature.skew_comparator.infinity_norm.threshold or feature.drift_comparator.infinity_norm.threshold
The L-infinity form is basically abs(max([x1,....,xn]) In this case x1 = count(values bucket1)/total values in control set - count(values bucket1)/total values in treatment set. Once we have the L-inf we check > (feature.skew_comparator.infinity_norm.threshold or feature.drift_comparator.infinity_norm.threshold) And if so, COMPARATOR_L_INFTY_HIGH is triggered. The actual value(0.01) needs to be fine-tuned based on your particular case and data stats.