Understanding L-infinity norm which is used in TFDV

1.3k Views Asked by At

I was trying to implement TensorFlow Data Validation to check drift/skew in a dataset. They are using the L-infinity norm as a metric. I didn't understand the concept. Can anyone explain how it is calculated and why they used threshold as a 0.01 here ?

 train_day1_stats = tfdv.generate_statistics_from_tfrecord(data_location=train_day1_data_path)
# Add a drift comparator to schema for 'payment_type' and set the threshold of L-infinity norm for triggering drift anomaly to be 0.01.
**tfdv.get_feature(schema, 'payment_type').drift_comparator.infinity_norm.threshold = 0.01**
drift_anomalies = tfdv.validate_statistics(
    statistics=train_day2_stats, schema=schema, previous_statistics=train_day1_stats)

Tensorflow Website image

2

There are 2 best solutions below

0
On

The COMPARATOR_L_INFTY_HIGH is triggered as follows:

  • Used Schema Fields: * feature.skew_comparator.infinity_norm.threshold.
    * feature.drift_comparator.infinity_norm.threshold

  • Statistics Fields: * feature.string_stats.rank_histogram

  • Detection Condition: L-infinity norm of the vector that represents the difference between the normalized counts from the feature.string_stats.rank_histogram in the control statistics (i.e., serving statistics for skew or previous statistics for drift) and the treatment statistics (i.e., training statistics for skew or current statistics for drift) > feature.skew_comparator.infinity_norm.threshold or feature.drift_comparator.infinity_norm.threshold

The L-infinity form is basically abs(max([x1,....,xn]) In this case x1 = count(values bucket1)/total values in control set - count(values bucket1)/total values in treatment set. Once we have the L-inf we check > (feature.skew_comparator.infinity_norm.threshold or feature.drift_comparator.infinity_norm.threshold) And if so, COMPARATOR_L_INFTY_HIGH is triggered. The actual value(0.01) needs to be fine-tuned based on your particular case and data stats.

0
On

Detailed detection conditions are explained in tensor flow documentation (link below),

https://www.tensorflow.org/tfx/data_validation/anomalies

for your case it mentions,

COMPARATOR_L_INFTY_HIGH

Schema Fields:

feature.skew_comparator.infinity_norm.threshold feature.drift_comparator.infinity_norm.threshold

Statistics Fields:

feature.string_stats.rank_histogram*

Detection Condition: L-infinity norm of the vector that represents the difference between the normalized counts from the feature.string_stats.rank_histogram in the control statistics (i.e., serving statistics for skew or previous statistics for drift) and the treatment statistics (i.e., training statistics for skew or current statistics for drift) > feature.skew_comparator.infinity_norm.threshold or feature.drift_comparator.infinity_norm.threshold