How to define features presence in a TensorFlow Data Validation schema?

137 Views Asked by At

I want to create a new TensorFlow Data Validation schema from scratch with fixed features name, type and presence.

import tensorflow_data_validation as tfdv
from tensorflow_metadata.proto.v0 import schema_pb2

# Initialisation
my_schem=schema_pb2.Schema()

# New features (one per available type)
for k in schema_pb2.FeatureType.items():
    my_schem.feature.add(name=f'feat_{k[1]}', type=k[0])

tfdv.display_schema(schema=my_schem)

The code above returns the following schema:

Feature name Type Presence Valency Domain
'feat_0' TYPE_UNKNOWN -
'feat_1' BYTES -
'feat_2' INT -
'feat_3' FLOAT -
'feat_4' STRUCT -

How can I set a Presence property to my features ?

1

There are 1 best solutions below

0
Maxime Oriol On

As mentioned in the FeaturePresence documentation, two arguments are possible:

  1. min_fraction: minimum fraction of examples that have this feature
  2. min_count : minimum number of examples that have this feature

If min_fraction=1, 100% of examples need to have this feature, i.e. the feature is required. If not, the feature is optional.

import tensorflow_data_validation as tfdv
from tensorflow_metadata.proto.v0 import schema_pb2

# Initialisation
my_schem=schema_pb2.Schema()

# A new required feature
my_schem.feature.add(name='required_feat', type='INT', presence=schema_pb2.FeaturePresence(min_fraction=1))

# A new optional feature
my_schem.feature.add(name='optional_feat', type='INT', presence=schema_pb2.FeaturePresence(min_fraction=0.5))

tfdv.display_schema(schema=my_schem)

The code above returns the following schema:

Feature name Type Presence Valency Domain
'required_feat' INT required -
'optional_feat' INT optional -