How do I save a TFDV stats in the correct format for them to be loaded back in?

745 Views Asked by At

It is puzzling to me that there is a tfdv.load_statistics() function, but no corresponding tfdv.write_statistics() function. How do I go about saving the statistics, and then loading them again?

e.g.

import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_dataframe(df)

# how do I save?


# load back for later use
saved_stats = tfdv.load_statistics('saved_stats.stats')

I can save the string representation to a file, but this is not the format that load_statistics expects.

with open('saved_stats.stats', 'w') as o:
    o.write(str(stats))

Pointers anyone?

4

There are 4 best solutions below

2
On

have you tried this : tfdv.utils.stats_util.write_stats_text ?

0
On

There's a function called tfdv.load_stats_binary that you can use to solve this problem.

0
On

Okay figure out this hacky way to do it.

df = ... # create pandas df
from tensorflow_metadata.proto.v0 import statistics_pb2
import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_dataframe(df)

# save it
with open('saved_stats.stats', 'wb') as o:
    o.write(stats.SerializeToString())

# load back for later use
with open('saved_stats.stats', 'rb') as i:
    loaded_stats = statistics_pb2.FromString(i.read())
0
On

In the current tfdv version 1.3.0 there are the following methods that can be used:

Example:

import tensorflow_data_validation as tfdv

stats = tfdv.generate_statistics_from_dataframe(df)
stats_path = "my-stats-file.stats"

# saving
tfdv.write_stats_text(stats, stats_path)


# loading
stats = tfdv.load_stats_text(stats_path)