filter data in tfrecord with spark/scala without aggregate steps?

66 Views Asked by user3834294 At 28 July 2025 at 00:08

I have a very large tfrecord directory, and need to filter it with some column to generate new tfrecord files.

Code likes that

val df = spark.read.format("tfrecords").option("recordType", "Example").load(input_path).filter(udf_filter(col("label")))
df.write.format("tfrecords").option("recordType", "Example").mode(SaveMode.Overwrite).save(output_path)

When I run it in spark cluster, I find it will run with two steps(aggregate + write)

I check the code in https://github.com/tensorflow/ecosystem/blob/master/spark/spark-tensorflow-connector/src/main/scala/org/tensorflow/spark/datasources/tfrecords/TensorFlowInferSchema.scala#L39, it have the aggregate steps !

Can I avoid it?

The issue in github is here https://github.com/tensorflow/ecosystem/issues/201

Original Q&A

filter data in tfrecord with spark/scala without aggregate steps?

There are 0 best solutions below

Related Questions in TENSORFLOW

Related Questions in APACHE-SPARK

Related Questions in TFRECORD

Related Questions in SPARK-SHUFFLE

Trending Questions

Popular # Hahtags

Popular Questions