I am trying to replicate some data preprocessing that I have done in pandas into tensorflow transform.
I have a few CSV files, which I joined and aggregated with pandas to produce a training dataset. Now as part of productionising the model I would like this preprocessing to be done at scale with apache beam and tensorflow transform. However it is not quite clear to me how I can reproduce the same data manipulation there. Let's look at two main operations: JOIN dataset a and dataset b to produce c and group by col1 on dataset c. This would be a quite straightforward operation in pandas, but how would I do this in tensorflow transform running on apache beam? Am I using the wrong tool for the job? What would be the right tool then?
join datasets with tfx tensorflow transform
173 Views Asked by DarioB At
1
There are 1 best solutions below
Related Questions in APACHE-BEAM
- Api for video processing with Apache beam
- Reading CSV header with Dataflow
- BigqueryIO Unable to Write to Date-Partitioned Table
- Azure Blob support in Apache Beam?
- Consuming unbounded data in windows with default trigger
- How to get a list of elements out of a PCollection in Google Dataflow and use it in the pipeline to loop Write Transforms?
- Read a file from GCS in Apache Beam
- Reading and Writing XML files through Apache Beam/Google Cloud DataFlow
- Multiple file generation while writing to XML through Apache Beam
- Unable to serialize com.google.api.services.bigquery.Bigquery$Tables
- Apache Beam Dataflow Jobs started failing with: Workflow failed
- What is a single bar in python?
- Download location for apache_beam.io.gcp.gcsio.GcsBufferedReader object
- Processing Total Ordering of Events By Key using Apache Beam
- Pick elements in processElement() - Apache Beam
Related Questions in TFX
- How do I call ExampleValidator to analyze split data sets?
- Why isn't SchemaGen supported in tfdv.display_schema()?
- How to make a custom metric available to TFMA/Beam?
- TFX. Properties for CsvCoder in CsvExampleGen: 'Columns do not match specified csv headers'
- TFX component CsvExampleGen always yields Examples with empty outputs (and inputs)
- Best practices to use .tfrecord files for forecasting
- How to Run a TFX Orchestration Pipeline Outside Jupyter?
- How to configure optional component with TFX?
- TFX TypeError: Argument input_params should be a Channel of type <class 'tfx.types.standard_artifacts.ExternalArtifact'> (got test_string)
- AttributeError: module 'tfx.utils.io_utils' has no attribute 'file_io'
- TFX pipeline-root not found
- Unable to use Sentence embeddings in Transform component (TFX)
- What does DataAccessor do in tfx?
- Add reserved tokens to `tft.vocabulary`
- How do you feed Ragged Tensors to a DNN trained by TensorFlow Extended?
Related Questions in TENSORFLOW-TRANSFORM
- Unable to use Sentence embeddings in Transform component (TFX)
- apache beam rows to tfrecord in order to GenerateStatistics
- Add reserved tokens to `tft.vocabulary`
- Transforming tensorflow datasets to beam datasets
- Problem with Tensorflow Transform(TFX) compute_and_apply_vocabulary/sparse_tensor_to_dense_with_shape
- How can i run my apache beam pipeline with a local CSV-File when using Tensorflow Extended?
- Can tf.transform handle viewfs:// path?
- Tensorflow Transform debug and iterative development best practices?
- Dealing with missing values in tensorflow
- What would be best practice for placing pre-processing and augmentation of images in a TFX pipeline?
- join datasets with tfx tensorflow transform
- Converting tokens to word vectors effectively with TensorFlow Transform
- How to send REST API request to Tensorflow Serving model with Sparse tensors?
- How to see all the possible options for schema metadata in tensorflow?
- Tensorflow - Convert timestamp to day of the week
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
You can use the Beam Dataframes API to do the join and other preprocessing exactly as you would have in Pandas. You can then use
to_pcollectionto get a PCollection that you can pass directly to your Tensorflow Transform operations, or save it as a file to read in later.For top-level functions (such as merge) one needs to do
and use operations
beam_pd.func(...)in place ofpd.func(...).