I have created a Cloudera 5.x cluster with the Spark option set:
I would like to run a simple test using PySpark to read data from one Datatap and write it to another Datatap.
What are the steps for doing this with PySpark?
I have created a Cloudera 5.x cluster with the Spark option set:
I would like to run a simple test using PySpark to read data from one Datatap and write it to another Datatap.
What are the steps for doing this with PySpark?
Copyright © 2021 Jogjafile Inc.

For this example, I'm going to use the TenantStorage DTAP that is created by default for my Tenant.
I've uploaded a dataset from https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv
Next, locate the controller node and ssh into it:
Because the tenant is setup with the default Cluster Superuser Privileges (Site Admin and Tenant Admin), I can download the tenant ssh key from the cluster page and use that to ssh into the controller node:
x.x.x.xfor me is the public IP address of my BlueData gateway.Note that we are connecting to port 10007 which is the port of the controller.
Run pyspark:
Access the datafile and retrieve the first record:
The results are:
If you want to read the data from one Datatap, process it and save it to another Datatap it would look something like this: