How to write to HDFS using kedro

306 Views Asked by At

I'm trying to output of my Kedro pipeline to the HDFS file system. But I couldn't see on the internet how to do that and on Kedro documents. If anybody configured kedro in catalog please share a sample code how to do that.

Also how to connect hdfs securely using credentials

I have the data in panda dataframe.

How the entry for this catalog.yml looks like and where do I mention the credentials

2

There are 2 best solutions below

1
On

In your catalog you can define filepath like hdfs://user@server:port/path/to/data

https://kedro.readthedocs.io/en/stable/data/data_catalog.html#specifying-the-location-of-the-dataset

2
On

Assuming you can write to hdfs from outside Kedro (standalone spark) , this should be straightforward from Kedro.

Use the sparkDataSet in your catalog file and define the properties like hive meta store etc in spark.yml and that should be it

Then, Like Rahul mentioned above you need to specify the full path to the hdfs location you want to write to , if you are still facing issues , please share some snapshots

dataset_name:
  type: spark.SparkDataSet
  filepath: hdfs://your_bucket/location/file.parq