GCP dataproc - HDFS gets deleted automatically when you terminate the dataproc cluster. How to make it persistent HDFS even if you delete the dataproc cluster it should not delete the HDFS? Is it possible?
GCP |Dataproc|How to create a persistent HDFS volumn means even if you delete the dataproc cluster it should not delete the HDFS? Is it possible?
1.1k Views Asked by Devender Prakash At
3
There are 3 best solutions below
1

when you are creating a dataproc cluster in GCP , it uses the Hadoop Distributed File System (HDFS) for storage.
According to your Statement when you are terminating a dataproc cluster you HDFS gets automatically deleted , this happens if you are using VM Disks.
HDFS data and intermediate shuffle data is stored on VM boot disks, which are Persistent Disks if no Local SSD are Provided.
If Local SSD are Attached HDFS will stay in SSD and will not get deleted. VM Boot disks are deleted when the cluster is deleted.
you can check this documentation also to avoid loss of HDFS using VM disk in dataproc.
0

- Looking upon the available documentation it looks like persistent HDFS volumes are not available at this time.
- You can look into the Dataproc release notes updates to check update about this feature
- Found a similar scenario of your question What happens to my data when a cluster is shut down from the Google official documentation FAQ with a answer which might help you
- As a best practice google is recommending to utilize GCP as persistent storage layer for Dataproc. Google Cloud Storage connector which will provide you a "Direct data access" to your file stored in Cloud Storage and access them directly.
Google Cloud Storage can be used. The connector for this is installed by default in Dataproc. When you shut down a Hadoop cluster, unlike HDFS, you continue to have access to your data in Cloud Storage. How to use Cloud-Storage connector