I am using cloud Dataproc as a cloud service for my research. Running Hadoop and spark job on this platform(cloud) is a bit slower than that of running the same job on a lower capacity virtual machine. I am running my Hadoop job on 3-node cluster(each with 7.5gb RAM and 50GB disk) on the cloud which took 4min49sec, while the same job took 3min20sec on the single node virtual machine(my pc) having 3gb RAM and 27GB disk. Why is the result slower in the cloud with multi-node clustering than on normal pc?
Why is the Hadoop job slower in cloud (with multi-node clustering) than on normal pc?
228 Views Asked by santobedi At
2
There are 2 best solutions below
0

to be a bit more detailed here the numbers/facts which are interesting to find out the reason for the "slower" cloud environment:
job type &size:
- size of data 1mb or 1TB
- xml , parquet ....
- what kind of process (e.g wordcount, format change, ml,....) and of course the options (executors and drivers ) for your spark-submit or spark-shell
Hadoop Configuration:
- do you use a distribution (hortonworks or cloudera?)
- spark standalone or in yarn mode
- how are nodemangers configured
First of all: not easy to answer without knowing the complete configuration and the type of job your running.
possible reasons are:
http://HOSTNAME:8080 open ressourcemanager webapp and compare available vcores and memory
Job adds more overhead when running parallelized so that it is slower
I would say it is something like 1. and 2.
For more detailed answer let me know:
br