Why is the Hadoop job slower in cloud (with multi-node clustering) than on normal pc?

228 Views Asked by At

I am using cloud Dataproc as a cloud service for my research. Running Hadoop and spark job on this platform(cloud) is a bit slower than that of running the same job on a lower capacity virtual machine. I am running my Hadoop job on 3-node cluster(each with 7.5gb RAM and 50GB disk) on the cloud which took 4min49sec, while the same job took 3min20sec on the single node virtual machine(my pc) having 3gb RAM and 27GB disk. Why is the result slower in the cloud with multi-node clustering than on normal pc?

2

There are 2 best solutions below

1
On BEST ANSWER

First of all: not easy to answer without knowing the complete configuration and the type of job your running.

possible reasons are:

  1. missconfiguration

http://HOSTNAME:8080 open ressourcemanager webapp and compare available vcores and memory

  1. job type

Job adds more overhead when running parallelized so that it is slower

  1. hardware Selected virtual Hardware is slower than the local one. Thourgh low disk io and network overhead

I would say it is something like 1. and 2.

For more detailed answer let me know:

  • size and type of the job and how you run it.
  • hadoop configuration
  • cloud architecture

br

0
On

to be a bit more detailed here the numbers/facts which are interesting to find out the reason for the "slower" cloud environment:

  1. job type &size:

    • size of data 1mb or 1TB
    • xml , parquet ....
    • what kind of process (e.g wordcount, format change, ml,....) and of course the options (executors and drivers ) for your spark-submit or spark-shell
  2. Hadoop Configuration:

    • do you use a distribution (hortonworks or cloudera?)
    • spark standalone or in yarn mode
    • how are nodemangers configured