Distributed Spark and HDFS Cluster with 6 to 7 Nodes hardware configuration

187 Views Asked by Nitesh At 19 July 2017 at 11:56

I am planning to spin my development cluster for trend analysis for Infrastructure Monitoring application which I am planning to build using Spark for analysing failure trend and Cassandra for storing incoming data and analysed data. Consider collecting performance matrix from around 25000 machines/servers (probably set of same application on different servers). I am expecting performance matrix of size 2MB/sec from each machine, which I am planning to push into Cassandra table having timestamp, server as primary key and application along with some important matrix as clustering key. I will be running Spark job on top of this stored information for performance matrix failure trend analysis.

Comming to the question, How many nodes (machines) and of what configuration in terms of CPU and Memory do I need to kick start my cluster considering above scenario.

Original Q&A

There are 1 best solutions below

markc On 19 July 2017 at 13:01

Cassandra needs a well planned out data model for things to run well. It is very much worth spending time planning things out at this stage before you have a large data set and find out you probably would have done better re-arranging the data model!

The "general" rule of thumb is you shape your model to the queries, while paying attention to avoiding things like really large rows, large deletes, batches and such the like which can have big performance penalties.

The docs give a good start on planning and testing you would probably find useful. I would also recommend using the Cassandra stress tool. You can use it to push performance tests into your Cassandra cluster to check latencies and any performance problems. You can use your own schema too which I personally think is super-useful!

If you are using cloud based hardware like AWS then its relatively easy to scale up / down and see what works best for you. You dont need to throw big hardware at Cassandra, its easier to scale horizontally than vertically.

I'm assuming you are pulling back the data into a separate spark cluster for the analytics side too so these nodes would be running plain Cassandra (less hardware specs). If however you are using the Datastax Enterprise version (where you can run nodes in spark "mode") then you will need more beefier hardware with the additional load you need for spark driver programs, executors and such the like. Another good docs link is the DSE hardware recommendations

Distributed Spark and HDFS Cluster with 6 to 7 Nodes hardware configuration

There are 1 best solutions below

Related Questions in HADOOP

Related Questions in APACHE-SPARK

Related Questions in MATRIX

Related Questions in CASSANDRA

Related Questions in HARDWARE-INFRASTRUCTURE

Trending Questions

Popular # Hahtags

Popular Questions