I am working on Spark(Berkeley) Cluster Computing System. On my research, I learnt about some other in-memory systems like Redis, Memcachedb etc. It would be great if someone could give me a comparison between SPARK and REDIS (and MEMCACHEDB). In what scenarios does Spark have an advantage over these other in-memory systems?
Compare in-memory cluster computing systems
5.4k Views Asked by void At
1
There are 1 best solutions below
Related Questions in APACHE-SPARK
- Getting error while running spark-shell on my system; pyspark is running fine
- ingesting high volume small size files in azure databricks
- Spark load all partions at once
- Databricks Delta table / Compute job
- Autocomplete not working for apache spark in java vscode
- How to overwrite a single partition in Snowflake when using Spark connector
- Parse multiple record type fixedlength file with beanio gives oom and timeout error for 10GB data file
- includeExistingFiles: false does not work in Databricks Autoloader
- Spark connectors from Azure Databricks to Snowflake using AzureAD login
- SparkException: Task failed while writing rows, caused by Futures timed out
- Configuring Apache Spark's MemoryStream to simulate Kafka stream
- Databricks can't find a csv file inside a wheel I installed when running from a Databricks Notebook
- Add unique id to rows in batches in Pyspark dataframe
- Does Spark Dynamic Allocation depend on external shuffle service to work well?
- Does Spark structured streaming support chained flatMapGroupsWithState by different key?
Related Questions in REDIS
- How to Socket.IO Multithreading on a Raspberry Pi?
- How to get the session ID returned by cookie with spring-session-data-redis
- Cannot serialize (Spring Boot)
- JEDIS/REDIS 'ON' Keyword or broken query?
- Quart_Sessions Redis deletes keys and create backups instead
- Docker builds redis, mounts the host network and uses 192.168.*.* to access the redis server and is denied
- Need a script to fetch the redis latency values over 20 seconds and store the results in a file
- Service in Docker Compose not connecting to Redis container in docker, Failed to connect to any host resolved for DNS name
- Install redis vector database on GCP in a GKE cluster
- how to avoid while loop while waiting for future complete?
- Is it possible to append the data in Redis command
- Not able to inject RedisCache/SyncCache/StatefulRedisConnection beans in micronaut 4.2.1 version
- RedisConnectionFailureException intermittently
- using redis timeseries in aredes error =>Error handling publish event: [ErrorReply: ERR TSDB: invalid value]
- HttpResponseMessage caching using redis
Related Questions in APACHE-STORM
- ERROR: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "maprfs"
- Use rack aware policy for kafka in apache storm
- Apache storm + Kafka Spout
- Getting classCastException when upgrade from strom/zookeepr 2.5/3.8.0 to 2.6/3.9.1
- Does SGX or Gramine support mmap files?
- Apache Storm: Get Blob download exception in Nimbus log
- Apache Storm: can't receive tuples from multiple bolts
- How to make apache storm as FIPS (Federal Information Processing Standard ) compliant
- one bolt recive from 2 others in streamparse python
- How to deploy a topology on Apache Storm Nimbus deployed on AWS ECS
- How to store custom metatags in elasticsearch index from a website using stormcrawler
- conf/storm.yaml is not populated with values coming from config map
- How to process late tuples from BaseWindowedBolt?
- Unable to Start this Storm Project
- Handing skewed processing time of events in a streaming application
Related Questions in MEMCACHEDB
- Memcached "stats cachedump" command was not show all keys
- Can't get some Memcached values by key
- How to configure SASL enabled memcached username and password on mac
- Enabling SASL auth on memcacheD server using Couchbase/spymemcache client
- Correct way to load bulk data
- How to connect to memcachedb and use API
- Whether memcachedb is embedded?
- Unable to connect to memcachedb using libmemcached
- is memcache relative to database
- using multiple cache backends at the same time
- Php Memcached delete not working
- Memcache flush all does not empty slabs?
- Memcache vs Buffer Pool in MySQL
- How to avoid memcached to cache a web page when an application error occurs?
- Compare in-memory cluster computing systems
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
They are complete different beasts.
Redis and memcachedb are distributed stores. Redis is a pure in-memory system with optional persistency featuring various data structures. Memcachedb provides a memcached API on top of Berkeley-DB. In both cases, they are more likely to be used by OLTP applications, or eventually, for simple real-time analytics (on-the-fly aggregation of data).
Both Redis and memcachedb lack mechanisms to efficiently iterate on the stored data in parallel. You cannot easily scan and apply some processing to the stored data. They are not designed for this. Also, except by using client-side manual sharding, they cannot be scaled out in a cluster (a Redis cluster implementation is on-going though).
Spark is a system to expedite large scale analytics jobs (and especially the iterative ones) by providing in-memory distributed datasets. With Spark, you can implement efficient iterative map/reduce jobs on a cluster of machines.
Redis and Spark both rely on in-memory data management. But Redis (and memcached) play in the same ballpark as the other OLTP NoSQL stores, while Spark is rather similar to an Hadoop map/reduce system.
Redis is good at running numerous fast storage/retrieval operations at a high throughput with sub-millisecond latency. Spark shines at implementing large scale iterative algorithms for machine learning, graph analysis, interactive data mining, etc ... on a significant volume of data.
Update: additional question about Storm
The question is to compare Spark to Storm (see comments below).
Spark is still based on the idea that, when the existing data volume is huge, it is cheaper to move the process to the data, rather than moving the data to the process. Each node stores (or caches) its dataset, and jobs are submitted to the nodes. So the process moves to the data. It is very similar to Hadoop map/reduce, except memory storage is aggressively used to avoid I/Os which makes it efficient for iterative algorithms (when the output of the previous step is the input of the next step). Shark is only a query engine built on top of Spark (supporting ad-hoc analytical queries).
You can see Storm as the complete architectural opposite of Spark. Storm is a distributed streaming engine. Each node implements a basic process, and data items flow in/out a network of interconnected nodes (contrary to Spark). With Storm, the data move to the process.
Both frameworks are used to parallelize computations of massive amount of data.
However, Storm is good at dynamically processing numerous generated/collected small data items (such as calculating some aggregation function or analytics in real time on a Twitter stream).
Spark applies on a corpus of existing data (like Hadoop) which has been imported into the Spark cluster, provides fast scanning capabilities due to in-memory management, and minimizes the global number of I/Os for iterative algorithms.