I'm trying to plan a SolrCloud implementation, and given current index sizes from testing, my estimated physical index size for 1 billion documents is roughly 20 terabytes. So far, I've been unable to find a cloud host that can support a single volume of this size. I was hoping somebody could provide some guidance with regard to managing an index this large. Is a 20TB index absurd? Is there something I'm missing with regard to SolrCloud architecture? Most of the guidelines I've seen indicate that the entire index, regardless of shard count, should be replicated on every machine to guarantee redundancy, so every node would require a 20TB storage device. If there's anyone out there who can shed some light, I would greatly appreciate it.
How to manage very large Solr indexes
611 Views Asked by LandonC At
1
There are 1 best solutions below
Related Questions in SOLR
- Developing a search and tag heavy website
- How can I integrate Solr5.1.0 with Nutch1.10
- Solr ping taking time during full import
- Indexed data is not displaying on storefront
- Heap size issue on migrating from Solr 5.0.0 to Solr 5.1.0
- Monolithic ETL to distributed/scalable solution and OLAP cube to Elasticsearch/Solr
- Exact word not boosting much Solr
- Solr stopped with Error opening new searcher at org.apache.solr.core
- Data import in solr from multiple entities
- solr reindexing issue for EdgeNgramFilter
- Heap memory Solr and Elasticsearch
- How to index documents with their metadata in a DB using Solr 5.1.0
- Isnull equivalent in SOLR
- SolrNet query not working for Scandinavian characters
- Query always the same with Sunspot/Solr on rails
Related Questions in BIGDATA
- How to add a new event to Apache Spark Event Log
- DB candidate as CouchDB/Schema replacement
- Getting java.lang.IllegalArgumentException: requirement failed while calling Sparks MLLIB StreamingKMeans from java application
- More than expected jobs running in apache spark
- Does Cassandra support aggregation function or any other capabilities like Map Reduce?
- Accessing a large number of unsorted array elements in Python
- What are the approaches to the Big-Data problems?
- Talend Open Studio for Big Data
- How to store and retrieve time series using google appengine using python
- Connecting Spark code from web application
- Designing an API on top of BigQuery
- Apache Spark architecture
- Hive(Bigdata)- difference between bucketing and indexing
- When does an action not run on the driver in Apache Spark?
- Use of core-site.xml in mapreduce program
Related Questions in SOLRCLOUD
- Solr custom UpdateRequestProcessorFactory fails with "Error Instantiating UpdateRequestProcessorFactory"
- How to manage very large Solr indexes
- solrcloud - choosing cores for update and search requests
- How to add a node to SolrCloud dynamically without SPLITSHARD?
- Unable to Upload Solr Configuration to ZooKeeper
- Solrcloud & data import handler
- SolrCloud index empty after a successful document submission
- SolrCloud: Unable to Create Collection, Locking Issues
- How do I deal with excessive CLOSE_WAIT which are clogging up Solr
- Why are Solr's logs time series stored in different collections based on time instead of different shards based on time
- Solr data beign indexed in all server[Sharding Mode]
- How do you confiugre /export requestHandler in SolrCloud to use all shards
- SessionException occurs when crawling with solrCloud
- How to specify drive when creating solrcloud collection
- SolrCloud on AWS ECS
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Not sure where you read such guidelines?
It is totally normal to keep only a portion of the index in each shard (each shard having one master and a number of replicas).
You would need to study how to shard your index, using built in routing based on a hash or provide your own.
Edit: so if I understand correctly, you are assuming that every node in the cluster must have either a master or a replica of EVERY shard, correct? If so, the answer is no. In order to provide resilience, you need to have master/replicas of every shard somewhere in the cluster, but you can have a node N that does not contain anything from shard S, as long as S has a master and a replica (at least) in other nodes.