I have worked upon Lucene before and now moving towards Solr. The problem is that I am not able to do Indexing on Solr as fast as Lucene can do.
My Lucene Code:
public class LuceneIndexer {
public static void main(String[] args) {
String indexDir = "/home/demo/indexes/index1/";
IndexWriterConfig indexWriterConfig = null;
long starttime = System.currentTimeMillis();
try (Directory dir = FSDirectory.open(Paths.get(indexDir));
Analyzer analyzer = new StandardAnalyzer();
IndexWriter indexWriter = new IndexWriter(dir,
(indexWriterConfig = new IndexWriterConfig(analyzer)));) {
indexWriterConfig.setOpenMode(OpenMode.CREATE);
StringField bat = new StringField("bat", "", Store.YES); //$NON-NLS-1$ //$NON-NLS-2$
StringField id = new StringField("id", "", Store.YES); //$NON-NLS-1$ //$NON-NLS-2$
StringField name = new StringField("name", "", Store.YES); //$NON-NLS-1$ //$NON-NLS-2$
StringField id1 = new StringField("id1", "", Store.YES); //$NON-NLS-1$ //$NON-NLS-2$
StringField name1 = new StringField("name1", "", Store.YES); //$NON-NLS-1$ //$NON-NLS-2$
StringField id2 = new StringField("id2", "", Store.YES); //$NON-NLS-1$ //$NON-NLS-2$
Document doc = new Document();
doc.add(bat);doc.add(id);doc.add(name);doc.add(id1);doc.add(name1);doc.add(id2);
for (int i = 0; i < 1000000; ++i) {
bat.setStringValue("book"+i);
id.setStringValue("book id -" + i);
name.setStringValue("The Legend of the Hobbit part 1 " + i);
id1.setStringValue("book id -" + i);
name1.setStringValue("The Legend of the Hobbit part 2 " + i);
id2.setStringValue("book id -" + i);//doc.addField("id2", "book id -" + i); //$NON-NLS-1$
indexWriter.addDocument(doc);
}
}catch(Exception e) {
e.printStackTrace();
}
long endtime = System.currentTimeMillis();
System.out.println("commited"); //$NON-NLS-1$
System.out.println("process completed in "+(endtime-starttime)/1000+" seconds"); //$NON-NLS-1$ //$NON-NLS-2$
}
}
Output: Process completed in 19 seconds
Followed By My Solr Code:
SolrClient solrClient = new HttpSolrClient("http://localhost:8983/solr/gettingstarted"); //$NON-NLS-1$
// Empty the database...
solrClient.deleteByQuery( "*:*" );// delete everything! //$NON-NLS-1$
System.out.println("cleared"); //$NON-NLS-1$
ArrayList<SolrInputDocument> docs = new ArrayList<>();
long starttime = System.currentTimeMillis();
for (int i = 0; i < 1000000; ++i) {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("bat", "biok"+i); //$NON-NLS-1$ //$NON-NLS-2$
doc.addField("id", "biok id -" + i); //$NON-NLS-1$ //$NON-NLS-2$
doc.addField("name", "Tle Legend of the Hobbit part 1 " + i); //$NON-NLS-1$ //$NON-NLS-2$
doc.addField("id1", "bopk id -" + i); //$NON-NLS-1$ //$NON-NLS-2$
doc.addField("name1", "Tue Legend of the Hobbit part 2 " + i); //$NON-NLS-1$ //$NON-NLS-2$
doc.addField("id2", "bopk id -" + i); //$NON-NLS-1$ //$NON-NLS-2$
docs.add(doc);
if (i % 250000 == 0) {
solrClient.add(docs);
docs.clear();
}
}
solrClient.add(docs);
System.out.println("completed adding to Solr. Now commiting.. Please wait"); //$NON-NLS-1$
solrClient.commit();
long endtime = System.currentTimeMillis();
System.out.println("process completed in "+(endtime-starttime)/1000+" seconds"); //$NON-NLS-1$ //$NON-NLS-2$
Output : process completed in 159 seconds
My pom.xml is
<!-- solr dependency -->
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-solrj</artifactId>
<version>5.0.0</version>
</dependency>
<!-- other dependency -->
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1.1</version>
</dependency>
<!-- Lucene dependency -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>5.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>5.0.0</version>
</dependency>
I have downloaded solr 5.0 and then have started solr using $solr/bin/solr start -e cloud -noprompt which starts solr in 2 nodes.
I havent changed anything in the solr setup which I have downloaded, can any one guide me as to what is going wronge. I read that solr can be used for near real time indexing (http://lucene.apache.org/solr/features.html) and I am not able to do that in my demo code, though, Lucene is fast in indexing and can be used to do so in near real time if not real time.
I know Solr uses Lucene, so what is the mistake that I am making.. I am still researching the scenario.
Any help or guidance is most welcomed.
Thanks in Advance.!! cheers:)
Solr is a general-purpose highly-configurable search server. The Lucene code in Solr is tuned for general use, not specific use cases. Some tuning is possible in the configuration and the request syntax.
Well-tuned Lucene code written for a specific use-case will always outperform Solr. The disadvantage is that you must write, test, and debug the low-level implementation of the search code yourself. If that's not a major disadvantage to you, then you might want to stick to Lucene. You'll have more capability than Solr can give you, and you can very likely make it run faster.
The response you got from Erick on the Solr mailing list is relevant. To get the best indexing performance, your client must send updates to Solr in parallel.
The ConcurrentUpdateSolrClient that he mentioned is one way to do this, but it comes with a fairly major disadvantage -- the client code will not be informed if any of those indexing requests fails. CUSC swallows most exceptions.
If you want proper exception handling, you will need to manage the threads yourself and use HttpSolrClient, or CloudSolrClient if you choose to run SolrCloud. The SolrClient implementations are thread-safe.