i want to automatically index a document or a website when it is fed to apache solr . How we can achieve this ? I have seen examples of using a CRON job that need to be called via a php script , but they are not quite clear in explaination. Using java api SolrJ , is there any way that we can index data automatically , without having the need to manually do it ??
How to auto-index data using solr and nutch?
518 Views Asked by Saurabh Chaturvedi AtThere are 2 best solutions below
Ali
On
If you are using Apache Nutch, you have to use Nutch solr-index plugin. With using this plugin you can index web documents as soon as they be crawled by Nutch. But the main question would be how can you schedule Nutch to start periodically.
As far as I know you have to use a scheduler for this purpose. I did know an old Nutch project called Nutch-base which uses Apache Quartz for the purpose of scheduling Nutch jobs. You can find the source code of Nutch-base from the following link:
https://github.com/mathieuravaux/nutchbase
If you consider this project there is a plugin called admin-scheduling. Although it is implemented for and old version of Nutch but it could be a nice start point for developing scheduler plugin for Nutch.
It is worth to say that if you are going to crawl website periodically and fetch the new arrival links you can use this tutorial.
Related Questions in APACHE
- .htaccess redirect 403 error files to 404 error document
- RestApi server code is not workinng
- Convert Apache VirtualHost to nginx Server Block for Dynamic Subdomains
- Looking the Method that MANUALLY INSTALL PHP on OSX Yosemite
- Premature end of script on VPS
- Rasterization with Javascript looks different on Apache server
- Vagrant - Ansible error installing Apache
- Can't use subdomain in Chrome using Apache (XAMPP)
- Django webapp (on an Apache2 server) hangs indefintely when importing nltk in views.py
- Redirect keystone app to sub directory using htaccess
- How can I integrate Solr5.1.0 with Nutch1.10
- Disconnect Client connected to cgi application
- Solr ping taking time during full import
- How to redirect an incoming request to specific serverName to different server in apache2?
- What is the correct way to link Django Flatpages?
Related Questions in SOLR
- Developing a search and tag heavy website
- How can I integrate Solr5.1.0 with Nutch1.10
- Solr ping taking time during full import
- Indexed data is not displaying on storefront
- Heap size issue on migrating from Solr 5.0.0 to Solr 5.1.0
- Monolithic ETL to distributed/scalable solution and OLAP cube to Elasticsearch/Solr
- Exact word not boosting much Solr
- Solr stopped with Error opening new searcher at org.apache.solr.core
- Data import in solr from multiple entities
- solr reindexing issue for EdgeNgramFilter
- Heap memory Solr and Elasticsearch
- How to index documents with their metadata in a DB using Solr 5.1.0
- Isnull equivalent in SOLR
- SolrNet query not working for Scandinavian characters
- Query always the same with Sunspot/Solr on rails
Related Questions in NUTCH
- How can I integrate Solr5.1.0 with Nutch1.10
- Trigger Apache Nutch Crawl Programmatically
- Nutch 2.3 REST curl syntax
- Nutch 2.3 + Elasticsearch / results not visualizing in Kibana
- inject runtime exception nutch 2.3
- Internal Server error while adding documents Solr
- Integrate Solr-5.2.1 with crawled data from Nutch?
- Nutch 2.x run every URL every time
- Nutch REST api Results (limited)
- Nutch: How to re-try transient errors (and none of the other URLs)?
- Apache Nutch REST api
- Integration of Apache Nutch 1.12 and Solr 5.4.1 failed
- what does SetProperty of solr.home do in Solr?
- Parsing open graph tags with nutch (into ElasticSearch)
- Nutch 2.3 - javax.net.ssl.SSLException
Related Questions in SOLRJ
- Solr ping taking time during full import
- Solr Negative Boost Query result containing Some Specific Words
- Apache Solr file not getting indexed or "uploaded"
- SOLR - highlight searching text ? Is this possible
- Solr Exact match boost Reduce the results
- How to use all the cores of Solr in solrj
- Solr, how to define Nested Documents in the schema.xml
- How to get word count of SOLR document?
- SolrJ solr query for boolean params getting undefined field exception
- Error Submitting PDF's using SolrJ and Solr 5.1.0
- How to identify documents failed in a Solr batch request?
- How can colons in Solr field names be escaped for the "fl" parameter?
- How do you confiugre /export requestHandler in SolrCloud to use all shards
- Solr Streams Mechanics
- Java index json data to solr
Related Questions in MOSS2007ENTERPRISESEARCH
- Enterprise Search web service in SharePoint
- Windows SharePoint Services Search won't stop
- MOSS 2007 Navigation Options/Settings
- How do I code a custom search page to search current site and sub-sites only in SharePoint 2007?
- How to auto-index data using solr and nutch?
- Can we crawl and index Google Drive documents using nutch and solr?
- MOSS search crawl fails with "Access is denied ..."
- Where is the Content Source Name in the SSP Search Database
- Timeout problems with Microsoft Office SharePoint Server 2007 Query Web Service
- How to achieve this site structure?
- Is it possible to use Elastic Enterprise Search through NEST client in C#
- I need to know how to copy data of specify columns from one list to another using 1 common column in sharepoint 2007
- The search request was unable to connect to the Search Service
- How do I perform a MOSS FullTextSqlQuery and filter people results by the Skills managed property?
- How to programmatically render DataFormWebPart?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
You can write a scheduler and call the solrJ code which is doing indexing/reindexing.
For writing the scheduler please refer below links
http://www.mkyong.com/java/how-to-run-a-task-periodically-in-java/
http://archive.oreilly.com/pub/a/java/archive/quartz.html