I am trying to access 7.6 million records from a SQL Server 2008 table on a remote machine(Asia) from my Windows 7 Enterrpise workstation(in US) using Solr 4.6.0.I am trying to do a full-import using clean=false as per docs.I am using the JTDS driver for SQL Server. I get the following error after 4 hrs:
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(Unknown Source)
..... at net.sourceforge.jtds.jdbc.SharedSocket.readPacket(SharedSocket.java:850)
I am assuming SQL server is disconnecting the socket connection after that much time. It works when i try from the Solr Admin UI with start=0, rows=100000 i.e for just 100K records.
My database definition file using the DataImportHandler in Solr:
<dataSource name="dbDS"
type="JdbcDataSource"
driver="net.sourceforge.jtds.jdbc.Driver" url="jdbc:jtds:sqlserver://xxx.corp.com/xxx;instance=xxx;useCursors=true;useLOBs=false;socketKeepAlive=true;socketTimeout=432000;"
user="xx" password="xxx" />
<entity name="log"
dataSource="dbDS"
query="select * from XXX"
transformer="TemplateTransformer, DateFormatTransformer"
deltaImportQuery="select * from log where LOGID ='${dataimporter.delta.id}'"
deltaQuery="select LOGID from log where TIME_STAMP > '${dataimporter.last_index_time}'"
preImportDeleteQuery="type:log"
> .....
How can I get all the 7.6 mln records using the full-import ?
Will batchSize attribute in dataSource for jdbc connection be useful ?
I am at the point of re-inventing the wheel and writing my own code to get the data from the database in a batch of a defined size(100k) and storing the last imported timestamp in a property file as done by Solr's delta-import and adding to Solr while applying all the analyzer modifications from schema.xml to the resultset.I am thinking of catching this Exception and retrying the batch of data to be pulled from the DB.Is there any easier way to do this ? I also looked at this: solr, solrj: I/O exception (java.net.SocketException) caught when processing request: Connection reset
I do have a conf/dataimport.properties. as mentioned in some other posts.
TIA,
Vijay
Vijay, I think this isn't a "Solr" problem per se. This looks like the challenge of using socket based connectivity for a long period of time over very long network connections. And in my experience, the longer you have the socket open, with JDBC, over a very long network connection, the harder things get! That's why we end up using simpler methods like messaging or request/response when bridging long distances, versus the very fast, but also more brittle method like sockets and JDBC.
My rule of thumb with DIH is that when it works, it's great. But when it starts getting in your way, due to issues like what you have, then you are better off rolling your own.
If you want to continue to use DIH, can you transfer the data to a local database first, and then ingest? I bet you'll see that work much better. I also wonder a bit about how long ingesting 7 million records over such a long distance may take, especially if they are large meaty log messages.