I'm trying to add some data from an SQL query to my Elasticsearch, and it's currently running as a Logstash jdbc input like this:
jdbc {
type => "database_log"
jdbc_connection_string => "jdbc:sqlserver://(redacted)"
jdbc_user => (redacted)
jdbc_password => (redacted)
jdbc_driver_library => "D:\ELK_56\sqljdbc4.jar"
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
schedule => "0 * * * *"
statement => "select user_id,First_Name from user_login"
Now the problem is that like each hour when the input runs, it just adds the entire result of the query, some 124k rows. My intent is to just add the new rows (of which there are about 3-4 per hour) that have appeared since the last time i queried, and avoid duplicates.
I can't just solve it by adding a WHERE clause to the query since this particular table has no date column or any sort of such identifier.
Is there any way to have logstash or elasticsearch itself substract the previous result from the latest one and just add the new rows?
Thank you for your reply.
The only way I know of to have Elasticsearch filter out the duplicates is to make the field in your index that you are loading with the
user_id
column from your database the _id field in the index. Elasticsearch will have to process and update all 124k records each time, but that should eliminate the duplicate entries in your index.You should be able to set this in your Logstash config as outlined in this previous question: Change ID in elasticsearch