Let's say I have 50 documents I want to ingest into an index. So I do this and I have 50 documents I can retrieve if I were to query Elasticsearch.
At a later time, perhaps through an automated process, these same 50 documents end up getting ingested again. I do a query and I see pairs of documents with everything except for different _id value. Actually the _id value which is 20 characters for the pair has characters 0,1 different and characters 16-19 different but characters 2-15 the exact same. I assume these _id are autogenerated, maybe the first 2 characters being some sort of sequence number?
But how would I go about having the document, each time, map to the same _id?
I expect each unique document to map to the same _id value so that my index is not filled up with the exact same information multiple times.
You can use the
fingerprintingest processor to calculate an_idfield from the fields in your document.You'll need to decide which fields to use, that is, what makes a document "unique" - is it all fields, or are there specific identifier fields you want to use.
Then you define an ingest pipeline such as
When you ingest documents, you specify that you want to use this pipeline