Elasticsearch structure in the correct and effective way for search engine

187 Views Asked by At

I'm building a search engine for my audio store.

I only use 1 index for the audio documents and here is the structure:

{
  id: { type: 'integer' },
  title: { type: 'search_as_you_type' },
  description: { type: 'text' },
  createdAt: { type: 'date' },
  updatedAt: { type: 'date' },
  datePublished: { type: 'date' },
  duration: { type: 'float' },
  categories: {
    type: 'nested',
    properties: {
      id: { type: 'integer' },
      name: { type: 'text' }
    },
  }
}

It's simple to search by text the audio documents with the order by date published. But I want to make it more powerful to make a text search and order by trending based on the audio listen times and purchase histories in a specific range, eg: text search trending audios for the last 3 months or the last 30 days, so I tweaked the structure as below:

{
  ...previousProperties,
  listenTimes: {
    type: 'nested',
    properties: {
      timestamp: { type: 'date' },
      progress: { type: 'float' }, // value 0-1.
    },
  },
  purchaseHistories: {
    type: 'nested',
    properties: {
      timestamp: { type: 'date' }
    },
  },
}

And here is my query to get trending audios for the last 3 months and it worked:

{
  bool: {
    should: [
      {
        nested: {
          path: 'listenTimes',
          query: {
            function_score: {
              query: {
                range: {
                  'listenTimes.timestamp': {
                    gte: $range,
                  },
                },
              },
              functions: [
                {
                  field_value_factor: {
                    field: 'listenTimes.progress',
                    missing: 0,
                  },
                },
              ],
              boost_mode: 'replace',
            },
          },
          score_mode: 'sum',
        },
      },
      {
        nested: {
          path: 'purchaseHistories',
          query: {
            function_score: {
              query: {
                range: {
                  'purchaseHistories.timestamp': {
                    gte: 'now+1d-3M/d',
                  },
                },
              },
              boost: 1.5,
            },
          },
          score_mode: 'sum',
        },
      },
    ],
  },
}

I have some uncertainty with my approach such as:

  • The number of listen times and purchase histories record of each audio are quite big, is it effective if I structured the data like this? I just only test with the sample data and it seems to work fine.
  • Does Elasticsearch will re-index the whole document every time I push new records of listen times and purchase histories into the audio docs?

I'm very new to Elasticsearch, so could someone please give me some advice on this case, thank you so much!

1

There are 1 best solutions below

0
On

First question is a good one, it depends how you will implement it, you will have to look out for atomic action since, I'm guessing, you're planning to fetch number of listen times and then save incremented value. If you're doing this from one application in one thread and it's managing to process it in time, then you're fine, but you're not able to scale. I would say that elasticsearch is not really made for this kind of transactions. First idea that popped into my brain is saving numbers into SQL database and updating elasticsearch on some schedule. I suppose those results don't have to be updated in real time?

And about second question I'll just post quote from elasticsearch documentation The document must still be reindexed, but using update removes some network roundtrips and reduces chances of version conflicts between the GET and the index operation., you can find more on this link.