Elasticsearch schema for multiple versions of the same text

136 Views Asked by At

I'm working on a system that downloads articles from various news sites and performs various NLP analyses on the texts. I want to store multiple versions and aspects of each article, including

  • The raw HTML
  • A cleaned-up text-only version
  • CoreNLP output of the article.

    Since I want to store the text-only version on Elasticsearch, I thought about storing everything else on Elasticsearch, as well. I have no Elasticsearch experience, so I can't tell what's a better way to store these:

    1. Have one record per article, with the HTML, text and CoreNLP outputs as properties of that article : {html: '....', text: '....', CoreNLP: '....'}
    2. Store each type of information in its own type: /articles/html/1, /articles/text/1, /articles/corenlp/1, etc...

    Which one is more common? Is there a third, better option?

1

There are 1 best solutions below

2
On

Depends on where you want to do the COreNLP, the html tidy up, etc. If you want to do this in elastic I would use the multi field types:

https://www.elastic.co/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html

If you do it outside of elastic, which would not be common since this is a good task for elastic, you could use the multiple fields approach.