CouchDB vs OpenSearch for filtering and querying a large dataset

78 Views Asked by At

I have to store and be able to filter and search through a lot of XML files and download wanted files (at the moment there are 800-900k of XML files, in the future this number could probably surpass 10x this number). The filtering will be done based on information in the XML files themselves but also based on some metadata which is not available directly in the XML files themselves. The filtering and download capabilities would be available through a web UI.

The XML files in question follow a standard which means that adding any additional data inside them is not desirable.

My idea was to extract all the "interesting" info from each XML file (a subset of all attributes) and store it with additional aforementioned metadata in a JSON format into some kind of a document-oriented database while the actual XML documents would be stored on a filesystem of some server. This way, when the user would apply various filters through a web UI, a query would be sent to the back-end via a REST API and data (a list of JSON documents) would be fetched based on it. When the user chooses to download a certain XML document, it would be fetched from a filesystem with the actual file location available inside the JSON document. An example JSON document would look something like:

{
    "metadata": {
        "meta_attr1": "someValue",
        "meta_attr2": "someOthervalue",
        ...
    },
    "data": {    # This data is extracted from a XML file upon the 
                 # addition of a new XML file into the DB
        "attr1": 1234,
        "attr2": "foo",
        "attr3": {
            "subattr1": 3.14,
            "subattr2": "bar"
        },
        "attr4": [1, 2, 3, 4]
        ...
    },
    # URI used to fetch actual XML file that is represented by this JSON
    "xml_uri": "https://some-server/project-files/example.xml"
}

The schema of a JSON documents would be pretty much standardized. However, the original XMLs follow a standard which has multiple versions and new ones will probably appear therefore potentially altering JSON schema.

There shouldn't be too much parallel requests. At least in the beginning the service would be used internally with approx. 100 users having access to it.

Now, our legal department has told me that MongoDB is not acceptable due to its SSPL license.

I'm currently trying to decide whether to use CouchDB or OpenSearch for storing of JSON documents that would be used for filtering. I've done some research into both CouchDB and OpenSearch. However, I've only used OpenSearch a bit before so I'm more aware of its capabilities.

They both seem to have a lot of similar characteristics, e.g.:

  • Both work with JSON documents
  • Both are designed to be easy for horizontal scaling and therefore there shouldn't be too much worry about data loss due to single node failure.
  • Both should be fast when doing reads over large amounts of data (for CouchDB views need to be created)

However I'm not sure about the downsides of either one of them in my use case.

So my actual questions are if anybody had experience with both CouchDB and OpenSearch and can maybe give some pros and cons of each of the two from their experience? What to watch out for when setting up or using any of these two? What do you suggest I should use in this case? Is there maybe a third option that would be more suitable for this kind of a use case?

0

There are 0 best solutions below