Centralized database for multiple consumers vs local datasets

91 Views Asked by At

I was looking for some general advice for the following situation. Here is some context:

An organization has an API that is used by multiple internal consuming applications to create, update and view specific information that is an important part of the calculations that these applications provide to end users, the data that this API provides is an input by the user which basically represents facts data about their org (sets of POST, PUT, GET endpoints). At the same time, it is important for the end users that this information they input is the SAME across applications which is why the central database is a big benefit as it offers a source of truth (this is why this API and its database have multiple consumers).

The benefits of this is that the customer's data is always accurate and does not need to be maintained in multiple systems separately by the end user. A major con is that having a central database like this introduces a single point of failure, as well as slows down performance as more and more data gets stored in it.

The organization has recently decided that increasing performance of these consuming applications, as well as removing an over-reliance on this centralized database and API, is something that should be pursued. One approach being discussed is to continue to publish data to the central source of truth but instead of always retrieving the data from it via GET (which is where performance issues happen), to store a local database which is a copy of the central source which then gets continuously updated by change Events that this API generates. Specifically the retrieval of information (not publishing it) is where these issues come up.

To me, this seems like a very fragile solution as we are overly relying on events to keep the data up to date and if the datasets ever get out of sync, the whole value of a central source of truth is lost. Also if you maintain your own local databases, there is no longer a source of truth really, even if it gets updated by essentially polling the API for updates by using Event-driven approach.

My question is: are there any alternative solutions that are worth considering to increase performance of these consuming applications and remove the single point of failure (two separate problems I guess) WHILE still using that central database? To be clear, the data centralization piece is still a major benefit that the org understands needs to be preserved...but I fear with this approach it will not be. Thank you for reading!

Just looking for advice on architecture design. I have thought about alternatives to the approach described

1

There are 1 best solutions below

3
On BEST ANSWER

To me, this seems like a very fragile solution as we are overly relying on events to keep the data up to date and if the datasets ever get out of sync

I would argue that this solution is not the fragile, this is actually quite a common approach in microservice/async architectures and there are options to negate effect caused by sync failures by using appropriate message handling. The main problem with it is that you are switching to eventual consistency from so called strong consistency which based on the question is still desired (though you need to make sure that it actually is, there are quite a lot of cases when those requirements can be relaxed)

are there any alternative solutions that are worth considering to increase performance of these consuming applications and remove the single point of failure

There are alternative approaches. If you have amounts of data and/or load that is not manageable by single database instance/server (note that sometimes adding synchronous read replica(s) can be enough, though I assumed you already have done those) you can consider using database sharding (in simple terms you split your data in several logical groups and have a single db instance handling one/several of them) or switching to using one of the distributed databases which basically trying to solve the same problem your have. Note that different databases give you different guarantees regarding atomicity (transactions)/fault tolerance and you will need to dive deeper to understand if they satisfy your needs.

Read more:

  • CAP theorem

    states that any distributed data store can provide only two of the following three guarantees:

    • Consistency: Every read receives the most recent write or an error.
    • Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write.
    • Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

    When a network partition failure happens, it must be decided whether to do one of the following:

    • cancel the operation and thus decrease the availability but ensure consistency
    • proceed with the operation and thus provide availability but risk inconsistency.

    Thus, if there is a network partition, one has to choose between consistency or availability. Note that consistency as defined in the CAP theorem is quite different from the consistency guaranteed in ACID database transactions.

  • Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (youtube summary)