Reporting denormalized vs normalized database

2.5k Views Asked by At

We are buidling reporting application, in which we will be dumping large volumn data on daily basis. Currently we are having two data models one for operational and another reporting.

Few columns in reporting data model present in operational tables. If operational data model gets updated how we can ensure that reporting data model make similar changes, also how much cost of updating will be?

Example:

Report Table - 
     user_name
     organisation_name
     event_name
     etc 10 more col.

User Table -
    id
    name
    ..

Organisation Table -
    id
    name 

Lets consider report table consist 1 Million records, and organization name gets changed, that change needs to be reflected in Report table. Frequency of changing details could be average.

So we have two choices Normalized database -
Which will save updation on Report table, but query processing will take longer time. Denormalized database - Which will help us to direct faster query processing but it will involve complexity to maintain it.

Please advice, which way we should go? Our volume of data is very high, and reporting data is high granular.

4

There are 4 best solutions below

0
On

The problem with the word "denormalization" is that it's not specific about what design principles you are going to use. A better plan is to adopt a specific design plan, one that doesn't lead to a fully normalized database, but that has something going for it.

A fairly widely used design plan is star schema, or a near variant, snowflake schema. These two schemas have been used for reporting databases, as well as for data marts and data warehouses.

In every case where I've dealt with a star schema, the data was being copied from one or more other databases, and these source databases were normalized. The source databases were being used for OLTP (On Line Transaction Processing), while the star schema was being used for OLAP (On Line Analytic Processing) purposes, including reporting.

The process of transferring data from an OLTP store to an OLAP store on a periodic basis, such as once a day, is called ETL (Extract, Transfer, and Load). It's an art in and of itself, and there are tools you can buy to facilitate ETL. There are also techniques you can learn if you want to build your own ETL process.

The pattern of having two databases, one for OLTP and the other for OLAP lets you get the benefits of two different design patterns in two different contexts. Maintaining two different databases costs nearly twice as much as maintaining one, and you have to manage the transfer process as well.

All this doesn't give you a definitive answer to your question, but it does give you a few buzzwords to use in searching for relevant items on the web.

2
On

The only option really is normalized database. Why?

Pros:

  • Easier maintenance.
  • Since you will have a lot of data, normalized database will need less disk space! Why? Because instead of let's say representing user everytime with his username, which let's say takes 10 chars in average, you would use his id and take only 2 or 3 chars for that. In long range, that is a huge difference.

The "slower" querying is not actually really slower. The only con, which is just a piece of cake is, that you will have to code a bit more. But you code once, the database stays forever.

0
On

First Question:

Lets consider report table consist 1 Million records, and organization name gets changed, that change needs to be reflected in Report table...

...how much cost of updating will be?

It should be low because we don't need to update the "report table". You only need to update the dimension table, and here is why:

Consider not making a "report table" that gets read directly by the reporting tool. Consider instead using a star schema (one of the "denormalized" options). The report will be generated by joining the fact to the dimensions at runtime.

Please refer to this entity relationship diagram (ERD) of a sales star schema: https://upload.wikimedia.org/wikipedia/en/f/fe/Star-schema-example.png

Let's imagine that the example ERD in the Wiki article is a data warehouse for a company that owns different store brands, so their names would be different from each other.

So let's add a "store_name" column to the DIM_STORE table, and only to the DIM_STORE table. The FACT_SALES table remains as is.

When a store changes its name we update the DIM_STORE.store_name column.

DIM_STORE and FACT_SALES join on store_id, allowing us to get the current store name from the DIM.

Stores rarely change their names, but when it does happen the report users often want this change to be recorded. This type of dimensional update is called a slowly changing dimension (SCD).

This Wiki article explains SCDs: https://en.wikipedia.org/wiki/Slowly_changing_dimension

FYI, SCD types 1 and 2 are commonly used. I prefer Type 2 because it keeps a history, but choose the best one for your reporting requirements.

The ERD comes from this Wiki article on star schemas: https://en.wikipedia.org/wiki/Star_schema

Second Question:

If operational data model gets updated how we can ensure that reporting data model make similar changes

If table structures are changed in the source system then you will have to manually update your load process and your data warehouse tables accordingly. In some cases this may involve re-loading all of the data.

Surrogate Keys: closely related to your question, surrogate keys are necessary for maintaining SCDs: http://www.bidw.org/datawarehousing/what-is-surrogate-key/

In the Wiki article on SCDs the supplier_key is their surrogate key generated by the data warehouse or the ETL prcess, and the supplier_code is analogous to your organization_id (or the store_id in the Wiki article about star schemas) which comes from the transactional source database.

I think these concepts require some research and re-reading to digest, so I hope you don't hurry things. If done right they require a lot of time for planning and design but will save a lot of development time later.

0
On

I think the first thing you need to find out is how frequently the organization name is likely to change. If the answers is not very often, then I would keep the reporting table as is (denormalized).