Is star schema still necessary for a big-data-warehouse?

7.3k Views Asked by At

I am designing a new hadoop-based data warehouse using hive and I was wondering whether the classic star/snowflake schemas were still a "standard" in this context.

Big Data systems embrace redundancy so that fully normalized schemas have usually poor performance (for example, in NoSQL databases like HBase or Cassandra).

Is still a best practice making star-schema data warehouses with hive?

Is it better designing row-wide (reduntant) tables, by exploiting new columnar file formats?

2

There are 2 best solutions below

0
On

When designing for NoSQL databases you tend to optimize for a specific query by preprocessing parts of the query and thus store a denormalized copy of the data (albeit denormalized in a query-specific way).

The star schema, on the other hand, is an all-purpose denormalization that's usually appropriate.

When you're planning on using hive, you're really not using it for the optimization but for the general-purposefullness (?) of SQL and as such, I'd imagine the star schema is still appropriate. For a NoSQL db with a non-SQL interface, however, I'd suggest you use a more query-specific design.

0
On

Joins are evil. In particular on Hadoop where we can't guarantee data co-locality especially in case we need to join two large tables. This is one of the differences between Hadoop and a traditional MPP such as Teradata, Greenplum etc. In an MPP I evenly distribute my data based on a hashed key across all nodes in my cluster. The relevant rows for order and order_item table would end up on the same nodes in my cluster, which would at least eliminate data transfer across the network. In Hadoop you would nest the order_item data inside the order table, which would eliminate the need for joins.

If on the other hand you have a small lookup/dimension table and a large fact table you can broadcast the small table across all nodes in your cluster thereby eliminating the need for network transfer.

In summary, star schemas are still relevant but mostly from a logical modelling point of view. Physically you may be better off denormalizing even further to create one big columnar compressed and nested fact table.

I have written up a full blog post that discusses the purpose and usefulness of dimensional models on Hadoop and Big Data technologies