How to know if it is a good choice to make a field with only a few distinct possible values as a dimension table?

79 Views Asked by At

I am trying to create a fact table and dimension tables from a taxi trips records where I downloaded from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page. This is the data dictionary describes the meaning of the fields.

I am not quite sure which columns should store the fact table and which should have dimension tables.

For example, the dataset I am exploring is yellow_tripdata_2023-01.parquet. There are 3,066,766 records. For column mta_tax (description: $0.50 MTA tax that is automatically triggered based on the metered rate in use.), there are only 10 different values enter image description here , which are saved in float.

So, I think it would be a good idea to have a dimension table to save the field and use foreign key to represent the value is fact table, which can reduce the disk space. The count for each value: enter image description here

However, from the description of mta_tax, it should be saved in fact table.

What should be a better way to do it?

2

There are 2 best solutions below

0
David Browne - Microsoft On

What questions does the business need to answer that involves the mta_tax? If not, omit the column. Otherwise would it be more convenient to model it as a dimension or a fact table attribute, or perhaps use it in the calculation of some other fact table attribute?

Dimensional modeling is unlike normalization in that it's not an abstract and mathematical exercise. You need to know more than the functional dependencies to create a useful dimensional model. You need to know what information you are trying to extract from the data, and the easiest way to think about that is in terms of the questions that users are trying to answer.

0
NickW On

If this is a value that will be used to aggregate or filter measures then it is a dimensional attribute and should be in a dimension. However, if there is no existing dimension that it is logically part of (and no other attributes it can be grouped with) then it makes little sense to create a dimension table with just one attribute - therefore it can sit in the fact table as a degenerate dimension.

If this is a measure that can be aggregated in queries then it should be in a fact table. If it is only used as part of a calculation for other measures then it should be used in the ETL logic to load those measures and doesn't need to be in the fact table - unless those calculations can only occur at query time (e.g. averages)

An attribute can be both a measure and a dimensional attribute - and so sit in both a fact table and a dimension table