Data Mart star schema development solution

603 Views Asked by At

i have to translate a DB into a DM, but i have some doubt about, this is the DB schema:

https://i.stack.imgur.com/PHha1.png

This is a simple DB to store authors, books and various othe things (foreign keys of author table are wrongs and the table "book" as another field called year). I should built a DM to analyse how authors works trought years (coauthors and books). I would like to add even a way to see citations of an author... The DM i'm building is something like this:

https://i.stack.imgur.com/MPCTL.png

Now my doubt is: how could i add citations in this datamart?

PS for citation i mean a book that cites an author and i'm working with kettle and penthao

1

There are 1 best solutions below

2
On

Citations and book authorship have different granularities. As such, they should be in different fact tables.

How I would do it:

  1. Citations fact table: grain is 1 citation of 1 person in 1 book. external keys are for the time dimension, cited author dimension, author dimension, book dimension and whatever else you may need. This data mart gives you directly counts of citations of person X, broken down by time, book author, etc.

  2. Authorship fact table: one may think that the grain is 1 book, but in fact it's not. The grain is 1 author of 1 book. That's the most atomic level of data. To get a book count you can either define that 1 book co-authored by 1 person counts as 1 book, counts as 0.5, as 1/Number of co-authors or any other useful metric. If you also want to count books, you should use the 1/N metric, together with any other you find useful.

  3. Co-authorship relationships: trying to determine the authors that publish the most together: this is trickier. Here the fact granularity is also authorship, but with 1 entry for each pair of co-authors available. So, if a book is written by Albert, Bill and Charles, you'd get 1 entry with author Albert and co-author Bill, one for Albert as author and Charles as co-author, etc (all 6 combinations). This allows you to get a full list of authors and their co-authors and count how many times they appear combine, but everything will show up as double counted: Albert+Bill and Bill+Albert shows up twice. The best way to filter out the duplicates would be to either define "authors in alphabetical order, where Albert+Bill, Albert+Charles and Bill+Charles are stored but not the others, or on the client side, removing duplicates as a query post-processing.

To combine multiple metrics arriving from multiple data marts, you should add a post-processing layer to your visualisation tool, to cross reference all these results.

Finally, one comment: this problem doesn't seem to be best treated with a data mart. Book metadata doesn't have a fixed data schema and a schemaless structure may be best to do all those searchs (look into Elastic Search and Mongo DB, they are perhaps better suited for this specific problem.