Storing wide-form dataframes in datajoint table

81 Views Asked by At

Say I have some analysis that spits out a wide-form pandas dataframe with a multiindex on the index and columns. Depending on the analysis parameters, the number of columns may change. What is the best design pattern to use to store the outputs in a datajoint table? The following come to mind, each with pros and cons

  1. Reshape to long-form and store single entries with index x column levels as primary keys
  • Pros: Preserves the ability to query/constrain based on both index and columns
  • Cons: Each analysis would insert millions of rows to the table, and I may have to do hundreds of such analyses. Even adding this many rows seems to take several minutes per dataframe, and queries become slow
  1. Keep as wide-form and store single rows as longblob with just index levels as primary keys
  • Pros: Retain ability to query based on index levels, results in tables with a more reasonable number of rows
  • Cons: Loses the ability to query based on column levels, the columns would then also have to be stored somewhere to be able to reconstruct the original dataframes. Since dataframes with different numbers of columns need to be stored in the same table, it is not feasible to explicitly encode all the columns in the table definition
  1. Store the dataframe itself as e.g. an h5 and store it in the database simply as a filepath or as an attachment
  • Pros: Does not result in large databases, simple to implement
  • Cons: Does not really feel in the "spirit" of datajoint, lose the ability to perform constraints or queries

Are there any designs or pros/cons I haven't thought of?

1

There are 1 best solutions below

3
Dimitri Yatsenko On

Before providing a more specific answer, let's establish a few basics (also known as normal forms).

DataJoint implements the relational data model. Under the relational model, complex dataframes of the type you described require normalization into multiple related tables related to each other through their primary keys and foreign keys.

Each table will represent a single entity class: Units and Trials will be represented in separate tables.

All entities in a given table will have the same attributes (columns). They will be uniquely identified by the same attribute(s) comprising the primary key.

In addition to the primary key, tables may have additional secondary indexes to accelerate queries.

If you already knew about normalization, we can talk how about to normalize your design. If not, we can refer you to a quick tutorial.