Storing wide-form dataframes in datajoint table

Question

Storing wide-form dataframes in datajoint table

81 Views Asked by zagaroo At 25 July 2022 at 16:49

Say I have some analysis that spits out a wide-form pandas dataframe with a multiindex on the index and columns. Depending on the analysis parameters, the number of columns may change. What is the best design pattern to use to store the outputs in a datajoint table? The following come to mind, each with pros and cons

Reshape to long-form and store single entries with index x column levels as primary keys

Pros: Preserves the ability to query/constrain based on both index and columns
Cons: Each analysis would insert millions of rows to the table, and I may have to do hundreds of such analyses. Even adding this many rows seems to take several minutes per dataframe, and queries become slow

Keep as wide-form and store single rows as longblob with just index levels as primary keys

Pros: Retain ability to query based on index levels, results in tables with a more reasonable number of rows
Cons: Loses the ability to query based on column levels, the columns would then also have to be stored somewhere to be able to reconstruct the original dataframes. Since dataframes with different numbers of columns need to be stored in the same table, it is not feasible to explicitly encode all the columns in the table definition

Store the dataframe itself as e.g. an h5 and store it in the database simply as a filepath or as an attachment

Pros: Does not result in large databases, simple to implement
Cons: Does not really feel in the "spirit" of datajoint, lose the ability to perform constraints or queries

Are there any designs or pros/cons I haven't thought of?

Original Q&A

There are 1 best solutions below

**Dimitri Yatsenko** · Answer 1 · 2022-07-25T18:49:55.110000

Before providing a more specific answer, let's establish a few basics (also known as normal forms).

DataJoint implements the relational data model. Under the relational model, complex dataframes of the type you described require normalization into multiple related tables related to each other through their primary keys and foreign keys.

Each table will represent a single entity class: Units and Trials will be represented in separate tables.

All entities in a given table will have the same attributes (columns). They will be uniquely identified by the same attribute(s) comprising the primary key.

In addition to the primary key, tables may have additional secondary indexes to accelerate queries.

If you already knew about normalization, we can talk how about to normalize your design. If not, we can refer you to a quick tutorial.

Storing wide-form dataframes in datajoint table

There are 1 best solutions below

Related Questions in PANDAS

Related Questions in DATAJOINT

Trending Questions

Popular # Hahtags

Popular Questions