Question
What is the right way to structure multivariate data with categorical labels accumulated over repeated trials for exploratory analysis in R? I don't want to slip back to MATLAB.
Explanation
I like R's analysis functions and syntax (and stunning plots) much better than MATLAB's, and have been working hard to refactor my stuff over. However, I keep getting hung up on the way data is organized in my work.
MATLAB
It's typical for me to work with multivariate time series repeated over many trials, which are stored in a big matrix rank-3 tensor multidimensional array of SERIESxSAMPLESxTRIALS. This lends itself to some nice linear algebra stuff occasionally, but is clumsy when it comes to another variable, namely CLASS. Typically class labels are stored in another vector of dimension 1xTRIALS
.
When it comes to analysis I basically plot as little as possible, because it takes so much work to get together a really good plot that teaches you a lot about the data in MATLAB. (I'm not the only one who feels this way).
R
In R I've been sticking as close as I can to the MATLAB structure, but things get annoyingly complex when trying to keep the class labeling separate; I'd have to keep passing the labels into functions in even though I'm only using their attributes. So what I've done is separate the array into a list of arrays by CLASS. This adds complexity to all of my apply()
functions, but seems to be worth it in terms of keeping things consistent (and bugs out).
On the other hand, R just doesn't seem to be friendly with tensors/multidimensional arrays. Just to work with them, you need to grab the abind
library. Documentation on multivariate analysis, like this example seems to operate under the assumption that you have a huge 2-D table of data points like some long medieval scroll a data frame, and doesn't mention how to get 'there' from where I am.
Once I get to plotting and classifying the processed data, it's not such a big problem, since by then I've worked my way down to data frame-friendly structures with shapes like TRIALSxFEATURES (melt
has helped a lot with this). On the other hand, if I want to quickly generate a scatterplot matrix or latticist histogram set for the exploratory phase (i.e. statistical moments, separation, in/between-class variance, histograms, etc.), I have to stop and figure out how I'm going to apply()
these huge multidimensional arrays into something those libraries understand.
If I keep pounding around in the jungle coming up with ad-hoc solutions for this, I'm either never going to get better or I'll end up with my own weird wizardly ways of doing it that don't make sense to anybody.
So what's the right way to structure multivariate data with categorical labels accumulated over repeated trials for exploratory analysis in R? Please, I don't want to slip back to MATLAB.
Bonus: I tend to repeat these analyses over identical data structures for multiple subjects. Is there a better general way than wrapping the code chunks into for
loops?
Maybe dplyr::tbl_cube ?
Working on from @BrodieG's excellent answer, I think that you may find it useful to look at the new functionality available from
dplyr::tbl_cube
. This is essentially a multidimensional object that you can easily create from a list of arrays (as you're currently using), which has some really good functions for subsetting, filtering and summarizing which (importantly, I think) are used consistently across the "cube" view and "tabular" view of the data.Loading arrays into cubes
Here's an example using
arr
as defined in the other answer:So note that D means Dimensions and M Measures, and you can have as many as you like of each.
Easy conversion from multi-dimensional to flat
You can easily make the data tabular by returning it as a data.frame (which you can simply convert to a data.table if you need the functionality and performance benefits later)
Subsetting
You could obviously flatten all data for every operation, but that has many implications for performance and utility. I think the real benefit of this package is that you can "pre-mine" the cube for the data that you require before converting it into a tabular format that is ggplot-friendly, e.g. simple filtering to return only series 1:
tbl_cube currently works with the
dplyr
functionssummarise()
,select()
,group_by()
andfilter()
. Usefully you can chain these together with the%.%
operator.For the rest of the examples, I'm going to use the inbuilt
nasa
tbl_cube object, which has a bunch of meteorological data (and demonstrates multiple dimensions and measures):Grouping and summary measures
So here is an example showing how easy it is to pull back a subset of modified data from the cube, and then flatten it so that it's appropriate for plotting:
Consistent notation for n-d and 2-d data structures
Sadly the
mutate()
function isn't yet implemented fortbl_cube
but looks like that will just be a matter of (not much) time. You can use it (and all the other functions that work on the cube) on the tabular result, though - with exactly the same notation. For example:Plotting - as an example of R functionality that "likes" flat data
Then you can plot with
ggplot()
using the benefits of flattened data:Using data.table on the resulting flat data
I'm not going to expand on the use of
data.table
here, as it's done well in the previous answer. Obviously there are many good reasons to usedata.table
- for any situation here you can return one by a simple conversion of the data.frame:Working dynamically with your cube
Another thing I think is great is the ability to add measures (slices / scenarios / shifts, whatever you want to call them) to your cube. I think this will fit well with the method of analysis described in the question. Here's a simple example with
arr.cube
- adding an additional measure which is itself an (admittedly simple) function of the previous measure. You access/update measures through the syntax yourcube$mets[$...]
Dimensions - or not ...
I've played a little with trying to dynamically add entirely new dimensions (effectively scaling up an existing cube with additional dimensions and cloning or modifying the original data using yourcube
$dims[$...]
) but have found the behaviour to be a little inconsistent. Probably best to avoid this anyway, and structure your cube first before manipulating it. Will keep you posted if I get anywhere.Persistance
Obviously one of the main issues with having interpreter access to a multidimensional database is the potential to accidentally bugger it with an ill-timed keystroke. So I guess just persist early and often:
Hope that helps!