I've dipping my toe into Matlab's categorical variable pool in the context of Matlab tables. Actually, I may have wandered into that territory in the past, but if so, it would have been in a relatively superficial manner.
These days, I want to use Matlab code patterns to do what I normally would do in MS Access, e.g., various types of joins and filtering. Much of my data is categorical, and I've read up on the advantages of using categorical variables in tables. However, they mostly centre around descriptiveness (over enumerated types) and memory efficiency. I haven't run across mention of speed. Do categorical variables offer a speed advantage?
I also wonder how advisable it is to use categorical variables when doing various types of joins. The categorical variables will occupy different tables, so it's not clear to me how equivalence in values is established if such variables are involved in the SQL ON
clause (which Matlab refers to as a keys
parameter).
From the dearth of relevant Google hits, it almost seems like I'm in new territory, which to me would be a scary thing. Lack of documentation of best practices, and the resulting need for trial/error and reverse engineering, requires more time than I can devote, so I'll sadly revert back to using strings.
If anyone can point to online guidance information, I'd appreciate it.
A partial answer only....
The following test indicates that catgorized data behaves sensibly when used as join keys:
So the only question now is about speed. This is a general question, not just for joins, so I'm not sure what would be a good test. There are many possibilities, e.g., number of table rows, number of categories, whether it's a join or a filtering, etc.
In any case, I believe that the answers to both question would be better documented.