Data reduction/transformation

16 Views Asked by At

Has anyone seen any method to reduce the data for reducing the computation amount? What I mean by that is when number of features are huge, one may apply PCA to reduce the dimension and computation. What if we have a handful of features but huge number of data points (time series). How can one reduce that?

1

There are 1 best solutions below

0
Has QUIT--Anony-Mousse On

Subsampling is fairly common.

Many statistical properties are well preserved when you subsample. If you have 1000000 points, the mean estimated from just 10000 is already very close; and maybe well within the reliability of your data.

Another approach is clustering with a simple and fast method such as k-means - and a large k, say sqrt(N). This will approximate your data with a least-squares objective using k data points. (You should also use the weights afterwards, as the resulting vectors will reflect different amounts of data).

Last but not least, many reduction techniques - probably including PCA - can be used on the transposed matrix. Then you reduce the number of instances, not the number of variables. But PCA is fairly expensive and on the transposed matrix, it would scale O(n³). So I would rather consider directly working with a truncated SVD.

But apparently your data are time series. I would suggest to look for data reduction that integrates your knowledge about what is important here.