I have a very large training dataset. My training dataset contains 1050 gestures, with each gesture containing 12,000 data points. Feeding our machine learning models with this many data points will result to a very slow performance and poor accuracy. As a result, I used PCA to remove irrelevant characteristics from a high-dimensional space and projected the most important features into a lower-dimensional subspace, improving classification accuracy and reducing computational time. Using PCA we have reduced 12,000 data points for each gesture to 15 PCs without compromising the information extracted from the data.

In the future, I would like to store my machine learning model onto an Arduino. An Arduino is a small chip that roughly has 256KB storage. My training dataset that I use to fit the PCA to is 225MB in storage, therefore not possible.

Is there a way to perform and fit PCA to my training dataset so that I can transpose my unseen testing dataset in the future on the Arduino without having to store the training dataset to my Arduino for fitting?

Here is my code to fit my training dataset

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

transposed_normDF.columns = transposed_normDF.columns.map(str)
features = [str(i) for i in range(0,11999)]
x = transposed_normDF.loc[:, features].values
y = df.loc[:,['label']].values

pca = PCA(n_components=0.99)
principalComponents = pca.fit_transform(x)

pc = pca.explained_variance_ratio_.cumsum()
x1 = StandardScaler().fit_transform(principalComponents)
full_newdf = pd.DataFrame(data = x1
             , columns = [f'pc_stdscaled_{i}' for i in range(len(pc))])
full_finalDf = pd.concat([full_newdf, df[['label']]], axis = 1)
print(full_finalDf)
print(full_newdf.shape)

Here is my code to transpose unseen data

pca = PCA(n_components=0.99)

newdata_transformed = pca.transform(in_data)
pc = pca.explained_variance_ratio_.cumsum()
x1 = StandardScaler().fit(principalComponents)
X1 = x1.transform(newdata_transformed)
newdf = pd.DataFrame(data = X1
             , columns = [f'pc_stdscaled_{i}' for i in range(len(pc))])
newdf.head()
1

There are 1 best solutions below

5
On

Yes, it is possible to fit PCA on a training set and reuse later on another program. You can use pickle to save the model and load it. Here is a code snippet for that:

from sklearn.decomposition import PCA
import pickle as pk
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10, centers=3, n_features=20, random_state=0)
pca = PCA(n_components=2)
result = pca.fit_transform(X) # Assume X is having more than 2 dimensions    
input = X[0]
result = pca.transform([input])
print(result) # output: [[ 25.27946068  -2.74478573]]
pk.dump(pca, open("pca.pkl","wb"))

After saving the fitted PCA, you can reload in another program and transform new input samples without loading the training data as follow:

# later reload the pickle file, no training data needed
pca_reloaded = pk.load(open("pca.pkl",'rb')) 
result_new = pca_reloaded.transform([input]) # X_new is a new data sample
print(result_new) # output: [[ 25.27946068  -2.74478573]]

When you compare result and result_new, you find that they are equal.

Source: https://datascience.stackexchange.com/questions/55066/how-to-export-pca-to-use-in-another-program