I am using the pandas library to import a dataset. I split the dataset into X and y variables. I preprocessed X through a pipeline and have the X_data variable. Later, I am using XGBoost and RF models for training.
I used the sys and pympler libraries to see how much memory space all variables are taking up in my working space.
First Issue: The issue is that I am getting different values for sys.getsizeof() and asizeof.asizeof(). I have looked around other forums and learned that sys does not work for certain variables (e.g., set) etc. But I am getting a huge difference in value even for a normal panda series. I am not sure which one to select for the value of 'y'.
Second issue: I am using asizeof.asizeof() to find the size of the XGBoost and RF models. For the RF model, I am getting about 261KB for the fitted model, but for XGBoost, I am getting around 5.2KB for the fitted model. I am not sure if these values for either model are correct. Maybe I am missing something out?
Below are the code and outputs.
For Issue# 1
from pympler import asizeof
import sys
###
# Code for Preprocessing and model training not shown
###
type(y)
output: pandas.core.series.Series
asizeof.asizeof(y)
output: 6706024
sys.getsizeof(y)
output: 121104
_____________
type(X_data)
output: scipy.sparse._csr.csr_matrix
asizeof.asizeof(X_data)
output: 2301736
sys.getsizeof(X_data)
output: 56 # I discarded this value as this byte size is for python object as my research
Issue# 2 I also tried to use pickle and joblib.dump. but I am getting different values for models.
import joblib
from pympler import asizeof
import pickle
rf_classifier1 = RandomForestClassifier(**param, random_state=42)
## before training
print(asizeof.asizeof(rf_classifier1))
print(sys.getsizeof(rf_classifier1))
joblib.dump(rf_classifier1, 'rf_classifier1.joblib')
print(sys.getsizeof('rf_classifier1.joblib'))
print(asizeof.asizeof('rf_classifier1.joblib'))
print("----------")
rf_classifier1.fit(X_data, y)
## after training
# checking size directly using sys and asizeof commands
print(asizeof.asizeof(rf_classifier1))
print(sys.getsizeof(rf_classifier1))
print("----------")
# checking with joblib
joblib.dump(rf_classifier1, 'rf_classifier1.joblib')
print(sys.getsizeof('rf_classifier1.joblib'))
print(asizeof.asizeof('rf_classifier1.joblib'))
print("----------")
# checking with pickle.dumps
p = pickle.dumps(rf_classifier1)
print(sys.getsizeof(p))
print(asizeof.asizeof(p))
Output ---
2712
56
70
72
----------
361064
56
----------
70
72
----------
17719262
17719264
For the last two outputs, the outcomes did not make sense to me. model size was 361K when used asizeof, whereas pickle.dumps showed 17719262, and joblib.dump has no change.