I'm currently working on a problem to do image classification on images using Bayesian Networks. I have tried using pomegranate
, pgmpy
and bnlearn
. My dataset contains more than 200,000 images, on which I perform some feature extraction algorithm and get a feature vector of size 1026.
pgmpy
from pgmpy.models import BayesianModel
from pgmpy.estimators import HillClimbSearch, BicScore, K2Score
est = HillClimbSearch(feature_df, scoring_method=BicScore(feature_df[:20]))
best_model = est.estimate()
edges = best_model.edges()
model = BayesianModel(edges)
pomegranate
from pomegranate import *
model = BayesianNetwork.from_samples(feature_df[:20], algorithm='exact')
bnlearn
library(bnlearn)
df <- read.csv('conv_encoded_images.csv')
df$Age = as.numeric(df$Age)
res <- hc(df)
model <- bn.fit(res,data = df)
The program written in bnlearn
in R completes running in couple of minutes, while the pgmpy
runs for hours and pomegranate freezes my system after a few minutes. You can see from my code that I'm giving first 20 rows for training in both pgmpy
and pomegranate
programs, while bnlearn
takes the whole dataframe. Since I am doing all my image preprocessing and feature extraction in python, it is difficult for me to switch between R and python for training.
My data contains continuous values ranging from 0 to 1. I've also tried discretizing the data to 0's and 1's, which didn't resolve the issue.
Is there any way I can speed up training in these python packages or am I doing anything wrong in my code?
Thanks for any help in advance.
Edit:
https://drive.google.com/file/d/1HbAqDQ6Uv1417zPFMgWBInC7-gz233j2/view?usp=sharing
This is dataset with 300 columns and ~40000 rows. In case you want to try reproducing the output.