this dataset is great! However I am stuck… there seems to be a logarithmic relationship between number of Votes (x axis) and Approval Index(y axis), but I tried using both scipy.curve_fit as well as lmfit to fit the data… but I don't get an equation, it either gives the wrong equation and hence the wrong graph, or I get a truth value error when I try with lmfit…
however, I used an online website to generate a curve and it worked… sort of. I took x values at constant spaced intervals and corresponding y values, then used it to plot points, and generate a curve from the website. The equation I got was y = 0.986203np.log(12.6695x + 64567.3) - 8.30291. When I try to enter these as the guess parameters[p0] or make_parameters values, it still doesn't seem to work… I have added the code below for reference!
If anyone can tell me what I am doing wrong, that would be highly appreciated, as I've spent about 2 hours troubleshooting this... Thank you :)
Here is the link to the dataset as well: https://www.kaggle.com/datasets/alessandrolobello/the-ultimate-film-statistics-dataset-for-ml
Below is the code I tried:
import pandas as pd
import numpy as np
import sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
import os
import plotly.express as px
df=pd.read_csv('movie_statistic_dataset.csv')
#number of votes vs approval index
data = newdf[['movie_numerOfVotes','approval_Index']].reset_index(drop=True)
data.columns = ['Votes','Approval']
X = data[['Votes']].values
Y = data[['Approval']].values
#trying curve_fit
from scipy.optimize import curve_fit
def logfunction(a,b,x):
return a*np.log(b*x)
popt,pcov = curve_fit(logfunction,X.ravel(),Y.ravel())
a,b = popt
ypred_log = logfunction(a,b,X.ravel())
plt.scatter(X.ravel(),ypred_log)
plt.scatter(X.ravel(),Y.ravel())
#the curve doesn't fit
#trying lmfit
from lmfit import Model, Minimizer
def logfunc(x):
return a*np.log(b*x + c)
mod = Model(logfunc)
params = mod.make_params(a=1,b=15,c=6000)
result = mod.fit(np.array(data['Approval']),params=params,x=np.array(data['Votes']))
result.fit_report() #gives an error!
#equation I found online:
x = np.arange(0,6000001,10000)
y = 0.986203*np.log(12.6695*x + 64567.3) - 8.30291
#trying again with lmfit
def logfunc(x):
return y = a*np.log(b*x + c) - d
mod = Model(logfunc)
params = mod.make_params(a=0.982, b=12.7,c=64567,d=8.302)
result = mod.fit(np.array(data['Approval']),params=params,x=np.array(data['Votes']))
#also doesn't work!!
The main changes I made were:
curve_fit, the first variable must be the independent variable (xin this case)