fitting a log curve using lmfit and scipy.curve_fit - not working! (Dataset: The Ultimate Film Statistics Dataset - for ML)

58 Views Asked by At

this dataset is great! However I am stuck… there seems to be a logarithmic relationship between number of Votes (x axis) and Approval Index(y axis), but I tried using both scipy.curve_fit as well as lmfit to fit the data… but I don't get an equation, it either gives the wrong equation and hence the wrong graph, or I get a truth value error when I try with lmfit…

however, I used an online website to generate a curve and it worked… sort of. I took x values at constant spaced intervals and corresponding y values, then used it to plot points, and generate a curve from the website. The equation I got was y = 0.986203np.log(12.6695x + 64567.3) - 8.30291. When I try to enter these as the guess parameters[p0] or make_parameters values, it still doesn't seem to work… I have added the code below for reference!

If anyone can tell me what I am doing wrong, that would be highly appreciated, as I've spent about 2 hours troubleshooting this... Thank you :)

Here is the link to the dataset as well: https://www.kaggle.com/datasets/alessandrolobello/the-ultimate-film-statistics-dataset-for-ml

Below is the code I tried:

import pandas as pd
import numpy as np
import sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from sklearn.metrics import mean_squared_error

import os
import plotly.express as px

df=pd.read_csv('movie_statistic_dataset.csv')

#number of votes vs approval index

data = newdf[['movie_numerOfVotes','approval_Index']].reset_index(drop=True)
data.columns = ['Votes','Approval']

X = data[['Votes']].values
Y = data[['Approval']].values


#trying curve_fit

from scipy.optimize import curve_fit

def logfunction(a,b,x):
    return a*np.log(b*x)

popt,pcov = curve_fit(logfunction,X.ravel(),Y.ravel())

a,b = popt
ypred_log = logfunction(a,b,X.ravel())

plt.scatter(X.ravel(),ypred_log)
plt.scatter(X.ravel(),Y.ravel())

#the curve doesn't fit

#trying lmfit

from lmfit import Model, Minimizer

def logfunc(x):
    return a*np.log(b*x + c)

mod = Model(logfunc)

params = mod.make_params(a=1,b=15,c=6000)

result = mod.fit(np.array(data['Approval']),params=params,x=np.array(data['Votes']))

result.fit_report()   #gives an error! 

#equation I found online:
x = np.arange(0,6000001,10000)
y = 0.986203*np.log(12.6695*x + 64567.3) - 8.30291


#trying again with lmfit

def logfunc(x):
    return y = a*np.log(b*x + c) - d

mod = Model(logfunc)

params = mod.make_params(a=0.982, b=12.7,c=64567,d=8.302)

result = mod.fit(np.array(data['Approval']),params=params,x=np.array(data['Votes']))
#also doesn't work!!

1

There are 1 best solutions below

0
Muhammed Yunus On

The main changes I made were:

  • For curve_fit, the first variable must be the independent variable (x in this case)
  • I adjusted the form of the equation to get a better fit

enter image description here

import pandas as pd
import numpy as np
import sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import mean_squared_error

import os

df = pd.read_csv('movie_statistic_dataset.csv')

#number of votes vs approval index

data = df[['movie_numerOfVotes','approval_Index']].reset_index(drop=True)
data.columns = ['Votes','Approval']

X = data[['Votes']].values
Y = data[['Approval']].values


#trying curve_fit

from scipy.optimize import curve_fit

def logfunction(x, a, c): #first argument must be the independent variable
    return a * np.log(x + 1e-8)**3.8 + c
    # return a * np.log(x + 1e-6)**b + c #could estimate b instead

popt,pcov = curve_fit(logfunction, X.ravel(), Y.ravel())
a, c = popt

#Fitted curve at the data points
ypred_log = logfunction(X.ravel(), a, c)

#Fitted curve on a new axis
x_fine = np.linspace(X.min(), X.max(), num=300)
ypred_log_fine = logfunction(x_fine, a, c)

#Plot
plt.scatter(X.ravel(), Y.ravel(), marker='.', s=20, color='darkgray', label='data')

plt.plot(x_fine, ypred_log_fine, color='black', label='fit')
#Plot the fit at data points:
# plt.scatter(X.ravel(), ypred_log, marker='|', color='black', label='fit')
plt.legend()

plt.gcf().set_size_inches(8, 3)
plt.xlabel('X')
plt.ylabel('Y')