Below given is my R code to create a model using R programming to predict the prices of diamonds from the diamond dataset. Here I am not able to create the model by giving a log for each row. Without using log I am getting a horrible model with incorrect prediction prices. I am also pasting the error shown and the dataset for reference.
The error is as given below
> mod =(lm(log(price)~log(carat)+log(x)+log(y)+log(z),data=train))
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'x'
Link for the dataset is attached here: https://www.kaggle.com/shivam2503/diamonds
Below given is the code for the same
setwd ("C:/akash/study videos/virginia")
akash = read.csv("diamonds.csv")
#summary(akash)
ind = sample(2, nrow(akash),replace = TRUE , prob = c(0.8,0.2))
train = akash[ind==1,]
test = akash[ind==2,]
mod =(lm(log(price)~log(carat)+log(x)+log(y)+log(z),data=train))
summary(mod)
predicted = predict(mod,newdata = test)
mon = round(exp(predicted),0)
head(mon)
#head(test)
#View(akash)
Your model fails because the minimum values for your variable
x,y,z
are 0, so when you log-transform these variables you obtain-inf
:you can try to log-transform just the outcome, remove the minimum values from the transformation, or simply change the model.
Just for example: here I compare the RMSE for the
lm
with no transformation, thelm
withlog(price)
transformation, and a simple random forests model from the packageranger
. I'm usingcaret
to use the same model interface (by defaultcarte::train
perform a 25 bootstrap resample to choose the best parameters for the given model, so in this example, only random forest has some tuning parameters).which give me: