How do we make a model in r using more than one row

54 Views Asked by At

Below given is my R code to create a model using R programming to predict the prices of diamonds from the diamond dataset. Here I am not able to create the model by giving a log for each row. Without using log I am getting a horrible model with incorrect prediction prices. I am also pasting the error shown and the dataset for reference.

The error is as given below

> mod =(lm(log(price)~log(carat)+log(x)+log(y)+log(z),data=train))
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  NA/NaN/Inf in 'x'

Link for the dataset is attached here: https://www.kaggle.com/shivam2503/diamonds

Below given is the code for the same

setwd ("C:/akash/study videos/virginia")
akash = read.csv("diamonds.csv")
#summary(akash)
ind = sample(2, nrow(akash),replace = TRUE , prob = c(0.8,0.2)) 
train = akash[ind==1,]
test = akash[ind==2,]
mod =(lm(log(price)~log(carat)+log(x)+log(y)+log(z),data=train))
summary(mod)
predicted = predict(mod,newdata = test)
mon = round(exp(predicted),0)
head(mon)
#head(test)
#View(akash)
1

There are 1 best solutions below

0
On

Your model fails because the minimum values for your variable x,y,z are 0, so when you log-transform these variables you obtain -inf:

lapply(c("x","y","z"),function(x)summary(log(diamonds[[x]])))

you can try to log-transform just the outcome, remove the minimum values from the transformation, or simply change the model.

Just for example: here I compare the RMSE for the lm with no transformation, the lm with log(price) transformation, and a simple random forests model from the package ranger. I'm using caret to use the same model interface (by default carte::train perform a 25 bootstrap resample to choose the best parameters for the given model, so in this example, only random forest has some tuning parameters).

library(ggplot2)#for "diamonds" dataset
data("diamonds")
set.seed(5)
ind = sample(2, nrow(diamonds),replace = TRUE , prob = c(0.8,0.2)) 
train = diamonds[ind==1,]
test = diamonds[ind==2,]

library(caret)
rf <- train(price~carat+x+y+z,data=train,method="ranger")
lm <- train(price~carat+x+y+z,data=train,method="lm")
lm_log <- train(log(price)~carat+x+y+z,data=train,method="lm")

RMSE(predict(rf,test),test$price)/mean(test$price)*100
RMSE(predict(lm,test),test$price)/mean(test$price)*100
RMSE(exp(predict(lm_log,test)),test$price)/mean(test$price)*100

which give me:

[1] 35.73012
[1] 40.2437
[1] 45.92143