TypeError: unsupported operand type(s) for -: ‘str’ and ‘int’ in PyCaret regression

1.5k Views Asked by At

I read multiple available questions about this topic, but still do not understand my problem.

I am trying to build a regression, using PyCaret:

from pycaret.regression import *
fooPy = setup(data = foo, target = 'pts', session_id = 123)

I receive error:

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Not sure where is the problem, because I do not see any strings in my structure:

pts_500                   float64
pts_500_p                 float64
OBP_avg                   float64
SLG_avg                   float64
SB_avg                    float64
RBI_avg                   float64
R_avg                     float64
home                      int64
first_time_pitcher        int32
park_ratio_OBP            float64
park_ratio_SLG            float64
order                     float64
SO_avg_p                  float64
pts_500_parkadj_p         float64
pts_500_parkadj           float64
SLG_avg_parkadj           float64
OPS_avg_parkadj           float64
SLG_avg_parkadj_p         float64
OPS_avg_parkadj_p         float64
pts_BxP                   float64
SLG_BxP                   float64
OPS_BxP                   float64
whip_SO_BxP               float64
whip_SO_B                 float64
whip_SO_B_parkadj         float64
order                     float64
ops x pts_500 order15     float64
ops x pts_500 parkadj     float64
ops23 x pts_500           float64
ops x pts_500 orderadj    float64
whip_p                    float64
whip_SO_p                 float64
whip_SO_parkadj_p         float64
whip_parkadj_p            float64
pts                       float64
dtype: object

home and first_time_pitcher are integers.

Full error looks like:

Appreciate any tips!

2

There are 2 best solutions below

0
On BEST ANSWER

I found the answer myself and it was very trivial and embarassing.

Order variable was included twice in the dataset. I checked the correlation and got 1.0 correlation between the same variables.

# Check correlation
cor = df[features].corr()
cor.loc[:,:] = np.tril(cor, k=-1) 
cor = cor.stack()
cor[(cor > 0.7) | (cor < -0.7)]
0
On

Just to add onto @Anakin Sykwalker's answer. This error (with the confusing error message) is caused by duplicated column names.

It can be simply resolved by removing one of the duplicated column names, either by renaming (e.g. df.rename) or dropping (e.g. df.drop) the columns.

An example to reproduce the error is included below (using pycaret 2.3.6) :

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# artificially create 2 columns with same name, Number of times pregnant
diabetes.columns = ['Number of times pregnant',
       'Number of times pregnant',
       'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)',
       '2-Hour serum insulin (mu U/ml)',
       'Body mass index (weight in kg/(height in m)^2)',
       'Diabetes pedigree function', 'Age (years)', 'Class variable']

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')

This will end with an error message below :

TypeError: unsupported operand type(s) for +: 'int' and 'str'