Returning Sklearn iris-data with cross validation

49 Views Asked by At

I am slightly confused by what the following code returns for X and y:

from sklearn import datasets

X, y = datasets.load_iris(return_X_y=True)

I am seeing that print(X) gives the iris -data of shape 150x4, which seems correct. However, I am trying to understand what print(y) exactly gives - it simply returns this vector:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

I assume that 0,1 and 2 refer to classes in the iris data that correspond to the class labels, i.e. 'setosa', 'versicolor' and 'virginica'. Am I correct? Could someone elaborate on this and perhaps make it slightly more intuitive?

1

There are 1 best solutions below

0
Vons On

Broadly speaking there's two types of datasets -- for regression and classification. Here you have classification where the X are the predictors and y are the group memberships.

from sklearn import datasets

iris = datasets.load_iris()

print(iris.DESCR)

Output:

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica

As you can see from the comments as well that Setosa corresponds to 0 and so on.

from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

X, y = datasets.load_iris(return_X_y = True)

df = pd.DataFrame(np.column_stack((X, y)), columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])

import seaborn as sns

sns.FacetGrid(df, hue='class').map(plt.scatter, 'sepal_length', 'petal_length').add_legend()
plt.show()

enter image description here