`unique` in data.frame.describe() not work [python][pandas]

6.1k Views Asked by At

Hi it's something fundamental but I can't fix it... unique() shows unique values in each column, but describe() shows NaN. Why... Any help's appreciated. thanks

import numpy as np
import pandas as pd

train = pd.read_csv('train.csv', header=0)

# works:
train['Pclass'].unique()
# array([3, 1, 2], dtype=int64)
train['Survived'].unique()
# array([0, 1], dtype=int64)

# not work:
train.describe(include='all')
#         PassengerId    Survived      Pclass               Name   Sex  \
# count    891.000000  891.000000  891.000000                891   891   
# unique          NaN         NaN         NaN                891     2   
# top             NaN         NaN         NaN  Mitkoff, Mr. Mito  male   
# freq            NaN         NaN         NaN                  1   577   
# mean     446.000000    0.383838    2.308642                NaN   NaN   
# std      257.353842    0.486592    0.836071                NaN   NaN   
# min        1.000000    0.000000    1.000000                NaN   NaN   
# 25%      223.500000    0.000000    2.000000                NaN   NaN   
# 50%      446.000000    0.000000    3.000000                NaN   NaN   
# 75%      668.500000    1.000000    3.000000                NaN   NaN   
# max      891.000000    1.000000    3.000000                NaN   NaN   
# 
#                Age       SibSp       Parch  Ticket        Fare        Cabin  \
# count   714.000000  891.000000  891.000000     891  891.000000          204   
# unique         NaN         NaN         NaN     681         NaN          147   
# top            NaN         NaN         NaN  347082         NaN  C23 C25 C27   
# freq           NaN         NaN         NaN       7         NaN            4   
# mean     29.699118    0.523008    0.381594     NaN   32.204208          NaN   
# std      14.526497    1.102743    0.806057     NaN   49.693429          NaN   
# min       0.420000    0.000000    0.000000     NaN    0.000000          NaN   
# 25%      20.125000    0.000000    0.000000     NaN    7.910400          NaN   
# 50%      28.000000    0.000000    0.000000     NaN   14.454200          NaN   
# 75%      38.000000    1.000000    0.000000     NaN   31.000000          NaN   
# max      80.000000    8.000000    6.000000     NaN  512.329200          NaN   
# 
#        Embarked  
# count       889  
# unique        3  
# top           S  
# freq        644  
# mean        NaN  
# std         NaN  
# min         NaN  
# 25%         NaN  
# 50%         NaN  
# 75%         NaN  
# max         NaN  
1

There are 1 best solutions below

1
On BEST ANSWER

The describe method for numeric columns doesn't list the number of unique values, since this is usually not particularly meaningful for numeric data, the describe method for string columns does:

import pandas as pd
df = pd.DataFrame({'string_column': ['a', 'a', 'b'], 'numeric': [1, 2, 1]})

df['numeric'].describe()
Out[6]: 
count    3.000000
mean     1.333333
std      0.577350
min      1.000000
25%      1.000000
50%      1.000000
75%      1.500000
max      2.000000
Name: numeric, dtype: float64

df['string_column'].describe()
Out[7]: 
count     3
unique    2
top       a
freq      2
Name: string_column, dtype: object

Since your dataframe contains both, the results are being merged and nans inserted where the column doesn't have that value.

If your numeric columns are actually just codes reflecting different classes/categories, you might want to convert them to Categorical to get more meaningful info about them:

df['categorized'] = pd.Categorical(df['numeric'])

df['categorized'].describe()
Out[10]: 
count     3
unique    2
top       1
freq      2
Name: categorized, dtype: int64