Original dataframe:
dp.head(10)
Creating new dataframe using recommended selection method:
dtest = pd.DataFrame(dp[dp['numdept'].isin([3,6,8,10])]).dropna()
dtest.reset_index(drop =True, inplace = True)
dtest.head(10)
Testing to make sure that only the values in [3,6,8,10] are in dtest['numdept']:
print "numdept is 5:", dtest[dtest["numdept"].isin ([5])]
print "set of distinct values in the numdept column:", sorted(set(dtest['numdept'].tolist()))
>> numdept is 5: Empty DataFrame
>> Columns: [numgrade, numyear, numdept]
>> Index: []
>> set of distinct values in the numdept column: [3, 6, 8, 10]
Plotting:
plt.figure(figsize=(16, 8))
sb.boxplot(x="numyear", y="numgrade", hue="numdept", data=dtest)
Question: Why are the "nummdept" categories in the plot legend showing values other than 3,6,8,10?
Problem surfaced in an ipython notebook, but recurs even when I carry the code to a regular environment. Also tried to avoid seaborn related issues by using the suggestion here, to no avail.
Using Canopy 1.7.4.3348, jupyter 1.0.0-15, pandas 0.19.0-1 matplotlib 1.5.1-9 and seaborn 0.7.0-6
EDIT: On an impulse, inserted the following before the plotting code:
grouped = dtest.groupby(['numdept', 'numyear'])
grouped.mean()
The output has numdept
values that should not exist in dtest
.
Does this make it a pandas bug?
You are using a categorical variable. It appears the legend is based on the categories in the categorical variable, not the values that are actually present. A categorical variable may represent categories that don't actually occur in the data, and these categories are still shown in the legend.
As suggested in the documentation, you can do
dtest.numdept.cat.remove_unused_categories()
to remove the empty categories.