I have a data set on N numbers that I want to test for normality. I know scipy.stats has a kstest function but there are no examples on how to use it and how to interpret the results. Is anyone here familiar with it that can give me some advice?
According to the documentation, using kstest returns two numbers, the KS test statistic D and the p-value. If the p-value is greater than the significance level (say 5%), then we cannot reject the hypothesis that the data come from the given distribution.
When I do a test run by drawing 10000 samples from a normal distribution and testing for gaussianity:
import numpy as np
from scipy.stats import kstest
mu,sigma = 0.07, 0.89
kstest(np.random.normal(mu,sigma,10000),'norm')
I get the following output:
(0.04957880905196102, 8.9249710700788814e-22)
The p-value is less than 5% which means that we can reject the hypothesis that the data are normally distributed. But the samples were drawn from a normal distribution!
Can someone understand and explain to me the discrepancy here?
(Does testing for normality assume mu = 0 and sigma = 1? If so, how can I test that my data are gaussianly distributed but with a different mu and sigma?)
Your data was generated with mu=0.07 and sigma=0.89. You are testing this data against a normal distribution with mean 0 and standard deviation of 1.
The null hypothesis (
H0
) is that the distribution of which your data is a sample is equal to the standard normal distribution with mean 0, std deviation 1.The small p-value is indicating that a test statistic as large as D would be expected with probability p-value.
In other words, (with p-value ~8.9e-22) it is highly unlikely that
H0
is true.That is reasonable, since the means and std deviations don't match.
Compare your result with:
To test your data is gaussian, you could shift and rescale it so it is normal with mean 0 and std deviation 1:
Warning: (many thanks to user333700 (aka scipy developer Josef Perktold)) If you don't know
mu
andsigma
, estimating the parameters makes the p-value invalid:prints
which shows that
stats.kstest
may not reject the expected number of null hypotheses if the sample is normalized using the sample's mean and standard deviation