How to give negative log10 distribution to Python probplot function for qq-plotting p-values?

40 Views Asked by At

I want to translate the following code from R to Python using scipy.stats.probplot.

qqplot(-log10(ppoints(1000)), -log10(p_value))

This is Q-Q plot of p-values compared to uniform distribution with a minus log scale. I am after something like the following. (I know that there are other libraries that achieve this, but I am looking for the answer for probplot.)

probplot(-np.log10(p_values_data), dist="uniform", sparams=(0, 1), plot=plt)

This does not work correctly because, the x-axis is uniform. Here, plt is due to import matplotlib.pyplot as plt. I found the post here, among others, but I did not find anything on modifying the dist parameter to accommodate -log10(uniform).

How can I get this plot using probplot?

Here is a revision of the problem description.

Here is the data generation.

import numpy as np
from scipy.stats import chi2,probplot
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

def compute_p_with_chi2(x,y):
    model = ols('y ~ x', data=dict(y=y, x=x)).fit()
    t_stat = model.tvalues['x']
    p_value = 1-chi2.cdf(t_stat**2, 1)
    return p_value

def compute_pvalues(X_data,p_data):
  p_values = []
  for col in X_data.T:
      p_value = compute_p_with_chi2(col,p_data)
      p_values.append(p_value)
  return p_values

n = 100
p = 1000
X = np.random.binomial(2, 0.4, size=(n, p))
y = np.random.normal(size=n)

p_values = compute_pvalues(X,y)

Doing a histogram of the p-values, I get a uniform distribution as expected.

plt.hist(p_values)

However, plotting the Q-Q using the probplot, I do not get two overlapping diagonals. Here is what I get.

enter image description here

probplot(-np.log10(p_values), dist="uniform", sparams=(0, 1), plot=plt)

I am including the desired output from R with the (first) code above.

enter image description here

My feeling is that this is something very simple, but I am somehow missing it.

3

There are 3 best solutions below

1
Federicofkt On

you can manually transform the data in order to compare them to a uniform distribution:

transformed_data = -np.log10(p_values_data)

expected_quantiles = stats.uniform.ppf(np.linspace(0.001, 0.999, len(transformed_data)))

and then the command you've already provided

1
Matt Haberland On

The dist parameter accepts an object that works like a probability distribution; more specifically, it must have a ppf method. If want to compare your data against the log-uniform distribution (the distribution of a random variable whose logarithm is uniformly distributed), you would do:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import probplot
from scipy import stats

a, b = 1e-6, 1
p_values_data = np.logspace(-6, 0, 100)
probplot(p_values_data, dist=stats.loguniform, sparams=(a, b), plot=plt)

enter image description here

If you want your x-axis to be log-spaced, you would add:

plt.gca().set_xscale('log')

enter image description here

If this doesn't answer your question, please include in your question a minimal reproducible example - in this case, the data and an example of the plot you are trying to produce.

0
Matt Haberland On

It sounds like it is required for the first parameter to be:

-np.log10(p_values_data)

where p_values_data are some values distributed between 0 and 1, and the comparison is to be made against the uniform distribution. This is equivalent to my previous answer, but it's a different way of visualizing it.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import probplot

a, b = 1e-6, 1
p_values_data = np.logspace(-6, 0, 1000)
probplot(-np.log10(p_values_data), dist="uniform", sparams=(0, 6), plot=plt)
# plt.gca().set_xscale('log')  # if desired for some reason

enter image description here

The main difference from the user's original code is that the parameters passed to the uniform distribution should reflect the range of the distribution - the lower limit can be 0, but in that case the upper limit needs to be at least -log10(np.min(p_values_data)). If you choose to change the lower limit, note that SciPy's uniform distribution is parameterized by the left endpoint and the scale (difference between endpoints), not the left and right endpoints separately.

If one of the p_values_data is exactly 0, then -log10(0) is infinite. You would need to specify what you want to happen in that case.

Again, if this doesn't answer your question, please include in your question a minimal reproducible example - in this case, the data and an example of the plot you are trying to produce.