Pearsonr & pvalues on large dataset including nulls

178 Views Asked by At

I have a datafile of ~375 cell lines and ~14,000 genes. I'm attempting to compute the pairwise correlations for each gene with every other gene.

Code is very simple as I'm using the pingouin package:

import pandas as pd
import pingouin as pg
df = pd.read_csv("CCLE Proteomics.csv", index_col=0, header=0)
df_corr = df.rcorr(stars=False)
print(df_corr)

Attempting to run this code returns:

ValueError: x and y must have length at least 2.

Pingouin uses Scipy pearsonr to do the calculations, and using pearsonr without Pingouin returns the same error.

I've also tried using a dummy dataset (5x7 dataframe of random numbers) which works fine when it doesn't include any null values, but returns the same error if null values exist within the dataframe. Based on this I believe the null values in my dataset are causing the issue - unfortunately the data is spotty enough that removing ALL rows/columns containing a null value leaves me with no rows/columns left, and in the dummy data set even one NaN value is enough to throw the error. As rcorr removes NaN values before feeding in to pearsonr, I believe it's dropping all my datapoints and having nothing left to feed in.

df.corr can calculate my r-values just fine, but I'm in need of a method to calculate p-values for this dataset as well, as we expect a significant number of these correlations to be insignificant.

Is there a way I can drop/mask NaN values within my dataset without dropping entire rows/columns? Is there a way to run pearsonr that behaves similarly to spearmanr with (nan_policy:'omit')? Am I off base and it's not the NaN values that are the issue here?

1

There are 1 best solutions below

0
On

I can't say what Pingouin is doing because I'm not certain what function you're using. (pairwise_corr?) In any case, here's how you can do this with SciPy directly, although it requires the use of a private decorator.

Properly configured, _axis_nan_policy_factory returns a decorator that adds axis, nan_policy, keepdims, and masked array support to reducing functions. Here is how it can be applied to scipy.stats.pearsonr:

import numpy as np
from scipy import stats
# This works in SciPy 1.11.x. That said, direct use of private functions is not
# supported, and private APIs may change without notice.
from scipy.stats._axis_nan_policy import _axis_nan_policy_factory

# Add axis/nan_policy support to stats.pearsonr
pearsonr = _axis_nan_policy_factory(lambda *res: tuple(res), n_samples=2, 
                                    too_small=1, paired=True)(stats.pearsonr)

# Generate random data
rng = np.random.default_rng(4534645763457)
m, n = 5, 7
x = rng.random(size=(m, n))
y = rng.random(size=(m, n))

# Compute correlation coefficients for corresponding rows in one vectorized call
res = pearsonr(x, y, axis=-1)

# Loop over rows manually
ref = np.zeros((2, m))
for i, samples in enumerate(zip(x, y)):
  ref[:, i] = stats.pearsonr(*samples)

# Compare the results
np.testing.assert_allclose(res, ref)

In the example above, we computed the statistic and pvalue of corresponding rows of x and y. However, you want to perform the test on pairwise combinations of rows. (I'm assuming rows, but this can be adjusted for columns). To do this, we can use standard NumPy broadcasting rules (rather than the ad hoc rule used in spearmanr).

x1 = np.expand_dims(x, 0)
x1.shape  # (1, 5, 7)
x2 = np.expand_dims(x, 1)
x2.shape  # (5, 1, 7)

statistic, pvalue = pearsonr(x1, x2, axis=-1)
# statistic and p-value are for each pair
statistic.shape  # (5, 5)
pvalue.shape  # (5, 5)

Using nan_policy='omit', you can perform the calculation with NaNs omitted:

x[1, 3] = np.nan  # add a NaN in row 1
x1 = np.expand_dims(x, 0)
x2 = np.expand_dims(x, 1)
statistic2, pvalue2 = pearsonr(x1, x2, axis=-1, nan_policy='omit')

# Confirm that the NaNs are not propagated to the result
np.all(np.isfinite(statistic))
np.all(np.isfinite(pvalue))

# statistic (and p-value) will not agree with previous result in tests
# involving row 2 (except row 2 vs row 2)
statistic2 == statistic
# array([[ True, False,  True,  True,  True],
#        [False,  True, False, False, False],
#        [ True, False,  True,  True,  True],
#        [ True, False,  True,  True,  True],
#        [ True, False,  True,  True,  True]])

The fact that spearmanr has ad hoc support for 2D arrays and axis is all that is preventing us from applying this functionality to the correlation functions in SciPy. If you would like to see this functionality added to SciPy, please leave your thoughts in scipy/scipy#9307.