I have a datafile of ~375 cell lines and ~14,000 genes. I'm attempting to compute the pairwise correlations for each gene with every other gene.
Code is very simple as I'm using the pingouin package:
import pandas as pd
import pingouin as pg
df = pd.read_csv("CCLE Proteomics.csv", index_col=0, header=0)
df_corr = df.rcorr(stars=False)
print(df_corr)
Attempting to run this code returns:
ValueError: x and y must have length at least 2.
Pingouin uses Scipy pearsonr to do the calculations, and using pearsonr without Pingouin returns the same error.
I've also tried using a dummy dataset (5x7 dataframe of random numbers) which works fine when it doesn't include any null values, but returns the same error if null values exist within the dataframe. Based on this I believe the null values in my dataset are causing the issue - unfortunately the data is spotty enough that removing ALL rows/columns containing a null value leaves me with no rows/columns left, and in the dummy data set even one NaN value is enough to throw the error. As rcorr removes NaN values before feeding in to pearsonr, I believe it's dropping all my datapoints and having nothing left to feed in.
df.corr can calculate my r-values just fine, but I'm in need of a method to calculate p-values for this dataset as well, as we expect a significant number of these correlations to be insignificant.
Is there a way I can drop/mask NaN values within my dataset without dropping entire rows/columns? Is there a way to run pearsonr that behaves similarly to spearmanr with (nan_policy:'omit')? Am I off base and it's not the NaN values that are the issue here?
I can't say what Pingouin is doing because I'm not certain what function you're using. (
pairwise_corr
?) In any case, here's how you can do this with SciPy directly, although it requires the use of a private decorator.Properly configured,
_axis_nan_policy_factory
returns a decorator that addsaxis
,nan_policy
,keepdims
, and masked array support to reducing functions. Here is how it can be applied toscipy.stats.pearsonr
:In the example above, we computed the statistic and pvalue of corresponding rows of
x
andy
. However, you want to perform the test on pairwise combinations of rows. (I'm assuming rows, but this can be adjusted for columns). To do this, we can use standard NumPy broadcasting rules (rather than the ad hoc rule used inspearmanr
).Using
nan_policy='omit'
, you can perform the calculation with NaNs omitted:The fact that
spearmanr
has ad hoc support for 2D arrays andaxis
is all that is preventing us from applying this functionality to the correlation functions in SciPy. If you would like to see this functionality added to SciPy, please leave your thoughts in scipy/scipy#9307.