I am trying to run models on genomic data using Dask. But, I am getting an error, when I standardize or process the data.
I am working on a SLURM-Cluster. Therefore, first I am starting a cluster:
cluster = SLURMCluster(
cores=16,
processes=1,
name='dask-worker',
memory='32GB',
walltime='12:00:00',
log_directory='logs',
python='srun --cpu_bind=verbose python',
death_timeout=300,
env_extra=['export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK',
'source ~/.conda/envs/NN001/bin/activate'],
)
client = Client(cluster)
Then I am using PLINK-files of ~8GB that I am reading in using pandas-plink (https://pandas-plink.readthedocs.io/en/latest/api/pandas_plink.read_plink1_bin.html). These look like:
G = read_plink1_bin('xxx.bed', verbose=True, chunk=Chunk(nsamples=4000, nvariants=4000))
G = G.isel(sample = slice(0,n_samples), variant = slice(0, n_features))
G = G.fillna(G.median(dim='sample'))
print(G)
#Output:
<xarray.DataArray 'genotype' (sample: 10000, variant: 200000)>
dask.array<where, shape=(10000, 200000), dtype=float32, chunksize=(55, 2000), chunktype=numpy.ndarray>
Coordinates: (12/14)
* sample (sample) object '12' '13' ... '15' '16'
* variant (variant) <U13 'variant0' 'variant1' ... 'variant998' 'variant999'
fid (sample) object '12' '13' ... '15' '16'
iid (sample) object '12' '13' ... '15' '16'
father (sample) object '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
mother (sample) object '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
... ...
chrom (variant) object '1' '1' '1' '1' '1' '1' ... '1' '1' '1' '1' '1'
snp (variant) object '123' '124' ... '999'
cm (variant) float64 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
pos (variant) int32 123456 123457 123458 ... 987654 987655 987656
a0 (variant) object 'A' 'C' 'A' 'A' 'A' 'C' ... 'C' 'A' 'T' 'A' 'T'
a1 (variant) object 'G' 'T' 'G' 'C' 'G' 'T' ... 'T' 'G' 'C' 'G' 'C'
I need to standardize them using fit_transform() from StandardScaler():
from dask_ml.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(G)
But, then I'm getting this error:
2022-05-14 02:10:18,224 - distributed.protocol.core - CRITICAL - Failed to Serialize
Traceback (most recent call last):
...
raise TypeError(msg, str(x)[:10000])
TypeError: ('Could not serialize object of type tuple.', '(<function _read_bed_chunk at
0x148a6e435160>, memmap([255, 255, 255, ..., 255, 255, 63], dtype=uint8), 613049,
458747, 0, 1024, 0, 1024, <Allele.a1: 1>)')
The error also occurs when calling G.compute()
and G.values
.
I'm only getting this error when starting a Cluster. Without using a cluster, everything works as expected.
It seems that the data array X is not serializable. Can I still use it? Or is there a way to make it serialized?
EDIT:
- Added information about reading the data in.
- Fixed typo in 3rd code block (input to fit_transform is G not X)
pandas_plink.read_plink returns a tuple with three items: alleles, samples, and genotypes. The documentation shows how you would unpack this tuple into three variables. I don't understand the subject matter so I don't know why these are called bim, fam, and bed, but it corresponds to the fact that the function returns a tuple with three items.
It isn't clear to me which of bim, fam, or bed you want to scale, but the reason you are getting the error is that the variable you are passing to the StandardScaler is a tuple returned by read_plink, so you need to identify which items in the tuple is the array that you want to scale.