I don't have any experience with hadoop and am running into issues while attempting to run a singularity container that uses it - it seems like it's not really getting started and i'm trying to figure out why. After sifting through all the output, this seems to be the first indication of the problem:
2024-03-10 17:25:12 INFO Server:3102 - IPC Server handler 7 on default port 51139, call Call#41 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from localhost:60086 / 127.0.0.1:60086: java.io.FileNotFoundException: File does not exist: /user/ark19/cloudgene-cli/job-20240310-170017/temp/outputimputation/18/temp/_temporary/1/_temporary/attempt_1710089589002_0004_m_000000_0/part-m-00000 (inode 16888) [Lease. Holder: DFSClient_attempt_1710089589002_0004_m_000000_0_-1772981706_1, pending creates: 1]
Since I have very little understanding of how hadoop works, I'm not even sure where to start troubleshooting, and the first thing that I observe is that i don't even see a /user dir:
Singularity> ls /
anaconda-post.log bin data dev environment etc home hpc lib lib64 localdata media mnt opt proc root run sbin singularity srv sys tmp usr var
I am working on my institution's computer cluster running AlmaLinux 9.3 & slurm (and running from inside the container):
srun -c 32 --x11 --mem=128G --pty bash -i
singularity shell --hostname localhost --bind ${workingDirectory}:/data --bind ${containerDirectory}/imputation-protocol-latest.sif
Perhaps I'm missing something obvious, or need to further troubleshoot my configuration. I thought I was potentially on to something when I discovered this comment in hadoop/config/hdfs-site.xml:
Immediately exit safemode as soon as one DataNode checks in. On a multi-node cluster, these configurations must be removed.
But lines following that comment didn't change anything.
Happy to provide further information on my set up but don't yet know what is most relevant. Thanks!