Haskell Shake build: how can I set up a shared cache folder using shakeShare and/or shakeCloud?

301 Views Asked by At

I understand this is a new feature being worked on for GHC's Hadrian build system, so the workflow might be advanced, oddly specific, or still evolving. I read these so far:

It sounds like it should work for my use case: a domain-specific language for bioinformatics that would benefit greatly from caching comparisons between large genomes. I would be happy working from a basic minimal example or description of where to look. But I've included more details about my program too in case they make it easier...

The shortcut interpreter builds lots of artifacts with names derived from hashes of their inputs (somewhat Nix-like), and theoretically they should be portable across machines or even operating systems. A small program run might generate files + symlinks like this:

~/.shortcut
├── cache
│   ├── each
│   │   ├── 59e0192b1f
│   │   │   ├── 2633d268bf.ndb -> ../../../exprs/makeblastdb_nucl/2633d268bf_0.ndb
│   │   │   └── 2633d268bf.ndb.args -> ../../../cache/lines/3428ab5186.txt
│   │   ├── 623e07ac5b
│   │   │   ├── b9361606af.str.list -> ../../../exprs/extract_queries/b9361606af_0.str.list
│   │   │   └── b9361606af.str.list.args -> ../../../cache/lines/12ae82a598.txt
│   │   └── f477cfe47b
│   │       ├── 35e97350d8.bht -> ../../../exprs/blastn/ce1c174684_420e4f7fdf_35e97350d8_0.bht
│   │       └── 35e97350d8.bht.args -> ../../../cache/lines/9491bbe6a0.txt
│   ├── lines
│   │   ├── 0094e500eb.txt
│   │   ├── 12ae82a598.txt
│   │   ├── 246ddae0d8.txt
│   │   ├── 3428ab5186.txt
│   │   ├── 46767f8ae8.txt
│   │   ├── 5d54256d91.txt
│   │   ├── 61a97fd32d.txt
│   │   ├── 6de4b9ad67.txt
│   │   ├── 778251fd80.txt
│   │   ├── 81f7f42c42.txt
│   │   ├── 91ce94df26.txt
│   │   ├── 9491bbe6a0.txt
│   │   ├── b575c745e6.txt -> ../../cache/lines/6de4b9ad67.txt
│   │   ├── f094bac04c.txt
│   │   └── fcfb7a47a6.txt
│   ├── load
│   │   ├── 1e7afd22cf.gbk -> /home/jefdaj/shortcut/data/Mycoplasma_bovis_HB0801-P115.gbk
│   │   └── 28ce925871.gbk -> /home/jefdaj/shortcut/data/Mycoplasma_genitalium_M2321.gbk
│   ├── makeblastdb
│   │   └── 6de4b9ad67
│   │       ├── 6de4b9ad67.ndb.err
│   │       ├── 6de4b9ad67.ndb.nhr
│   │       ├── 6de4b9ad67.ndb.nin
│   │       ├── 6de4b9ad67.ndb.nsq
│   │       └── 6de4b9ad67.ndb.out
│   └── seqio
├── exprs
│   ├── any
│   │   └── e5677b1051_0.str.list.list -> ../../cache/lines/91ce94df26.txt
│   ├── blastn
│   │   ├── ce1c174684_420e4f7fdf_35e97350d8_0.bht -> ../../exprs/blastn/ce1c174684_420e4f7fdf_35e97350d8_0.bht.out
│   │   ├── ce1c174684_420e4f7fdf_35e97350d8_0.bht.out
│   │   └── ce1c174684_420e4f7fdf_35e97350d8_0.bht.out.err
│   ├── blastn_db
│   │   ├── 46e62edec1_420e4f7fdf_9ef76468c4_0.bht -> ../../exprs/blastn_db/46e62edec1_420e4f7fdf_9ef76468c4_0.bht.out
│   │   ├── 46e62edec1_420e4f7fdf_9ef76468c4_0.bht.out
│   │   └── 46e62edec1_420e4f7fdf_9ef76468c4_0.bht.out.err
│   ├── blastn_each
│   │   └── 46e62edec1_420e4f7fdf_2943ae4ea3_0.bht.list -> ../../cache/lines/12ae82a598.txt
│   ├── extract_queries
│   │   ├── 53376e198d_0.str.list -> ../../cache/lines/246ddae0d8.txt
│   │   ├── 53376e198d_0.str.list.tmp
│   │   ├── 53376e198d_0.str.list.tmp.err
│   │   ├── b9361606af_0.str.list -> ../../cache/lines/246ddae0d8.txt
│   │   ├── b9361606af_0.str.list.tmp
│   │   └── b9361606af_0.str.list.tmp.err
│   ├── extract_queries_each
│   │   └── d724d35317_0.str.list.list -> ../../cache/lines/91ce94df26.txt
│   ├── gbk_to_fna
│   │   ├── 262cf7e4e4_a355cc10e8_0.fna
│   │   └── 262cf7e4e4_cdab12f059_0.fna
│   ├── list
│   │   ├── 3bbbf950a3_0.str.list.list -> ../../cache/lines/91ce94df26.txt
│   │   ├── 65127d0127_0.fna.list -> ../../cache/lines/6de4b9ad67.txt
│   │   └── df91bd4d94_0.str.list.list.list -> ../../cache/lines/46767f8ae8.txt
│   ├── load_fna
│   ├── load_gbk
│   │   ├── 15e3d91521_0.gbk -> ../../cache/load/28ce925871.gbk
│   │   └── 74e27ec9a5_0.gbk -> ../../cache/load/1e7afd22cf.gbk
│   ├── makeblastdb_nucl
│   │   ├── 2633d268bf_0.ndb -> ../../cache/lines/81f7f42c42.txt
│   │   └── 954abc5fe7_0.ndb -> ../../cache/lines/81f7f42c42.txt
│   ├── makeblastdb_nucl_each
│   │   └── 209fe6406d_0.ndb.list -> ../../cache/lines/5d54256d91.txt
│   ├── num
│   │   └── a53b190835_0.num -> ../../cache/lines/778251fd80.txt
│   ├── singletons
│   │   └── 954abc5fe7_0.fna.list.list -> ../../cache/lines/0094e500eb.txt
│   └── str
│       ├── 90811d06ee_0.str -> ../../cache/lines/f094bac04c.txt
│       ├── b4da62b027_0.str -> ../../cache/lines/fcfb7a47a6.txt
│       └── b81c880be5_0.str -> ../../cache/lines/61a97fd32d.txt
├── profile.html
└── vars
    ├── mapped.str.list.list -> ../exprs/extract_queries_each/d724d35317_0.str.list.list
    ├── mbov.fna -> ../exprs/gbk_to_fna/262cf7e4e4_a355cc10e8_0.fna
    ├── mgen.fna -> ../exprs/gbk_to_fna/262cf7e4e4_cdab12f059_0.fna
    ├── result -> ../exprs/any/e5677b1051_0.str.list.list
    └── single.str.list -> ../exprs/extract_queries/53376e198d_0.str.list

27 directories, 64 files

I tried adding shakeShare = Just "sharedir" to my Shake options. When I run the build, delete all artifacts, and re-run, it fails to find a cached file:

error! Error when running Shake build system:
  at want, called at ./ShortCut/Core/Eval.hs:253:7 in main:ShortCut.Core.Eval
* Depends on: eval
  at need, called at ./ShortCut/Core/Eval.hs:256:25 in main:ShortCut.Core.Eval
* Depends on: /root/.shortcut/vars/result
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/exprs/all/b2ba759dce_0.str.list
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/exprs/any/0a89231af6_0.str.list.list
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/exprs/list/df91bd4d94_0.str.list.list.list
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/vars/mapped.str.list.list
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/exprs/extract_queries_each/730901fdda_0.str.list.list
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/exprs/tblastn_each/46e62edec1_183f436b85_a5a22079c6_0.bht.list
  at need, called at ./ShortCut/Core/Actions.hs:110:3 in main:ShortCut.Core.Actions
* Depends on: /root/.shortcut/exprs/num/a53b190835_0.num
* Raised the exception:
/home/jefdaj/shortcut/sharedir/.shake.cache/2faae061b9976bed/0x134125AC: getPermissions:getFileStatus: does not exist (No such file or directory)

That was expected, but now how do I go about fixing it? The hashes should be stable because all symlinks are relative to the top level tmpdir (~/.shortcut here). And Shake should know about each of them since I make sure to call trackWrite.

Do I need to explicitly add all files to a cache using newCache, newCacheIO, and/or an oracle as they're built, or am I missing something simpler?

Ideally I would have a shared directory to cache everything done on the demo server, and also give users the option to connect their own instances to it like a Nix binary cache.

2

There are 2 best solutions below

0
On BEST ANSWER

I misunderstood what Shake caches are for: I assumed they cache build artifacts indexed by all their inputs, but actually they just cache reading + processing of files used to decide things while producing those artifacts. Like in this example from the docs:

digits <- newCache $ \file -> do
    src <- readFile' file
    return $ length $ filter isDigit src
"*.digits" %> \x -> do
    v1 <- digits (dropExtension x)
    v2 <- digits (dropExtension x)
    writeFile' x $ show (v1,v2)

As it says, "This function is useful when creating files that store intermediate values, to avoid the overhead of repeatedly reading from disk, particularly if the file requires expensive parsing." I think the cache here is equivalent to writing an intermediate .digits file and needing it twice.

I implemented what I was looking for separately by writing a need' function that checks if a file is in the cache (local or remote), fetches it if possible, and then calls Development.Shake.need. Depending on the build system, it might also be possible to do this kind of "caching" with something like rsync.

The Shake cache still looks very useful for the part of my code that needs to read a list of all sequence IDs into memory once per program run and refer to them later from various rules.

0
On

I think there's a confusion here that newCache and newCacheIO are involved with the Cloud Shake feature. They aren't - in fact, newCache happens to be one of the things that has no impact on the cloud cache (coincidentally, rather than because of the Cache in the name). I see no reason your use case shouldn't work with Cloud Shake.

The exception /home/jefdaj/shortcut/sharedir/.shake.cache/2faae061b9976bed/0x134125AC: getPermissions:getFileStatus: does not exist (No such file or directory) seems to imply that Shake has recorded that the above file is in the cache, and can be used to satisfy the file. So the question is where did the file go? Is it possible when you cleaned the files that they deleted the files from the cache too? Shake can be run with both --share-copy and --share-symlink - I'd recommend the former to ensure there is less action-at-a-distance, which might also explain what is deleting the files. I'd also recommend using Shake HEAD, as there are some fixes that haven't yet been released.

If that still doesn't work, perhaps raise a ticket on the Shake GitHub?