Many inputs to one output, access wildcards in input files

167 Views Asked by At

Apologies if this is a straightforward question, I couldn't find anything in the docs.

currently my workflow looks something like this. I'm taking a number of input files created as part of this workflow, and summarizing them.

Is there a way to avoid this manual regex step to parse the wildcards in the filenames?

I thought about an "expand" of cross_ids and config["chromosomes"], but unsure to guarantee conistent order.

rule report:
    output:
        table="output/mendel_errors.txt"
    input:
        files=expand("output/{chrom}/{cross}.in", chrom=config["chromosomes"], cross=cross_ids)
    params:
        req="h_vmem=4G",
    run:
        df = pd.DataFrame(index=range(len(input.files), columns=["stat", "chrom", "cross"])

        for i, fn in enumerate(input.files):
            # open fn / make calculations etc // stat =
            # manual regex of filename to get chrom cross // chrom, cross =
            df.loc[i] = stat, chrom, choss

This seems a bit awkward when this information must be in the environment somewhere.

1

There are 1 best solutions below

0
On

(via Johannes Köster on the google group)

To answer your question: Expand uses functools.product from the standard library. Hence, you could write

from functools import product

product(config["chromosomes"], cross_ids)