Bash: How to extract parent directory of 3 files at a time

104 Views Asked by At

I have file names like this:

/foo/bar/bazz/JMA01023D_E07/JMA01023D_E07_EKDL230054768-1A_22HFKNLT3_L4_1.fq.gz
/foo/bar/bazz/JMA01023D_E08/JMA01023D_E08_EKDL230054768-1A_22HFKNLT3_L4_1.fq.gz
/foo/bar/bazz/JMA01023D_E09/JMA01023D_E09_EKDL230054768-1A_22HFKNLT3_L4_1.fq.gz
/foo/bar/bazz/JMA01022D_E06/JMA01022D_E06_EKDL230054767-1A_22HF2WLT3_L7_1.fq.gz
/foo/bar/bazz/JMA01001D_A01/JMA01001D_A01_EKDL230054750-1A_222T7MLT4_L1_1.fq.gz
/foo/bar/bazz/JMA01001D_A02/JMA01001D_A02_EKDL230054750-1A_222T7MLT4_L1_1.fq.gz

3 of these files (full path, sorted alphabetically) form a triplet. I would like to get the parent folder name for 3 files at a time.

So the desired output would be:

JMA01001D_A01 JMA01001D_A02 JMA01022D_E06
JMA01023D_E07 JMA01023D_E08 JMA01023D_E09

Something like this:

find "$@" -iname '*_1.fq.gz' | sort | xargs -I % -n3 sh -c echo % | sed -r 's/ *[^ ]*\/([^ ]+)\/([^ ]+)/\1 /g\'

And ideally, I would like to support spaces, so something with find -print0, sort -z and xargs -0 would be ideal. But I just can't seem to get it to work.

Could someone please help me untangle my brain? It doesn't have to use sed, something with dirname/basename or awk would be fine as well...

4

There are 4 best solutions below

2
0stone0 On BEST ANSWER

You can use to get the folder name and pipe that into xargs -n 3 to get the output with 3 items per line:

... | awk -F'/' '{print $(NF-1)}' | xargs -n 3

So if I place your input in /tmp/foo and run the following:

sort /tmp/foo | awk -F'/' '{print $(NF-1)}' | xargs -n 3

The output is

JMA01001D_A01 JMA01001D_A02 JMA01022D_E06
JMA01023D_E07 JMA01023D_E08 JMA01023D_E09
0
pmf On

You can use awk with its built-in special variables.

  • NF gives the number of fields (when split by what is defined with -F). Use it as $(NF-1) to get the second-to-last item, which is the parent directory of a file found.
  • NR is the number of the record (or "line") being processed. Check it for its divisibility by three using NR%3 to decide what to print after the parent directory.
  • OFS and ORS are the output field and record (or "line") separators, which default to a space and a line break, respectively, and are printed after the parent directory, depending on whether the current item is an "inner" first or second part of the triplet, or the "final" third one.
find "$@" -iname '*_1.fq.gz' | sort |
awk -F/ '{printf "%s" (NR%3? OFS: ORS), $(NF-1)}'
JMA01001D_A01 JMA01001D_A02 JMA01022D_E06
JMA01023D_E07 JMA01023D_E08 JMA01023D_E09

Note: This correctly handles spaces but not line breaks occurring in the file names.

3
jonas On

The follwing solution uses a combination of dirname and basename:

find . -iname '*_1.fq.gz' -exec dirname {} \; | xargs basename | sort | xargs -n3
0
potong On

This might work for you (GNU parallel):

... | sort | parallel echo {//} | parallel -n3 echo {/}

or if you prefer:

... | sort | parallel -n3 'echo {=s#/.*/(.*)/.*#$1#=}'