Randomly sample lines retaining commented header lines

241 Views Asked by At

I'm attempting to randomly sample lines from a (large) file, while always retaining a set of "header lines". Header lines are always at the top of the file and unlike any other lines, begin with a #.

The actual file format I'm dealing with is a VCF, but I've kept the question general

Requirements:

  • Output all header lines (identified by a # at line start)
  • The command / script should (have the option to) read from STDIN
  • The command / script should output to STDOUT

For example, consider the following sample file (file.in):

#blah de blah
1
2
3
4
5
6
7
8
9
10

An example output (file.out) would be:

#blah de blah
10
2
5
3
4

I have a working solution (in this case selecting 5 non-header lines at random) using bash. It is capable of reading from STDIN (I can cat the contents of file.in into the rest of the command) however it writes to a named file rather than STDOUT:

cat file.in | tee >(awk '$1 =~ /^#/' > file.out) | awk '$1 !~ /^#/' | shuf -n 5 >> file.out
1

There are 1 best solutions below

8
On BEST ANSWER

By using process substitution (thanks Tom Fenech), both commands are seen as files.
Then using cat we can concatenate these "files" together and output to STDOUT.

cat <(awk '/^#/' file) <(awk '!/^#/' file | shuf -n 10)

Input

#blah de blah
1
2
3
4
5
6
7
8
9
10

Output

#blah de blah
1
9
8
4
7
2
3
10
6
5