Linux Shell "paste" command - guaranteed line-based interleaving?

457 Views Asked by At

I'm using the paste command in the Linux shell (bash) for the purposes of interleaving the output of two commands, on an every-other-line basis. Empirically, my code seems to work, but I'm wondering whether paste has guaranteed behavior of alternating lines of output from each command. If I ran my below code 100 times, and one of the commands ran slower than the other one time, could I get two lines of output from one command before the other command caught up?. I don't see any documentation of this in the man page on my systems (OS X or Ubuntu 18.04), it just seems to work...

Find the sample code below (FYI, it's producing a formatted hex dump in a format I like):

# Version 1
paste -d "\n" <(
    cat "${largeFile}" | 
    LC_CTYPE=C tr -c '[:print:]' '.' | 
    sed -E $'s/.{30}/&\\\n/g' | 
    sed 's/./&  /g'
) <(
    xxd -p "${largeFile}" |
    sed 's/../& /g'
)

Output:

G  I  F  8  9  a  .  .  q  .  w  .  .  !  .  3  C  r  e  a  t  e  d     w  i  t  h     t  
47 49 46 38 39 61 c7 01 71 00 77 ff 00 21 fe 33 43 72 65 61 74 65 64 20 77 69 74 68 20 74 
...

This question is sort of the opposite topic of the behavior of cat when you give it two streams: cat nominally prints from one stream first (perhaps all of its lines), then the other. For example:

# Echo buffers at 8096 characters on my system
numLines=$(( 2 * (8096 + 36) / 37 ))

# Make some dummy text that is double the size of echo's buffer
longText1="$( seq -f "Line %04.0f: $( printf "%s" {a..z} )" 1 ${numLines} )"
longText2="$( seq -f "Line %04.0f: $( printf "%s" {A..Z} )" 1 ${numLines} )"

# Print two streams simultaneously and check for interleaving
cat <(echo "${longText1}") <(echo "${longText2}")

The above code prints all lines from longText1 (lower case alphabet) first "most of the time", then the text from longText2 (upper case alphabet). But if you run it enough times, that isn't always true. Refer to this post: https://unix.stackexchange.com/a/476089/464414

So for cat, interleaving behavior for two streams is actually undefined, but you could easily guess that the behavior is "it doesn't interleave" based on empirical testing. What about for paste -- is the behavior guaranteed? I worry because I don't see anything about that written in the man page.


SIDEBAR:

In case anyone runs into this post searching for keywords "hexdump formatting" or something, here is one more version of the code that sometimes runs a little faster. Only use if the answers say it's safe ;-)

# Version 2
paste -d "\n" <(
    cat "${largeFile}" | 
    LC_CTYPE=C tr -c '[[:print:]]' '.' | 
    fold -bw 1 | 
    tr "\n" " " | 
    fold -bw 2 | 
    tr "\n" " " | 
    fold -bw 90
) <(
    xxd -p "${largeFile}" | 
    fold -bw2 | 
    tr -s "\n" " " | 
    fold -bw 90
)

Or:

formatAscii=\
'/0 "%010_ad  |\t" '\
'30/1 " %_p " '\
'"\n"'
formatHex=\
'/0 "%010_ad  |\t" '\
'30/1 "%02x " '\
'"\n"'
hexdump -v -e "${formatAscii}" -e "${formatHex}" "${largeFile}"
2

There are 2 best solutions below

0
On

I'm wondering whether paste has guaranteed behavior of alternating lines of output from each command

This is by definition how paste is supposed to work, and the only reason it doesn't work that way, is that your paste command is broken, i.e. it does not comply to the POSIX standard which states "The paste utility shall concatenate the corresponding lines of the given input files, and write the resulting lines to standard output."

In other words, paste file1 file2 reads one line from file1 and one line from file2, then it outputs the concatenation of the two lines, then it repeats these operations with the next line.

If (...) one of the commands ran slower than the other one time, could I get two lines of output from one command before the other command caught up?

No. paste file1 file2 always concatenate in the order of the input files (file1 then file2), and it always proceed to the concatenation only after it has read one line from each input file.

The only expected exception is when EOF is met (one file has less lines than the other) and the behavior is well defined by POSIX: "If an end-of-file condition is detected on one or more input files, but not all input files, paste shall behave as though empty lines were read from the files on which end-of-file was detected"

I don't see any documentation of this in the man page on my systems

If a line takes time to be read, paste will wait until it is fully available (i.e. until the ending \n has been read or EOF is met). This is the expected behavior when your manual page states "xxx reads a line of input". If there is a timeout or any special condition that may disrupt the normal behavior (i.e. make the read abort), then and only then will it be mentioned in the manual.

This question is sort of the opposite topic of the behavior of cat when you give it two streams: cat nominally prints from one stream first (perhaps all of its lines), then the other.

Yes, this is how cat is defined in the POSIX standard: "The cat utility shall read files in sequence and shall write their contents to the standard output in the same sequence."

But if you run it enough times, that isn't always true.

No. You may run cat 10 billion times, it will always run as defined by POSIX. If your cat file1 file2 sometimes outputs file2 before file1, then your cat is broken.

0
On

guaranteed line-based interleaving?

GNU tools are licensed any GPL license that gives no warranty of any kind. BSD utilities are under BSD license and also give no warranty of any kind. There are licensed POSIX operating systems that might give you a guarantee of the behavior.

guaranteed line-based interleaving?

There is no guaranty of any kind. Yes, this is how this utility work.

if you run it enough times, that isn't always true

Unless cosmic rays or other very highly unlikely events do not mess with your hardware, that is always true.

is the behavior guaranteed?

There is no guaranty of any kind. Yes, this is how it work.


All programs, shell and kernel you are using are open source. Inspect their source to become familiar with how they work.

The "most important" description of standard utilities is in POSIX, see POSIX cat and POSIX paste.