grep, pipes and output buffering

By Przemek, last update September 2024

Today in “footguns I should have known to avoid”, we take a look at output buffering of command line tools such as grep.

Problem

The other day I was looking at some logs being streamed live to stdout by a log listener tool. I was interested specifically in log entries containing the string GRPC and NOT containing the string ipv4:192.168.

Let’s grep! That should be easy enough. I typed in the command below, hit enter …

❯ log_listener | grep GRPC | grep -v ipv4:192.168

And saw … nothing. Because I also run unfiltered log_listener in another terminal window, I noticed a few matching log entries being produced; but still nothing was appearing in my grep output. What’s going on?

Repro: a case of missing dogs

To reproduce this behavior, we can use a simple python program that will emulate a live log entries producer:

import time
from itertools import cycle

ANIMALS = ['CAT', 'DOG', 'ZEBRA']
iterate_over_animals = cycle(ANIMALS)

while True:
  print(next(iterate_over_animals))
  time.sleep(1)

Running it like so:

$ python3 repro.py

, the program prints one line per second, alternating between various friendly animal names:

CAT
DOG
ZEBRA
CAT

So far, so good. Now, let’s grep for dogs 🐕:

❯ python3 repro.py | grep DOG

… where is my output??

We’d expect a line containing the DOG string to appear in the output once every 3 seconds. Instead, I don’t see any output being produced. What’s going on??

Standard output buffering

A still from the Scooby Doo cartoon in which the mask of the culprit is removed and their identity is revealed

What’s going on is stdout buffering. We will see that more closely in practice, but first let’s recap the basic idea of output buffering. When a program produces a bit of output, depending on the context, it may make sense to either:

make it immediately available to the consumer (e.g. the next program, or the end user looking at the terminal window) by issuing a write syscall (no buffering)
store the output in memory, but wait until even more output is produced before calling the write syscall (buffering)

Both approaches could make sense, depending on the context. For example, if the processing takes place in an interactive process (and a user is watching the output being produced), it would make sense to write the output as it is produced (no buffering). On the other hand, the write syscall is costly, so for a program running in background it makes sense to optimize the performance by buffering the writes (esp. if the program is frequently producing small chunks of output.)

The output buffering behavior is controlled by each program. In practice, most programs follow the default behavior applied by their programming environment.

Are we connected to a terminal?

On Unix-like systems, most command line programs match the output buffering strategy applied by libc, and documented in this part of the GNU libc manual: Newly opened streams are normally fully buffered, with one exception: a stream connected to an interactive device such as a terminal is initially line buffered.

This means that the buffering behavior differs depending on whether the output stream “is connected to an interactive device”. Again, this makes practical sense: a program producing large amounts of output in small chunks will perform better if multiple chunks are grouped together and written in a single sycall. Line buffering will write the output whenever we have a complete line, full buffering will write the output whenever we have a full block (e.g. 8KiB or 64KiB, depending on the system and the context).

But how would a program (or libc, running within the program) know whether it’s connected to a terminal?

Enter isatty() , a POSIX standard C function taking a file descriptor as an argument:

int isatty(int fd);

It returns 1 or 0, depending on whether the given file descriptor is connected to a terminal. It works as advertised, with the following program:

#include <unistd.h>
#include <stdio.h>

int main() {
  if (isatty(STDOUT_FILENO)) {
    printf("output connected to a terminal\n");
  } else {
    printf("output NOT connected to a terminal\n");    
  }
  return 0;
}

We’re going to see:

❯ gcc check_for_tty.c -o check_for_tty && ./check_for_tty
output connected to a terminal
❯ gcc check_for_tty.c -o check_for_tty && ./check_for_tty > output_file && cat output_file
output NOT connected to a terminal

Magic ✨! We just created a program that produces different output depending on how it is consumed. (whether it’s a good idea is a separate question, but hey, who’s gonna stop us!)

For bonus points, let’s quickly find out how this isatty() is even possible. The library funcrion either issues a system call that somehow can distinguish between different type of output file descriptors, or is powered by pure magic. strace can check for the first hypothesis:

(note: the previous snippet was compatible with both Linux and MacOS, for strace we need to run on Linux)

❯ strace ./check_for_tty
<CUT>
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
<CUT>
write(1, "output connected to a terminal\n", 31) = 31
exit_group(0)                           = ?
+++ exited with 0 +++

We got it – it’s not magic after all, it’s the ioctl() syscall, a catch-all mechanism for manipulating various device-specific IO settings. As we can see in the man page, TCGETS is the identifier of the terminal driver. The syscall sets errno to ENOTTY if the stream (here stream 1, ie. stdout) is not connected to a terminal.

Where it goes wrong

OK, so far we’ve seen a pretty reasonable policy (line buffering if connected to the output terminal, full buffering if not). What then is going wrong with our simple example:

❯ python3 repro.py | grep DOG

The problem is that the isatty() call does what it says it does: detects whether the file descriptor is connected to a terminal. Directly. But with pipes connecting programs together, it cannot tell that the referenced stdout is connected to a terminal indirectly via a pipe and then another process.

All that the python interpreter can tell is that it’s connected to a pipe. The end consumer can be a terminal, a file, or a treasure chest on a bottom of the ocean for all it knows. There is no mechanism to propagate information about the end consumer of the stream back through all the processes linked via pipes.

Wait, why?

Actually, this isn’t necessarily a bad thing. If we wanted to grep a file, we’re good: cat file.txt | grep GRPC | grep -v ipv4:192.168 does what we want. Depending on the size of the file, the first process will either finish or reach the block size and flush before we notice any delay. Block buffering actually makes it faster.

However, if the source of the data is a live program, it will likely use block buffering for standard output, and leave the other two hanging. Even if the first process doesn’t use block buffering (i.e. it overrides the libc default to use line buffering or no buffering), the first second program (grep) will be block-buffered by default and we will still not see any output until one full block is filled.

Making the dogs show up

Probably the simplest solution is to find out how to disable output buffering (or force line buffering instead of block buffering) for each of the program in the chain. For example:

for python3, it’s -u (“force the stdout and stderr streams to be unbuffered”)
for grep, it’s --line-buffered (“Force output to be line buffered. By default, output is line buffered when standard output is a terminal and block buffered otherwise”)
for sed, it’s -l (“Make output line buffered”)

As we can see already, there isn’t much consistency. sed has -u like python3, but python3 doesn’t have -l. Anyways, this at least works, and of course we don’t have to bother with the last program in the chain:

❯ python3 -u repro.py | grep DOG
DOG
DOG
DOG

🐕 !

Conclusion

Streaming live output through grep will make you patient (or make you learn about output buffering).