Share memory by communicating

Sunday, February 28, 2016

If you’ve learned about the Go programming language at all, you’ve probably come across the koan, “Don’t communicate by sharing memory; share memory by communicating.” It’s a snappy little bit of chiasmus, but what does it actually mean? The natural inclination is to say, “It means ‘channels good; mutexes bad.’ ” Certainly, that’s not too far off the mark as a first order approximation of its meaning. But it’s actually a bit deeper than that.

A few months back, I read through The Go Programming Language by Donovan and Kernighan. It’s a great book with a lot of good ideas about how to write clear, clean code. One bit in particular that got me thinking was their use of channels to implement semaphores. The basic idea is that when you’re making, e.g., a web crawler, in Go you quickly run into the problem of too much parallelism. You can’t run more than a few dozen requests simultaneously or else you’ll exhaust the system’s resources and fail. As it happens, I had written a web crawler in Go to scape my own blog back in 2013, so I was already familiar with this exact problem.

One way that they solve the problem is by restricting the number of active go-routines by rationing out slots in a buffered channel as “tokens”:

    // tokens is a counting semaphore used to
    // enforce a limit of 20 concurrent requests.
    var tokens = make(chan struct{}, 20)

    func crawl(url string) []string {
        fmt.Println(url)
        tokens <- struct{}{} // acquire a token
        list, err := links.Extract(url)
        <-tokens // release the token

        if err != nil {
            log.Print(err)
        }
        return list
    }

This is a very elegant solution to the problem. make(chan struct{}, 20) means “create a buffered channel that can hold up to twenty items before it blocks.” You can spawn as many go-routines as necessary and know that at any given time, only 20 simultaneous calls to links.Extract will be made because only 20 “tokens” (empty structs) can be in the tokens’ channel at once.

Naturally, I wanted to write a helper object to encapsulate this pattern and add a few other “nice to have” features as well. Ideally, I would like to combine the helper object with a done channel and have an interface like:

    var s = semaphore.New(maxN)

    func task(arg ...args){
        if !s.Acquire() {
            return
        }
        defer s.Release()

        // work ...

        if moreWork {
            go task(moreWork)
        }
    }()

    // time passes ...

    s.Stop()

The basic idea is that if Stop has not previously been called s.Acquire will block until there is a free token to acquire. If Stop has been called, it returns false alerting the routine of the need to bail out.

My first attempt to create a Semaphore type revolved around two channels of empty structs:

    func New(n int) *Semaphore {
        return &Semaphore{
            sem:  make(chan struct{}, n),
            done: make(chan struct{}),
        }
    }

Acquire listened to the two channels to get a return value:

    func (s *Semaphore) Acquire() bool {
        select {
        case s.sem <- struct{}{}:
            return true
        case <-s.done:
            return false
        }
    }

To make this work, we can rely on the fact that once a channel is closed, it will always return its zero value. So, we only need to close s.done once to make it return false to all future callers.

However, I soon realized that there was a difficult to eliminate race condition at the heart of this type. If Stop and Release are called at around the same time, another token might be freed up in the sem channel. An attempt to send to a closed channel will panic, so we can’t just close sem. Instead, we can use the fact that nil channel will always block to prevent sem from getting any new tokens. So, I tried having Stop set sem to nil. But this was still flagged as a race by Go’s race detector. And in fact, it didn’t totally solve the problem because Acquire’s select statement would still be listening to the original sem channel from before it was set to nil. (select evaluates all its arguments as expressions before waiting for a communication event.) I ended up having to surround each method with a lock to get rid of the race conditions:

    func (s *Semaphore) Acquire() bool {
        s.rw.RLock()
        sem := s.sem
        s.rw.RUnlock()

        select {
        case sem <- struct{}{}:
            return true
        case <-s.done:
            return false
        }
    }

    func (s *Semaphore) Release() {
        s.rw.RLock()
        sem := s.sem
        s.rw.RUnlock()

        select {
        case <-sem:
        case <-s.done:
        }
    }

    func (s *Semaphore) Stop() {
        s.rw.Lock()
        defer s.rw.Unlock()

        s.sem = nil
        close(s.done)
    }

So, now I had a functioning semaphore type, but I ended up not being able to write a pure channel-based solution because of the race conditions. Thinking about the race conditions was hard and often set me to wondering, “Well, what if A and B happen at the same time? And then in the middle C? Okay, but what about A?” It was confusing to keep track in my mind of how the different locks interacted, and my thoughts tended to go in circles. Notice that unlock commands are not deferred in Acquire or Release. That’s because holding the lock until the end of the method call would block the other methods and thereby cause deadlock. Convincing myself of that took a lot of work playing around with a sample program and the race detector.

A few weeks later, I had a new idea. To be really useful, it should be possible to have the call to Stop block until all the acquired tokens have been released. Normally, the way to do something like this would be by creating a sync.WaitGroup and calling Add, Done, and Wait as appropriate. So, I started working on this, but the race conditions keep popping up and I couldn’t figure out how to make it work without either triggering the race detector or hitting deadlock. Finally, a new approach occurred to me. What if I created a controller Go-routine, and had it keep track of how many outstanding tokens there were? The new approach looked like this:

    func New(n int) *Semaphore {
        s := Semaphore{
            acquire:  make(chan bool),
            release:  make(chan struct{}),
            stop:     make(chan chan struct{}),
        }
        go s.start(n)
        return &s
    }

    func (s *Semaphore) start(max int) {
        count := 0

        for {
            var acquire = s.acquire

            // nil always blocks sends
            if count >= max {
                acquire = nil
            }

            select {

            case acquire <- true:
                count++

            case s.release <- struct{}{}:
                count--

            case wait := <-s.stop:
                close(s.acquire)

                // Drain remaining calls to Release
                for count > 0 {
                    s.release <- struct{}{}
                    count--
                }
                close(wait)
                return
            }
        }
    }

In the new approach, start acted as a central dispatcher for the state of the Semaphore. The old approach superficially looked like it was “Go-ish” because it was using a channel. But the channel was basically just acting like a queue with a built-in lock. The Go way is to “share memory by communicating.” In the old approach, each of the methods had access to the relevant memory: the number of outstanding tokens. In the new approach, count is a variable that is local to the dispatcher go-routine. The other methods have to communicate with the dispatcher in order to learn what the current system state is. As a result, the Acquire and Release became radically simpler:

    func (s *Semaphore) Acquire() bool {
        return <-s.acquire
    }

    func (s *Semaphore) Release() {
        <-s.release
    }

With this approach, every method on the object gets its own channel to communicate with the central dispatcher go-routine. On first consideration, one might design Acquire and Release to send a message on a channel to the dispatcher to tell it that the caller would like to acquire or release a token. But since the only information that needs to be sent in these cases is the very existence of a caller, a send or a receive will work just as well. The deciding factor is that a closed channel returns its default value as many times as it is called. So by having Acquire use a chan bool, false can be quickly and efficiently returned by closing the s.acquire channel when Stop is called.

Stop is still a little bit complicated, since it needs to wait for a signal back from the start go-routine, but not overly so:

    func (s *Semaphore) Stop() {
        blocker := make(chan struct{})
        s.stop <- blocker
        <-blocker
    }

Stop could also have been written to use two predefined channels in the struct, one for sending the signal to the dispatcher and one for blocking until clean up is done, but using a channel of channels has the advantage of not creating the blocker channel until it’s actually needed. Another method would have been to have Stop do the listening for all the Release calls to happen, but I think it’s better to have all the logic for communication in one place (the dispatcher).

So, what are overall the pros and cons of this new design? One way to test this objectively is with some quick benchmarks.

At first glance, however, the benchmarks don’t look so good for the new version:

    func Benchmark(b *testing.B) {
        s := semaphore.New(1)
        for i := 0; i < b.N; i++ {
            s.Acquire()
            s.Release()
        }
    }

The results are approximately 1,400 ns/op for the new channel version vs. 400 ns/op for the old locking version. An added microsecond of overhead!

However, if one digs deeper into the benchmarks, the numbers for channel version start to look better:

        s := semaphore.New(5)
        go s.Stop()
        for s.Acquire() {
        }

This performs virtually identically between the two versions (~2,600 ns/op). Change the number of tokens from 5 to 50, and suddenly the channel version has markedly better performance. It holds steady at around 2,600 ns/op, but the locking version degrades to 11,500 ns/op. I think what the benchmarks show is that the slowness of the dispatcher version comes from the time it takes to coordinate the context switch to the other go-routine. But if you’re using a semaphore, you’re probably going to be waiting on a bunch of go-routines that are blocking doing IO anyway, so it’s silly to measure the performance of locking and unlocking by itself. A central dispatcher is fast enough for most purposes.

A con of the dispatcher approach is that it adds another go-routine to the mix. If Stop is never called, that go-routine can become a memory leak. Probably most uses for semaphore will end up calling Stop anyway once the timeout is over, but it’s another thing to remember to do.

One question we might have about the dispatcher approach is that all of the state of the system is going through a single centralized bottleneck. Aren’t god objects and long methods supposed to be bad? Isn’t it better to move logic out of the dispatcher’s main loop and into the methods? While that advice is very good as a general principle, when dealing with race conditions, centralization makes reasoning about the code radically simpler. When writing locking code, if you have M states and N methods, you need to think about all N states in each of the methods, giving you an M × N code explosion. By centralizing the logic in a dispatch loop, the N states only need to be considered in one location.

This is a classic engineering tradeoff. All of the bottlenecking of logic that happens in the central dispatcher will still have to be done some of the time in each method that interacts in lock-based code, but by distributing it to each method, you can pick and choose how much of the logic to implement in each branch. That’s a feature, because it lets you optimize the code, but it’s also a drawback (and source of potential bugs) because it forces you to optimize the code and lets you fail to realize that one method also needs a particular bit of logic for the synchronization to work.

On the whole, I think that the easiest way to write code so that you can reason about it is by localizing memory access to one go-routine, and then synchronizing to communicate that state out to others. If your object has some particular bit of state that it needs to keep track of (say the count of how many active callers there are, a set of URLs that have been downloaded, or a hash map of key-values), try isolating it into one central go-routine and then sharing access to it through channels that can communicate with that go-routine with one channel per method. It’s not always the best approach, but it’s often the easiest to reason about. In other words, “Don’t communicate by sharing memory; share memory by communicating.”

The Ethically-Trained Programmer