Practical Go compiler (GC) optimizations

Update: After publishing the article, I was pointed out that, there is a builtin function that allows for clearing slices and maps.

The compiler has some neat optimizations to make our code run faster where possible. There's a whole list of them, but these are some of my favorite ones. I came across these originally as I was reading mistake #96 of the 100 Go mistakes and how to avoid them book.

1. Map lookup by `[]byte`

This is possibly my most used optimization, and the one I remember the most, as it does come up often enough. Also, small things like these kinda get me excited. As I was reading this, wanted to see how it is faster. As the wiki suggests, the optimization does not allocate to the heap. If you are interested, you can have a look at the optimized method here.

There are three instances where the compiler will choose not to allocate:

Used for m[T1{... Tn{..., string(k), ...} ...}] and m[string(k)]
where k is []byte, T1 to Tn is a nesting of struct and array literals.
Used for "<"+string(b)+">" concatenation where b is []byte.
Used for string(b)=="foo" comparison where b is []byte.

It's faster when variables are not allocated to the heap and left in the stack. As when the stack gets poped it's marked as free and the data can be overwritten i.e. it clears itself. Whereas the variables in the heap need to be cleared by the garbage collector, which is a lot more expensive.

The one I have used the most is the first one, especially m[string(k)] where k is a []byte. To be honest, I only found out about the second and the third one, as I was looking through the doc just now. The last one is good to know too.

The opposite also works, when ranging over []byte(s), where s is a string. So the following conversion also will not allocate:

s := "foo"
for i, c := range []byte(s) {
    // ...
}

2.Function Inlining

According to Wikipedia:

In computing, inline expansion, or inlining, is a manual or compiler optimization that replaces a function call site with the body of the called function. Inline expansion is similar to macro expansion, but occurs during compilation, without changing the source code (the text), while macro expansion occurs prior to compilation, and results in different text that is then processed by the compiler.

The compiler has a few criteria for inlining:

The function should be simple, the wiki defines simple as code that has fewer than 80 AST nodes. It is also know as the inlining budget.

You can check if you function can be inlined by using the -gcflags when you run go build. If your function cannot be inlined it will say something like:
cannot inline foo: function too complex: cost xx exceeds budget 80

Function doesn’t contain complex things like closures, defer, recover, select, etc.
Function isn’t prefixed by go:noinline.
Function isn’t prefixed by go:uintptrescapes, since the escape information will be lost during inlining.
Function has body.

This optimization is done automatically by the compiler, allows for better instruction caching due to improved spatial locality, I've gone through caches in a previous article. Plus, the compiler can decide if a variable needs to be allocated to the heap or it can stay in the stack.

You might be wondering, if it's done automatically by the compiler, why do I need to care about it. Before Go 1.9, only leaf functions were considered for inlining. But newer versions of Go allow for mid-stack inlining. So essentially, if you have a function that has a slow and a fast path, you could inline the fast path. This would improve performance, especially if the fast section is accessed often enough, allowing for more cache hits.

I saw this yesterday as I was trying to find out why mutex lock method is considered read-like.

// Lock locks m.
// If the lock is already in use, the calling goroutine
// blocks until the mutex is available.
func (m *Mutex) Lock() {
	// Fast path: grab unlocked mutex.
	if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
		if race.Enabled {
			race.Acquire(unsafe.Pointer(m))
		}
		return
	}
	// Slow path (outlined so that the fast path can be inlined)
	m.lockSlow()
}

3. Optimized memclr using `clear`

The last one is a simple one. If your program requires you to clear the slice or array that you were using either to reuse the data structure or to avoid data leaks.

Since v1.21.0, there is a builtin function called clear that resets maps, slices and arrays to it's zero value.

You can read more about clear here.

For versions prior to 1.21.0

Gc (1.5+) range over the slice or array and reset the value at each index to it's zero value.

for i := range a {
	a[i] = zero
}

In which the evaluation of a is free from side effects, the compiler will replace these loops with calls to memclr.

1. Map lookup by []byte

2.Function Inlining

3. Optimized memclr using clear

For versions prior to 1.21.0

1. Map lookup by `[]byte`

3. Optimized memclr using `clear`