Lab 14: Performance Tuning

Time: 45 minutes | Level: Advanced | Docker: docker run -it --rm golang:1.22-alpine sh

Overview

Master Go runtime performance tuning: GOGC, GOMEMLIMIT, GOMAXPROCS, goroutine stack growth, zero-copy I/O with net.Buffers, TCP socket options, and systematic benchmarking with -benchmem.


Step 1: GOMAXPROCS — CPU Parallelism

package main

import (
	"fmt"
	"runtime"
	"sync"
	"time"
)

func cpuIntensiveWork(n int) float64 {
	result := 0.0
	for i := 0; i < n; i++ {
		result += float64(i) * 1.0001
	}
	return result
}

func parallelWork(procs int) time.Duration {
	runtime.GOMAXPROCS(procs)
	var wg sync.WaitGroup
	start := time.Now()

	numTasks := runtime.NumCPU() * 2
	for i := 0; i < numTasks; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			cpuIntensiveWork(1_000_000)
		}()
	}
	wg.Wait()
	return time.Since(start)
}

func main() {
	fmt.Printf("NumCPU: %d\n", runtime.NumCPU())
	fmt.Printf("Current GOMAXPROCS: %d\n", runtime.GOMAXPROCS(0))

	// GOMAXPROCS controls how many OS threads run Go code concurrently
	// Default: runtime.NumCPU()
	// Set via env: GOMAXPROCS=4 go run main.go
	// Set programmatically:
	old := runtime.GOMAXPROCS(4)
	fmt.Printf("Set GOMAXPROCS=4 (was %d)\n", old)
	runtime.GOMAXPROCS(old) // restore
}

Step 2: GOGC Tuning


Step 3: GOMEMLIMIT (Go 1.19+)


Step 4: Goroutine Stack Growth


Step 5: Benchmark Comparison


Step 6: TCP Socket Options


Step 7: GC Stats + Memory Limits Demo

📸 Verified Output:


Step 8: Capstone — Systematic Performance Analysis

📸 Verified Benchmark Output:

Analysis:

  • concatPlus: 99 allocs/op — quadratic allocation (string is immutable, every += copies)

  • concatBuilder: 1 alloc/op — 20x faster, ~48x fewer allocations

  • concatBuffer: 2 allocs/op — good, but Builder is slightly better for pure string building


Summary

Knob
How to Set
Effect

GOMAXPROCS

runtime.GOMAXPROCS(n)

CPU parallelism

GOGC

debug.SetGCPercent(n)

GC frequency vs memory

GOMEMLIMIT

debug.SetMemoryLimit(n)

Prevent OOM, control GC

TCP_NODELAY

conn.SetNoDelay(true)

Low-latency writes

SO_KEEPALIVE

conn.SetKeepAlive(true)

Dead connection detection

net.Buffers

buffers.WriteTo(conn)

Zero-copy scatter-gather

-benchmem

go test -benchmem

Measure allocs/op

Key Takeaways:

  • GOMEMLIMIT (Go 1.19) is the preferred way to control memory — beats GOGC alone

  • GOMAXPROCS defaults to NumCPU — rarely needs tuning

  • strings.Builder is 20x faster than += string due to zero intermediate allocations

  • TCP_NODELAY trades throughput for latency — enable for interactive protocols

  • net.Buffers uses writev for efficient multi-buffer writes without copying

  • Always benchmark before optimizing — measure, don't guess

Last updated