20 Nov 2019

Beating C with 70 Lines of Go

Chris Penner’s recent article, Beating C with 80 Lines of Haskell, has generated quite some controversy over the Internet, and it has since turned into a game of trying to take on the venerable wc with different languages:

Today, we will be pitting Go against wc. Being a compiled language with excellent concurrency primitives, it should be trivial to achieve comparable performance to C.

While wc is also designed to read from stdin, handle non-ASCII text encodings, and parse command line flags (manpage), we will not be doing that here. Instead, like the articles mentioned above, we will focus on keeping our implementation as simple as possible.

The source code for this post can be found here.

Benchmarking & comparison

We will use the GNU time utility to compare elapsed time and maximum resident set size.

$ /usr/bin/time -f "%es %MKB" wc test.txt

We will use the same version of wc as the original article, compiled with gcc 9.2.1 and -O3. For our own implementation, we will use go 1.13.4 (I did try gccgo too, but the results were not very promising). We will run all benchmarks with the following setup:

Intel Core i5-6200U @ 2.30 GHz (2 physical cores, 4 threads)
4+4 GB RAM @ 2133 MHz
240 GB M.2 SSD
Fedora 31

For a fair comparison, all implementations will use a 16 KB buffer for reading input. The input will be two us-ascii encoded text files of 100 MB and 1 GB.

A naïve approach

Parsing arguments is easy, since we only require the file path:

if len(os.Args) < 2 {
    panic("no file path specified")
}
filePath := os.Args[1]

file, err := os.Open(filePath)
if err != nil {
    panic(err)
}
defer file.Close()

We’re going to iterate through the text bytewise, keeping track of state. Fortunately, in this case, we require only 2 states:

The previous byte was whitespace
The previous byte was not whitespace

When going from a whitespace character to a non-whitespace character, we increment the word counter. This approach allows us to read directly from a byte stream, keeping memory consumption low.

const bufferSize = 16 * 1024
reader := bufio.NewReaderSize(file, bufferSize)

lineCount := 0
wordCount := 0
byteCount := 0

prevByteIsSpace := true
for {
    b, err := reader.ReadByte()
    if err != nil {
        if err == io.EOF {
            break
        } else {
            panic(err)
        }
    }

    byteCount++

    switch b {
    case '\n':
        lineCount++
        prevByteIsSpace = true
    case ' ', '\t', '\r', '\v', '\f':
        prevByteIsSpace = true
    default:
        if prevByteIsSpace {
            wordCount++
            prevByteIsSpace = false
        }
    }
}

To display the result, we will use the native println() function - in my tests, importing the fmt library caused a ~400 KB increase in executable size!

println(lineCount, wordCount, byteCount, file.Name())

Let’s run this:

	input size	elapsed time	max memory
`wc`	100 MB	0.58 s	2052 KB
`wc-naive`	100 MB	0.77 s	1416 kB
`wc`	1 GB	5.56 s	2036 KB
`wc-naive`	1 GB	7.69 s	1416 KB

The good news is that our first attempt has already landed us pretty close to C in terms of performance. In fact, we’re actually doing better in terms of memory usage!

Splitting the input

While buffering I/O reads is critical to improving performance, calling ReadByte() and checking for errors in a loop introduces a lot of unnecessary overhead. We can avoid this by manually buffering our read calls, rather than relying on bufio.Reader.

To do this, we will split our input into buffered chunks that can be processed individually. Fortunately, to process a chunk, the only thing we need to know about the previous chunk (as we saw earlier) is if its last character was whitespace.

Let’s write a few utility functions:

type Chunk struct {
    PrevCharIsSpace bool
    Buffer          []byte
}

type Count struct {
    LineCount int
    WordCount int
}

func GetCount(chunk Chunk) Count {
    count := Count{}

    prevCharIsSpace := chunk.PrevCharIsSpace
    for _, b := range chunk.Buffer {
        switch b {
        case '\n':
            count.LineCount++
            prevCharIsSpace = true
        case ' ', '\t', '\r', '\v', '\f':
            prevCharIsSpace = true
        default:
            if prevCharIsSpace {
                prevCharIsSpace = false
                count.WordCount++
            }
        }
    }

    return count
}

func IsSpace(b byte) bool {
    return b == ' ' || b == '\t' || b == '\n' || b == '\r' || b == '\v' || b == '\f'
}

Now, we can split the input into Chunks and feed them to the GetCount function.

totalCount := Count{}
lastCharIsSpace := true

const bufferSize = 16 * 1024
buffer := make([]byte, bufferSize)

for {
    bytes, err := file.Read(buffer)
    if err != nil {
        if err == io.EOF {
            break
        } else {
            panic(err)
        }
    }

    count := GetCount(Chunk{lastCharIsSpace, buffer[:bytes]})
    lastCharIsSpace = IsSpace(buffer[bytes-1])

    totalCount.LineCount += count.LineCount
    totalCount.WordCount += count.WordCount
}

To obtain the byte count, we can make one system call to query the file size:

fileStat, err := file.Stat()
if err != nil {
    panic(err)
}
byteCount := fileStat.Size()

Now that we’re done, let’s see how this performs:

	input size	elapsed time	max memory
`wc`	100 MB	0.58 s	2052 KB
`wc-chunks`	100 MB	0.34 s	1404 KB
`wc`	1 GB	5.56 s	2036 KB
`wc-chunks`	1 GB	3.31 s	1416 KB

Looks like we’ve blown past wc on both counts, and we haven’t even begun to parallelize our program yet.¹ tokei reports that this program is just 70 lines of code!

Parallelization

Admittedly, a parallel wc is overkill, but let’s see how far we can go. The original article reads from the input file in parallel, and while it improved runtime, the author does admit that performance gains due to parallel reads might be limited to only certain kinds of storage, and would be detrimental elsewhere.

For our implementation, we want our code to be performant on all devices, so we will not be doing this. We will set up 2 channels, chunks and counts. Each worker will read and process data from chunks until the channel is closed, and then write the result to counts.

func ChunkCounter(chunks <-chan Chunk, counts chan<- Count) {
    totalCount := Count{}

    for chunk := range chunks
        count := GetCount(chunk)
        totalCount.LineCount += count.LineCount
        totalCount.WordCount += count.WordCount
    }

    counts <- totalCount
}

We will spawn one worker per logical CPU core:

numWorkers := runtime.NumCPU()

chunks := make(chan Chunk)
counts := make(chan Count)

for i := 0; i < numWorkers; i++ {
    go ChunkCounter(chunks, counts)
}

Now, we run in a loop, reading from the disk and assigning jobs to each worker:

const bufferSize = 16 * 1024
lastCharIsSpace := true

for {
    buffer := make([]byte, bufferSize)
    bytes, err := file.Read(buffer)
    if err != nil {
        if err == io.EOF {
            break
        } else {
            panic(err)
        }
    }
    chunks <- Chunk{lastCharIsSpace, buffer[:bytes]}
    lastCharIsSpace = IsSpace(buffer[bytes-1])
}
close(chunks)

Once this is done, we can simply sum up the counts from each worker:

totalCount := Count{}
for i := 0; i < numWorkers; i++ {
    count := <-counts
    totalCount.LineCount += count.LineCount
    totalCount.WordCount += count.WordCount
}
close(counts)

Let’s run this and see how it compares to the previous results:

	input size	elapsed time	max memory
`wc`	100 MB	0.58 s	2052 KB
`wc-channel`	100 MB	0.27 s	6644 KB
`wc`	1 GB	5.56 s	2036 KB
`wc-channel`	1 GB	2.22 s	6752 KB

Our wc is now a lot faster, but there has been quite a regression in memory usage. In particular, notice how our input loop allocates memory at every iteration! Channels are a great abstraction over sharing memory, but they aren’t always the best tool for the job.

Better parallelization

In this section, we will allow every worker to read from the file, and use sync.Mutex to ensure that reads don’t happen simultaneously. We can create a new struct to handle this for us:

type FileReader struct {
    File            *os.File
    LastCharIsSpace bool
    mutex           sync.Mutex
}

func (fileReader *FileReader) ReadChunk(buffer []byte) (Chunk, error) {
    fileReader.mutex.Lock()
    defer fileReader.mutex.Unlock()

    bytes, err := fileReader.File.Read(buffer)
    if err != nil {
        return Chunk{}, err
    }

    chunk := Chunk{fileReader.LastCharIsSpace, buffer[:bytes]}
    fileReader.LastCharIsSpace = IsSpace(buffer[bytes-1])

    return chunk, nil
}

We then rewrite our worker function to read directly from the file:

func FileReaderCounter(fileReader *FileReader, counts chan Count) {
    const bufferSize = 16 * 1024
    buffer := make([]byte, bufferSize)

    totalCount := Count{}

    for {
        chunk, err := fileReader.ReadChunk(buffer)
        if err != nil {
            if err == io.EOF {
                break
            } else {
                panic(err)
            }
        }
        count := GetCount(chunk)
        totalCount.LineCount += count.LineCount
        totalCount.WordCount += count.WordCount
    }

    counts <- totalCount
}

Like earlier, we can now spawn these workers, one per CPU core:

fileReader := &FileReader{
    File:            file,
    LastCharIsSpace: true,
}
counts := make(chan Count)

for i := 0; i < numWorkers; i++ {
    go FileReaderCounter(fileReader, counts)
}

totalCount := Count{}
for i := 0; i < numWorkers; i++ {
    count := <-counts
    totalCount.LineCount += count.LineCount
    totalCount.WordCount += count.WordCount
}
close(counts)

Let’s see how this performs:

	input size	elapsed time	max memory
`wc`	100 MB	0.58 s	2052 KB
`wc-mutex`	100 MB	0.12 s	1580 KB
`wc`	1 GB	5.56 s	2036 KB
`wc-mutex`	1 GB	1.21 s	1576 KB

Our parallelized implementation runs at more than 4.5x the speed of wc, with lower memory consumption! This is pretty significant, especially if you consider that Go is a garbage collected language.

Conclusion

While in no way does this article imply that Go > C, I hope it demonstrates that Go can be a viable alternative to C as a systems programming language.

If you have suggestions, questions, or complaints, feel free to drop me a mail!

How is this possible? Being a garbage-collected language with a runtime, one would not expect Go to beat C in a fair test. Upon closer inspection of the source code, we see the main reason this is happening - wc is calling iswspace() in a loop, which is built to support wide characters as well as different locales. Our implementation uses a much more naïve comparison, and changing the C code to reflect this gives us a major performance boost there too! ↩︎

Ajeet D'Souza