Beating C with 70 Lines of Go
Chris Penner’s recent article, Beating C with 80 Lines of Haskell, has generated quite some controversy over the Internet, and it has since turned into a game of trying to take on the venerable wc
with different languages:
Today, we will be pitting Go against wc
. Being a compiled language with excellent concurrency primitives, it should be trivial to achieve comparable performance to C.
While wc
is also designed to read from stdin
, handle non-ASCII text encodings, and parse command line flags (manpage), we will not be doing that here. Instead, like the articles mentioned above, we will focus on keeping our implementation as simple as possible.
The source code for this post can be found here.
Benchmarking & comparison
We will use the GNU time utility to compare elapsed time and maximum resident set size.
$ /usr/bin/time -f "%es %MKB" wc test.txt
We will use the same version of wc as the original article, compiled with gcc 9.2.1
and -O3
. For our own implementation, we will use go 1.13.4
(I did try gccgo
too, but the results were not very promising). We will run all benchmarks with the following setup:
- Intel Core i5-6200U @ 2.30 GHz (2 physical cores, 4 threads)
- 4+4 GB RAM @ 2133 MHz
- 240 GB M.2 SSD
- Fedora 31
For a fair comparison, all implementations will use a 16 KB buffer for reading input. The input will be two us-ascii
encoded text files of 100 MB and 1 GB.
A naïve approach
Parsing arguments is easy, since we only require the file path:
if len(os.Args) < 2 {
panic("no file path specified")
}
filePath := os.Args[1]
file, err := os.Open(filePath)
if err != nil {
panic(err)
}
defer file.Close()
We’re going to iterate through the text bytewise, keeping track of state. Fortunately, in this case, we require only 2 states:
- The previous byte was whitespace
- The previous byte was not whitespace
When going from a whitespace character to a non-whitespace character, we increment the word counter. This approach allows us to read directly from a byte stream, keeping memory consumption low.
const bufferSize = 16 * 1024
reader := bufio.NewReaderSize(file, bufferSize)
lineCount := 0
wordCount := 0
byteCount := 0
prevByteIsSpace := true
for {
b, err := reader.ReadByte()
if err != nil {
if err == io.EOF {
break
} else {
panic(err)
}
}
byteCount++
switch b {
case '\n':
lineCount++
prevByteIsSpace = true
case ' ', '\t', '\r', '\v', '\f':
prevByteIsSpace = true
default:
if prevByteIsSpace {
wordCount++
prevByteIsSpace = false
}
}
}
To display the result, we will use the native println()
function - in my tests, importing the fmt
library caused a ~400 KB increase in executable size!
println(lineCount, wordCount, byteCount, file.Name())
Let’s run this:
input size | elapsed time | max memory | |
---|---|---|---|
wc |
100 MB | 0.58 s | 2052 KB |
wc-naive |
100 MB | 0.77 s | 1416 kB |
wc |
1 GB | 5.56 s | 2036 KB |
wc-naive |
1 GB | 7.69 s | 1416 KB |
The good news is that our first attempt has already landed us pretty close to C in terms of performance. In fact, we’re actually doing better in terms of memory usage!
Splitting the input
While buffering I/O reads is critical to improving performance, calling ReadByte()
and checking for errors in a loop introduces a lot of unnecessary overhead. We can avoid this by manually buffering our read calls, rather than relying on bufio.Reader
.
To do this, we will split our input into buffered chunks that can be processed individually. Fortunately, to process a chunk, the only thing we need to know about the previous chunk (as we saw earlier) is if its last character was whitespace.
Let’s write a few utility functions:
type Chunk struct {
PrevCharIsSpace bool
Buffer []byte
}
type Count struct {
LineCount int
WordCount int
}
func GetCount(chunk Chunk) Count {
count := Count{}
prevCharIsSpace := chunk.PrevCharIsSpace
for _, b := range chunk.Buffer {
switch b {
case '\n':
count.LineCount++
prevCharIsSpace = true
case ' ', '\t', '\r', '\v', '\f':
prevCharIsSpace = true
default:
if prevCharIsSpace {
prevCharIsSpace = false
count.WordCount++
}
}
}
return count
}
func IsSpace(b byte) bool {
return b == ' ' || b == '\t' || b == '\n' || b == '\r' || b == '\v' || b == '\f'
}
Now, we can split the input into Chunk
s and feed them to the GetCount
function.
totalCount := Count{}
lastCharIsSpace := true
const bufferSize = 16 * 1024
buffer := make([]byte, bufferSize)
for {
bytes, err := file.Read(buffer)
if err != nil {
if err == io.EOF {
break
} else {
panic(err)
}
}
count := GetCount(Chunk{lastCharIsSpace, buffer[:bytes]})
lastCharIsSpace = IsSpace(buffer[bytes-1])
totalCount.LineCount += count.LineCount
totalCount.WordCount += count.WordCount
}
To obtain the byte count, we can make one system call to query the file size:
fileStat, err := file.Stat()
if err != nil {
panic(err)
}
byteCount := fileStat.Size()
Now that we’re done, let’s see how this performs:
input size | elapsed time | max memory | |
---|---|---|---|
wc |
100 MB | 0.58 s | 2052 KB |
wc-chunks |
100 MB | 0.34 s | 1404 KB |
wc |
1 GB | 5.56 s | 2036 KB |
wc-chunks |
1 GB | 3.31 s | 1416 KB |
Looks like we’ve blown past wc
on both counts, and we haven’t even begun to parallelize our program yet.1 tokei
reports that this program is just 70 lines of code!
Parallelization
Admittedly, a parallel wc
is overkill, but let’s see how far we can go. The original article reads from the input file in parallel, and while it improved runtime, the author does admit that performance gains due to parallel reads might be limited to only certain kinds of storage, and would be detrimental elsewhere.
For our implementation, we want our code to be performant on all devices, so we will not be doing this. We will set up 2 channels, chunks
and counts
. Each worker will read and process data from chunks
until the channel is closed, and then write the result to counts
.
func ChunkCounter(chunks <-chan Chunk, counts chan<- Count) {
totalCount := Count{}
for chunk := range chunks
count := GetCount(chunk)
totalCount.LineCount += count.LineCount
totalCount.WordCount += count.WordCount
}
counts <- totalCount
}
We will spawn one worker per logical CPU core:
numWorkers := runtime.NumCPU()
chunks := make(chan Chunk)
counts := make(chan Count)
for i := 0; i < numWorkers; i++ {
go ChunkCounter(chunks, counts)
}
Now, we run in a loop, reading from the disk and assigning jobs to each worker:
const bufferSize = 16 * 1024
lastCharIsSpace := true
for {
buffer := make([]byte, bufferSize)
bytes, err := file.Read(buffer)
if err != nil {
if err == io.EOF {
break
} else {
panic(err)
}
}
chunks <- Chunk{lastCharIsSpace, buffer[:bytes]}
lastCharIsSpace = IsSpace(buffer[bytes-1])
}
close(chunks)
Once this is done, we can simply sum up the counts from each worker:
totalCount := Count{}
for i := 0; i < numWorkers; i++ {
count := <-counts
totalCount.LineCount += count.LineCount
totalCount.WordCount += count.WordCount
}
close(counts)
Let’s run this and see how it compares to the previous results:
input size | elapsed time | max memory | |
---|---|---|---|
wc |
100 MB | 0.58 s | 2052 KB |
wc-channel |
100 MB | 0.27 s | 6644 KB |
wc |
1 GB | 5.56 s | 2036 KB |
wc-channel |
1 GB | 2.22 s | 6752 KB |
Our wc
is now a lot faster, but there has been quite a regression in memory usage. In particular, notice how our input loop allocates memory at every iteration! Channels are a great abstraction over sharing memory, but they aren’t always the best tool for the job.
Better parallelization
In this section, we will allow every worker to read from the file, and use sync.Mutex
to ensure that reads don’t happen simultaneously. We can create a new struct
to handle this for us:
type FileReader struct {
File *os.File
LastCharIsSpace bool
mutex sync.Mutex
}
func (fileReader *FileReader) ReadChunk(buffer []byte) (Chunk, error) {
fileReader.mutex.Lock()
defer fileReader.mutex.Unlock()
bytes, err := fileReader.File.Read(buffer)
if err != nil {
return Chunk{}, err
}
chunk := Chunk{fileReader.LastCharIsSpace, buffer[:bytes]}
fileReader.LastCharIsSpace = IsSpace(buffer[bytes-1])
return chunk, nil
}
We then rewrite our worker function to read directly from the file:
func FileReaderCounter(fileReader *FileReader, counts chan Count) {
const bufferSize = 16 * 1024
buffer := make([]byte, bufferSize)
totalCount := Count{}
for {
chunk, err := fileReader.ReadChunk(buffer)
if err != nil {
if err == io.EOF {
break
} else {
panic(err)
}
}
count := GetCount(chunk)
totalCount.LineCount += count.LineCount
totalCount.WordCount += count.WordCount
}
counts <- totalCount
}
Like earlier, we can now spawn these workers, one per CPU core:
fileReader := &FileReader{
File: file,
LastCharIsSpace: true,
}
counts := make(chan Count)
for i := 0; i < numWorkers; i++ {
go FileReaderCounter(fileReader, counts)
}
totalCount := Count{}
for i := 0; i < numWorkers; i++ {
count := <-counts
totalCount.LineCount += count.LineCount
totalCount.WordCount += count.WordCount
}
close(counts)
Let’s see how this performs:
input size | elapsed time | max memory | |
---|---|---|---|
wc |
100 MB | 0.58 s | 2052 KB |
wc-mutex |
100 MB | 0.12 s | 1580 KB |
wc |
1 GB | 5.56 s | 2036 KB |
wc-mutex |
1 GB | 1.21 s | 1576 KB |
Our parallelized implementation runs at more than 4.5x the speed of wc
, with lower memory consumption! This is pretty significant, especially if you consider that Go is a garbage collected language.
Conclusion
While in no way does this article imply that Go > C, I hope it demonstrates that Go can be a viable alternative to C as a systems programming language.
If you have suggestions, questions, or complaints, feel free to drop me a mail!
-
How is this possible? Being a garbage-collected language with a runtime, one would not expect Go to beat C in a fair test. Upon closer inspection of the source code, we see the main reason this is happening -
wc
is callingiswspace()
in a loop, which is built to support wide characters as well as different locales. Our implementation uses a much more naïve comparison, and changing the C code to reflect this gives us a major performance boost there too! ↩︎