Pages

Friday, October 1, 2010

Google Go: A Brief Stroll Down GC Lane

The following is a quick summary of the Google compiler, gc, for their language Go. I have recently been hooked onto Go as it is a relatively new language, and my research interests are compilers. Not to mention I have been looking for a compiler to hack on for my PhD research. Oddly enough, I caught a talk about Go where Rob Pike, Go evangelist and developer, gave an awesome presentation on the language, which kinda lured me in. Plus, to benefit my cause, the source is free and open, so I decided to take a stroll down the source code lane to see how Go churns one's source code into an executable.

Note, this is my interpretation of the gc compiler, I cannot say it is 100% accurate, nor am I reviewing the compiler. Since there is relatively little public documentation on how the internals of gc tick, I figured this exploration would be a good learning experience for me. Maybe someone else will find this somewhat useful.

TL;DR: Cool compiler for a cool language Go which focuses on memory safety and easy implementation of parallelization. GC is based on Plan 9 with their specific assembly and object file format.

The following is a brief outline of how a Google Go source file is compiled by the gc compiler taken from the Google repository at revision 6277:ed9fe47b342d (September 14, 2010)

Quick Overview

The gc compiler primarily functions from two spots in the Go source tree, the general non-architecture-specific code in go/src/cmd/gc and the specific architecture based code in the go/src/{5g,6g,8g} directory. The other portions of the build phase are also located in the respected 5l/6l/8l and 5a/6a/8a directories for the linker and assembler. The 5/6/8 prefix denotes the specific architectures ARM, x86 64-Bit, or x86 32-Bit respectively. All three of the aforementioned tools (compiler, assembler, linker) make up the build process.

The gc compiler is derived from the Bell Labs Plan 9 operating-system project. Heck, why start from scratch when part of the Plan 9 team now works at Google :-) Anyways, the main focus of this post is on the compiler, but some mention of the linker will also be made. The Plan 9 errrr go tools (compiler, assembler, and linker) are written in C. What does this mean? Well, for one, the gc compiler for Go is not self-hosting, it does not build itself, but there is good reason for this, mainly bootstrapping and to support the online community who can simply bootstrap with an existing C compiler, e.g. gcc [5]. Most users probably already have the latter.

There is also a GCC front-end for Go, gccgo, which can be obtained from the gcc project's source repository. Consider that since gccgo does utilize the gcc framework, it should be able to use a set of optimizations that gcc provides for their intermediate representation. However, this post focuses on the gc compiler, developed by Google.

From Source to Binary

The system adheres to a common build phase:
Compile --> Assemble (Plan 9 Object Files) --> Link the object files.

Object Files

The object files are not pure binary output from the compiler, rather they have some human-readable text as well. The format is specific to Plan 9 and is not documented properly (publically) as far as I am aware. The information in the object files, later to be assembled and linked, allow for link-time-optimization, roughly known as whole program optimization, or cross translation-unit/compilation-module optimization. I am unaware to the degree of cross-module optimization that takes place, as the Plan 9 Compiler document [2] states: "The loader [linker] is a multiple pass program that reads object files and libraries and produces an executable binary. The loader also does some minimal optimizations and code rewriting."

From Source to Object File

Since this post is mainly concerned with what the compiler is doing, and mostly ignoring the linking phase now, the compiler begins its work in main(), of course, remember, it is written in C. Anyways, src/lib9/main.c is where all the magic kicks-off. From there the source file(s) must be lexed, which is built on the lex/flex and yacc/bison tools. The parser and lexer do their jobs to process the source-code into data that the compiler will use to manipulate and generate object files in three passes. Those being interpreted from the commented source as:
  1. The compiler first processes the "const, type, and names and types of funcs".
  2. Then, in the second pass, the compiler handles "variable assignments."
  3. Finally, the thrid pass processes function bodies.
As mentioned, the final pass is when the functions are processed. This processing starts with a call to the funccompile() routine. The body of the latter is defined in src/cmd/gc/dcl.c and then calls the architecture-specific compile() routine. compile() is specific to either the 5g or 6g compiler. The remainder of this overview assumes the x86-64 6g compiler.

The compiler builds up an internal representation, e.g syntax tree, built of Nodes. The program as a whole is a NodeList (list of Node structures). The Node structure is a mammoth generic structure that represents all constructs in the language. A function body is represented as a Node with internal pointers to more nodes defining the other pieces of that routine. For instance, a statement can be represented as a NodeList, the Nodes in that list are the various components that makeup the statement walkret() in go/src/cmd/gc/walk.c shows a traversal across the function body to find the return statement for that function.

But before we get too far, the typecheck() routine should first be examined. This routine is called for all three passes and occurs before the compiler jumps to the architecture-specific call to the compile() routine. The typecheck() routine acts on a single Node that it is passed. typecheck() first checks to determine if the Node represents an IOTA (in essence an enumerated value). If it is such, its value is changed to a constant. String literals and other constructs are further qualified, and some error checking occurs, such as array boundaries. Basically, further resolution from what the parser gives us and safety checks.

So, after some type-checking is complete by the three passes, it's time to do more architecture-specific compiling via the compile() routine. The width of the construct represented by the Node is first calculated (e.g. data types like float, int, etc). If the value is a string, a flag is flipped here making note of this needed runtime check, similarly the same occurs for some cases of arrays. Next, the arguments are setup for functions.

The walk() routine in go/src/cmd/gc/walk.c is then called which further processes the body of the function including the return statement. Unused variables are also checked. The walk{stmt,expr,switch,select}() routines are called on the function body. The walks add more data to the Nodes that comprise the ultimate representation of the function being compiled. Walking also adds calls and variables as needed. The uniqueness of how the language is implemented and works stem from these walks.

The final phase of compile() is a call to heapmoves() which, according to the source code, "migrates" the funcion input/output values "between the stack and the heap." The latter calls the paramstoheap() routine (also in src/cmd/walk.c), where the function arguments (input) are walked. This portion of compilation "generates" the "code to allocate copies of escaped parameters to the heap." Finally, after the latter call returns, and just before compile() exits, the output arguments of the function being processed (return arguments), are copied to the stack.

*whew* now the compile() is complete, and the object code is assembled in a Go/Plan9 syntax that is dumped to output. To begin this object code production, a call from main() (after funccompile()) in go/src/cmd/gc/lex.c to dumpobj() in go/src/cmd/gc/obj.c is executed. Some of the routines responsible for emitting the object code are 5g/6g compiler specific, for instance dumpdata() and dumpfuncs() located in go/src/cmd/{5g,6g}/gobj.c. The code that is internally assembled by the 5g/6g compiler can be viewed via the compiler's -S argument( e.g. ./6g my_source.go -S). The object file basically contains that information in a binary format, with some metadata for linking.

References

  1. Google Go Source Code,
    https://go.googlecode.com/hg (source code)
    Revision 6277:ed9fe47b342d (September 14, 2010)

  2. Ken Thompson, Plan 9 C Compilers,
    http://lsub.org/sys/doc/compiler.ps

  3. Plan 9 Man Pages (9c, 2a)

  4. Google Group for Go:
    http://groups.google.com/group/golang-nuts/

  5. Google Go FAQ:
    http://www.golang.org/doc/go_lang_faq.html

  6. Plan 9 From Bell Labs (Website)
    http://plan9.bell-labs.com/plan9/

-Matt

No comments:

Post a Comment