Friday, November 9, 2012

Devil Hands: The Wonders of a Loosely Typed Compiler

GCC has been something that has captured my attention for years. It seems there is a set of chosen developers whom have bowed to the darklord himself, to create a tool which is just mind-blowingly fascinating. gcc takes text and makes binaries! Sure, that's what most compilers do, but gcc just towered over me with as being, almost, a pinnacle of hackery. It's just shear black magic! And I like the dark arts.

For years I have enjoyed being a software developer; however, I wanted to get my hands dirty on a lower-level than just application development. For some reason, I wanted to work on tools that others use to construct their tools (and applications). Enter the realm of compilers. Since gcc is open source, and many people and organizations rely on it, I felt I would sell my soul. After all, idle hands do the devil's work. So, I decided to put my devil hands to work.

Well, part of my wanting to get into compilers eventually led me towards a PhD... still working on it. Well, I should be working on it, but I am writing this post instead. Anyways, during a PhD meeting the other day, one of my advisers mentioned that gcc has something like over 100 different representations for a certain data structure used inside the compiler. This structure is used to represent various things about the program that gcc is compiling. Things like variables, statements, declarations, etc. This adviser is right, in fact, he is never wrong. No, really, he knows all.

Well, I have been dipping my hands in the secret sauce that is gcc for a while now. In fact, one of the reasons I wanted to get a PhD was to hack on gcc. So my compiler driven region based memory management uses GCC-plugins to accomplish the research. Anyways, I have mentioned to my advisors many times that, yes, in fact the "tree" node data structure in gcc can be anything!

So what is the big deal if a data structure, used inside the compiler, can represent anything? Isn't that polymorphism at its finest? Well... no. It means that the internals of gcc are essentially untyped. Heck, a "tree" is defined as a union consisting of 41 structures. To top it off, there is another file (tree.def) that defines about 195 (in gcc 4.7.1) different "things" a tree can be! This is all fine and dandy, if you know what you are doing. But it means one must be an 31337 ninja when writing gcc code. The danger comes in referencing data inside the tree node. If accessed improperly, the data will not have the correct meaning. While there are a handful of data structure inside a "tree" node, there are over a hundred of possibilities which use certain fields or not. And if you do not ask the type, at runtime, what the type is, and perform your reading of the data properly then *boom* you get bad data. This is what I call a type bomb. The result of a lack of typing on a data structure. In other words, you have a grey box, and you basically have two things you can do, before pulling data out of it. It has structure, but certain parts are not always populated. Either you can ask the black box what is inside it, or you can assume what is inside it. A lack of type safety makes reading data can be dangerous. Since it permits the possibility that you can read the data incorrectly. But, I like this, it makes playing with gcc kinda fun, in a masochistic way.

To reiterate, how you read data in a "tree" instance can vastly change. Suppose that you want to check that whatever the "tree" represents is a global (e.g. global variable). Well, you can use the "is_global_var()" predicate. Just pass a tree instance to it, and it will return true or false. But, one must be very cautious. See, "is_global_var()" only works on declaration nodes, and not the handful of other "things" a "tree" can represent. I can pass a variable declaration "tree" node and the predicate will produce a proper truth value. However, if I pass it a particular instance of a variable, such as given by a SSA representation, then the result is not sound. Since the predicate will always return true-or-false, a value will be returned, it just might be wrong.

Consider the following block of C code:

void foo(void)
    int x;
    x = 1;

During a walk over the statements in this code, well actually, gcc will be walking a more verbose representation, its intermediate (GIMPLE) representation. Anyways, suppose your compiler pass wants to check if the 'x' variable in the declaration statement 'int x;' is global. The "is_global_var()" will return a proper truth value. However, if you pass the 'x' in the 'x = 1;' assignment statement, the value returned by the predicate cannot be trusted. Since a tree node can be thought of as an opaque type kinda (like a "void *") you can read a "variable" type as if it were a "declaration." And there is nothing wrong with that, no compiler warning. The "is_global_var()" does not check first what type of "tree" node it was passed, and will just report the result if the node were a declaration. This is dangerous, and is what causes the lack of sanity. The type enforcement comes in the comment above the "is_global_var()":
From gcc-4.7.1:

/* Return true if T (assumed to be a DECL) is a global variable.
      A variable is considered global if its storage is not automatic.  */

TLDR; Untyped data structures are convenient, but require that the programmer be an omniscient monkey-saurus and know what is up with the data they are flinging. Tread lightly young warriors. Oh, and to get an idea of what "things" a tree can be, check here. From gcc 4.7.1.

Resources: gcc-4.7.1



  1. There is no "global" scope in C. There are four kinds of scopes: function, file, block, and function prototype. A variable that appears outside of any block or list of parameters has file scope, which terminates at the end of the translation unit.

  2. Ah, thank you for informing me. I was referring to anything that has external linkage.

  3. The access macros actually support type checking, but only when compiled in "checking" mode, not in release builds.