Before I could use ANTLR for a large production quality compiler I needed to understand how to write ANTLR grammars and work with the ANTLR parser. Rather than start right in on a large language grammar, I developed several smaller test cases. I found these very helpful in understanding ANTLR. They are published here in the hope that they will help others as well.
In a modern compiler the output of the parser is an abstract syntax tree (AST) which usually mirrors the input language. The AST is then processed by subsequent passes (perhaps transformed into other ASTs) and finally used as the basis to construct the control flow/data flow graph used for code generation. As a result, the examples all generate AST output.
The AST I use is based on a binary tree. I originally saw this type of tree used in the ANSI C compiler developed by Peter Donovan (this compiler was used by Philips for their Trimedia processor). A similar AST structure has been used for lcc. This kind of binary tree is described on page 33 (Intermediate Representations and Translation) of Language Translation Using PCCTS and C++ by Terence Parr.
The most basic form of this kind of tree is described by:
class tree_node { public: const char *name; class tree_node *kid; class tree_node *sib; }; // tree_node
An expression like
A + Bis represented by the AST
+ / \ kid sib / \ A NULL / \ kid sib / \ NULL B
In this tree A is the child (or kid) of + and B is the sibling (or sib) node of A.
If the buildAST flag in the ANTLR parser options section is set to true, the parser created by ANTLR will read the source langauge and create ANTLR abstract syntax trees in memory. For example, a grammar for simple assignment expressions is shown below. Since the buildAST option is true ASTs will be created. Note that grammar terminals that are followed by a ! will not be added to the AST created by ANTLR. By default ANTLR creates trees as "sibling lists".
options { language="Cpp"; } class MyExprParser extends Parser; options { k = 2; exportVocab=MyExpr; buildAST = true; } exprlist : ( assignment_statement )* EOF! ; assignment_statement : assignment SEMICOLON! ; assignment : (IDENT ASSIGN )? expr ; primary_expr : IDENT | constant | (LPAREN! expr RPAREN! ) ; sign_expr : (MINUS)? primary_expr ; mul_expr : sign_expr (( TIMES | DIVIDE | MOD ) sign_expr)* ; expr : mul_expr (( PLUS | MINUS ) mul_expr)* ; constant : (ICON | CHCON) ;
Click here for the complete parser and lexer grammar. The grammar, Makefile and main.cpp to print the tree structure can be downloaded as a tar file here.
The expression
a = b + c * d;
will result in the AST
a / \ NULL b \ + \ c \ * \ d
If the simple expression parser is built and executed:
expr < test_expr
it will print
tree traverse: a sib: = sib: b sib: + sib: c sib: * sib: d sib: p sib: = sib: r sib: * sib: s sib: + sib: t sib: * sib: u
The grammar must be annotated to with tree commands to produce a parser that creates trees in the correct shape (that is, operators at the root, which operands as children). A somewhat more complicated expression parser can be seen here and downloaded in tar form here. Note that grammar terminals which should be at the root of a sub-tree are annotated with ^.
If this parser is built and executed it will display properly constructed trees:
tree traverse: = kid: a sib: + kid: b sib: * kid: c sib: d sib: = kid: p sib: + kid: * kid: r sib: s sib: * kid: t sib: u
The tree shape control operators (for example ^ and !) can only be applied to terminal nodes.
The tree printout above is useful because it shows the "kid" and "sibling" structure of the tree. But it is not very readable, especially for large trees. After working with Peter Donovan on his ANSI C compiler, which served as an early foundation for Quickturn's SimServer Verilog compiler, I got used to Peter's style of tree printing. Peter describes the history of this as:
The print style is derived from early notes from Frank DeRemer, he likes to call it "Cambridge polish form", since he came up with it at MIT, the implementation is my own.
I have rewritten the algorithm in C++ and adapted it for ANTLR but it is largely Peter's original algorithm, which I think is rather elegant (see print_tree.hpp and print_tree.cpp or download here).
This tree print algorithm will print the two assignment expression trees above as:
<= a <+ b <* c d > > > <= p <+ <* r s > <* t u > > >
I developed a "tiny C" grammar for ANTLR annotated to produce trees to better understand what a real ANTLR language front end would be like. One of the things that I learned with this grammar is that debugging ANTLR grammars is not that much easier than, say, debugging a YACC grammar. There was a mistake in the production describing the for-loop. ANTLR gave a non-determinism warning for the unary minus production. I spent hours pouring over my expression grammar containing the unary minus, but I could find no error in it. I looked at the section of the C++ parser created by ANTLR for expressions and it looked fine. Finally I gave up. It was not until I started working on the trees that I noticed the problem in the for-loop grammar. Perhaps a debugger would have helped.
Creating ASTs for a programming language is more complex than creating ASTs for expressions, since the grammar is larger. For example, my AST for the C style for-loop has four children:
For example, the ASTs for
for (i = 0; i < 10; i = i + 1) { sum = sum + i; ave = sum / i; }
is shown below:
<for <= i 0 > << i 10 > <= i <+ i 1 > > <block <= sum <+ sum i > > <= ave </ sum i > > > >
The ANTLR grammar for the for-loop is shown below:
for_loop : "for"^ loop_cntrl statement ; loop_cntrl : LPAREN! loop_init loop_cond loop_incr RPAREN! ; loop_init : SEMICOLON! { #loop_init = #([NULL_NODE, "null_init"]); } | assignment SEMICOLON! ; loop_cond : SEMICOLON! { #loop_cond = #([NULL_NODE, "null_cond"]); } | expr SEMICOLON! ; loop_incr : () // empry { #loop_incr = #([NULL_NODE, "null_incr"]); } | assignment ;
Note that the structure of the grammar productions is influenced by the desired structure of the resulting tree. For example, if trees were not being built the grammar for a for-loop might be:
for_loop : "for" LPAREN assignment SEMICOLON expr SEMICOLON assignment RPAREN statement ;
The tree construction annotation for this ANTLR grammar will construct NULL nodes for any missing elements of the grammar. For example, it is legal in C to have a loop where the loop initialization, loop condition, loop increment and loop statement are empty:
for ( ; ; ) ;
The results in a for-loop tree that has four null children.
for kid: null_init sib: null_cond sib: null_incr sib: null_stmt
My complete tiny C grammar can be seen here and a tar file containing the grammar and associated Makefile can be downloaded here.
By defining a grammar with the appropriate tree construction annotation, ANTLR will create a scanner and a parser that builds ASTs. For a simple language translater this is great because very little code has to be written to get a lot done. However, if the standard ANTLR trees and tree construction don't fit into your application things get more complicated. For example, I like very "light weight" trees composed of simple nodes. A simple example of this kind of light weight tree class shown here. I also want to be able to allocate these tree nodes from a memory pool that I can rapidly deallocate when when I no longer need it (deallocating a memory pool is much faster than calling delete for every node in a tree). For example, a C compiler usually compiles a C program one function at a time. After a function has been compiled, its AST is no longer needed and the memory allocated for it can be reused.
Custom trees can be created during the ANTLR parse by including tree construction code in the ANTLR semantic actions (e.g., C++ code associated with the syntax rules). Building a syntax tree requires passing sub-trees "upward" through the parse. For example, the grammar for an expression like "a + b" could be expressed as
expr : primary_expr PLUS primary_expr ; primary_expr : IDENT ;
If we are going to build a tree for this grammar, the primary_expr production needs to allocate a node and pass it upward to the expr production, which will create the tree with PLUS at the root. In the grammar below the expr production has two local variables decalred in it, e1 and e2. These variables are assigned the result of the parse of the two primary_expr rules. This result will be the identifier node that is allocated in the primary_expr production.
expr returns [pNode expr_tree] { pNode e1 = NULL; pNode e2 = NULL; } : e1 = primary_expr PLUS e2 = primary_expr { expr_tree = factory.build_binary( PLUS, e1, e2 ); } ; primary_expr [pNode returns id_node] : IDENT { id_node = factory.make_node( n_id ); } ;
The complete grammar for a simple assignment expression grammar is shown here. A tar file, containing the grammar, associated C++ code to build and print the trees and a Makefile can be downloaded here.
If this expression parser is given an input file containing
a = b + c * d; x = -y; p = --q; e = f + g + h + i; j = k * l * m * n; q = 3 + 4 * 5;
it will print
<n_expr_list <n_assign n_id[a] <n_plus n_id[b] <n_times n_id[c] n_id[d] > > > <n_assign n_id[x] <n_uminus n_id[y] > > <n_assign n_id[p] <n_uminus n_uminus n_id[q] > > <n_assign n_id[e] <n_plus <n_plus <n_plus n_id[f] n_id[g] > n_id[h] > n_id[i] > > <n_assign n_id[j] <n_times <n_times <n_times n_id[k] n_id[l] > n_id[m] > n_id[n] > > <n_assign n_id[q] <n_plus n_icon[3] <n_times n_icon[4] n_icon[5] > > > >
This grammar is interesting because it shows how objects (trees and tree nodes) built at lower levels in the grammar can be passed upward. The problem with this grammar is that the semantic actions (the C++ code) tend to obscure the grammar, which harms its readability. For example, compare this tree construction grammar with the grammar to build ANTLR trees. Building trees via semantic actions not only makes the grammar less readable, but also takes more coding than simply letting ANTLR build the trees. The ideal case would be to redefine ANTLR tree node and tree allocation objects.
The allocation and destruction of data structures like abstract syntax trees (AST) and symbol tables is central to compiler design and implementation. It must be possible to rapidly deallocate data structures when they are no longer used. The standard C++ delete operator consumes too much computation time when it is applied to each node object in a large AST. ASTs and symbol tables can be rapidly deallocated if they are allocated from memory pools. A memory pool can be built out of chained lists of large memory blocks. Symbols and AST nodes can be allocated from these blocks as the AST and symbol table are built. When they are no longer needed the large memory blocks can be rapidly deallocated and the memory reused.
The ANTLR ASTs are composed of tree nodes built from the AST class (see AST.hpp). The AST class supports the kid and sibling pointers, discussed above. The AST class also points to an ASTNode object. The ASTNode object (see ASTNode.hpp) is the base class for the object containing the node information other than the kid and sibling pointers. The base class defines virtual functions to get the node type (getType) and the node name (getText). The CommonASTNode object (see CommonASTNode.hpp) is an ASTNode subclass that provides an implemation for these functions.
ANTLR trees can be pool allocated if the new operator is overridden in for the ASTNode class and the AST class. Note that any sub-class will inherit the overridden new operator. The overridden new operator uses C++ "positional new" which allows an allocation function or object to be defined for new. In this case a simple function that allocates memory and returns a void * pointer.
from antlr/cpp/antlr/ASTNode.hpp existing antlr includes go here #includeclass ASTNode { public: void *operator new( size_t num_bytes ) { assert( FALSE ); return NULL; } // // Overloaded new // void *operator new( size_t num_bytes, void *(GetMemFx)( size_t ) ) { return GetMemFx( num_bytes ); } ... the rest of the ASTNode definition ...
Here the standard new function is redefined to assert (issue a runtime error) if it is called. The new operator is redefined to use "positional new". The redefined new operator can be used by the sub-class CommonASTNode to allocate ANTLR AST nodes:
from CommonASTNode.cpp static void *GetMem( size_t num_bytes ) { void *new_mem; new_mem = malloc( num_bytes ); return new_mem; } // GetMem ASTNode* CommonASTNode::factory() { return new( GetMem ) CommonASTNode; }
The new operator must also be redefined this way for the AST class. Once is done the allocation code in the ASTFactor class can be slightly rewritten to use positional new:
from ASTFactory.cpp static void *GetMem( size_t num_bytes ) { void *new_mem; new_mem = malloc( num_bytes ); return new_mem; } // GetMem /** Create a new empty AST node; if the user did not specify * an AST node type, then create a default one: CommonAST. */ RefAST ASTFactory::create() { AST *pAST; ASTNode* node = nodeFactory(); //new CommonASTNode(); node->setType(Token::INVALID_TYPE); pAST = new( GetMem ) AST( node ); return RefAST( pAST ); }
I have only used the memory allocation function here to show how a custom positional new operator can be created. Obviously this function is not doing pool allocation. To implement memory pool allocation a pool object should be passed as the argument to positional new. Instead of calling the POSIX malloc function, new might invoke the a pool class function like GetMem:
void *operator new( size_t num_bytes, memory_pool &pool ) { return pool.GetMem( num_bytes ); }
The abstract syntax tree (AST) is probably the core data structure for a compiler. The compiler front end builds the AST, the compiler middle does tree to tree transformation and control flow graph construction from the AST. As a result, I want direct control over the definition of the AST. At the same time, it would be nice if ANTLR tree construction could be used to build the AST during the parse since the grammar would not be obscured by a lot of semantic action code.
A custom AST node can be defined by sub-classing the ASTNode class (just as the CommonASTNode class does) and rewriting the ASTFactory class to, for example, create special nodes for constants, identifiers and non-terminals. This customization, plus memory pool allocation would give me the structure I need for a production compiler.
There are problems with this approach however. The custom code is embedded in the ANTLR support code (e.g., the code in the antlr/cpp and antlr/cpp/antlr directories). It is possible that future versions of ANTLR will change this support code making it incompatible with the custom tree building code. Anyone who modifies or maintains the custom code must also understand the structure and operation of the ANTLR AST classes and ASTFactor. This is a lot of overhead when the only gain is clearer grammar.
Another alternative would be to build pool allocated ANTLR trees and then construct a custom AST by traversing the ANTLR trees. When the custom AST is created the ANTLR trees could be discarded. This would use twice the memory needed to create the custom trees directly. For languages like C, C++ or Java this extra memory use might not be a problem, since system memory is so large on modern computers. On the other hand, if the compiler is given automaticly generated input (from a compiler that compiles a langauge into C, for example) which can create huge functions, the extra memory usage could cause the compiler to fail. Problems could also arise compiling Hardware Design Languages (HDLs) like Verilog or VHDL. Especially in the case of Verilog, large amounts of intermediate information must be kept in memory during the compile. Doubling the memory use by using an ANTLR intermediate would be unacceptable, since HDL compilers tend to be memory limited for large circuit designs.
ANTLR generated trees are great for simple language translators. But the complexity of redefining the AST and AST construction (in the ASTFactory class) has forced me to conclude that using ANTLR's tree construction is not practical for a production quality compiler for languages like C, C++ or Java. For the reasons outlined above I believe that it is more desirable to build the trees directly from the grammar using semantic action code (see the expression example above).
Ian Kaplan, August 1, 1999