Skip navigation

Monthly Archives: April 2010

It is quite interesting to see how automated tools can speed up development, not only in the context of professional development but also when you just want to do stuff ( or for assignments :() I recently took a graduate compilers course, which was quite a lot of work for me, but it was actually because i didnt know the right tools to provide the right abstractions. So, now i will try to explain how to use CUP/JFlex to get a quick and clean compiler up and running.


The job of a lexer is to read the input files as a stream of characters and produce a stream of tokens. Each token has an associated value which provides a higher level function. So, just as an example, instead of having the later stages deal with 33+43 as a stream of characters, the lexer would say that this is a stream of 3 tokens: INT1 + INT2 , where INT1 has value 33 and INT2 has value 43. JFlex is a tool to generate a lexer using a standard set of rules, and it makes the job really easy. You can find more information about JFlex from There are 3 pars of a JFlex file:

  1. User code: This is the code you want to be copied as such in your Lexer java file. Generally this would include the imports and packages you would need in your lexer code.
  2. Options and declarations: The user code section is ended with %% on a line. After that, you can specify options for the JFLex that determine how the lexer code is generated. Generally you would have the following set of options:
    • %cup directs JFLex to generate code that would be used by CUP parser (more on this later).
    • %class ClassName directs JFlex to name the generated lexer as ClassName
    • %line enables you to access the line of the token via yyline
    • %column is similar to %line, and you can access the column via yycolumn
    • %unicode specifies the character set of the source file, generally, unicode is used.
    • %debug is an optional flag that is very useful if your parser (not you lexer, but even for a lexer) is failing and you cannot tell which token it fails on. It will generate debug information about each token as it is passed on to the parser so you can tell exactly what the problem is.
  3. And following these is optional user code. Here you would include code that you would like to include in the class definition of the lexer. So, any data members you would like to use the lexer to maintain would be declared here. You can also have member functions. One of the most common things to do here is to include
  private Symbol symbol(int type) {
    return new Symbol(type, yyline, yycolumn);
  private Symbol symbol(int type, Object value) {
    return new Symbol(type, yyline, yycolumn, value);

The two functions above are used in the lexer code to return symbols of the appropriate type and value. These would return instances of java_cup.runtime.Symbol when invoked by CUP based on the type of the symbol and any associated values with it. Having these in the declaration sections helps prevent editing the generated lexer since it would be ready to use once generated. All the code in this section is to be included in %{ and %} appearing on separate lines by themselves.
Following this is the macro declaration section, where you would define macros for terminals. These are regular expressions which would be used in the actual lexical rules later. The reason for having separate macros is to make the lexical rules more readable. These cannot be recursive (nor mutually recursive) and would be reported as errors by JFlex. A macro is specified by a name for the macro, followed by a = sign, followed by the regular expression the macro name should replace. For instance :

LineTerminator = \r|\n|\r\n
InputCharacter = [^\r\n]
WhiteSpace     = {LineTerminator} | [ \t\f]
Identifier = [:jletter:] [:jletterdigit:]*

Represents some macros for the different white spaces. You can make use of predefined classes to make the macros shorter. For instance, a common definition for identifiers is give above as well. This definition makes use of predefined classes jletter and jletterdigit.
The final part of this section is used to define the state variables which represent the state of the lexer should maintain. This can be any number of state variables separated by comma. The definition starts with %state followed by any number of comma separated state variables. So a common example would be to have a STRING state, which would be set upon entry to a double quoted region, and cleared upon exit. This way, the contents inside the string are not parsed as actual tokens, instead treated as a string.
3. Lexical rules.
This is the core of the lexer, since all we have done so far is to set up how the lexer actually does stuff, and how we want it to facilitate other tools (CUP). Now we will define the actual rules used to convert the stream of characters to a stream of tokens. There are two types of lexical rules; those that require the lexer to be in a particular state, and those that don’t.
• First, let us see how we can toggle between states. This is done by calling the function yybegin(STATE). So, when reading a double quote, to enter into STRING state, you would have a rule as follows:

\" { string.setLength(0); yybegin(STRING); }
•	For rules that require the lexer to be in a particular state, you would have that state precede the rule. So, if you just want to concatenate all the characters you read into a buffer once you are in STRING state (remember, upon entering a region enclosed by double quotes we would enter STRING state). The rule would be

\"  { yybegin(YYINITIAL);
      return symbol(sym.STRING_LITERAL,
            string.toString());    }
[^\n\r\"\\]+ { string.append( yytext() ); }
\\t { string.append('\t'); }
\\n { string.append('\n'); }
\\r { string.append('\r'); }
\\\"{ string.append('\"'); }
\\  { string.append('\\'); }

• For rules that should be applied regardless of the state of the lexer, you need not specify an initial state.
Initially the lexer is in YYINITIAL state, and all the rules that should be applied by the lexer are included in a block.
One thing that you might like to do is to store values for different tokens as they are packaged as a single token by the lexer. This is useful for instance in a parser where you would like some information about the actual token text. To do this you can pair the symbol identifier (type) with the value of the text. For instance, in the case of an identifier, the rule would be

{Identifier} {return symbol(sym.IDENTIFIER, new String(yytext())); }

And in your parser (more on this later) you would access the “value” of the token.
That was how you would basically put up a lexer.
Now that we have a lexer up and running, we can move forward to generate a parser that uses the tokens generated by the lexer to verify if the sentence is indeed from a grammar of expression. Keep in mind that there would be two regular expressions here, one for defining the tokens themselves, and another that describes how the tokens can be combined to produce larger constructs.


CUP is a recursive descent parser generator. It can be integrated with JFlex very easily as described above. Now we will take a look at how to specify the grammar to CUP in a nice way. A CUP specification has four parts.

  1. Preliminary and miscellaneous declarations specifying how parser is to be generated, and supply parts of runtime code. This includes import and package statements, initialization code as well as linking up with the scanner. When using with JFlex, you do not need to specify the initialization code, or scanner code since the %cup flag in flex file takes care of that.
  2. Declaring terminals and non-terminals and associate classes with them.Here you would declare the terminals and non terminals to be used in your grammar. The terminals are defined first, followed by the non-terminals. For each terminal/non-terminal you can also define the type of object/value associated with that terminal/non-terminal. This type would determine what return value the parser would expect for each of these terminals/non-terminals.
  3. Precedence and associativity of terminals. Here you would define the precedence and associativity of the tokens (for instance operators). One key thing to do here is to add an associativity rule for “ELSE” tokens that handles the dangling else.
  4. Grammar. This is the core of the parser since it specifies how the non-terminals are expressed in terms of other non-terminals or terminals. You would also place semantic actions (such as building the AST) in this section.

Now we shall consider an extended example of how to write the CUP specification for a parser. As described above, the first part is trivial, and you would generally include the following statements in it.

import java_cup.runtime.*;
import compiler.ast.*;
import java.util.*;

This list would include all the packages you would use in your parser, as well as the package statement if desired. Now in the second section, you would describe the terminals and non-terminals. For instance:

terminal String IDENTIFIER;

non terminal  Node program, procedure_declaration;
non terminal  Type type;
non terminal  Node procedure_heading, procedure_body, statement_sequence;
non terminal  Node identifier, formal_parameter, formal_parameter_section;
non terminal  Node statement, assignment_statement, procedure_call, if_statement;
non terminal  Node while_statement, expression, expression_list, designator;
non terminal  Node constant_expression, constant_declaration;
non terminal  Node structure_type, field_list;
non terminal  Node equality_expression, simple_expression, term, factor, dummystart;
non terminal  Node comparator_operator, factor_operator, else_statement, expression_listS;
non terminal String identifier_arrayS, identifier_array;

non terminal LinkedList variable_declaration;

Here, you would describe the type of the node (terminal or non terminal) and then the class associated with that node. This class is used by CUP to return appropriate objects for each node. In my opinion, it is easier to first build an acceptor (leaving the classes empty) and then add actions. This way you can be sure that the grammar is correct, and can add actions to it later (which would do something with the program).
Now that you have the terminals and non-terminals, you would specify any precedence/associativity rules for the grammar. This is done as follows;

precedence left PLUS, MINUS;
precedence left ASTERICK, FSLASH, MOD;
precedence right ELSE;

The last precedence rule is required to handle the dangling else problem.
Now we come to the core of the parser specification. This part has the grammar in the form of productions. Each production (or rule) can have an action associated with it. This section starts with an optional “start with” statement directing CUP which is the start non-terminal. If you want the top level non-terminal to be PROGRAM_NODE, you would have a statement start with PROGRAM_NODE;. The basic form of the rule is as follows:
Non-terminal ::= some_node some_node | some_node some_node ;
So, for instance, for a program you would have something like

program::= var_const_declarationS procedure_declarationS;
var_const_declarationS::= constant_declarationS variable_declarationS;
            procedure_declarationS procedure_declaration;

The last rule for procedure_declarationS shows an example of an epsilon production, where the non-terminal goes to a null node. To extend this example so we have some actions executed by the parser when it actually applies the reduction (in other words, when it successfully applies a rule) we can associate semantic actions with the rules, one with each rule. So the above grammar now becomes:

program::= var_const_declarationS procedure_declarationS  {: System.out.println("Program"); :};
var_const_declarationS::=constant_declarationS variable_declarationS {: System.out.println("Const-Var Decls"); :};
procedure_declarationS::= {: System.out.println(“End of ProcedureList”); :}
            |procedure_declarationS procedure_declaration {: System.out.println("ProcedureList"); :};

This would associate java code snippets specified between the {: :} that is executed when the rule is applied by the parser. So, good, but when you actually want to do something useful with the semantic actions (instead of just printing out which rule was applied) you need to refer to the terminals and non-terminals in the rule. Suppose you wanted to create a list of all the variables declared in a given type declaration, so for some statement like
int a, b, c[100];
You would like to associate a, b, and c with integer types. To do this you would keep a list of all the values, represented by an identifier_list. The (incomplete) rules for such a production would be as follows:

              variable_declarationS variable_declaration {: System.out.println("VarsList"); :};
variable_declaration ::=
              type identifier_list SEMICOLON {: System.out.println ("VarDecl"); :}  ;
constant_declarationS ::=
              constant_declarationS constant_declaration {::};
identifier_array ::= IDENTIFIER identifier_arrayS {:System.out.println("Array/ID");:};
identifier_arrayS ::=
            |  LSQPAREN constant_expression RSQPAREN {:System.out.println("Index");:};
               identifier_array identifier_listS {::};
identifier_listS ::=
            |  COMMA identifier_list {: System.out.println ("ID-Arry"); :}   ;

type ::=      SYM_INT|SYM_LONG {: System.out.println("Type"); :};
constant_declaration ::=
              CONST type IDENTIFIER EQ constant_expression SEMICOLON {: System.out.println ("ConstDecl"); :}   ;

The rule of interest, that would illustrate the problem is
identifier_array ::= IDENTIFIER identifier_arrayS {:System.out.println(“Array/ID”);:};
Since you are creating an array of identifiers, you would want to store the identifier in a data structure as well as the remaining portion of the sentence in that data structure. This would mean you would need some way to reference the terminals/non-terminals in the production to build up a resultant value; need to access the values of IDENTIFIER and identifier_arrayS to compute the values of identifier_array. The problem is more subtle when you have a recursive rule such as
constant_declarationS ::=
constant_declarationS constant_declaration {::};
where you have the same non-terminal nodes on both sides of the production, so there has to be some way to distinguish between the non-terminal on the left and on the right. To do this, we can associate names with the terminals and non-terminals in the right hand side of the production. The non-terminal on the left has the special name RESULT, and this is the value returned by the semantic action. So, now all our rules would have the form

                  RESULT = new SomeObject();
              non_terminal_2:n_t_2 non_terminal_3:n_t_3
                   RESULT = new SomeObject(n_t_2.getValue(), n_t_3.getValue());

It should be noted that the type of RESULT was defined in the second section (where you also defined the terminal and non-terminals along with their class types). You do not need to add return statement since these are automatically inserted by the parser. You can refer to the objects on the right hand sides as if they were actual objects in Java, so in the above example, getValue are methods of the associated classes. The variable declaration example can now be rewritten as follows :

                  System.out.println("End VarsList");
              variable_declarationS:lst variable_declaration:v_decl

variable_declaration ::=
              type:t_id identifier_list:id_l SEMICOLON
                  LinkedList r = new LinkedList();
                  VarDecl n;
                  Iterator li = id_l.iterator();
                      n = new VarDecl(t_id, (String)(;
                      System.out.println("Added "+n);
                  RESULT = r;
                  System.out.println ("#############VarDecl "+id_l);
constant_declarationS ::=
              constant_declarationS constant_declaration {::};

identifier_array ::= IDENTIFIER:id identifier_arrayS:id_a
                  RESULT =  id + id_a;
                  System.out.println("Array/ID"+id+" "+id_a);
identifier_arrayS ::=
                    RESULT= new String();
            |  LSQPAREN constant_expression:c_e RSQPAREN
                    RESULT = "[EXP]";
               identifier_array:id_a identifier_listS:id_l
                   RESULT = id_l;
identifier_listS ::=
                   RESULT = new LinkedList();
            |  COMMA identifier_list:id_l
                   RESULT = id_l;
                   System.out.println ("ID-Arry");

Using the above notation, you can basically build up a whole parser. The approach I think makes more sense is to first write an acceptor and then annotate the productions with semantic actions. This makes the job easier since you know you have the right set of rules before you get into the semantic actions.
To test your build, you can have a script do the following :

 java -jar ../jflex/lib/JFlex.jar Lexer.flex
 java -jar ../cup/java-cup-11a.jar Parser.cup
 javac -cp ../cup/java-cup-11a.jar:../jflex/lib/JFlex.jar:.
 for d in $(find  examples/ -name *.c |sort); do
    echo "Processing " $d
    java -cp ../cup/java-cup-11a.jar:../jflex/lib/JFlex.jar:. Tester $d 2>&1

Where is a simple java program that calls the parser and its parse method.
So, that’s all for now, looking forward to your comments on where I can add things to make it more clearer.