Writing a Compiler 1: Expressions and Intermediate Language

Recently I’ve been getting increasingly broody to write a compiler. They’ve always been something of a mystery to me and, while I’ve always fancied writing one, I’ve never had a convincing reason to do so. But, since I’ve returned to retro computing, I can see definite limitations in the options currently available for writing software for 8-bit machines.

Goals

I’ll write at some future point about my visions for the language. For now, I’ll state the following goals:

  • The language will be based on Pascal, but with enhancements to make it more suitable for use on low-end machines.
  • It will be coded in Delphi (which is an IDE and libraries based around Pascal).
  • It will target the Z80 (it could, possibly, also output some form if Intermediate Language, similar to P-Code, which could the be executed by a run-time either on the Z80 or other systems).
  • It will initially run as a cross-compiler, running on Windows and targeting the Z80.
  • It should be able to self compile. That will, eventually, mean it will be able to compile itself. At that point I will also have a version which runs on the Z80 and outputs Z80 code. The self-hosted version will almost certainly lose some functionality in order to be fast enough and compact enough to be practical when running on the Z80. I’d hope to confine this loss of functionality to features such as optimisations rather than core language syntax.

The need to self-compile means I’m writing this in mostly ‘old-school’ Pascal syntax and dropping things such as objects. I am, however, re-using old code (such as my file parser class) where I already have it to hand, and using some modern data structures to save time. I can update the code to replace these when I move towards the self-compiling version.

Beyond that I’ll state two disclaimers:

  1. I am not a ‘professional’ compiler writer – don’t treat this as a lesson in how you should write a compiler. It’s more a diary of what I’m discovering along the way.
  2. The code at the moment is mostly exploratory. You can think of it as rapid prototyping. I want to explore the necessary techniques and get an overview of how to write every stage of a compiler. I’ll return later to fill in the detail and refactor as necessary. In particular the syntax of any test code is not final, and error handling of the syntax is going to be pretty rough and ready at this stage.

Having said that, I have one eye on the fact that I will probably bash this code into shape for the final compiler, rather than completely rewriting it. So I’m trying to write things in a way that can be reasonably translated to work efficiently on a Z80, and including a reasonable amount of error handling.

And one final point – there’s a lot of jargon in compiler writing. I don’t find most of it very helpful or meaningful so I won’t be using much of it. Instead of terms such as ‘syntactical analysis’ and ‘lexical analysis’ I’ll use phrases such as ‘parsing the source code’. In general I’ll be describing practical algorithms and source code rather than abstract mathematical concepts.

Expression Evaluation

A compiler comprises a lot of moving parts. A lot of them – such as syntax parsing – are fairly straight forward grunt-work. Others – such as expression parsing and code generation – are a mysterious and feel complex and difficult. On the basis that I want to understand the difficult bits I’m starting with the expression parser.

Have a look at an expression such as:

a + b * c - 4

and you’ll see that the b * c part needs to be executed before the addition. In other words, we can’t just work from left to right generating code as we go. We need to re-order the expression.

I’ve written a few expression evaluators over the years – code which reads in an expression and calculates the answer, similar to an interpreter. The normal method is to create a stack for operands and a stack for operators. You work through the expression from left to write pushing items onto the stacks as you go. Taking the expression above, by the time you get to the subtraction, you’ll have the following:

Operands: a b c
Operators: + *

However, before you push each operand you compare it’s precedence with the current top-of-stack (TOS) operator. If the new operator is of higher precedence you push it on the operator stack – as we did with the multiply. If the new operator is the same or lower precedence than the TOS then you pop the TOS operator and the two TOS operands, perform the operation, and push the result back onto the operand stack.

So, when we read in the subtract operator we pop the multiply and b and c and the stacks look like this:

Operands: a %1
Operators +

Where %1 is shorthand for the intermediate result of the multiplication.

We now, again, compare the precedence of the subtraction operator with the TOS. This time the TOS operator is the addition. Addition and subtraction are both the same precedence so again we pop the operator (addition) and the top two operands and end up with:

Operands: %2
Operators:

with the %2 referring to the second intermediate result.

Since the operator stack is now empty we can continue parsing the input and pushing onto the stacks to get:

Operands: %2 c
Operators: -

We’ve reach the end of the input, which we treat as the lowest possible precedence, and we now work our way back through through stacks popping operators and operands for each operation in turn. In this case we do the subtraction and get the final result.

Stacks are simple data structures, and our code is simply looping through the input. All pretty simple stuff.

Intermediate Language

But, how do we go about rewriting the expression and turning it into code? And, given my end goals, how do you do with the minimum of code and using only basic data structures.

I’ll be rewriting the code into an Intermediate Language (IL), which could also be called Intermediate Representation (IR). This will be a set of primitives representing the operations to be performed at each stage. Primitives will be things such as ‘addition’, ‘multiplication’, etc which can be fairly easily mapped to assembly language. As mentioned earlier, this IL code could also be directly executed by a run-time1.

This IL code will be stored in binary data structures. I’ll be using a textual representation in these articles for convenience. A primitive will take the form:

destination = operation operand, operand

Where,

  • Destination is where the result will be placed. In an expression this will be an intermediate result23. This is essentially the same as the temporary results above referred to with the %n syntax.
  • Operation is … erm … the operation
  • The two operands are the values passed into the operation. These could be literal values, variables, or intermediate results.

Every every intermediate result will be given a unique identifier (%n). This allows the lifetimes of values to be analysed and will be used later during optimisation4 and code generation5.

As for variables, every value written to a variable will be given a unique ‘version number’, for the same reasons as the indexing of intermediate results. These values are normally written as subscripts after the variable name and are usually referred to as subscripts.

Thus I ended up with the following data structure:

type TILItem = record
    Op: TILOperation;       //The operation
    DestLoc: TILLocation;
    DestData: Integer;      //Destination (a temp result index). -1 if operation has no result
    DestSub: Integer;

    Param1Loc: TILLocation; //Location of parameter 1
    Param1Data: Integer;    //Data for parameter 1
    Param1Sub: Integer;     //Subscript for parameter 1 (i.e. the variable write index)

    Param2Loc: TILLocation; //Location for parameter 2
    Param2Data: Integer;    //Data for parameter 2
    Param2Sub: Integer;     //Subscript for parameter 2 (i.e. the variable write index)
  end;

The ‘Data’ items contain either a literal value, an index into the list of variables, or the index (identifier) of an intermediate result.

Expression Rewriting

We’re now ready to parse and rewrite the expression into this Intermediate Language. The trick we need to use is recursion.

The algorithm I’ve written processes items in pairs – an operand and the following operator (or ‘no operator’ at the end of the expression). I’m referring to each of these blocks as a ‘slug’ because I don’t have a better term. The example above gets broken up as follows:

| a + | b * | c - | 4 |

where the | marks the bounds of each slug.

The code reads the first two pairs as the left and write slugs. Thus:

Left: a +
Right: b *

It compares the precedence of the two operators. Multiplication is higher precedence than addition, and therefore needs to be performed first. Thus the code recurses onto itself. The old right slug now becomes the left slug, and a new right slug is read:

Left: b *
Right: c -

This time the right slug operator is lower precedence than the left. The code writes the first line of IL:

%1 = multiply b, c

It then combines this intermediate result and the old right hand operator and returns it to the caller. The caller now has:

Left: a +
Right: %1 -

This time the two operators have the same precedence. The following IL data is produced:

%2 = add a, %1

The left slug is updated, and a new right slug is read:

Left: %2 -
Right: 4 <no operator>

No operator indicates the end of the expression and, for convenience, is handled internally by assigning it a lower precedence and any ‘actual’ operator. Thus another line of IL is generated:

? = subtract %2, 4

and returned to the caller. There’s no destination data here, that will need to be assigned by the caller. For example, if the code is a variable assignment then the caller must add the relevant data for that variable. If the code is part of the parameter list to a function call the value will need to be placed on the relevant stack, and so on for other code constructs.

Therefore the algorithm, in outline, is (for every pair of slugs):

  • If the right slug is higher precedence -> recurse with right slug as left
  • If the priorities are the same -> output the IL for the left slug and the operand from the right slug.
  • Update the left slug with the intermediate result and operator from the right slug.
  • If the right slug is lower precedence -> output IL as if the priorities were the same, return to the caller with the right slug updated to reflect the intermediate result.

In practice the task of assigning intermediate result indexes is assigned to the caller, on the basis that the ultimate destination of the expression needs to be assigned by whatever code called the expression parser.

Examples

A couple of examples of the parser/IL generator in action should help to show what’s happening. The first is for the expression example above, with the result being assigned to a variable.

Source:

a:=a+b*c-4

Output:

%1 = Multiply %b_0, %c_0
%2 = Add %a_0, %1
%a_1 = Subtract %2, 4

Note the subscripts on variable references, here represented with an ‘_n’ suffix.

And a more complex example:

Source:

a:=a*2+(b+1)*(b-1)

Output:

%1 = Multiply %a_0, 2
%2 = Add %b_0, 1
%3 = Subtract %b_0, 1
%4 = Multiply %2, %3
%a_1 = Add %1, %4

Contrast this with a similar expression but where we add the two blocks in parenthesis, instead of multiplying them.

Source:

a:=a*2+(b+1)+(b-1)

Output:

%1 = Multiply %a_0, 2
%2 = Add %b_0, 1
%3 = Add %1, %2
%4 = Subtract %b_0, 1
%a_1 = Add %3, %4

Source Code

Read the source code here. Note that this is a work in progress and the code may have changed by the time you read it.

Footnotes

  1. In practice some of the data will only be needed at compile time and could be removed if creating a separate executable.
  2. Which could be placed into register, pushed on the stack, or saved to memory in a hidden variable.
  3. As the code gets expanded beyond expressions, the entire syntax will get rewritten into these primitives. At that stage the destination could also be a variable. I’ll talk more about writing to variables later.
  4. For example values which are calculated but never used can be eliminated as ‘dead code’.
  5. For example we’ll nee to allocates values to registers, and if we run out of registers we’ll need to ‘spill over’ values to the stack or temporary variables. We’ll probably want to use registers for values which are used shortly after, those values not used until later are the ones to store on the stack or in hidden variables.

One Reply on “Writing a Compiler 1: Expressions and Intermediate Language”

Comments are closed.