Building an Interpreter, Part 2: The Parser

In part 1 I explained how the first phase of an interpreter, the lexical analysis, transforms a given piece of code (in our case in the source language Nand2Tetris HDL) into a list of tokens. In this part I will turn that list of tokens into a tree form that is easier to reason about. The tool in charge of this transformation is called a parser.

The Phrasal Syntax of HDL

Before I can go into the details of the second phase of an interpreter—the syntactical analysis or simply parsing—I have to define the phrasal syntax of HDL. Just like the lexical syntax dictates how tokens are made of single characters, the phrasal syntax is a recipe of how to build phrases in HDL.

In the following I will show you the lexical syntax as a little reminder followed by the regular expressions that represent the phrasal syntax of HDL.

The lexical syntax of HDL
The phrasal syntax of HDL

I highlighted tokens identified by the scanner during the lexical analysis and also reused the syntactical categories id and num (which also produce tokens of the same name). Please note that in general regular expressions do not suffice to model the phrasal syntax of a programming language. You will most likely need some kind of grammar. If you want to know more, you may read up on the Chomsky hierarchy. I was lucky that HDL is rather simple and a declarative language. That’s why I got away with regular expressions only.

Now let’s take some small HDL script and have a look at the syntax tree produced by the rules above.

CHIP Not {
  IN in;
  OUT out; // True iff a[0]=a[1]
  PARTS:
    Nand(a=in, b=in, out=out);
}
chip}partspart;)connectionsconnectionpinExpidout=pinExpidout,connectionpinExpidin=pinExpidb,connectionpinExpidin=pinExpida(idNand:PARTS;pinspinStmtidoutOUT;pinspinStmtidinIN{idNotCHIP

This syntax tree represents a piece of code and allows me to annotate certain parts of that code with information. For example, so far I have not distinguished between pin identifiers that appear in the IN or OUT sections and the PARTS section. I can now do so and actually have done so by calling them pinStmt (pin statements inside IN/OUT) and pinExp (pin expressions inside PARTS), respectively.

I think of a syntax tree as a smarter code representation than just the list of tokens I had before. As a matter of fact, the syntax tree above might be a wee bit too smart. A lot of the encoded information is actually not needed. I can dumb this down a bit.

From Concrete to Abstract Syntax

While the concrete syntax dictates how a language’s phrases are built out of tokens (by means of the phrasal syntax) and characters (by means of the lexical syntax), the abstract syntax answers the question, “What are the significant parts of a phrase?” It then describes those significant parts in tree form. The resulting tree is called an abstract syntax tree or simply AST.

Let’s take a simple example to nail down the differences between concrete and abstract syntax. Imagine a very simple language that allows us to add numbers. The sum expression could have several different concrete syntaxes: For example, the common infix syntax 2 + 3 or the prefix/Polish syntax + 2 3.

The parse trees for the concrete syntax of infix and prefix notation

No matter which concrete syntax you look at, the abstract syntax is the same, though. The significant parts of a sum expression are the operator and its two operands.

Plus32
The AST is the same for both concrete productions

When I look at the phrasal syntax of HDL, it is clear what the significant parts of each syntactical category are. For example, a pinExp is really either a string (the pin’s identifier), or a string and a number (in case it’s a bus), or a string and two numbers (if it’s a sub-bus). If I take this up to the top, I understand that a chip is nothing more than an indentifier, two lists of pin statements (input and output), and a list of parts.

The grammar for the abstract syntax of HDL can be written down as Reason type declarations with variant constructors.

type pinStmt =
  | PinDecl(string)
  | BusDecl(string, int);

type pins = list(pinStmt);

type pinExp =
  | Pin(string)
  | Bus(string, int)
  | SubBus(string, int, int);

type connection =
  | Conn(pinExp, pinExp);

type connections = list(connection);

type part =
  | Part(string, connections);

type parts = list(part);

type chip =
  | Chip(string, pins, pins, parts);

If I take the HDL code bit from above and apply this grammar I get the following AST.

Chip[]Part[]ConnPin“out”Pin“out”ConnPin“in”Pin“b”ConnPin“in”Pin“a”“Nand”[]PinDecl“out”[]PinDecl“in”“Not”

This looks a lot cleaner and more managable, yet I still got all the information I need to evaluate this bad boy—excellent! The next and final step is turning all the type declarations from above into methods that create ASTs in code.

Implementing the Parser

I already mentioned that I am quite “lucky” with HDL being a simple language. That is also the reason why the parser looks a lot like the scanner I implemented in part 1. The parser will take a list of tokens and turn that into an AST. It does its job by matching the list of tokens against the tokens the grammar for the concrete syntax expects, e.g. a chip has to start with the keyword CHIP, followed by an identifier, followed by an LBRACE, and so on.

Each type declared above will get a matching function that can produce all possible type variants while checking that the concrete syntax is valid.

Note that this parser is called top-down because I will start parsing with the chip function and make my way downwards by until all functions are done and return the values of the leaves, i.e. strings and numeric values.

let chip = (ts) => {
  switch (ts) {
  | [CHIP, ID(chipName), LBRACE, IN, ...tr] => {
    switch (pins(tr)) {
    | (astIn, [SEMIC, OUT, ...tr]) => {
      switch (pins(tr)) {
      | (astOut, [SEMIC, PARTS, COLON, ...tr]) => {
        switch (parts(tr)) {
        | (astParts, tr) => (Chip(chipName, astIn, astOut, astParts), tr)
        }
      }
      | _ => raise(SyntaxError("Expected SEMIC PARTS COLON", tr))
      }
    }
    | _ => raise(SyntaxError("Expected SEMIC OUT", tr))
    }
  }
  | _ => raise(SyntaxError("Expected CHIP ID() LBRACE IN", ts))
  }
};

The chip function assumes the presence of certain tokens dictated by the concrete syntax and extracts the sub-ASTs for its input pins, output pins, and parts. It ultimately returns a Chip variant. The functions for parts, parts, connections, and so on are omitted for brevity here, but you can play around with the parser and the full code on sketch.sh.

The only thing left to do for the parser function is calling chip and checking that there are no remaining tokens.

let parser = (ts) => {
  switch (chip(ts)) {
  | (ast, []) => ast
  | _ => raise(SyntaxError("Unnecessary code after chip", ts))
  }
};

let code = "
CHIP Eq {
  IN a, b;
  OUT out; // True iff a=b
  PARTS:
    Xor(a=a, b=b, out=uneq);
    Not(in=uneq, out=out);
}";

code |> scanner |> parser;
/*
Chip(
  "Eq",
  [
    PinDecl("b"),
    PinDecl("a")
  ],
  [
    PinDecl("out")
  ],
  [
    Part(
      "Not",
      [
        Conn(Pin("out"), Pin("out")),
        Conn(Pin("in"), Pin("uneq"))
      ]
    ),
    Part(
      "Xor",
      [
        Conn(Pin("out"), Pin("uneq")),
        Conn(Pin("b"), Pin("b")),
        Conn(Pin("a"), Pin("a"))
      ]
    )
  ]
)
 */

Note that although the code contains the line Xor(a=a, b=b, out=uneq); the resulting AST has the order of the connections reversed. On a closer look even the parts themselves and the pin declarations are in reverse. This does not matter, though. I am once again relying on the declarative nature of the HDL language—there is not explicit order for things. If I would write a parser for an imperative or functional programming language it would be a totally different discussion.

You can have a look at a demo of this parser over on sketch.sh and come back for part 3 where I will take the AST and evaluate it.