Building an Interpreter, Part 1: The Scanner

A couple of weeks ago, I started the fantastic Coursera course From Nand to Tetris where one implements a full virtual computer starting from the smallest building block, the Nand gate. The course is very project-centered and uses a small declarative language called—quite generically—HDL to implement the various hardware pieces in software.

The course instructors even provide a hardware simulator where you can run, test, and visually simulate your HDL scripts. That works all fine and dandy, however, I have some issues with that approach. For example, there is no way to edit your HDL scripts right inside the hardware simulator, tests are not re-run on changes in your HDL code which makes testing quite cumbersome, and the Java UI is not really pleasant on the eyes.

The GUI of the Hardware Simulator

I immediately thought about creating an in-browser IDE for HDL scripts with a simple UI, auto-testing, and nothing to download or install. Before I could think about the feature set, though, I needed an interpreter for HDL. And so my quest begins.

Initial Considerations

In a sense an interpreter implements a language, not unlike a compiler. However, compilation in the traditional sense is a two-stage process where a high-level language is first transformed into machine code for the target machine and then executed by this machine to produce results. An interpreter, on the other hand, is a program that takes code in a source language and executes/evaluates it right away.

Now that I knew what I wanted to make, I had to think about the right tools. I wanted the HDL interpreter to run inside the browser so JavaScript is a sensible implementation language. There are a couple of great examples of interpreters/compilers made with JavaScript out there:

All of these projects are excellent resources and I highly recommend checking them out, but using JavaScript as the implementation language did not feel right to me. I wanted to write code that clearly describes the concepts and steps an interpreter is made of and with JavaScript I always felt they are getting lost at some point.

And then it hit me! In my very first lecture in college Programming – An Introduction to Computer Science we built an interpreter with SML for a very simple programming language called F. And it was a thing of beauty. We started thinking about the syntax and semantics of F, then assembled grammars and inference rules on paper, and finally poured everything into easy-to-read, functional SML code.

I wanted the same for my HDL interpreter on top of being runnable inside the browser. While there is the SMLtoJs project, I opted for a combo with a much larger community: Reason/OCaml with Bucklescript.

The Phases of an Interpreter

Interpreter design and the underlying theory of programming languages is well understood in computer science and includes the following steps:

  1. The lexical analysis (also called tokenization or scanning) takes a piece of code in our source language and turns its characters and symbols into a stream of words that are valid according to the source language’s concrete syntax. Those words are called tokens.
  2. The syntax analysis (also known as parsing) is in charge of turning this list of tokens into a tree form which complies with the languags’s grammar for its abstract syntax. The resulting tree is called abstract syntax tree, or AST.
  3. The semantic analysis verifies that the AST is semantically meaningful and adheres to the source language’s static semantics. That also includes type checking and flow control checking.
  4. The final step is called evaluation which takes the verified AST and evaluates its expressions according to the language’s dynamic semantics inside a certain environment. This environment holds the rules for evaluating expressions of our source language and also a symbol table to lookup variables.

In this post we will tackle the lexical analysis of HDL by starting with describing the lexical syntax. The lexical syntax describes how tokens are created out of characters and symbols by means of regular expressions. It is the first half of the concrete syntax.

The second half is called phrasal syntax and describes how more complex hierarchical structures are made out of tokens. We will talk more about the phrasal syntax in part two of this series.

The Lexical Syntax of HDL

Let’s have a first glimpse at an HDL script:

/**
 * Checks if two input bits are equal
 */

CHIP Eq {
  IN a[2];
  OUT out; // True iff a[0]=a[1]
  PARTS:
    Xor(a=a[0], b=b[1], out=uneq);
    Not(in=uneq, out=out);
}

There are reserved keywords (CHIP, IN, …), identifiers (Eq, a, …), numbers (2), and symbols ({, [, ;, …). Converting this piece of code into a stream of tokens would yield something like: CHIP, ID("Eq"), LBRACE, IN, ID("a"), LBRACK, NUM(2), RBRACK, and so on.

Based on the HDL primer inside the book The Elements of Computing Systems (the book accompanying the online course I mentioned in the intro of this article) I came up with the following regular expressions describing the lexical syntax of HDL.

That means a token is either one of the keywords, a certain symbol, a number, or an identifier. A number is a sequence of digits with at least one digit. An identifier is a letter followed by any sequence of letters and digits. HDL is case sensitive so I included lowercase and uppercase letters.

This simple token type of the grammar above can now be transformed one-to-one into Reason code using a variant:

type token =
  | CHIP | IN | OUT | PARTS
  | TRUE | FALSE
  | LPAR | RPAR | LBRACE | RBRACE | LBRACK | RBRACK | EQUAL | COLON | SEMIC | COMMA | DDOT
  | NUM(int)
  | ID(string)

Isn’t this gorgeous? token is a variant with a number of so-called constructors. A variant is a very elegant way to say that a token can either be this or that or that other thing. Constructors can even hold extra data, like NUM (holding an int value) and ID (holding the identifier string).

Now that we defined the lexical syntax and introduced a token variant, we can move on to implementing the actual scan function that takes our code and transforms it into a list of tokens.

Implementing the Scanner

Let’s recap our plan: We want to write a function that takes our code as a string, traverses it character by character and spits out a list of tokens. Unfortunately, a string in Reason is not a list of chars (unlike in Standard ML). So we have to write a little helper function that does that for us.

let explode = (s) => {
  let rec exp = (cs, i) => i < 0 ? cs : exp([s.[i], ...cs], i - 1);
  exp([], String.length(s) - 1)
}

I also introduce some helper functions to check whether a character is a digit, a letter, or alphanumerical (digit or letter). Finally, I declare an exception which is raised if we come across an unknown character.

exception InvalidCharacter(char)

let isDigit = (c) => Char.code(c) >= 48 && Char.code(c) <= 57

let isLetter = (c) => {
  let lowercaseC = Char.lowercase(c);
  Char.code(lowercaseC) >= 97 && Char.code(lowercaseC) <= 122
}

let isAlphaNum = (c) => isLetter(c) || isDigit(c)

All that’s left to do now is implementing the actual scan function. This is the moment where Reason really shines and offers us two amazing features—pattern matching and mutually recursive functions—to produce concise and readable code.

let rec scan = (cs) =>
  switch (cs) {
    | [] => []
    | [' ' | '\n' | '\r' | '\t', ...cr] => scan(cr)
    | ['C', 'H', 'I', 'P', ...cr] => [CHIP, ...scan(cr)]
    | ['I', 'N', ...cr] => [IN, ...scan(cr)]
    | ['O', 'U', 'T', ...cr] => [OUT, ...scan(cr)]
    | ['P', 'A', 'R', 'T', 'S', ...cr] => [PARTS, ...scan(cr)]
    | ['t', 'r', 'u', 'e', ...cr] => [TRUE, ...scan(cr)]
    | ['f', 'a', 'l', 's', 'e', ...cr] => [FALSE, ...scan(cr)]
    | ['(', ...cr] => [LPAR, ...scan(cr)]
    | [')', ...cr] => [RPAR, ...scan(cr)]
    | ['{', ...cr] => [LBRACE, ...scan(cr)]
    | ['}', ...cr] => [RBRACE, ...scan(cr)]
    | ['[', ...cr] => [LBRACK, ...scan(cr)]
    | [']', ...cr] => [RBRACK, ...scan(cr)]
    | ['=', ...cr] => [EQUAL, ...scan(cr)]
    | [':', ...cr] => [COLON, ...scan(cr)]
    | [';', ...cr] => [SEMIC, ...scan(cr)]
    | [',', ...cr] => [COMMA, ...scan(cr)]
    | ['.', '.', ...cr] => [DDOT, ...scan(cr)]
    | ['/', '/', ...cr] => scanCommLine(cr)
    | ['/', '*', ...cr] => scanCommBlock(cr)
    | [c, ...cr] => if (isDigit(c)) {
        scanNum(0, [c, ...cr])
      }
      else if (isLetter(c)) {
        scanId("", [c, ...cr])
      }
      else {
        raise(InvalidCharacter(c))
      }
  }
and scanNum = (v, cs) => if (List.length(cs) == 0 || !isDigit(List.hd(cs))) {
    [NUM(v), ...scan(cs)]
  }
  else {
    scanNum(10 * v + Char.code(List.hd(cs)) - Char.code('0'), List.tl(cs)) /* A */
  }
and scanId = (v, cs) => if (List.length(cs) == 0 || !isLetterNum(List.hd(cs))) {
    [ID(v), ...scan(cs)]
  }
  else {
    scanId(v ++ String.make(1, List.hd(cs)), List.tl(cs))
  }
and scanCommLine = (cs) => if (List.hd(cs) == '\n') {
    scan(List.tl(cs))
  }
  else {
    scanCommLine(List.tl(cs))
  }
and scanCommBlock = (cs) => if (List.hd(cs) == '*' && List.hd(List.tl(cs)) == '/') { /* B */
    scan(List.tl(List.tl(cs)))
  }
  else {
    scanCommBlock(List.tl(cs))
  }

If cs (the source code as a list of characters) is empty we return the empty list and are done. If the first element (or head) of the character list is whitespace we ignore it and move on. If the characters match any of the reserved words or special symbols of HDL we consume those characters, append the respective token to our list, and move on.

Comments, numbers, and identifiers are a bit trickier and need some extra care. I call helper functions in case we detect a number, letter, or comment symbol. The functions scanCommLine and scanCommBlock use the next token(s) to detect whether the comment has ended. scanNum and scanId use the helper functions introduced earlier to return the variant constructors NUM and ID, respectively. NUM carries the int value of the number and ID the string value of the identifier. Once the helper functions are done with their work, we can call scan to carry on with the main task, thanks to mutually recursive functions (aka the and syntax).

Let me just highlight a couple of details:

Lastly, we wrap this auxiliary scan function with a wrapper function called scanner. The sole purpose of this function is converting a string of HDL code into an array of characters and calling scan.

let scanner = (code) => {
  let rec scan = (cs) =>
    /* scan code is omitted for brevity */
  ;
  scan(explode(code))
};

let code = "CHIP Eq {}";

scanner(code)
/* [CHIP, ID("Eq"), LBRACE, RBRACE] */ 

scanner("!")
/* Exception: InvalidCharacter('!'). */

And there you have it, our scanner is done! You can play around with a working version on Sketch.sh. In the next part of this series, we will deal with the phrasal and abstract syntax and use that knowledge to build a parser that generates an abstract syntax tree (AST).