An HDL Interpreter with Reason, Part 1: The Lexer

A couple of weeks ago, I started the fantastic Coursera course From Nand to Tetris where one implements a full virtual computer starting from the smallest building block, the Nand gate. The course is very project-centered and uses a small declarative language called HDL to implement the various hardware pieces in software.

The course instructors even provide a hardware simulator where you can run, test, and visually simulate your HDL scripts. That works all fine and dandy, however, I have some issues with that approach. For example, there is no way to edit your HDL scripts right inside the hardware simulator, tests are not re-run on changes in your HDL code which makes testing quite cumbersome, and the Java UI is not really pleasant on the eyes.

The GUI of the Hardware Simulator

I immediately thought about creating an in-browser IDE for HDL scripts with a simple UI, auto-testing, and nothing to download or install. Before I could think about the feature set, though, I needed an interpreter for HDL. And so my quest begins.

Initial Considerations

In a sense an interpreter implements a language, not unlike a compiler. However, compilation in the traditional sense is a two-stage process where a high-level language is first transformed into machine code for the target machine and then executed by this machine to produce results. An interpreter, on the other hand, is a program that takes code in a source language and executes/evaluates it in a certain environment. This environment holds the rules for evaluating expressions of our source language and also a lookup table for identifiers.

Now that I knew what I wanted to make, I had to think about the right tools. I wanted the HDL interpreter to run inside the browser so JavaScript is a sensible implementation language. There are a couple of great examples of interpreters/compilers made with JavaScript out there:

Each one of these projects is marvelous and I highly recommend checking them out, but using JavaScript as the implementation language did not feel right to me. I wanted to write code that clearly describes the concepts and steps an interpreter is made of and with JavaScript I always felt they are getting lost at some point.

And then it hit me! In my very first lecture in college Programming – An Introduction to Computer Science we built an interpreter with SML for a very simple programming language called F. And it was a thing of beauty. We started thinking about the syntax and semantics of F, then assembled grammars and inference rules on paper, and finally poured everything into easy-to-read SML code.

I wanted the same for my HDL interpreter on top of being runnable inside the browser. While there is the SMLtoJs project, I opted for a combo with a much larger community: Reason/OCaml with Bucklescript.

The Phases of an Interpreter

Interpreter design and the underlying theory of programming languages is well understood in computer science and comprises the following steps:

  1. The lexical analysis or tokenization takes a piece of code in our source language and turns its characters and symbols into a stream of words that are valid according to the source language’s grammar. This grammar is therefore a recipe for the language’s concrete syntax and those words are also called tokens.
  2. The syntactical analysis or parsing is in charge of turning this list of tokens into a tree form which complies with the languags’s grammar for its abstract syntax. The resulting tree is also called Abstract Syntax Tree, or AST.
  3. The final step is called evaluation which takes the AST and evaluates it according to the language’s dynamic semantics inside an environment that maps identifiers to values.

In this post we will tackle the lexical analysis of HDL starting with describing the lexical syntax.

The Lexical Syntax of HDL

Let’s have a first glimpse at an HDL script:

/**
 * Checks if two input bits are equal
 */

CHIP Eq {
  IN a, b;
  OUT out; // True iff a=b
  PARTS:
    Xor(a=a, b=b, out=uneq);
    Not(in=uneq, out=out);
}

I can see things like identifiers (a, b, …), single and multi line comments, reserved words (CHIP, IN, …), and symbols ({, }, ;, …). Based on the HDL primer inside the book The Elements of Computing Systems (the book accompanying the online course I mentioned in the intro of this article) I came up with the following grammar describing the lexical syntax of HDL.

   token = "CHIP" | "IN" | "OUT" | "PARTS" | "true" | "false" | "(" | ")" | "{" | "}" | "[" | "]" | "=" | ":" | ";" | "," | ".." | num | id
     num = digit [ num ]
      id = letter [ alphanum ]
alphanum = ( digit | letter ) [ alphanum ]
  letter = "a" | … | "z" | "A" | … | "Z" 
   digit = "0" | … | "9"

That means a token is either one of the reserved words, a certain symbol, a number, or an identifier. An identifier is any sequence of letters and digits not starting with a digit. HDL is case sensitive so I included lowercase and uppercase letters.

This simple token rule of the grammar above can now be transformed one-to-one into Reason code using a variant:

type token =
  | CHIP | IN | OUT | PARTS
  | TRUE | FALSE
  | LPAR | RPAR | LBRACE | RBRACE | LBRACK | RBRACK | EQUAL | COLON | SEMIC | COMMA | DDOT
  | NUM(int)
  | ID(string)

Isn’t this gorgeous? token is a variant with a number of so-called constructors. A variant is a very elegant way to say that a token can either be this or that or that other thing. Constructors can even hold extra data, like NUM (holding an int value) and ID (holding the identifier string).

Now that we defined the lexical syntax and introduced a token variant, we can move on to implementing the actual lex function that takes our code and transforms it into a list of tokens.

Implementing the Lexer

Let’s recap our plan: We want to write a function that takes our code as a string, traverses it character by character and spits out a list of tokens. Unfortunately, a string in Reason is not a list of chars (unlike in Standard ML). So we have to write a little helper function that does that for us.

let explode = (s) => {
  let rec exp = (cs, i) => i < 0 ? cs : exp([s.[i], ...cs], i - 1);
  exp([], String.length(s) - 1)
}

I also introduce some helper functions to check whether a character is a digit, a letter, or alphanumerical (digit or letter). Finally, I declare an exception which is raised if we come across an unknown character.

exception InvalidCharacter(char)

let isDigit = (c) => Char.code(c) >= 48 && Char.code(c) <= 57

let isLetter = (c) => {
  let lowercaseC = Char.lowercase(c);
  Char.code(lowercaseC) >= 97 && Char.code(lowercaseC) <= 122
}

let isAlphaNum = (c) => isLetter(c) || isDigit(c)

All that’s left to do now is implementing the actual lex function. This is the moment where Reason really shines and offers us two amazing features—pattern matching and mutually recursive functions—to produce concise and readable code.

let rec lex = (cs) =>
  switch (cs) {
    | [] => []
    | [' ' | '\n' | '\r' | '\t', ...cr] => lex(cr)
    | ['C', 'H', 'I', 'P', ...cr] => [CHIP, ...lex(cr)]
    | ['I', 'N', ...cr] => [IN, ...lex(cr)]
    | ['O', 'U', 'T', ...cr] => [OUT, ...lex(cr)]
    | ['P', 'A', 'R', 'T', 'S', ...cr] => [PARTS, ...lex(cr)]
    | ['t', 'r', 'u', 'e', ...cr] => [TRUE, ...lex(cr)]
    | ['f', 'a', 'l', 's', 'e', ...cr] => [FALSE, ...lex(cr)]
    | ['(', ...cr] => [LPAR, ...lex(cr)]
    | [')', ...cr] => [RPAR, ...lex(cr)]
    | ['{', ...cr] => [LBRACE, ...lex(cr)]
    | ['}', ...cr] => [RBRACE, ...lex(cr)]
    | ['[', ...cr] => [LBRACK, ...lex(cr)]
    | [']', ...cr] => [RBRACK, ...lex(cr)]
    | ['=', ...cr] => [EQUAL, ...lex(cr)]
    | [':', ...cr] => [COLON, ...lex(cr)]
    | [';', ...cr] => [SEMIC, ...lex(cr)]
    | [',', ...cr] => [COMMA, ...lex(cr)]
    | ['.', '.', ...cr] => [DDOT, ...lex(cr)]
    | ['/', '/', ...cr] => lexCommLine(cr)
    | ['/', '*', ...cr] => lexCommBlock(cr)
    | [c, ...cr] => if (isDigit(c)) {
        lexNum(0, [c, ...cr])
      }
      else if (isLetter(c)) {
        lexId("", [c, ...cr])
      }
      else {
        raise(InvalidCharacter(c))
      }
  }
and lexNum = (v, cs) => if (List.length(cs) == 0 || !isDigit(List.hd(cs))) {
    [NUM(v), ...lex(cs)]
  }
  else {
    lexNum(10 * v + Char.code(List.hd(cs)) - Char.code('0'), List.tl(cs)) /* A */
  }
and lexId = (v, cs) => if (List.length(cs) == 0 || !isLetterNum(List.hd(cs))) {
    [ID(v), ...lex(cs)]
  }
  else {
    lexId(v ++ String.make(1, List.hd(cs)), List.tl(cs))
  }
and lexCommLine = (cs) => if (List.hd(cs) == '\n') {
    lex(List.tl(cs))
  }
  else {
    lexCommLine(List.tl(cs))
  }
and lexCommBlock = (cs) => if (List.hd(cs) == '*' && List.hd(List.tl(cs)) == '/') { /* B */
    lex(List.tl(List.tl(cs)))
  }
  else {
    lexCommBlock(List.tl(cs))
  }

If cs (the source code as a list of characters) is empty we return the empty list and are done. If the first element (or head) of the character list is whitespace we ignore it and move on. If the characters match any of the reserved words or special symbols of HDL we consume those characters, append the respective token to our list, and move on.

Comments, numbers, and identifiers are a bit trickier and need some extra care. I call helper lexers in case we detect a number, letter, or comment symbol. The functions lexCommLine and lexCommBlock use look-ahead to detect whether the comment has ended. lexNum and lexId use the helper functions introduced earlier to return the variant constructors NUM and ID, respectively. NUM carries the int value of the number and ID the string value of the identifier. Once the helper lexers are done with their work, we can call lex to carry on with the main task, thanks to mutually recursive functions (aka the and syntax).

Let me just highlight a couple of details:

And there you have it, our lexer is done! You can play around with a working version on Sketch.sh that also includes a wrapper called lexer which takes care of splitting up our code string and calling lex.