Hacking on Esprima, the JS Parser, Day 2

OSS(9y ago)

Yesterday, we looked at how Esprima parses 2+2. Today, we're going to go one step further and see how Esprima parses sum.

function sum(a, b) {
  return a + b;
}

function sum(a, b) {
  return a + b;
}

Like yesterday, this post is just a polished version of my notes. The transitions will be a bit rough and there are probably some incomplete thoughts, but hopefully you'll be able to see how Esprima goes about the work of parsing a function.

#How do you add two numbers?

So lets start from the beginning.

esprima.parse("function sum(a,b) {return a + b;}");

esprima.parse("function sum(a,b) {return a + b;}");

When you parse the sum function you get an output like this:

{
  "type": "Program",
  "body": [
    {
      "type": "FunctionDeclaration",
      "id": {
        "type": "Identifier",
        "name": "sum"
      },
      "params": [
        {
          "type": "Identifier",
          "name": "a"
        },
        {
          "type": "Identifier",
          "name": "b"
        }
      ],
      "defaults": [],
      "body": {
        "type": "BlockStatement",
        "body": [
          {
            "type": "ReturnStatement",
            "argument": {
              "type": "BinaryExpression",
              "operator": "+",
              "left": {
                "type": "Identifier",
                "name": "a"
              },
              "right": {
                "type": "Identifier",
                "name": "b"
              }
            }
          }
        ]
      },
      "generator": false,
      "expression": false
    }
  ]
}

{
  "type": "Program",
  "body": [
    {
      "type": "FunctionDeclaration",
      "id": {
        "type": "Identifier",
        "name": "sum"
      },
      "params": [
        {
          "type": "Identifier",
          "name": "a"
        },
        {
          "type": "Identifier",
          "name": "b"
        }
      ],
      "defaults": [],
      "body": {
        "type": "BlockStatement",
        "body": [
          {
            "type": "ReturnStatement",
            "argument": {
              "type": "BinaryExpression",
              "operator": "+",
              "left": {
                "type": "Identifier",
                "name": "a"
              },
              "right": {
                "type": "Identifier",
                "name": "b"
              }
            }
          }
        ]
      },
      "generator": false,
      "expression": false
    }
  ]
}

Notes:

FunctionDeclaration - declarations take the form function foo() {}
type.name is set to sum
params are an array of identifiers (a, b)
body is a ReturnStatement with an argument BinaryExpression.

#Call Stack

Alright, now that we see the output, lets take a look at how it was generated. In this case, if we set a breakpoint at parseFunctionDeclaration the call stack is pretty simple.

parseFunctionDeclaration
parseStatementListItem
parseScriptBody
parseProgram
parse

This shows us that, the function declaration is just a StatementItem found by parseStatementListItem.

function parseStatementListItem() {
  if (lookahead.type === Token.Keyword) {
    switch (lookahead.value) {
      case "export":
        return parseExportDeclaration();
      case "import":
        return parseImportDeclaration();
      case "let":
        return parseLexicalDeclaration({ inFor: false });
      case "function":
        return parseFunctionDeclaration(new Node());
      case "class":
        return parseClassDeclaration();
    }
  }

  return parseStatement();
}

function parseStatementListItem() {
  if (lookahead.type === Token.Keyword) {
    switch (lookahead.value) {
      case "export":
        return parseExportDeclaration();
      case "import":
        return parseImportDeclaration();
      case "let":
        return parseLexicalDeclaration({ inFor: false });
      case "function":
        return parseFunctionDeclaration(new Node());
      case "class":
        return parseClassDeclaration();
    }
  }

  return parseStatement();
}

Looking at parseStatementListItem, all it's doing is comparing the lookahead's value to different types. It's not surprising that it finds the function declaration.

#How is a function parsed?

Here's a simplified version of parseFunctionDeclaration We'll look at how four things are handled: (names, params, strict mode, body).

function parseFunctionDeclaration(node) {
  if (!match("(")) {
    token = lookahead;
    var id = parseVariableIdentifier();
  }

  var params = parseParams();

  var previousStrict = strict;
  var body = parseFunctionSourceElements();
  strict = previousStrict;

  return node.finishFunctionDeclaration(id, params, body);
}

function parseFunctionDeclaration(node) {
  if (!match("(")) {
    token = lookahead;
    var id = parseVariableIdentifier();
  }

  var params = parseParams();

  var previousStrict = strict;
  var body = parseFunctionSourceElements();
  strict = previousStrict;

  return node.finishFunctionDeclaration(id, params, body);
}

#How is the function name parsed?

Function names are parsed the same way other variable identifiers are. There are a couple edge cases to watch out for like yield and strict mode, but basically Esprima lexes, finds the next token and returns the node.

function parseVariableIdentifier() {
  var token = lex();
  if (token.type !== Token.Identifier) throwUnexpectedToken(token);
  return node.finishIdentifier(token.value);
}

function parseVariableIdentifier() {
  var token = lex();
  if (token.type !== Token.Identifier) throwUnexpectedToken(token);
  return node.finishIdentifier(token.value);
}

#How are function params parsed?

When we're parsing params function sum (a,b) {return a + b}, we're looking for the a and b identifiers inside of the parens. We also know that they're separated by commas.

Parse params order of operations is as follows:

first check for a '('
check for the second ')' if it finds it return early
parse params

function parseParams() {
  options = { params: [] };
  expect("(");

  if (match(")")) return options;
  while (parseParam(options));

  return options;
}

function parseParams() {
  options = { params: [] };
  expect("(");

  if (match(")")) return options;
  while (parseParam(options));

  return options;
}

Are you surprised that parseParams delegates responsiblity to parseParam? You shouldn't be, parsers will always break the problem into the smallest possible unit!

Findings:

Params are just variable identifiers. Remember function names? Same thing!
options is a variable that is shared between parseParams and parseParam. Interesting choice sir...
parseParam checks for the last paren ).

function parseParam(param) {
  token = lookahead;
  var param = parseVariableIdentifier();
  options.params.push(param);

  return !match(")");
}

function parseParam(param) {
  token = lookahead;
  var param = parseVariableIdentifier();
  options.params.push(param);

  return !match(")");
}

#What's going on with strict modes?

When parseFunctionDeclaration parses the body it first saves the current strict mode. It then, parses the function body and resets the strict mode.

function parseFunctionDeclaration(node) {
  if (!match("(")) {
    token = lookahead;
    var id = parseVariableIdentifier();
  }

  var params = parseParams();

  var previousStrict = strict;
  var body = parseFunctionSourceElements();
  strict = previousStrict;

  return node.finishFunctionDeclaration(id, params, body);
}

function parseFunctionDeclaration(node) {
  if (!match("(")) {
    token = lookahead;
    var id = parseVariableIdentifier();
  }

  var params = parseParams();

  var previousStrict = strict;
  var body = parseFunctionSourceElements();
  strict = previousStrict;

  return node.finishFunctionDeclaration(id, params, body);
}

The reason for this is pretty cool. Remember that each function block can define it's own strict mode preference, well, by keeping a reference to the wrapping block's strict mode preference we can make sure it doesn't get clobbered.

function() { 'use strict' }

function() { 'use strict' }

#How is the function body parsed?

parseFunctionSourceElements will do a couple things to set specific function body fields, but the gist of it is parsing statements, which is a lot like parsing a JS script body.

Order of operations:

check for a left curly {
start parsing statements and checking for the right curly }

function parseFunctionSourceElements() {
  expect("{");
  var body = [];

  while (startIndex < length) {
    token = lookahead;
    if (match("}")) break;

    body.push(parseStatementListItem());
  }

  return node.finishBlockStatement(body);
}

function parseFunctionSourceElements() {
  expect("{");
  var body = [];

  while (startIndex < length) {
    token = lookahead;
    if (match("}")) break;

    body.push(parseStatementListItem());
  }

  return node.finishBlockStatement(body);
}

You might expect us to jump into how Esprima parses a statement, but I promise you that if we do that then this will be the blog post that never ends. In the spirit of the parser, if you're curious, I point you to yesterday's post on how 2+2 is parsed. If you've seen one statement parsed, you've seen them all :) Not really, but ya know we got to save something for tomorrow.

#Overview

Alright, so that finishes today's journey into how sum is parsed.

function sum(a, b) {
  return a + b;
}

function sum(a, b) {
  return a + b;
}

We looked at how sum's name, params, and body were parsed. Along the way, we saw the importance of breaking a big problem down into small pieces and the value of always checking for small grammatical hints ("(", ")", ",", ", ").