Hacking on Esprima, the JS Parser, Day 2

OSS(9y ago)

Yesterday, we looked at how Esprima parses 2+2. Today, we're going to go one step further and see how Esprima parses sum.

function sum(a, b) {
return a + b;
}
function sum(a, b) {
return a + b;
}

Like yesterday, this post is just a polished version of my notes. The transitions will be a bit rough and there are probably some incomplete thoughts, but hopefully you'll be able to see how Esprima goes about the work of parsing a function.

#How do you add two numbers?

So lets start from the beginning.

esprima.parse("function sum(a,b) {return a + b;}");
esprima.parse("function sum(a,b) {return a + b;}");

When you parse the sum function you get an output like this:

{
"type": "Program",
"body": [
{
"type": "FunctionDeclaration",
"id": {
"type": "Identifier",
"name": "sum"
},
"params": [
{
"type": "Identifier",
"name": "a"
},
{
"type": "Identifier",
"name": "b"
}
],
"defaults": [],
"body": {
"type": "BlockStatement",
"body": [
{
"type": "ReturnStatement",
"argument": {
"type": "BinaryExpression",
"operator": "+",
"left": {
"type": "Identifier",
"name": "a"
},
"right": {
"type": "Identifier",
"name": "b"
}
}
}
]
},
"generator": false,
"expression": false
}
]
}
{
"type": "Program",
"body": [
{
"type": "FunctionDeclaration",
"id": {
"type": "Identifier",
"name": "sum"
},
"params": [
{
"type": "Identifier",
"name": "a"
},
{
"type": "Identifier",
"name": "b"
}
],
"defaults": [],
"body": {
"type": "BlockStatement",
"body": [
{
"type": "ReturnStatement",
"argument": {
"type": "BinaryExpression",
"operator": "+",
"left": {
"type": "Identifier",
"name": "a"
},
"right": {
"type": "Identifier",
"name": "b"
}
}
}
]
},
"generator": false,
"expression": false
}
]
}

Notes:

  • FunctionDeclaration - declarations take the form function foo() {}
  • type.name is set to sum
  • params are an array of identifiers (a, b)
  • body is a ReturnStatement with an argument BinaryExpression.

#Call Stack

Alright, now that we see the output, lets take a look at how it was generated. In this case, if we set a breakpoint at parseFunctionDeclaration the call stack is pretty simple.

  1. parseFunctionDeclaration
  2. parseStatementListItem
  3. parseScriptBody
  4. parseProgram
  5. parse

This shows us that, the function declaration is just a StatementItem found by parseStatementListItem.

function parseStatementListItem() {
if (lookahead.type === Token.Keyword) {
switch (lookahead.value) {
case "export":
return parseExportDeclaration();
case "import":
return parseImportDeclaration();
case "let":
return parseLexicalDeclaration({ inFor: false });
case "function":
return parseFunctionDeclaration(new Node());
case "class":
return parseClassDeclaration();
}
}

return parseStatement();
}
function parseStatementListItem() {
if (lookahead.type === Token.Keyword) {
switch (lookahead.value) {
case "export":
return parseExportDeclaration();
case "import":
return parseImportDeclaration();
case "let":
return parseLexicalDeclaration({ inFor: false });
case "function":
return parseFunctionDeclaration(new Node());
case "class":
return parseClassDeclaration();
}
}

return parseStatement();
}

Looking at parseStatementListItem, all it's doing is comparing the lookahead's value to different types. It's not surprising that it finds the function declaration.

#How is a function parsed?

Here's a simplified version of parseFunctionDeclaration We'll look at how four things are handled: (names, params, strict mode, body).

function parseFunctionDeclaration(node) {
if (!match("(")) {
token = lookahead;
var id = parseVariableIdentifier();
}

var params = parseParams();

var previousStrict = strict;
var body = parseFunctionSourceElements();
strict = previousStrict;

return node.finishFunctionDeclaration(id, params, body);
}
function parseFunctionDeclaration(node) {
if (!match("(")) {
token = lookahead;
var id = parseVariableIdentifier();
}

var params = parseParams();

var previousStrict = strict;
var body = parseFunctionSourceElements();
strict = previousStrict;

return node.finishFunctionDeclaration(id, params, body);
}

#How is the function name parsed?

Function names are parsed the same way other variable identifiers are. There are a couple edge cases to watch out for like yield and strict mode, but basically Esprima lexes, finds the next token and returns the node.

function parseVariableIdentifier() {
var token = lex();
if (token.type !== Token.Identifier) throwUnexpectedToken(token);
return node.finishIdentifier(token.value);
}
function parseVariableIdentifier() {
var token = lex();
if (token.type !== Token.Identifier) throwUnexpectedToken(token);
return node.finishIdentifier(token.value);
}

#How are function params parsed?

When we're parsing params function sum (a,b) {return a + b}, we're looking for the a and b identifiers inside of the parens. We also know that they're separated by commas.

Parse params order of operations is as follows:

  1. first check for a '('
  2. check for the second ')' if it finds it return early
  3. parse params
function parseParams() {
options = { params: [] };
expect("(");

if (match(")")) return options;
while (parseParam(options));

return options;
}
function parseParams() {
options = { params: [] };
expect("(");

if (match(")")) return options;
while (parseParam(options));

return options;
}

Are you surprised that parseParams delegates responsiblity to parseParam? You shouldn't be, parsers will always break the problem into the smallest possible unit!

Findings:

  • Params are just variable identifiers. Remember function names? Same thing!
  • options is a variable that is shared between parseParams and parseParam. Interesting choice sir...
  • parseParam checks for the last paren ).
function parseParam(param) {
token = lookahead;
var param = parseVariableIdentifier();
options.params.push(param);

return !match(")");
}
function parseParam(param) {
token = lookahead;
var param = parseVariableIdentifier();
options.params.push(param);

return !match(")");
}

#What's going on with strict modes?

When parseFunctionDeclaration parses the body it first saves the current strict mode. It then, parses the function body and resets the strict mode.

function parseFunctionDeclaration(node) {
if (!match("(")) {
token = lookahead;
var id = parseVariableIdentifier();
}

var params = parseParams();

var previousStrict = strict;
var body = parseFunctionSourceElements();
strict = previousStrict;

return node.finishFunctionDeclaration(id, params, body);
}
function parseFunctionDeclaration(node) {
if (!match("(")) {
token = lookahead;
var id = parseVariableIdentifier();
}

var params = parseParams();

var previousStrict = strict;
var body = parseFunctionSourceElements();
strict = previousStrict;

return node.finishFunctionDeclaration(id, params, body);
}

The reason for this is pretty cool. Remember that each function block can define it's own strict mode preference, well, by keeping a reference to the wrapping block's strict mode preference we can make sure it doesn't get clobbered.

function() { 'use strict' }
function() { 'use strict' }

#How is the function body parsed?

parseFunctionSourceElements will do a couple things to set specific function body fields, but the gist of it is parsing statements, which is a lot like parsing a JS script body.

Order of operations:

  • check for a left curly {
  • start parsing statements and checking for the right curly }
function parseFunctionSourceElements() {
expect("{");
var body = [];

while (startIndex < length) {
token = lookahead;
if (match("}")) break;

body.push(parseStatementListItem());
}

return node.finishBlockStatement(body);
}
function parseFunctionSourceElements() {
expect("{");
var body = [];

while (startIndex < length) {
token = lookahead;
if (match("}")) break;

body.push(parseStatementListItem());
}

return node.finishBlockStatement(body);
}

You might expect us to jump into how Esprima parses a statement, but I promise you that if we do that then this will be the blog post that never ends. In the spirit of the parser, if you're curious, I point you to yesterday's post on how 2+2 is parsed. If you've seen one statement parsed, you've seen them all :) Not really, but ya know we got to save something for tomorrow.

#Overview

Alright, so that finishes today's journey into how sum is parsed.

function sum(a, b) {
return a + b;
}
function sum(a, b) {
return a + b;
}

We looked at how sum's name, params, and body were parsed. Along the way, we saw the importance of breaking a big problem down into small pieces and the value of always checking for small grammatical hints ("(", ")", ",", ", ").