Guts

NOTE: THIS DOCUMENT IS A WORK IN PROGRESS!

Overview

The goal of this document is to deep dive into the guts of Eureka, starting at the high level objects and drill all the way down into the parser / compiler / assembler / interpreter. If you’re just interested in seeing what the language itself offers or finding out some basic information (why the language was created, etc), please take the tour or read the FAQ.

You’ve been warned… time to dig in!

Common Programming Patterns

Just to save a bit of time when discussing various structures over the course of this document, I will give some general hints now that will help you understand how to find them in the codebase and expect certain functions to exist.

For the most part, each structure described in this document is declared and implemented in a .h/.c combination named the same as the structure, most likely in the eureka/core subdirectory. For each structure, every function call will be prefixed with the name of the structure, and will take an ekContext (explained in the next section) as its first argument. Every structure is likely to have a Create and a Destroy function.

As an example, the struct ekContext is defined in eureka/core/ekContext.h, implemented in eureka/core/ekContext.c, and the functions ekContextCreate() and ekContextDestroy() exist.

The only notable exception to this is ekTypes.h, which contain a base set of low level typedefs which aid in portability, such as ekS32 for signed integers or ekBool for a boolean value. Everything structure in Eureka should be defined using these base types (or other Eureka structures, of course).

Arrays and Strings (ekArray and ekString) also have a slightly different approach to them. They follow the naming convention and function calls, but are not defined directly in structures. To aid a little bit in code readability and type safety, ekArrays are declared as double pointers to the structure the array is intended to contain. For example, stack frames owned by ekContext are declared as such:

struct ekFrame **frames;

ekArray*() functions take references to these functions (a triple pointer! The horror!), and internally maintain capacity with a bit of pointer arithmetic. As an example, this is legal code:

extern ekContext *E;                        // created elsewhere
ekFrame **frames = NULL;
ekFrame *newFrame = ekFrameCreate(E, ...);
ekArrayPush(E, &frames, newFrame);          // creates array storage
ekArrayDestroy(E, &frames, ekFrameDestroy); // frees everything + storage

A similar mechanism exists for ekString, using a simple char pointer.

The Eureka Context

Everything Eureka does is performed inside of a single “context”, contained completely in the struct ekContext. Each ekContext provides its own set of memory routines, types (built-in and custom), compiler information (syntax tree, etc), interpreter state (global variables, stack frames, values, pools), and errors (both compile and runtime). An ekContext completely owns all state, making tracking and cleanup fairly easy. Every C routine involving Eureka has an ekContext as its first argument, using the variable name “E”.

All embedding usage of Eureka starts with creating a context, adding any global functions, variables, or custom types. Then arbitrary code can be compiled and executed on the fly with ekContextEval(), or objects can be manipulated directly by digging into the guts of your ekContext.

Compilation Flow

It is time for an example. Let’s start with hello world:

print("Hello, world!\n");

Passing this text into a call to ekContextEval() will start by spawning another top-level structure named ekCompiler. ekCompiler is responsible for all phases of compilation: lexing, parsing, and assembling. A successful compilation creates two substructures inside of ekCompiler. The lexing and parsing phases create a tree of ekSyntax (The AST, or “Abstract Syntax Tree”), and the assembling phase creates an ekChunk, which is the compiled code.

A failed compilation simply populates an array of strings (errors), and may or may not produce either of the other two structures.

The AST is only useful after compilation for generating pretty graphs or otherwise inspecting mistakes during the parsing phase, but the generated chunk can be given to the parent context (thus granting ownership of it), which then allows the code to be executed and any variables/functions created inside to continue to exist within the context. The associated ekCompiler structure is destroyed before ekContextEval() finishes.

Lexing and Parsing

Let’s rewind a bit and talk about creating the syntax tree, using two intertwined pieces: the lexer (ekLexer and the function ekLex) and the parser (the generated function ekParse). The goal of a lexer is to break the raw Eureka source code into tokens, which are then passed to the parser function. Eureka’s lexer is mostly implemented using a tool named re2c, which can generate a complex lexer from a series of regular expressions that define all of the tokens. The re2c source to the lexer is located in the file ekLexer.re, which is generated during the build into the (impossible to read) file ekLexer.re.c.

Looking over the re2c source, you will see lines such as:

"{"             { return ETT_OPENBRACE; }
"}"             { return ETT_CLOSEBRACE; }
"["             { return ETT_OPENBRACKET; }
"]"             { return ETT_CLOSEBRACKET; }
"."             { return ETT_PERIOD; }
"::"            { return ETT_COLONCOLON; }
":"             { return ETT_COLON; }
"\?"            { return ETT_QUESTIONMARK; }
"break"         { return ETT_BREAK; }
"this"          { return ETT_THIS; }
"while"         { return ETT_WHILE; }
"var"           { return ETT_VAR; }

The left side of the file is just a list of simple regular expressions, and the right side is the C code that should be executed upon matching that regex as the next token in the stream. Every time a new token is matched, the ID associated with that token is handed to the parser, along with the actual matched text. This creates a “token stream”, which is then consumed by the parser as it slowly builds the syntax tree.

The parser source is located in ekParser.y, and is in lemon format. lemon is is known as a “parser generator” or a “compiler compiler”, which means its input is a set of rules/directives providing a complete “grammar” (Eureka’s grammar), and its output is a state machine that creates and attaches pieces of syntax together into a tree (the syntax tree).

The generated parser from Lemon is a typical “shift-reduce” parser. Every token the parser receives is shifted (added) onto a stack of tokens, and then the parser checks the stack to see if it can “reduce” any adjacent tokens/syntax into a larger piece of syntax, which replaces them on the stack. If the token stream is valid, the final token should reduce all remaining items with a single top-level syntax node, which will be the root of the syntax tree.

This combination of token stream and shifting/reduction is the heart of the language’s look and feel. Keywords can be added/aliased/removed by just manipulating the lexer, and sweeping changes in grammar rules or syntactic sugar can be made by tweaking a directive in the parser source.

Quick vocab: The “tokens” here are what the parser refers to as terminals, and the pieces of syntax that we’re creating are referred to as nonterminals. They basically are talking about the “branches” of the tree vs the “leaves”, with the leaves being where the tree branch ends (terminates). Since everything starts with a token and works towards the root, the innards of a syntax tree are always syntax nodes (branches, nonterminals) and the tips are always tokens (leaves, terminals).

The hello world example provided earlier would produce this token stream:

TT_IDENTIFIER      print
ETT_LEFTPAREN      (
ETT_LITERALSTRING  "Hello, world!\n"
ETT_RIGHTPAREN     )
ETT_ENDSTATEMENT   ;
ETT_ENDSTATEMENT

The identifier would be shifted onto the stack, and then immediately be reduced into an identifier syntax node. The left paren would be shifted, which would clue in the parser to reduce the previous identifier into an lvalue_indexable, as it is the only possible valid combination for that order of tokens. The literal string is then shifted and reduced to an expression, and the following shifted right paren would cause leftparen + expression + rightparen to be further reduced into a paren_expr_list. The shifted endstatements following would cause the lvalue_indexable + paren_expr_list to be reduced to a single lvalue_indexable (which is now internally recognized as a function call), and then further reduced into a simple lvalue, which would then be reduced into an expression, then an expr_list, then a statement, then a statement_list and finally a chunk.

Did you follow that? It might sound crazy, but this full chain of reductions can be traced backwards from the following rules (in order of appearance, comments on the right):

lvalue_indexable ::= IDENTIFIER.                        "print"

expression       ::= LITERALSTRING.                     "Hello, world!\n"
expr_list        ::= expression.                        "Hello, world!\n"
paren_expr_list  ::= LEFTPAREN expr_list RIGHTPAREN.   ("Hello, world!\n")

lvalue_indexable ::= lvalue_indexable paren_expr_list.  [call print]
lvalue           ::= lvalue_indexable.                  ...
expression       ::= lvalue.                            ...
expr_list        ::= expression.                        ...
statement        ::= expr_list ENDSTATEMENT.            [call print] ;
statement_block  ::= statement.                         ...
chunk            ::= statement_list.                    ...

You can interpret the ::= symbol as “a (rightside) can be reduced to a (leftside).” Since the goal of a shift-reduce parser is to get to reduce all the way down to the first listed nonterminal in the grammar (chunk, in our case), the generated parser has every possible valid token order combination predetermined. When it sees that first ENDSTATEMENT, the only possible token stream that would still reduce down to a chunk would be if we followed the reduction path above. The LITERALSTRING terminal gets bundled up inside a parenthesized expression list, which mixes with the identifier to make a function call, which becomes the only statement in the chunk; a single statement in the statement_list array at the root of the syntax tree. This is visible in the graph of the syntax tree:

The syntax tree root is the only output of this phase, and the only input for the next phase: assembling.

Assembling

If a valid syntax tree is created, it must then be assembled into something that the interpreter can execute. The entry point into the assembling phase is ekAssemble, located in the compiler source. ekAssemble takes an ekCompiler as its input, and slowly builds an ekChunk structure as it recursively walks the syntax tree.

Inside of ekCompiler.c, there exists a mapping from every type of node in the AST to a function that provides its assembly, and those functions are responsible for assembling all of their child nodes. Since the output of a compilation is a “chunk”, and chunks are defined as statement lists (as shown in the grammar earlier), the assembler is really just the output of ekAssembleStatementList.

Due to the nature of the Eureka interpreter being entirely stack based (no registers), very little information needs to be conveyed from node to node during assembly. The input to each assembly function (along with some flags) is simply “how many values does the parent node want left on the stack”, and the return value is “how many values are left on the stack”. Each assembly function is required to provide the exact amount of requested values on the stack. This is achieved with a simple return macro PAD, which ensures that the value requested (keep) matches the values left on the stack by appending POP or PUSHNULL opcodes accordingly.

For the chunk, the input is simple; when a chunk is done executing, there shouldn’t be any values left on the stack, so zero is passed in at the top level. However, let’s use an example such as this:

a = 3 + 5;

This would create this syntax tree (slightly simplified):

TODO:

  • Finish Assembling
  • Intepreting
  • ekValue
  • ekValueType