documentation/docs/implementation/32-parsing.md

11 KiB
Raw Blame History

Parsing

Once we have generated a list of tokens (instances of Token) from the Lexer instance we need to turn these into a structure that represents our programs source code but using in-memory data-structures which we can traverse and process at a later stage.

Overview

The Parser class contains several methods for parsing different sub-structures of a TLang program and returning different data types generated by these methods. The parser has the ability to move back and forth between the token stream provided and fetch the current token (along with analysing it to return the type of symbol the token represents - known as the SymbolType (TODO: Cite the “Symbol types” section).

For example, the method parseIf() is used to parse if statements, it is called on the occurence of the token of if. This method returns an instance of type IfStatement. Then there are methods like parseBody() which is responsible for creating several sub-calls to methods such as parseIf() and building up a list of Statement instances (the top-type for all parser nodes).

The entry point to call is parse() which will return an instance of type Module.

!!! info The entry point handling may change soon with the advent of proper module support

API

The API exposed by the parser is rather minimal as there isnt much to a parser than controlling the token stream pointer (the position in the token stream), fetching the token and acting upon the type or value of said token. Therefore we have the methods summarised below:

  1. nextToken()
    • Moves the token pointer to the next token
  2. previousToken()
    • Moves the token pointer to the previous token
  3. getCurrentToken()
    • Returns the current Token instance at the current token pointer position
  4. hasTokens()
    • Returns true if there are tokens still left in the stream (i.e. tokenPtr < tokens.length), false otherwise

Initialization

The initialization of the parser is rather simple, an instance of the Parser class must be instantiated, along with this the following arguments must be provided to the constructor:

  1. Token[] tokens
    • This is an array of Token to be provided to the parser for parsing. This would have been derived from the Lexer via its performLex() and getTokens() call.

A new instance woud therefore be created with something akin to:

// Tokenize the following program
string sourceCode = "int i = 2;"
Lexer lexer = new Lexer(sourceCode);
lexer.performLex();

// Extract tokens and pass to the lexer
Token[] tokens = lexer.getTokens();
Parser parser = new Parser(tokens);

Symbol types

The token stream is effectively a list of instances of Token which consist just of the token itself as a string and the coordinates of the token (where it occurs). However, some tokens, despite being different strings, can be of the same type or syntactical grouping. For example one would agree that both tokens 1.5 and 25.2 are both different tokens but are both floating points. This is where the notion of symbol types comes in.

The enum SymbolType in parsing/symbols/check.d describes all of the available types of tokens there are in the grammar of the Tristan programming language like so:

public enum SymbolType {
    LE_SYMBOL,
    IDENT_TYPE,
    NUMBER_LITERAL,
    CHARACTER_LITERAL,
    STRING_LITERAL,
    SEMICOLON,
    LBRACE,
    ...
}

Given an instance of Token one can pass it to the getSymbolType(Token) method which will then return an enum member from SymbolType. When a token has no associated symbol type then SymbolType.UNKNOWN is returned. Now for an example:

// Create a new token at with (0, 0) as coordinates
Token token = new Token("100", 0, 0);

// Get the symbol type
SymbolType symType = getSymbolType(token);
assert(symType == SymbolType.NUMBER_LITERAL);

This assertion would pass as the symbol type of such a token is a number literal.

API

The API for working with and using SymbolTypes is made available within the parsing/data/check.d and contains the following methods:

  1. isType(string)
    • Returns true if the given string (a token) is a built-in type
    • Built-in type strings would be: byte, ubyte, short, ushort, int, uint, long, ulong, void
  2. getSymbolType(Token)
    • Returns the SymbolType associated with the given Token
    • If the token is not of a valid type then SymbolType.UNKNOWN is returned
  3. getCharacter(SymbolType)
    • This performs the reverse of getSymbolType(Token) in the sense that you provide it a SymbolType and it will return the corresponding string that is of that type.
    • This will work only for back-mapping a sub-section of tokens as you wont get anything back if you provide SymbolType.IDENT_TYPE as there are infinite possibiltiies for that - not a fixed token.

Data types

Every node returned by a parseX() is of a certain type and there are some important types to mention here. The following types are from either parsing/data.d or parsing/containers.d.

Statement

The Statement type is the top-type for most parse nodes, it has the following important methods and fields:

  1. weight
    • This holds a byte value which is used for when statements are required to be re-ordered. It starts default at 0 whereby that is the most prioritized re-ordering value (i.e. smaller means you appear first)
  2. parentOf()
    • This returns an instance of Container, specifically indicating of which container this Statement is a parent of.
    • It can be null if this Statement was not parented.
  3. parentTo(Container)
    • Set the parenting Container of this Statement to the one provided.
  4. toString()
    • The default string representtion method for Statements (unless overridden) is to show a rolling count which is increment with every instantiation of a Statement object.

Entity

The Entity type is a sub-type of Statement and represents any named entity, along with initialization scopes (TODO: these are not yet implemented semantically and accessor types) (TODO: these are not yet implemented semantically.) The following methods and fields are to note:

  1. this(string)
    • Constructs a new instance of an Entity with the provided name.
  2. getName()
    • Returns the name of the entity.
  3. setAccessorType(AccessorType accessorType)
    • TODO: Describe this
  4. getAccessorType()
    • TODO: Describe this
  5. setModifierType(InitScope initScope)
    • TODO: Describe this
  6. InitScope getModifierType()
    • TODO: Describe this
  7. bool isExternal()
    • If this returns true then it is a signal that this Entity should be emitted in a manner pertaining to an external symbol rather than one found in the current T module
  8. void makeExternal()
    • Mark this Entity as external
    • You will see this used in parseExtern() as that is where we need to mark entities as external for link-time resolution

Container

The Container type is an interface that specifies a certain type to implement a set of methods. These methods allow the type to become a container by then allowing one or more instances of Statement or rather a Statement[] to be contained within the container i.e. making it contain them.

It should be noted that the parenting method is used to climb up the hierachy given a Statement instance, however the Container technique is useful for a top-down search for an Entity - they are independent in that sense but can be used toghether TODO: double check but I believe this is right.

How to parse

The basic flow of the parser involves the following process:

  1. Firstly you need an entry point, this entry point for us is the parse() method which will return an instance of Module which represents the module - the TLang program.
  2. Every parseX() method gets called by another such method dependent on the current symbol (and sometimes a lookahead)
    • For example, sometimes when we come across SymbolType.IDENTIFIER we call parseName() which can then either call parseFuncCall(), parseTypedDeclaration() or parseAssignment(). This requires a lookahead to check what follows the identifier because just by itself it is too ambuguous grammatically.
    • After determining what comes next the token is pushed back using previousToken() and then we proceed into the correct function
    • Lookaheads are rare but they do appear in situations like that
  3. The parseX() methods return instances of Statement which is the top type for all parser-generated nodes or AST nodes.
  4. When you are about to parse a sub-section (like an if statement) of a bigger syntax group (like a body) you leave the offending token as the current token, then you call the parsing method (in this case parseIf()) and let it handle the call to nextToken() - this is simply the structure of parsing that TLang follows.
  5. Upon exiting a parseX() method you call nextToken() - this determines whether this method would continue parsing or not - if not then you return and the caller will continue with that current token and move on from there.

Example of parsing if-statements

We will now look at an example of how we deal with parsing if statements in our parser, specifically within the parseBody(). The beginning of this method starts by moving us off the offending token that made us call parseBody() (hence the call to nextToken()). After which we setup an array of Statement such that we can build up a body of them:

gprintln("parseBody(): Enter", DebugType.WARNING);

Statement[] statements;

/* Consume the `{` symbol */
nextToken();

Now we are within the body, as you can imagine a body is to be made up of several statements of which we do not know how many there are. Therefore we setup a loop that will iterate till we run out of tokens:

while (hasTokens())
{
    ...
}

Next thing we want to do if grab the current token and check what type of symbol it is:

while (hasTokens())
{
    /* Get the token */
    Token tok = getCurrentToken();
    SymbolType symbol = getSymbolType(tok);
    gprintln("parseBody(): SymbolType=" ~ to!(string)(symbol));

    ...
}

Following this we now have several checks that make use of getSymbolType(Token) in order to determine what the tokens type is and then in our case if the token is "if" then we will make a call to parseIf() and append the returned Statement-sub-type to the body of statements (Statement[]):

while(hasTokens())
{
    ...

    /* If it is a branch */
    else if (symbol == SymbolType.IF)
    {
        statements ~= parseIf();
    }

    ...
}