Lexical analysis

Lexical analysis is the process of taking a program as an input string \(A\) and splitting it into a list of \(n\) sub-strings \(A_{1},\,A_{2}\ldots A_{n}\) called tokens. The length \(n\) of this list of dependent on several rules that determine how, when and where new tokens are built - this set of rules is called a grammar.

Grammar

The grammar is described in the language section and can be viewed alongside this section for some context.

Overview

The source code for the lexical analysis part of the compiler is located in source/tlang/compiler/lexer/ which contains a few important module and class definitions.

Lexer API

The Lexer API describes the required methods that a tokenizer must have in order to be used within the TLang compiler infrastructure. The reason for describing such an interface is such that more improved tokenizers can be easily integrated down the line in a manner that does not require much churn in the parts of the code base that use the lexer (namely the parser).

The API is described in the table below and the file in question is in source/tlang/compiler/lexer/core/lexer.d which contains the LexerInterface described below:

Method name	Return type	Description
`getCurrentToken()`	`Token`	Returns the `Token` at the current cursor position
`nextToken()`	`void`	Moves the cursor forward once
`previousToken()`	`void`	Moves the cursor backwards once
`setCursor(ulong)`	`void`	Set’s the cursor’s position to the given index
`getCursor()`	`ulong`	Returns the cursor’s current position
`hasTokens()`	`bool`	Returns `true` if there are more tokens to be consumed, otherwise `false`
`getLine()`	`ulong`	Return’s the line number the lexer is at
`getColumn()`	`ulong`	Return’s the column number the lexer is at
`getTokens()`	`Token[]`	Exhausts the lexer’s token stream and returns all gathered tokens in an array

Character constants

For completion we include the commonly used character constant definitions. These come in the form of an enumeration type as shown below:

public enum LexerSymbols : char
{
    L_PAREN = '(',
    R_PAREN = ')',
    SEMI_COLON = ';',
    COMMA = ',',
    L_BRACK =  '[' ,
    R_BRACK =  ']' ,
    PLUS =  '+' ,
    MINUS =  '-' ,
    FORWARD_SLASH =  '/' ,
    PERCENT =  '%' ,
    STAR =  '*' ,
    AMPERSAND =  '&' ,
    L_BRACE =  '{' ,
    R_BRACE =  '}' ,
    EQUALS =  '=' ,
    SHEFFER_STROKE =  '|' ,
    CARET =  '^' ,
    EXCLAMATION =  '!' ,
    TILDE =  '~' ,
    DOT =  '.' ,
    COLON =  ':',
    SPACE = ' ',
    TAB = '\t',
    NEWLINE = '\n',
    DOUBLE_QUOTE = '"',
    SINGLE_QUOTE =  '\'' ,
    BACKSLASH =  '\\' ,
    UNDERSCORE =  '_' ,
    LESS_THAN =  '<' ,
    BIGGER_THAN =  '>' ,

    ESC_NOTHING =  '0' ,
    ESC_CARRIAGE_RETURN =  'r' ,
    ESC_TAB =  't' ,
    ESC_NEWLINE =  'n' ,
    ESC_BELL=  'a' ,

    ENC_BYTE =  'B' ,
    ENC_INT =  'I' ,
    ENC_LONG =  'L' ,
    ENC_WORD =  'W' ,
    ENC_UNSIGNED =  'U' ,
    ENC_SIGNED =  'S' ,
}

Helper methods

There are quite a few helper methods as well which are commonly used across the lexer implementation and therefore are worth being aware of. You can find these all within the tlang.compiler.lexer.core.lexer module.

Method name	Return type	Description
`isOperator(char c)`	`bool`	Checks if the provided character is an operator, returning `true` if so
`isSplitter(char c)`	`bool`	Checks if the provided character is a splitter, returning `true` if so
`isNumericalEncoder_Size(char)`	`bool`	Checks if the provided character is a numerical size encoder
`isNumericalEncoder_Signage(char)`	`bool`	Checks if the provided character is a numerical signage encoder
`isNumericalEncoder(char)`	`bool`	Checks if the provided character is either a numerical size encoder or signage encoder
`isValidEscape_String(char)`	`bool`	Checks if the given character is a valid escape character (something which would have followed a `\`)
`isValidDotPrecede(char)`	`bool`	Given a character return whether it is valid entry for preceding a ‘.’.

the `Token`

A Token represents, well, a token which is produced in following the grammar.

Method name	Return type	Description
`this(string, ulong, ulong)`	Constructor	Constructs a new `Token` with the given contents, followed by line and column numbers
`opEquals(Object other)`	`bool`	Overrides the behaviour of `==` to allow for comparing `Token`(s) based on content
`getToken()`	`string`	Returns the contents of this token
`toString()`	`string`	Returns a string representation of the token including its data and line information

Below, as an example of the API, we show how you con compare tokens (if you ever needed to):

// Create two tokens containing contents "int"
Token token1 = new Token("int");
Token token2 = new Token("int");

// Compare them for equality 
// (would evaluate to true rather than false by reference equality (the default in D))
assert(token1 == token2);

the `LexerException`

This is a simple exception type of which extends the TError exception type (the base type used within the TLang compiler system).

It is rather simple, the constructor takes in the following (in order of appearance):

LexerInterface
- We take in the offending instance of the lexer used which generated this exception
- This is such that coordinate information (the \((x,y)\) source text pointer can be added into error messages)
LexerError
- This is an optional parameter which defaults to LexerError.OTHER
- Base reason for the exception
string
- This is an optional parameter which defaults to ""
- This is the custom error text

The LexerError is an enumeration type that is comprised of the following members:

/** 
 * The specified error which occurred
 */
public enum LexerError
{
    /** 
     * If all the characters were
     * exhausted
     */
    EXHAUSTED_CHARACTERS,

    /** 
     * Generic error
     */
    OTHER
}

Implementation of the single-pass tokenizer

The current lexer implementation that is being used is the BasicLexer (available at source/tlang/compiler/lexer/kinds/basic.d) and it is a one-pass lexer, this means that before you use any methods such as nextToken() you must first have called performLex() on it such that the Token[] can be generated.

This is not the most efficient way, but a streaming lexer is not yet implemented however it is planned.

Overview

A quick overview of some of the fields which are used for tracking the state of the token building process:

Name	Type	Purpose
`sourceCode`	`string`	the whole input program (as a string) to be tokenized
`position`	`ulong`	holds the index to the current character in the string array `sourceCode`
`currentChar`	`char`	the current character at index-`position`
`tokens`	`Token[]`	The list of the currently built tokens
`line`	`ulong`	Current line the tokenizer is on (with respect to the source code input)
`column`	`ulong`	Current column the tokenizer is on (with respect to the source code input)
`currentToken`	`string`	The token string that is currently being built-up, char-by-char

The implementation of the lexer, the BasicLexer class, is explained in detail in this section. The lexical analysis is done one-shot via the performLex() method which will attempt to tokenize the input program, on failure returning false, true otherwise. In the successful case the tokens array will be filled with the created tokens and can then later be retrieved via a call to getTokens().

Below is an example usage of the BasicLexer which makes use of it in order to process the following input source code:

new A().l.p.p;

Here is the code to do so:

/**
* Test correct handling of dot-operator for
* non-floating point cases
*
* Input: `new A().l.p.p;`
*/
unittest
{
    import std.algorithm.comparison;
    string sourceCode = "new A().l.p.p;";
    BasicLexer currentLexer = new BasicLexer(sourceCode);
    currentLexer.performLex();
    gprintln("Collected "~to!(string)(currentLexer.getTokens()));
    assert(currentLexer.getTokens() == [
        new Token("new", 0, 0),
        new Token("A", 0, 0),
        new Token("(", 0, 0),
        new Token(")", 0, 0),
        new Token(".", 0, 0),
        new Token("l.p.p", 0, 0),
        new Token(";", 0, 0)
    ]);
}

Using `performLex()`

This method contains a looping structure which will read character-by-character from the sourceCode string and follow the rules of the grammar, looping whilst there are still characters available for consumption (position < sourceCode.length).

We loop through each character and dependent on its value we start building new tokens, certain characters will cause a token to finish being built which will sometimes be caused by isSplitter(character) being true. A typical token building process looks something like the following, containing the final character to be tacked onto the current token build up, the creation of a new token object and the addition of it to the tokens list, finishing with flushing the build up string and incrementing the coordinates:

A typical token building procedure looks something like this:

/* Generate and add the token */
currentToken ~= "'";
currentTokens ~= new Token(currentToken, line, column);

/* Flush the token */
currentToken = "";
column += 2
position += 2;

Character and token availability

Helper functions relating to character and token availability.

Method name	Return type	Description
`hasToken()`	`bool`	Returns `true` if there is a token currently built i.e. `currentToken.length != 0`, `false` otherwise.
`isBackward()`	`bool`	Returns `true` if we can move the character pointer backwards, `false` otherwise.
`isForward()`	`bool`	Returns `true` if we can move the character pointer forward, `false` otherwise.
`isNumericalStr()`	`bool`	This method is called in order to check if the build up, `currentToken`, is a valid numerical string. If the string is empty, then it returns `false`. If the string is non-empty and contains anything other than digits then it returns `false`, otherwise is returns `true`.

Grammar-wise

These are all the methods which pertain to the construction of tokens based on different states of the state machine.

These methods follow a sort of methodology whereby they will return true if there are characters left in the buffer which can still be processed after return, or false if there are none left.

Method name	Return type	Description
`doIdentOrPath()`	`bool`	Processes an ident with or without a dot-path
`doChar()`	`bool`	Tokenizes a character
`doString()`	`bool`	Tokenizes a string
`doComment()`	`bool`	Processes various different types of comments
`doEscapeCode()`	`bool`	Lex an escape code. If valid one id found, add it to the token, else throw Exception
`doNumber()`	`bool`	Lex a number, this method lexes a plain number, float or numerically encoded.
`doEncoder()`	`bool`	Lex a numerical encoder
`doFloat()`	`bool`	Lex a floating point, the initial part of the number is lexed by the `doNumber()` method.

Buffer management

These are methods for managing the advancement of the lexing pointer, the position of \((x, y)\) coordinates (used for error reporting) and so forth.

Method name	Return type	Description
`flush()`	`void`	Flush the current token to the token buffer.
`buildAdvance()`	`bool`	Consume the current char into the current token, returns `true` on non-empty buffer
`improvedAdvance(int inc = 1, bool shouldFlush = false)`	`bool`	Advances the source code pointer
`advanceLine()`	`bool`	Advance the position, line and current token, reset the column to 1. Returns `true` on non-empty buffer

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search