Lexical analysis
Lexical analysis is the process of taking a program as an input string \(A\) and splitting it into a list of \(n\) sub-strings \(A_{1},\,A_{2}\ldots A_{n}\) called tokens. The length \(n\) of this list of dependent on several rules that determine how, when and where new tokens are built - this set of rules is called a grammar.
Grammar
The grammar is described in the language section and can be viewed alongside this section for some context.
Overview
The source code for the lexical analysis part of the compiler is located
in source/tlang/compiler/lexer/
which contains a few important module
and class definitions.
Lexer API
The Lexer API describes the required methods that a tokenizer must have in order to be used within the TLang compiler infrastructure. The reason for describing such an interface is such that more improved tokenizers can be easily integrated down the line in a manner that does not require much churn in the parts of the code base that use the lexer (namely the parser).
The API is described in the table below and the file in question is in
source/tlang/compiler/lexer/core/lexer.d
which contains the
LexerInterface
described below:
Method name | Return type | Description |
---|---|---|
getCurrentToken() |
Token |
Returns the Token at the current cursor position |
nextToken() |
void |
Moves the cursor forward once |
previousToken() |
void |
Moves the cursor backwards once |
setCursor(ulong) |
void |
Set’s the cursor’s position to the given index |
getCursor() |
ulong |
Returns the cursor’s current position |
hasTokens() |
bool |
Returns true if there are more tokens to be consumed, otherwise false |
getLine() |
ulong |
Return’s the line number the lexer is at |
getColumn() |
ulong |
Return’s the column number the lexer is at |
getTokens() |
Token[] |
Exhausts the lexer’s token stream and returns all gathered tokens in an array |
Character constants
For completion we include the commonly used character constant definitions. These come in the form of an enumeration type as shown below:
public enum LexerSymbols : char
{
L_PAREN = '(',
R_PAREN = ')',
SEMI_COLON = ';',
COMMA = ',',
L_BRACK = '[' ,
R_BRACK = ']' ,
PLUS = '+' ,
MINUS = '-' ,
FORWARD_SLASH = '/' ,
PERCENT = '%' ,
STAR = '*' ,
AMPERSAND = '&' ,
L_BRACE = '{' ,
R_BRACE = '}' ,
EQUALS = '=' ,
SHEFFER_STROKE = '|' ,
CARET = '^' ,
EXCLAMATION = '!' ,
TILDE = '~' ,
DOT = '.' ,
COLON = ':',
SPACE = ' ',
TAB = '\t',
NEWLINE = '\n',
DOUBLE_QUOTE = '"',
SINGLE_QUOTE = '\'' ,
BACKSLASH = '\\' ,
UNDERSCORE = '_' ,
LESS_THAN = '<' ,
BIGGER_THAN = '>' ,
ESC_NOTHING = '0' ,
ESC_CARRIAGE_RETURN = 'r' ,
ESC_TAB = 't' ,
ESC_NEWLINE = 'n' ,
ESC_BELL= 'a' ,
ENC_BYTE = 'B' ,
ENC_INT = 'I' ,
ENC_LONG = 'L' ,
ENC_WORD = 'W' ,
ENC_UNSIGNED = 'U' ,
ENC_SIGNED = 'S' ,
}
Helper methods
There are quite a few helper methods as well which are commonly used
across the lexer implementation and therefore are worth being aware of.
You can find these all within the tlang.compiler.lexer.core.lexer
module.
Method name | Return type | Description |
---|---|---|
isOperator(char c) |
bool |
Checks if the provided character is an operator, returning true if so |
isSplitter(char c) |
bool |
Checks if the provided character is a splitter, returning true if so |
isNumericalEncoder_Size(char) |
bool |
Checks if the provided character is a numerical size encoder |
isNumericalEncoder_Signage(char) |
bool |
Checks if the provided character is a numerical signage encoder |
isNumericalEncoder(char) |
bool |
Checks if the provided character is either a numerical size encoder or signage encoder |
isValidEscape_String(char) |
bool |
Checks if the given character is a valid escape character (something which would have followed a \ ) |
isValidDotPrecede(char) |
bool |
Given a character return whether it is valid entry for preceding a ‘.’. |
the Token
A Token
represents, well, a token which is produced in following the
grammar.
Method name | Return type | Description |
---|---|---|
this(string, ulong, ulong) |
Constructor | Constructs a new Token with the given contents, followed by line and column numbers |
opEquals(Object other) |
bool |
Overrides the behaviour of == to allow for comparing Token (s) based on content |
getToken() |
string |
Returns the contents of this token |
toString() |
string |
Returns a string representation of the token including its data and line information |
Below, as an example of the API, we show how you con compare tokens (if you ever needed to):
// Create two tokens containing contents "int"
Token token1 = new Token("int");
Token token2 = new Token("int");
// Compare them for equality
// (would evaluate to true rather than false by reference equality (the default in D))
assert(token1 == token2);
the LexerException
This is a simple exception type of which extends the TError
exception
type (the base type used within the TLang compiler system).
It is rather simple, the constructor takes in the following (in order of appearance):
LexerInterface
- We take in the offending instance of the lexer used which generated this exception
- This is such that coordinate information (the \((x,y)\) source text pointer can be added into error messages)
LexerError
- This is an optional parameter which defaults to
LexerError.OTHER
- Base reason for the exception
- This is an optional parameter which defaults to
string
- This is an optional parameter which defaults to
""
- This is the custom error text
- This is an optional parameter which defaults to
The LexerError
is an enumeration type that is comprised of the
following members:
/**
* The specified error which occurred
*/
public enum LexerError
{
/**
* If all the characters were
* exhausted
*/
EXHAUSTED_CHARACTERS,
/**
* Generic error
*/
OTHER
}
Implementation of the single-pass tokenizer
The current lexer implementation that is being used is the BasicLexer
(available at source/tlang/compiler/lexer/kinds/basic.d
) and it is a
one-pass lexer, this means that before you use any methods such as
nextToken()
you must first have called performLex()
on it such that
the Token[]
can be generated.
This is not the most efficient way, but a streaming lexer is not yet implemented however it is planned.
Overview
A quick overview of some of the fields which are used for tracking the state of the token building process:
Name | Type | Purpose |
---|---|---|
sourceCode |
string |
the whole input program (as a string) to be tokenized |
position |
ulong |
holds the index to the current character in the string array sourceCode |
currentChar |
char |
the current character at index-position |
tokens |
Token[] |
The list of the currently built tokens |
line |
ulong |
Current line the tokenizer is on (with respect to the source code input) |
column |
ulong |
Current column the tokenizer is on (with respect to the source code input) |
currentToken |
string |
The token string that is currently being built-up, char-by-char |
The implementation of the lexer, the BasicLexer
class, is explained in
detail in this section. The lexical analysis is done one-shot via the
performLex()
method which will attempt to tokenize the input program,
on failure returning false
, true
otherwise. In the successful case
the tokens
array will be filled with the created tokens and can then
later be retrieved via a call to getTokens()
.
Below is an example usage of the BasicLexer
which makes use of it in
order to process the following input source code:
Here is the code to do so:
/**
* Test correct handling of dot-operator for
* non-floating point cases
*
* Input: `new A().l.p.p;`
*/
unittest
{
import std.algorithm.comparison;
string sourceCode = "new A().l.p.p;";
BasicLexer currentLexer = new BasicLexer(sourceCode);
currentLexer.performLex();
gprintln("Collected "~to!(string)(currentLexer.getTokens()));
assert(currentLexer.getTokens() == [
new Token("new", 0, 0),
new Token("A", 0, 0),
new Token("(", 0, 0),
new Token(")", 0, 0),
new Token(".", 0, 0),
new Token("l.p.p", 0, 0),
new Token(";", 0, 0)
]);
}
Using performLex()
This method contains a looping structure which will read
character-by-character from the sourceCode
string and follow the rules
of the grammar, looping whilst there are
still characters available for consumption
(position < sourceCode.length
).
We loop through each character and dependent on its value we start
building new tokens, certain characters will cause a token to finish
being built which will sometimes be caused by isSplitter(character)
being true
. A typical token building process looks something like the
following, containing the final character to be tacked onto the current
token build up, the creation of a new token object and the addition of
it to the tokens
list, finishing with flushing the build up string and
incrementing the coordinates:
A typical token building procedure looks something like this:
/* Generate and add the token */
currentToken ~= "'";
currentTokens ~= new Token(currentToken, line, column);
/* Flush the token */
currentToken = "";
column += 2
position += 2;
Character and token availability
Helper functions relating to character and token availability.
Method name | Return type | Description |
---|---|---|
hasToken() |
bool |
Returns true if there is a token currently built i.e. currentToken.length != 0 , false otherwise. |
isBackward() |
bool |
Returns true if we can move the character pointer backwards, false otherwise. |
isForward() |
bool |
Returns true if we can move the character pointer forward, false otherwise. |
isNumericalStr() |
bool |
This method is called in order to check if the build up, currentToken , is a valid numerical string. If the string is empty, then it returns false . If the string is non-empty and contains anything other than digits then it returns false , otherwise is returns true . |
Grammar-wise
These are all the methods which pertain to the construction of tokens based on different states of the state machine.
These methods follow a sort of methodology whereby they will return
true
if there are characters left in the buffer which can still be
processed after return, or false
if there are none left.
Method name | Return type | Description |
---|---|---|
doIdentOrPath() |
bool |
Processes an ident with or without a dot-path |
doChar() |
bool |
Tokenizes a character |
doString() |
bool |
Tokenizes a string |
doComment() |
bool |
Processes various different types of comments |
doEscapeCode() |
bool |
Lex an escape code. If valid one id found, add it to the token, else throw Exception |
doNumber() |
bool |
Lex a number, this method lexes a plain number, float or numerically encoded. |
doEncoder() |
bool |
Lex a numerical encoder |
doFloat() |
bool |
Lex a floating point, the initial part of the number is lexed by the doNumber() method. |
Buffer management
These are methods for managing the advancement of the lexing pointer, the position of \((x, y)\) coordinates (used for error reporting) and so forth.
Method name | Return type | Description |
---|---|---|
flush() |
void |
Flush the current token to the token buffer. |
buildAdvance() |
bool |
Consume the current char into the current token, returns true on non-empty buffer |
improvedAdvance(int inc = 1, bool shouldFlush = false) |
bool |
Advances the source code pointer |
advanceLine() |
bool |
Advance the position, line and current token, reset the column to 1. Returns true on non-empty buffer |