Feature: Lexer improvements (#3)

* Grammar

- Updated grammar for the numbers formatting

* Grammar

- Aligned `::=`

* Generate

- Updated markdown

* Lexer

- Added character constants source code listing

* Lexical

- Typo fix

* Generated

* Lexical analysis

- Added a section on "Character constants"
- Added a section on "Helper methods"

* Lexical analysis

- Added missing method

* Lexical analysis

- Removed old section on `isSpliter()`
- Removed the old "Others" section
- Documented the Grammar-wise methods

* Lexical analysis

- Documented the buffer management methods

* Lexical analysis

- Removed `stringMode` and `floatMode`

* Lexical analysis

- Documented `LexerException`
- Documented `LexerError`

* Lexical analysis

- Added missing grammar link

* Lexical analysis

- Fixed grammar link

* Lexical analysis

- Removed last TODO
- Fixed typo

* Generated

- Pandoc plot

* Lexical analysis

- Updated heading
- Changed header sizes
This commit is contained in:
Tristan B. Velloza Kildaire 2023-12-27 08:24:19 +02:00 committed by GitHub
parent 85837645bd
commit e182b6ee00
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 304 additions and 107 deletions

View File

@ -8,11 +8,11 @@ tokens are built - this set of rules is called a *grammar*.
### Grammar
TODO: fix the link The grammar is described in the [language
section](31-grammar.md) and can be viewed alongside this section for
some context.
The grammar is described in the [language
section](../../language/31-grammar/) and can be viewed alongside this
section for some context.
### Overview of files
### Overview
The source code for the lexical analysis part of the compiler is located
in `source/tlang/compiler/lexer/` which contains a few important module
@ -43,7 +43,79 @@ The API is described in the table below and the file in question is in
| `getColumn()` | `ulong` | Returns the column number the lexer is at |
| `getTokens()` | `Token[]` | Exhausts the lexers token stream and returns all gathered tokens in an array |
##### the `Token`
#### Character constants
For completion we include the commonly used character constant
definitions. These come in the form of an enumeration type as shown
below:
``` d
public enum LexerSymbols : char
{
L_PAREN = '(',
R_PAREN = ')',
SEMI_COLON = ';',
COMMA = ',',
L_BRACK = '[' ,
R_BRACK = ']' ,
PLUS = '+' ,
MINUS = '-' ,
FORWARD_SLASH = '/' ,
PERCENT = '%' ,
STAR = '*' ,
AMPERSAND = '&' ,
L_BRACE = '{' ,
R_BRACE = '}' ,
EQUALS = '=' ,
SHEFFER_STROKE = '|' ,
CARET = '^' ,
EXCLAMATION = '!' ,
TILDE = '~' ,
DOT = '.' ,
COLON = ':',
SPACE = ' ',
TAB = '\t',
NEWLINE = '\n',
DOUBLE_QUOTE = '"',
SINGLE_QUOTE = '\'' ,
BACKSLASH = '\\' ,
UNDERSCORE = '_' ,
LESS_THAN = '<' ,
BIGGER_THAN = '>' ,
ESC_NOTHING = '0' ,
ESC_CARRIAGE_RETURN = 'r' ,
ESC_TAB = 't' ,
ESC_NEWLINE = 'n' ,
ESC_BELL= 'a' ,
ENC_BYTE = 'B' ,
ENC_INT = 'I' ,
ENC_LONG = 'L' ,
ENC_WORD = 'W' ,
ENC_UNSIGNED = 'U' ,
ENC_SIGNED = 'S' ,
}
```
#### Helper methods
There are quite a few helper methods as well which are commonly used
across the lexer implementation and therefore are worth being aware of.
You can find these all within the `tlang.compiler.lexer.core.lexer`
module.
| Method name | Return type | Description |
|------------------------------------|-------------|-------------------------------------------------------------------------------------------------------|
| `isOperator(char c)` | `bool` | Checks if the provided character is an operator, returning `true` if so |
| `isSplitter(char c)` | `bool` | Checks if the provided character is a splitter, returning `true` if so |
| `isNumericalEncoder_Size(char)` | `bool` | Checks if the provided character is a numerical size encoder |
| `isNumericalEncoder_Signage(char)` | `bool` | Checks if the provided character is a numerical signage encoder |
| `isNumericalEncoder(char)` | `bool` | Checks if the provided character is either a numerical size encoder or signage encoder |
| `isValidEscape_String(char)` | `bool` | Checks if the given character is a valid escape character (something which would have followed a `\`) |
| `isValidDotPrecede(char)` | `bool` | Given a character return whether it is valid entry for preceding a .. |
#### the `Token`
A `Token` represents, well, a token which is produced in following the
grammar.
@ -68,8 +140,48 @@ Token token2 = new Token("int");
assert(token1 == token2);
```
TODO: Document `LexerException` and `LexerError` (see:
https://deavmi.assigned.network/git/tlang/tlang/src/branch/vardec_varass_dependency/source/tlang/compiler/lexer/core/exceptions.d)
#### the `LexerException`
This is a simple exception type of which extends the `TError` exception
type (the base type used within the TLang compiler system).
It is rather simple, the constructor takes in the following (in order of
appearance):
1. `LexerInterface`
- We take in the offending instance of the lexer used which
generated this exception
- This is such that coordinate information (the $(x,y)$ source text
pointer can be added into error messages)
2. `LexerError`
- This is an **optional** parameter which defaults to
`LexerError.OTHER`
- Base reason for the exception
3. `string`
- This is an **optional** parameter which defaults to `""`
- This is the custom error text
The `LexerError` is an enumeration type that is comprised of the
following members:
``` d
/**
* The specified error which occurred
*/
public enum LexerError
{
/**
* If all the characters were
* exhausted
*/
EXHAUSTED_CHARACTERS,
/**
* Generic error
*/
OTHER
}
```
------------------------------------------------------------------------
@ -99,20 +211,12 @@ state of the token building process:
| `column` | `ulong` | Current column the tokenizer is on (with respect to the source code input) |
| `currentToken` | `string` | The token string that is currently being built-up, char-by-char |
There are also some auxillary flags used for processing particular parts
of the grammar:
| Name | Type | Purpose |
|--------------|--------|---------------------------------------------------------------------------------|
| `stringMode` | `bool` | Whether we are current buliding up a string (e.g. `"we are here"`) or not |
| `floatMode` | `bool` | Whether we are current buliding up a floating-point literal (e.g. \`3.5) or not |
The implementation of the lexer, the `Lexer` class, is explained in
detail in this section. (TODO: constructor) The lexical analysis is done
one-shot via the `performLex()` method which will attempt to tokenize
the input program, on failure returning `false`, `true` otherwise. In
the successful case the `tokens` array will be filled with the created
tokens and can then later be retrieved via a call to `getTokens()`.
The implementation of the lexer, the `BasicLexer` class, is explained in
detail in this section. The lexical analysis is done one-shot via the
`performLex()` method which will attempt to tokenize the input program,
on failure returning `false`, `true` otherwise. In the successful case
the `tokens` array will be filled with the created tokens and can then
later be retrieved via a call to `getTokens()`.
Below is an example usage of the `BasicLexer` which makes use of it in
order to process the following input source code:
@ -155,12 +259,13 @@ unittest
This method contains a looping structure which will read
character-by-character from the `sourceCode` string and follow the rules
of the grammar (TODO: add link), looping whilst there are still
characters available for consumption (`position < sourceCode.length`).
of the [grammar](../../language/31-grammar/), looping whilst there are
still characters available for consumption
(`position < sourceCode.length`).
We loop through each character and dependent on its value we start
building new tokens, certain characters will cause a token to finish
being built which will sometimes be caused by `isSpliter(character)`
being built which will sometimes be caused by `isSplitter(character)`
being `true`. A typical token building process looks something like the
following, containing the final character to be tacked onto the current
token build up, the creation of a new token object and the addition of
@ -191,39 +296,35 @@ Helper functions relating to character and token availability.
| `isForward()` | `bool` | Returns `true` if we can move the character pointer forward, `false` otherwise. |
| `isNumericalStr()` | `bool` | This method is called in order to check if the build up, `currentToken`, is a valid numerical string. If the string is empty, then it returns `false`. If the string is non-empty and contains anything other than digits then it returns `false`, otherwise is returns `true`. |
TODO: There are probably some missing but the above are the most
general/used helper methods
#### Grammar-wise
#### isSpliter()
These are all the methods which pertain to the construction of tokens
based on different states of the state machine.
This method checks if the given character is one of the following:
These methods follow a sort of methodology whereby they will return
`true` if there are characters left in the buffer which can still be
processed after return, or `false` if there are none left.
``` d
character == ';' || character == ',' || character == '(' ||
character == ')' || character == '[' || character == ']' ||
character == '+' || character == '-' || character == '/' ||
character == '%' || character == '*' || character == '&' ||
character == '{' || character == '}' || character == '=' ||
character == '|' || character == '^' || character == '!' ||
character == '\n' || character == '~' || character =='.' ||
character == ':';
```
| Method name | Return type | Description |
|-------------------|-------------|-------------------------------------------------------------------------------------------|
| `doIdentOrPath()` | `bool` | Processes an ident with or without a dot-path |
| `doChar()` | `bool` | Tokenizes a character |
| `doString()` | `bool` | Tokenizes a string |
| `doComment()` | `bool` | Processes various different types of comments |
| `doEscapeCode()` | `bool` | Lex an escape code. If valid one id found, add it to the token, else throw Exception |
| `doNumber()` | `bool` | Lex a number, this method lexes a plain number, float or numerically encoded. |
| `doEncoder()` | `bool` | Lex a numerical encoder |
| `doFloat()` | `bool` | Lex a floating point, the initial part of the number is lexed by the `doNumber()` method. |
Whenever this method returns `true` it generally means you should flush
the current token, start a new token add the offending spliter token and
flush that as well.
#### Buffer management
### Others
These are methods for managing the advancement of the lexing pointer,
the position of $(x, y)$ coordinates (used for error reporting) and so
forth.
TODO: Document the other methods remaining
| Method name | Return type | Description |
|------------------------------------|-------------|-------------|
| `numbericalEncoderSegmentFetch()` | `x` | Desc. |
| `isBuildUpNumerical()` | `x` | Desc. |
| `isNumericalStr(string)` | `x` | Desc. |
| `isSpliter(char)` | `x` | Desc. |
| `isNumericalEncoder(char)` | `x` | Desc. |
| `isNumericalEncoder_Size(char)` | `x` | Desc. |
| `isNumericalEncoder_Signage(char)` | `x` | Desc. |
| `isValidEscape_String(char)` | `x` | Desc. |
| Method name | Return type | Description |
|----------------------------------------------------------|-------------|---------------------------------------------------------------------------------------------------------|
| `flush()` | `void` | Flush the current token to the token buffer. |
| `buildAdvance()` | `bool` | Consume the current char into the current token, returns `true` on non-empty buffer |
| `improvedAdvance(int inc = 1, bool shouldFlush = false)` | `bool` | Advances the source code pointer |
| `advanceLine()` | `bool` | Advance the position, line and current token, reset the column to 1. Returns `true` on non-empty buffer |

View File

@ -23,17 +23,20 @@ These are the basic types of comments supported.
These make are the basic atoms that define literals.
letter ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K"
letter ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K"
| "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V"
| "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g"
| "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r"
| "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z";
number ::= ("0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9") [encoder];
encoder ::= "S" | "B" | "W" | "I" | "L" | "UB" | "UW" | "UI" | "UL" | "SB"
digit ::= ("0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9")
number ::= digit {(digit | underscore)} ["." digit {(digit | underscore)}] [encoder]
encoder ::= "S" | "B" | "W" | "I" | "L" | "UB" | "UW" | "UI" | "UL" | "SB"
"SW" | "SI" | "SL";
float ::= (number | {number}), ".", (number | {number});
underscore ::= "_"
### Expressions

View File

@ -4,10 +4,9 @@ Lexical analysis is the process of taking a program as an input string $A$ and s
### Grammar
TODO: fix the link
The grammar is described in the [language section](31-grammar.md) and can be viewed alongside this section for some context.
The grammar is described in the [language section](../../language/31-grammar/) and can be viewed alongside this section for some context.
### Overview of files
### Overview
The source code for the lexical analysis part of the compiler is located in `source/tlang/compiler/lexer/` which contains a few important module and class definitions.
@ -29,7 +28,77 @@ The API is described in the table below and the file in question is in `source/t
| `getColumn()` | `ulong` | Return's the column number the lexer is at |
| `getTokens()` | `Token[]` | Exhausts the lexer's token stream and returns all gathered tokens in an array |
##### the `Token`
#### Character constants
For completion we include the commonly used character constant definitions. These come in
the form of an enumeration type as shown below:
```d
public enum LexerSymbols : char
{
L_PAREN = '(',
R_PAREN = ')',
SEMI_COLON = ';',
COMMA = ',',
L_BRACK = '[' ,
R_BRACK = ']' ,
PLUS = '+' ,
MINUS = '-' ,
FORWARD_SLASH = '/' ,
PERCENT = '%' ,
STAR = '*' ,
AMPERSAND = '&' ,
L_BRACE = '{' ,
R_BRACE = '}' ,
EQUALS = '=' ,
SHEFFER_STROKE = '|' ,
CARET = '^' ,
EXCLAMATION = '!' ,
TILDE = '~' ,
DOT = '.' ,
COLON = ':',
SPACE = ' ',
TAB = '\t',
NEWLINE = '\n',
DOUBLE_QUOTE = '"',
SINGLE_QUOTE = '\'' ,
BACKSLASH = '\\' ,
UNDERSCORE = '_' ,
LESS_THAN = '<' ,
BIGGER_THAN = '>' ,
ESC_NOTHING = '0' ,
ESC_CARRIAGE_RETURN = 'r' ,
ESC_TAB = 't' ,
ESC_NEWLINE = 'n' ,
ESC_BELL= 'a' ,
ENC_BYTE = 'B' ,
ENC_INT = 'I' ,
ENC_LONG = 'L' ,
ENC_WORD = 'W' ,
ENC_UNSIGNED = 'U' ,
ENC_SIGNED = 'S' ,
}
```
#### Helper methods
There are quite a few helper methods as well which are commonly used across the lexer implementation
and therefore are worth being aware of. You can find these all within the `tlang.compiler.lexer.core.lexer`
module.
| Method name | Return type | Description |
|---------------------------|---------------|-------------------------------------------------------------------------------|
| `isOperator(char c)` | `bool` | Checks if the provided character is an operator, returning `true` if so |
| `isSplitter(char c)` | `bool` | Checks if the provided character is a splitter, returning `true` if so |
| `isNumericalEncoder_Size(char)` | `bool` | Checks if the provided character is a numerical size encoder |
| `isNumericalEncoder_Signage(char)` | `bool` | Checks if the provided character is a numerical signage encoder |
| `isNumericalEncoder(char)` | `bool` | Checks if the provided character is either a numerical size encoder or signage encoder |
| `isValidEscape_String(char)` | `bool` | Checks if the given character is a valid escape character (something which would have followed a `\`) |
| `isValidDotPrecede(char)` | `bool` | Given a character return whether it is valid entry for preceding a '.'. |
#### the `Token`
A `Token` represents, well, a token which is produced in following the grammar.
@ -52,7 +121,42 @@ Token token2 = new Token("int");
assert(token1 == token2);
```
TODO: Document `LexerException` and `LexerError` (see: https://deavmi.assigned.network/git/tlang/tlang/src/branch/vardec_varass_dependency/source/tlang/compiler/lexer/core/exceptions.d)
#### the `LexerException`
This is a simple exception type of which extends the `TError` exception type (the base type used within the TLang compiler system).
It is rather simple, the constructor takes in the following (in order of appearance):
1. `LexerInterface`
* We take in the offending instance of the lexer used which generated this exception
* This is such that coordinate information (the $(x,y)$ source text pointer can be added into error messages)
2. `LexerError`
* This is an **optional** parameter which defaults to `LexerError.OTHER`
* Base reason for the exception
3. `string`
* This is an **optional** parameter which defaults to `""`
* This is the custom error text
The `LexerError` is an enumeration type that is comprised of the following members:
```{.d}
/**
* The specified error which occurred
*/
public enum LexerError
{
/**
* If all the characters were
* exhausted
*/
EXHAUSTED_CHARACTERS,
/**
* Generic error
*/
OTHER
}
```
---
@ -77,17 +181,8 @@ A quick overview of some of the fields which are used for tracking the state of
| `column` | `ulong` | Current column the tokenizer is on (with respect to the source code input) |
| `currentToken` | `string` | The token string that is currently being built-up, char-by-char |
There are also some auxillary flags used for processing particular parts of the grammar:
| Name | Type | Purpose
|-------------------|-----------------|--------------------------------------------------------------------------------|
| `stringMode` | `bool` | Whether we are current buliding up a string (e.g. `"we are here"`) or not |
| `floatMode` | `bool` | Whether we are current buliding up a floating-point literal (e.g. `3.5) or not |
The implementation of the lexer, the `Lexer` class, is explained in detail in this section. (TODO: constructor) The lexical analysis is done one-shot via the `performLex()` method which will attempt to tokenize the input program, on failure returning `false`, `true` otherwise. In the successful case the `tokens` array will be filled with the created tokens and can then later be retrieved via a call to `getTokens()`.
The implementation of the lexer, the `BasicLexer` class, is explained in detail in this section. The lexical analysis is done one-shot via the `performLex()` method which will attempt to tokenize the input program, on failure returning `false`, `true` otherwise. In the successful case the `tokens` array will be filled with the created tokens and can then later be retrieved via a call to `getTokens()`.
Below is an example usage of the `BasicLexer` which makes use of it in order to process the following input source code:
@ -127,9 +222,9 @@ unittest
#### Using `performLex()`
This method contains a looping structure which will read character-by-character from the `sourceCode` string and follow the rules of the grammar (TODO: add link), looping whilst there are still characters available for consumption (`position < sourceCode.length`).
This method contains a looping structure which will read character-by-character from the `sourceCode` string and follow the rules of the [grammar](../../language/31-grammar/), looping whilst there are still characters available for consumption (`position < sourceCode.length`).
We loop through each character and dependent on its value we start building new tokens, certain characters will cause a token to finish being built which will sometimes be caused by `isSpliter(character)` being `true`. A typical token building process looks something like the following, containing the final character to be tacked onto the current token build up, the creation of a new token object and the addition of it to the `tokens` list, finishing with flushing the build up string and incrementing the coordinates:
We loop through each character and dependent on its value we start building new tokens, certain characters will cause a token to finish being built which will sometimes be caused by `isSplitter(character)` being `true`. A typical token building process looks something like the following, containing the final character to be tacked onto the current token build up, the creation of a new token object and the addition of it to the `tokens` list, finishing with flushing the build up string and incrementing the coordinates:
A typical token building procedure looks something like this:
@ -156,37 +251,32 @@ Helper functions relating to character and token availability.
| `isNumericalStr()` | `bool` | This method is called in order to check if the build up, `currentToken`, is a valid numerical string. If the string is empty, then it returns `false`. If the string is non-empty and contains anything other than digits then it returns `false`, otherwise is returns `true`. |
#### Grammar-wise
TODO: There are probably some missing but the above are the most general/used helper methods
These are all the methods which pertain to the construction of tokens based on different states of the state machine.
#### isSpliter()
These methods follow a sort of methodology whereby they will return `true` if there are characters left in the buffer
which can still be processed after return, or `false` if there are none left.
This method checks if the given character is one of the following:
| Method name | Return type | Description |
|---------------------------|---------------|-------------------------------------------------------------------------------|
| `doIdentOrPath()` | `bool` | Processes an ident with or without a dot-path |
| `doChar()` | `bool` | Tokenizes a character |
| `doString()` | `bool` | Tokenizes a string |
| `doComment()` | `bool` | Processes various different types of comments |
| `doEscapeCode()` | `bool` | Lex an escape code. If valid one id found, add it to the token, else throw Exception |
| `doNumber()` | `bool` | Lex a number, this method lexes a plain number, float or numerically encoded. |
| `doEncoder()` | `bool` | Lex a numerical encoder |
| `doFloat()` | `bool` | Lex a floating point, the initial part of the number is lexed by the `doNumber()` method. |
```{.d .numberLines}
character == ';' || character == ',' || character == '(' ||
character == ')' || character == '[' || character == ']' ||
character == '+' || character == '-' || character == '/' ||
character == '%' || character == '*' || character == '&' ||
character == '{' || character == '}' || character == '=' ||
character == '|' || character == '^' || character == '!' ||
character == '\n' || character == '~' || character =='.' ||
character == ':';
```
#### Buffer management
Whenever this method returns `true` it generally means you should flush the current token, start a new token add the offending spliter token and flush that as well.
These are methods for managing the advancement of the lexing pointer, the position of $(x, y)$ coordinates (used for error
reporting) and so forth.
### Others
TODO: Document the other methods remaining
| Method name | Return type | Description |
|-----------------------------------------|---------------|-------------------------------------------------------------------------------|
| `numbericalEncoderSegmentFetch()` | `x` | Desc. |
| `isBuildUpNumerical()` | `x` | Desc. |
| `isNumericalStr(string)` | `x` | Desc. |
| `isSpliter(char)` | `x` | Desc. |
| `isNumericalEncoder(char)` | `x` | Desc. |
| `isNumericalEncoder_Size(char)` | `x` | Desc. |
| `isNumericalEncoder_Signage(char)` | `x` | Desc. |
| `isValidEscape_String(char)` | `x` | Desc. |
| Method name | Return type | Description |
|---------------------------|---------------|-------------------------------------------------------------------------------|
| `flush()` | `void` | Flush the current token to the token buffer. |
| `buildAdvance()` | `bool` | Consume the current char into the current token, returns `true` on non-empty buffer |
| `improvedAdvance(int inc = 1, bool shouldFlush = false)` | `bool` | Advances the source code pointer |
| `advanceLine()` | `bool` | Advance the position, line and current token, reset the column to 1. Returns `true` on non-empty buffer |

View File

@ -25,17 +25,20 @@ anything ::= ? all ASCII characters excluding newline ?
These make are the basic atoms that define literals.
```
letter ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K"
letter ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K"
| "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V"
| "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g"
| "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r"
| "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z";
number ::= ("0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9") [encoder];
encoder ::= "S" | "B" | "W" | "I" | "L" | "UB" | "UW" | "UI" | "UL" | "SB"
digit ::= ("0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9")
number ::= digit {(digit | underscore)} ["." digit {(digit | underscore)}] [encoder]
encoder ::= "S" | "B" | "W" | "I" | "L" | "UB" | "UW" | "UI" | "UL" | "SB"
"SW" | "SI" | "SL";
float ::= (number | {number}), ".", (number | {number});
underscore ::= "_"
```
### Expressions