documentation/docs/implementation/32-parsing.md

302 lines
11 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Parsing
Once we have generated a list of tokens (instances of `Token`) from the
`Lexer` instance we need to turn these into a structure that represents
our programs source code *but* using in-memory data-structures which we
can traverse and process at a later stage.
### Overview
The `Parser` class contains several methods for parsing different
sub-structures of a TLang program and returning different data types
generated by these methods. The parser has the ability to move back and
forth between the token stream provided and fetch the current token
(along with analysing it to return the type of symbol the token
represents - known as the `SymbolType` (TODO: Cite the “Symbol types”
section).
For example, the method `parseIf()` is used to parse if statements, it
is called on the occurence of the token of `if`. This method returns an
instance of type `IfStatement`. Then there are methods like
`parseBody()` which is responsible for creating several sub-calls to
methods such as `parseIf()` and building up a list of `Statement`
instances (the top-type for all parser nodes).
The entry point to call is `parse()` which will return an instance of
type `Module`.
!!! info The entry point handling may change soon with the advent of
proper module support
### API
The API exposed by the parser is rather minimal as there isnt much to a
parser than controlling the token stream pointer (the position in the
token stream), fetching the token and acting upon the type or value of
said token. Therefore we have the methods summarised below:
1. `nextToken()`
- Moves the token pointer to the next token
2. `previousToken()`
- Moves the token pointer to the previous token
3. `getCurrentToken()`
- Returns the current `Token` instance at the current token
pointer position
4. `hasTokens()`
- Returns `true` if there are tokens still left in the stream
(i.e. `tokenPtr < tokens.length`), `false` otherwise
### Initialization
The initialization of the parser is rather simple, an instance of the
`Parser` class must be instantiated, along with this the following
arguments must be provided to the constructor:
1. `Token[] tokens`
- This is an array of `Token` to be provided to the parser for
parsing. This would have been derived from the `Lexer` via its
`performLex()` and `getTokens()` call.
A new instance woud therefore be created with something akin to:
``` d
// Tokenize the following program
string sourceCode = "int i = 2;"
Lexer lexer = new Lexer(sourceCode);
lexer.performLex();
// Extract tokens and pass to the lexer
Token[] tokens = lexer.getTokens();
Parser parser = new Parser(tokens);
```
### Symbol types
The token stream is effectively a list of instances of `Token` which
consist just of the token itself as a string and the coordinates of the
token (where it occurs). However, some tokens, despite being different
strings, can be of the same type or *syntactical grouping*. For example
one would agree that both tokens `1.5` and `25.2` are both different
tokens but are both floating points. This is where the notion of symbol
types comes in.
The enum `SymbolType` in `parsing/symbols/check.d` describes all of the
available types of tokens there are in the grammar of the Tristan
programming language like so:
``` d
public enum SymbolType {
LE_SYMBOL,
IDENT_TYPE,
NUMBER_LITERAL,
CHARACTER_LITERAL,
STRING_LITERAL,
SEMICOLON,
LBRACE,
...
}
```
Given an instance of `Token` one can pass it to the
`getSymbolType(Token)` method which will then return an enum member from
`SymbolType`. When a token has no associated symbol type then
`SymbolType.UNKNOWN` is returned. Now for an example:
``` d
// Create a new token at with (0, 0) as coordinates
Token token = new Token("100", 0, 0);
// Get the symbol type
SymbolType symType = getSymbolType(token);
assert(symType == SymbolType.NUMBER_LITERAL);
```
This assertion would pass as the symbol type of such a token is a number
literal.
#### API
The API for working with and using `SymbolType`s is made available
within the `parsing/data/check.d` and contains the following methods:
1. `isType(string)`
- Returns `true` if the given string (a token) is a built-in type
- Built-in type strings would be:
`byte, ubyte, short, ushort, int, uint, long, ulong, void`
2. `getSymbolType(Token)`
- Returns the `SymbolType` associated with the given `Token`
- If the token is not of a valid type then `SymbolType.UNKNOWN` is
returned
3. `getCharacter(SymbolType)`
- This performs the reverse of `getSymbolType(Token)` in the sense
that you provide it a `SymbolType` and it will return the
corresponding string that is of that type.
- This will work only for back-mapping a sub-section of tokens as
you wont get anything back if you provide
`SymbolType.IDENT_TYPE` as there are infinite possibiltiies for
that - not a fixed token.
### Data types
Every node returned by a `parseX()` is of a certain type and there are
some important types to mention here. The following types are from
either `parsing/data.d` or `parsing/containers.d`.
#### `Statement`
The `Statement` type is the top-type for most parse nodes, it has the
following important methods and fields:
1. `weight`
- This holds a `byte` value which is used for when statements are
required to be re-ordered. It starts default at 0 whereby that
is the most prioritized re-ordering value (i.e. smaller means
you appear first)
2. `parentOf()`
- This returns an instance of `Container`, specifically indicating
of which container this Statement is a parent of.
- It can be `null` if this Statement was not parented.
3. `parentTo(Container)`
- Set the parenting `Container` of this Statement to the one
provided.
4. `toString()`
- The default string representtion method for Statements (unless
overridden) is to show a rolling count which is increment with
every instantiation of a Statement object.
#### `Entity`
The `Entity` type is a sub-type of `Statement` and represents any named
entity, along with initialization scopes (TODO: these are not yet
implemented semantically and accessor types) (TODO: these are not yet
implemented semantically.) The following methods and fields are to note:
1. `this(string)`
- Constructs a new instance of an Entity with the provided name.
2. `getName()`
- Returns the name of the entity.
3. `setAccessorType(AccessorType accessorType)`
- TODO: Describe this
4. `getAccessorType()`
- TODO: Describe this
5. `setModifierType(InitScope initScope)`
- TODO: Describe this
6. `InitScope getModifierType()`
- TODO: Describe this
7. `bool isExternal()`
- If this returns `true` then it is a signal that this Entity
should be emitted in a manner pertaining to an external symbol
rather than one found in the current T module
8. `void makeExternal()`
- Mark this Entity as external
- You will see this used in `parseExtern()` as that is where we
need to mark entities as external for link-time resolution
#### `Container`
The `Container` type is an interface that specifies a certain type to
implement a set of methods. These methods allow the type to *become* a
container by then allowing one or more instances of `Statement` or
rather a `Statement[]` to be contained within the container i.e. making
it contain them.
It should be noted that the parenting method is used to climb up the
hierachy **given** a Statement instance, however the Container technique
is useful for a top-down search for an Entity - they are independent in
that sense but can be used toghether TODO: double check but I believe
this is right.
### How to parse
The basic flow of the parser involves the following process:
1. Firstly you need an entry point, this entry point for us is the
`parse()` method which will return an instance of `Module` which
represents the module - the TLang program.
2. Every `parseX()` method gets called by another such method dependent
on the current symbol (and sometimes a lookahead)
- For example, sometimes when we come across
`SymbolType.IDENTIFIER` we call `parseName()` which can then
either call `parseFuncCall()`, `parseTypedDeclaration()` or
`parseAssignment()`. This requires a lookahead to check what
follows the identifier because just by itself it is too
ambuguous grammatically.
- After determining what comes next the token is pushed back using
`previousToken()` and then we proceed into the correct function
- Lookaheads are rare but they do appear in situations like that
3. The `parseX()` methods return instances of `Statement` which is the
top type for all parser-generated nodes or *AST nodes*.
4. When you are about to parse a sub-section (like an if statement) of
a bigger syntax group (like a body) you leave the *offending token*
as the current token, then you call the parsing method (in this case
`parseIf()`) and let it handle the call to `nextToken()` - this is
simply the structure of parsing that TLang follows.
5. Upon exiting a `parseX()` method you call `nextToken()` - this
determines whether this method would continue parsing or not - if
not then you return and the caller will continue with that current
token and move on from there.
#### Example of parsing if-statements
We will now look at an example of how we deal with parsing if statements
in our parser, specifically within the `parseBody()`. The beginning of
this method starts by moving us off the offending token that made us
call `parseBody()` (hence the call to `nextToken()`). After which we
setup an array of `Statement` such that we can build up a body of them:
``` d
gprintln("parseBody(): Enter", DebugType.WARNING);
Statement[] statements;
/* Consume the `{` symbol */
nextToken();
```
Now we are within the body, as you can imagine a body is to be made up
of several statements of which we do not know how many there are.
Therefore we setup a loop that will iterate till we run out of tokens:
``` d
while (hasTokens())
{
...
}
```
Next thing we want to do if grab the current token and check what type
of symbol it is:
``` d
while (hasTokens())
{
/* Get the token */
Token tok = getCurrentToken();
SymbolType symbol = getSymbolType(tok);
gprintln("parseBody(): SymbolType=" ~ to!(string)(symbol));
...
}
```
Following this we now have several checks that make use of
`getSymbolType(Token)` in order to determine what the tokens type is
and then in our case if the token is `"if"` then we will make a call to
`parseIf()` and append the returned Statement-sub-type to the body of
statements (`Statement[]`):
``` d
while(hasTokens())
{
...
/* If it is a branch */
else if (symbol == SymbolType.IF)
{
statements ~= parseIf();
}
...
}
```