Parsing
Once we have generated a list of tokens (instances of Token
) from the
Lexer
instance we need to turn these into a structure that represents
our program’s source code but using in-memory data-structures which we
can traverse and process at a later stage.
Overview
The Parser
class contains several methods for parsing different
sub-structures of a TLang program and returning different data types
generated by these methods. The parser has the ability to move back and
forth between the token stream provided and fetch the current token
(along with analysing it to return the type of symbol the token
represents - known as the SymbolType
(TODO: Cite the “Symbol types”
section).
For example, the method parseIf()
is used to parse if statements, it
is called on the occurence of the token of if
. This method returns an
instance of type IfStatement
. Then there are methods like
parseBody()
which is responsible for creating several sub-calls to
methods such as parseIf()
and building up a list of Statement
instances (the top-type for all parser nodes).
The entry point to call is parse()
which will return an instance of
type Module
.
Info
proper module support
API
The API exposed by the parser is rather minimal as there isn’t much to a parser than controlling the token stream pointer (the position in the token stream), fetching the token and acting upon the type or value of said token. Therefore we have the methods summarised below:
nextToken()
- Moves the token pointer to the next token
previousToken()
- Moves the token pointer to the previous token
getCurrentToken()
- Returns the current
Token
instance at the current token pointer position
- Returns the current
hasTokens()
- Returns
true
if there are tokens still left in the stream (i.e.tokenPtr < tokens.length
),false
otherwise
- Returns
Initialization
The initialization of the parser is rather simple, an instance of the
Parser
class must be instantiated, along with this the following
arguments must be provided to the constructor:
Token[] tokens
- This is an array of
Token
to be provided to the parser for parsing. This would have been derived from theLexer
via itsperformLex()
andgetTokens()
call.
- This is an array of
A new instance woud therefore be created with something akin to:
// Tokenize the following program
string sourceCode = "int i = 2;"
Lexer lexer = new Lexer(sourceCode);
lexer.performLex();
// Extract tokens and pass to the lexer
Token[] tokens = lexer.getTokens();
Parser parser = new Parser(tokens);
Symbol types
The token stream is effectively a list of instances of Token
which
consist just of the token itself as a string and the coordinates of the
token (where it occurs). However, some tokens, despite being different
strings, can be of the same type or syntactical grouping. For example
one would agree that both tokens 1.5
and 25.2
are both different
tokens but are both floating points. This is where the notion of symbol
types comes in.
The enum SymbolType
in parsing/symbols/check.d
describes all of the
available types of tokens there are in the grammar of the Tristan
programming language like so:
public enum SymbolType {
LE_SYMBOL,
IDENT_TYPE,
NUMBER_LITERAL,
CHARACTER_LITERAL,
STRING_LITERAL,
SEMICOLON,
LBRACE,
...
}
Given an instance of Token
one can pass it to the
getSymbolType(Token)
method which will then return an enum member from
SymbolType
. When a token has no associated symbol type then
SymbolType.UNKNOWN
is returned. Now for an example:
// Create a new token at with (0, 0) as coordinates
Token token = new Token("100", 0, 0);
// Get the symbol type
SymbolType symType = getSymbolType(token);
assert(symType == SymbolType.NUMBER_LITERAL);
This assertion would pass as the symbol type of such a token is a number literal.
API
The API for working with and using SymbolType
s is made available
within the parsing/data/check.d
and contains the following methods:
isType(string)
- Returns
true
if the given string (a token) is a built-in type - Built-in type strings would be:
byte, ubyte, short, ushort, int, uint, long, ulong, void
- Returns
getSymbolType(Token)
- Returns the
SymbolType
associated with the givenToken
- If the token is not of a valid type then
SymbolType.UNKNOWN
is returned
- Returns the
getCharacter(SymbolType)
- This performs the reverse of
getSymbolType(Token)
in the sense that you provide it aSymbolType
and it will return the corresponding string that is of that type. - This will work only for back-mapping a sub-section of tokens as
you won’t get anything back if you provide
SymbolType.IDENT_TYPE
as there are infinite possibiltiies for that - not a fixed token.
- This performs the reverse of
Data types
Every node returned by a parseX()
is of a certain type and there are
some important types to mention here. The following types are from
either parsing/data.d
or parsing/containers.d
.
Statement
The Statement
type is the top-type for most parse nodes, it has the
following important methods and fields:
weight
- This holds a
byte
value which is used for when statements are required to be re-ordered. It starts default at 0 whereby that is the most prioritized re-ordering value (i.e. smaller means you appear first)
- This holds a
parentOf()
- This returns an instance of
Container
, specifically indicating of which container this Statement is a parent of. - It can be
null
if this Statement was not parented.
- This returns an instance of
parentTo(Container)
- Set the parenting
Container
of this Statement to the one provided.
- Set the parenting
toString()
- The default string representtion method for Statements (unless overridden) is to show a rolling count which is increment with every instantiation of a Statement object.
Entity
The Entity
type is a sub-type of Statement
and represents any named
entity, along with initialization scopes (TODO: these are not yet
implemented semantically and accessor types) (TODO: these are not yet
implemented semantically.) The following methods and fields are to note:
this(string)
- Constructs a new instance of an Entity with the provided name.
getName()
- Returns the name of the entity.
setAccessorType(AccessorType accessorType)
- TODO: Describe this
getAccessorType()
- TODO: Describe this
setModifierType(InitScope initScope)
- TODO: Describe this
InitScope getModifierType()
- TODO: Describe this
bool isExternal()
- If this returns
true
then it is a signal that this Entity should be emitted in a manner pertaining to an external symbol rather than one found in the current T module
- If this returns
void makeExternal()
- Mark this Entity as external
- You will see this used in
parseExtern()
as that is where we need to mark entities as external for link-time resolution
Container
The Container
type is an interface that specifies a certain type to
implement a set of methods. These methods allow the type to become a
container by then allowing one or more instances of Statement
or
rather a Statement[]
to be contained within the container i.e. making
it contain them.
It should be noted that the parenting method is used to climb up the hierachy given a Statement instance, however the Container technique is useful for a top-down search for an Entity - they are independent in that sense but can be used toghether TODO: double check but I believe this is right.
How to parse
The basic flow of the parser involves the following process:
- Firstly you need an entry point, this entry point for us is the
parse()
method which will return an instance ofModule
which represents the module - the TLang program. - Every
parseX()
method gets called by another such method dependent on the current symbol (and sometimes a lookahead)- For example, sometimes when we come across
SymbolType.IDENTIFIER
we callparseName()
which can then either callparseFuncCall()
,parseTypedDeclaration()
orparseAssignment()
. This requires a lookahead to check what follows the identifier because just by itself it is too ambuguous grammatically. - After determining what comes next the token is pushed back using
previousToken()
and then we proceed into the correct function - Lookaheads are rare but they do appear in situations like that
- For example, sometimes when we come across
- The
parseX()
methods return instances ofStatement
which is the top type for all parser-generated nodes or AST nodes. - When you are about to parse a sub-section (like an if statement) of
a bigger syntax group (like a body) you leave the offending token
as the current token, then you call the parsing method (in this case
parseIf()
) and let it handle the call tonextToken()
- this is simply the structure of parsing that TLang follows. - Upon exiting a
parseX()
method you callnextToken()
- this determines whether this method would continue parsing or not - if not then you return and the caller will continue with that current token and move on from there.
Example of parsing if-statements
We will now look at an example of how we deal with parsing if statements
in our parser, specifically within the parseBody()
. The beginning of
this method starts by moving us off the offending token that made us
call parseBody()
(hence the call to nextToken()
). After which we
setup an array of Statement
such that we can build up a body of them:
gprintln("parseBody(): Enter", DebugType.WARNING);
Statement[] statements;
/* Consume the `{` symbol */
nextToken();
Now we are within the body, as you can imagine a body is to be made up of several statements of which we do not know how many there are. Therefore we setup a loop that will iterate till we run out of tokens:
Next thing we want to do if grab the current token and check what type of symbol it is:
while (hasTokens())
{
/* Get the token */
Token tok = getCurrentToken();
SymbolType symbol = getSymbolType(tok);
gprintln("parseBody(): SymbolType=" ~ to!(string)(symbol));
...
}
Following this we now have several checks that make use of
getSymbolType(Token)
in order to determine what the token’s type is
and then in our case if the token is "if"
then we will make a call to
parseIf()
and append the returned Statement-sub-type to the body of
statements (Statement[]
):