Parsing
Once we have generated a list of tokens (instances of Token
) from the
Lexer
instance we need to turn these into a structure that represents
our program’s source code but using in-memory data-structures which we
can traverse and process at a later stage.
Overview
The Parser
class contains several methods for parsing different
sub-structures of a TLang program and returning different data types
generated by these methods. The parser has the ability to move back and
forth between the token stream provided and fetch the current token
(along with analysing it to return the type of symbol the token
represents - known as the SymbolType
(TODO: Cite the “Symbol types”
section).
For example, the method parseIf()
is used to parse if statements, it
is called on the occurence of the token of if
. This method returns an
instance of type IfStatement
. Then there are methods like
parseBody()
which is responsible for creating several sub-calls to
methods such as parseIf()
and building up a list of Statement
instances (the top-type for all parser nodes).
The entry point to call is parse()
which will return an instance of
type Module
.
Info
proper module support
API
The API exposed by the parser is rather minimal as there isn’t much to a parser than controlling the token stream pointer (the position in the token stream), fetching the token and acting upon the type or value of said token. Therefore we have the methods summarised below:
nextToken()
- Moves the token pointer to the next token
previousToken()
- Moves the token pointer to the previous token
getCurrentToken()
- Returns the current
Token
instance at the current token pointer position
- Returns the current
hasTokens()
- Returns
true
if there are tokens still left in the stream (i.e.tokenPtr < tokens.length
),false
otherwise
- Returns
Initialization
The initialization of the parser is rather simple, an instance of the
Parser
class must be instantiated, along with this the following
arguments must be provided to the constructor:
Token[] tokens
- This is an array of
Token
to be provided to the parser for parsing. This would have been derived from theLexer
via itsperformLex()
andgetTokens()
call.
- This is an array of
A new instance woud therefore be created with something akin to:
// Tokenize the following program
string sourceCode = "int i = 2;"
Lexer lexer = new Lexer(sourceCode);
lexer.performLex();
// Extract tokens and pass to the lexer
Token[] tokens = lexer.getTokens();
Parser parser = new Parser(tokens);
Symbol types
The token stream is effectively a list of instances of Token
which
consist just of the token itself as a string and the coordinates of the
token (where it occurs). However, some tokens, despite being different
strings, can be of the same type or syntactical grouping. For example
one would agree that both tokens 1.5
and 25.2
are both different
tokens but are both floating points. This is where the notion of symbol
types comes in.
The enum SymbolType
in parsing/symbols/check.d
describes all of the
available types of tokens there are in the grammar of the Tristan
programming language like so:
public enum SymbolType {
LE_SYMBOL,
IDENT_TYPE,
NUMBER_LITERAL,
CHARACTER_LITERAL,
STRING_LITERAL,
SEMICOLON,
LBRACE,
...
}
Given an instance of Token
one can pass it to the
getSymbolType(Token)
method which will then return an enum member from
SymbolType
. When a token has no associated symbol type then
SymbolType.UNKNOWN
is returned. Now for an example:
// Create a new token at with (0, 0) as coordinates
Token token = new Token("100", 0, 0);
// Get the symbol type
SymbolType symType = getSymbolType(token);
assert(symType == SymbolType.NUMBER_LITERAL);
This assertion would pass as the symbol type of such a token is a number literal.
API
The API for working with and using SymbolType
s is made available
within the parsing/data/check.d
and contains the following methods:
isType(string)
- Returns
true
if the given string (a token) is a built-in type - Built-in type strings would be:
byte, ubyte, short, ushort, int, uint, long, ulong, void
- Returns
getSymbolType(Token)
- Returns the
SymbolType
associated with the givenToken
- If the token is not of a valid type then
SymbolType.UNKNOWN
is returned
- Returns the
getCharacter(SymbolType)
- This performs the reverse of
getSymbolType(Token)
in the sense that you provide it aSymbolType
and it will return the corresponding string that is of that type. - This will work only for back-mapping a sub-section of tokens as
you won’t get anything back if you provide
SymbolType.IDENT_TYPE
as there are infinite possibiltiies for that - not a fixed token.
- This performs the reverse of
Data types
Every node returned by a parseX()
is of a certain type and there are
some important types to mention here. The following types are from
either parsing/data.d
or parsing/containers.d
.
Statement
The Statement
type is the top-type for most parse nodes, it has the
following important methods and fields:
weight
- This holds a
byte
value which is used for when statements are required to be re-ordered. It starts default at 0 whereby that is the most prioritized re-ordering value (i.e. smaller means you appear first)
- This holds a
parentOf()
- This returns an instance of
Container
, specifically indicating of which container this Statement is a parent of. - It can be
null
if this Statement was not parented.
- This returns an instance of
parentTo(Container)
- Set the parenting
Container
of this Statement to the one provided.
- Set the parenting
toString()
- The default string representtion method for Statements (unless overridden) is to show a rolling count which is increment with every instantiation of a Statement object.
Entity
The Entity
type is a sub-type of Statement
and represents any named
entity, along with initialization scopes (TODO: these are not yet
implemented semantically and accessor types) (TODO: these are not yet
implemented semantically.) The following methods and fields are to note:
this(string)
- Constructs a new instance of an Entity with the provided name.
getName()
- Returns the name of the entity.
setAccessorType(AccessorType accessorType)
- TODO: Describe this
getAccessorType()
- TODO: Describe this
setModifierType(InitScope initScope)
- TODO: Describe this
InitScope getModifierType()
- TODO: Describe this
bool isExternal()
- If this returns
true
then it is a signal that this Entity should be emitted in a manner pertaining to an external symbol rather than one found in the current T module
- If this returns
void makeExternal()
- Mark this Entity as external
- You will see this used in
parseExtern()
as that is where we need to mark entities as external for link-time resolution
Container
The Container
type is an interface that specifies a certain type to
implement a set of methods. These methods allow the type to become a
container by then allowing one or more instances of Statement
or
rather a Statement[]
to be contained within the container i.e. making
it contain them.
It should be noted that the parenting method is used to climb up the hierachy given a Statement instance, however the Container technique is useful for a top-down search for an Entity - they are independent in that sense but can be used toghether TODO: double check but I believe this is right.
How to parse
The basic flow of the parser involves the following process:
- Firstly you need an entry point, this entry point for us is the
parse()
method which will return an instance ofModule
which represents the module - the TLang program. - Every
parseX()
method gets called by another such method dependent on the current symbol (and sometimes a lookahead)- For example, sometimes when we come across
SymbolType.IDENTIFIER
we callparseName()
which can then either callparseFuncCall()
,parseTypedDeclaration()
orparseAssignment()
. This requires a lookahead to check what follows the identifier because just by itself it is too ambuguous grammatically. - After determining what comes next the token is pushed back using
previousToken()
and then we proceed into the correct function - Lookaheads are rare but they do appear in situations like that
- For example, sometimes when we come across
- The
parseX()
methods return instances ofStatement
which is the top type for all parser-generated nodes or AST nodes. - When you are about to parse a sub-section (like an if statement) of
a bigger syntax group (like a body) you leave the offending token
as the current token, then you call the parsing method (in this case
parseIf()
) and let it handle the call tonextToken()
- this is simply the structure of parsing that TLang follows. - Upon exiting a
parseX()
method you callnextToken()
- this determines whether this method would continue parsing or not - if not then you return and the caller will continue with that current token and move on from there.
Example of parsing if-statements
We will now look at an example of how we deal with parsing if statements
in our parser, specifically within the parseBody()
. The beginning of
this method starts by moving us off the offending token that made us
call parseBody()
(hence the call to nextToken()
). After which we
setup an array of Statement
such that we can build up a body of them:
gprintln("parseBody(): Enter", DebugType.WARNING);
Statement[] statements;
/* Consume the `{` symbol */
nextToken();
Now we are within the body, as you can imagine a body is to be made up of several statements of which we do not know how many there are. Therefore we setup a loop that will iterate till we run out of tokens:
Next thing we want to do if grab the current token and check what type of symbol it is:
while (hasTokens())
{
/* Get the token */
Token tok = getCurrentToken();
SymbolType symbol = getSymbolType(tok);
gprintln("parseBody(): SymbolType=" ~ to!(string)(symbol));
...
}
Following this we now have several checks that make use of
getSymbolType(Token)
in order to determine what the token’s type is
and then in our case if the token is "if"
then we will make a call to
parseIf()
and append the returned Statement-sub-type to the body of
statements (Statement[]
):
while(hasTokens())
{
...
/* If it is a branch */
else if (symbol == SymbolType.IF)
{
statements ~= parseIf();
}
...
}
Import system
What is a program?
Before we continue we should quickly discuss what is a program. The
Program
type is defined in a rather simple manner. It is a kind-of
Container
(a type ypu shall see described more in detail later) and
hence has the methods for adding or querying Statement
(s) from/in
itself.
What makes a program unique is that it will only allow you to add
Statement
(s) to it which are of the Module
type, and here is where
the definition comes in.
A program is a set of modules
There are also methods that relate to how this is managed but that is
discussed in a later section on the topic of module management. All
you are required to know here is that programs can hold modules.
Notably too, a program is not any kind-of Entity
and hence has no
name associated with it, the first such Entity
within the tree which
does is that of its associated modules.
Determining what to import
We can now move onto the crux of the matter which is “How does the parser manage importing of modules?”.
First we must observe that import
statements are only valid at the
module-level, meaning that you will only ever see a call to
parseImport()
from the code within the parse(string, boolean)
as
follows:
So then, how does this work. Well, compared to other parts of the parser this is one which actually has to maintain state and makes use of multiple parsers in a recursive manner. Therefore it is worth delving deeper into as compared to other topics of parsing which are rather straight forward.
Steps:
The first few steps are rather simple, and are what you would expect from any other parsing method within the parser, but nonetheless they aid us in determining a set of important variables:
- First consume the token
import
- Now expect an identifier kind-of
SymbolType
, i.e. a name, then save and consume - Check if there is a
,
symbol, if so we then loop whilst we have a,
- Each iteration saving the name found (i.e.
a, b, c;
)
- Each iteration saving the name found (i.e.
- Expect a semi-colon (
;
) and consume it
At the end of this we should have a list of modules wanting to be
imported, namely collectedModuleNames
.
The code is attached below:
/* Consume the `import` keyword */
lexer.nextToken();
/* Get the module's name */
expect(SymbolType.IDENT_TYPE, lexer.getCurrentToken());
string moduleName = lexer.getCurrentToken().getToken();
/* Consume the token */
lexer.nextToken();
/* All modules to be imported */
string[] collectedModuleNames = [moduleName];
/* Try process multi-line imports (if any) */
while(getSymbolType(lexer.getCurrentToken()) == SymbolType.COMMA)
{
/* Consume the comma `,` */
lexer.nextToken();
/* Get the module's name */
expect(SymbolType.IDENT_TYPE, lexer.getCurrentToken());
string curModuleName = lexer.getCurrentToken().getToken();
collectedModuleNames ~= curModuleName;
/* Consume the name */
lexer.nextToken();
}
/* Expect a semi-colon and consume it */
expect(SymbolType.SEMICOLON, lexer.getCurrentToken());
lexer.nextToken();
Visiting modules
We now have an array of modules wanting to be imported in the form of
collectedModuleNames
. But what do we do with these now?
Well, we need to obviously open up the modules but that means two
three things:
- How do we map a module name
b
to a filename? - How do we prevent cycles when visiting modules
- How do we add them all to the correct program?
Well, lucky for you the answers are all here - albeit, it did take a lot of time to get this albeit simple-sounding system right.
The mapping of names is discussed in the module management section but
needless to say it is performed by the current Program
’s
ModuleManager
which can be retrieved via this.program.getModMan()
.
Let’s look at how this code works. Recall that we were still busy in
parseImport()
. Well, now we shall enter the following method
doImport(string[])
as follows:
This method first of all starts off by obtaining a few important object instances:
// Print out some information about the current program
Program prog = this.compiler.getProgram();
// Get the module manager
ModuleManager modMan = compiler.getModMan();
Now, we must perform the following steps:
- For every module name
i
incollectedModuleNames
- Find a
ModuleEntry
for namei
- Append this
ModuleEntry
tofoundEnts
- Find a
This step is what satisfies the first 1/3 steps of ours. This maps the incoming module name to a module filename. The code for doing this is shown below:
// Search for all the module entries
ModuleEntry[] foundEnts;
foreach(string mod; modules)
{
gprintln(format("Module wanting to be imported: %s", mod));
// Search for the module entry
ModuleEntry foundEnt = modMan.find(mod);
gprintln("Found module entry: "~to!(string)(foundEnt));
foundEnts ~= foundEnt;
}
Now we have to do the following steps:
- For each module entry
m
infoundEnts
- Check if
m
has been visited, if so, then go back to1
and process the next module entry - If not then mark the entry as visited
- Then read the module’s source code
- Parse the module and obtain a
Module
object - Save the obtained
Module
to the module entrym
- Check if
The code for the above algorithm can be seen below:
// For each module entry, only import
// it if not already in the process
// of being visited
foreach(ModuleEntry modEnt; foundEnts)
{
// Check here if already present, if so,
// then skip
if(prog.isEntryPresent(modEnt))
{
gprintln(format("Not parsing module '%s' as already marked as visited", modEnt));
continue;
}
// Mark it as visited
prog.markEntryAsVisited(modEnt);
// Read in the module's contents
string moduleSource = modMan.readModuleData_throwable(modEnt);
gprintln("Module has "~to!(string)(moduleSource.length)~" many bytes");
// Parse the module
import tlang.compiler.lexer.kinds.basic : BasicLexer;
LexerInterface lexerInterface = new BasicLexer(moduleSource);
(cast(BasicLexer)lexerInterface).performLex();
Parser parser = new Parser(lexerInterface, this.compiler);
Module pMod = parser.parse(modEnt.getPath());
// Map parsed module to its entry
prog.setEntryModule(modEnt, pMod);
}
This whole process is important, especially the aspect whereby we mark a
module as visited prior to finishing parsing it. One needs this because
a module b
might import a module a
which imports a module b
and it
is at this last point that we want the check for visitation to return
true
even though parsing of the b
module has not yet completed.
Then at the end of the entire process we save the obtained module
object to its respective module entry. The act of saving it basically
maps the module’s name to the Module
object itself and also adds
it to the Program
’s body - hence making this module a part of the
larger program.
The first call to parse(string, bool)
Now that we know how all of this fits together it is worth taking a look at where this all starts, the first call to the parser.
If we take a look at the doParser()
method inside the Compiler
object then we can get an idea of how this all starts off.
We firstly construct a new Parser
and provide it an instance of the
Compiler
itself and of course the tokens to parse. The compiler is
important as it contains the Program
instance whereby Module
(s) will
be attached to:
After this we then begin the parsing, notably we set the second
parameter to true
to indicate that the module being parsed is for the
entrypoint (we shall see the effect of this later):
// It is easier to grab the module
// name from inside hence we should probably
// have a default parameter isEntrypoint=false
// and then add it from within
Module modulle = parser.parse(this.inputFilePath, true);
Now, we look to the parse(string, bool)
method inside Parser
. Here
we do some obvious steps of determing the module’s name and then
constructing an empty Module
object to which the body of the module,
of which is to be parsed shortly, shall be added:
Module modulle;
/* Expect `module` and module name and consume them (and `;`) */
expect(SymbolType.MODULE, lexer.getCurrentToken());
lexer.nextToken();
/* Module name may NOT be dotted (TODO: Maybe it should be yeah) */
expect(SymbolType.IDENT_TYPE, lexer.getCurrentToken());
string moduleName = lexer.getCurrentToken().getToken();
lexer.nextToken();
expect(SymbolType.SEMICOLON, lexer.getCurrentToken());
lexer.nextToken();
/* Initialize Module */
modulle = new Module(moduleName);
We then also copy over the file path so that we can associate it with the module itself. This is not actually used anywhere else but it is here for now:
Continuing on we now approach a vital step which relates to the boolean flag we set earlier. We want to add ourselves, the current module, to the program immediately such that we are known to have been visited already. The reason for this is because else only modules we import would be added and we would be skipped (unless added via another import of one of those aforementioned modules):
/**
* If this is an entrypoint module (i.e. one
* specified on the command-line) then store
* it as visited
*/
if(isEntrypoint)
{
gprintln
(
format
(
"parse(): Yes, this IS your entrypoint module '%s' about to be parsed",
moduleName
)
);
ModuleEntry curModEnt = ModuleEntry(moduleFilePath, moduleName);
Program prog = this.compiler.getProgram();
prog.markEntryAsVisited(curModEnt); // TODO: Could not call?
prog.setEntryModule(curModEnt, modulle);
}
The reason the if statement is that because the parse(string, bool)
method will be called several other times for other modules when they
are parsed because the entry point module had an import for them, and
they follow semantics that we already mentioned earlier with the whole
parseImport()
mechanism. Hence this is just a special case that must
be handled.