966 lines
43 KiB
Markdown
966 lines
43 KiB
Markdown
---
|
||
title: Compiling a Functional Language Using C++, Part 13 - Cleanup
|
||
date: 2020-09-19T16:14:13-07:00
|
||
tags: ["C++", "Functional Languages", "Compilers"]
|
||
series: "Compiling a Functional Language using C++"
|
||
description: "In this post, we clean up our compiler."
|
||
---
|
||
|
||
In [part 12]({{< relref "12_compiler_let_in_lambda" >}}), we added `let/in`
|
||
and lambda expressions to our compiler. At the end of that post, I mentioned
|
||
that before we move on to bigger and better things, I wanted to take a
|
||
step back and clean up the compiler. Now is the time to do that.
|
||
|
||
In particular, I identified four things that could be improved
|
||
or cleaned up:
|
||
|
||
* __Error handling__. We need to stop using `throw 0` and start
|
||
using `assert`. We can also make our errors much more descriptive
|
||
by including source locations in the output.
|
||
* __Name mangling__. I don't think I got it quite right last
|
||
time. Now is the time to clean it up.
|
||
* __Code organization__. I think we can benefit from a top-level
|
||
class, and a more clear "dependency order" between the various
|
||
classes and structures we've defined.
|
||
* __Code style__. In particular, I've been lazily using `struct`
|
||
in a lot of places. That's not a good idea; it's better
|
||
to use `class`, and only expose _some_ fields and methods
|
||
to the rest of the code.
|
||
|
||
### Error Reporting and Handling
|
||
The previous post was rather long, which led me to omit
|
||
a rather important aspect of the compiler: proper error reporting.
|
||
Once again our compiler has instances of `throw 0`, which is a cheap way
|
||
of avoiding properly handling a runtime error. Before we move on,
|
||
it's best to get rid of such blatantly lazy code.
|
||
|
||
Our existing exceptions (mostly type errors) can use some work, too.
|
||
Even the most descriptive issues our compiler reports -- unification errors --
|
||
don't include the crucial information of _where_ the error is. For large
|
||
programs, this means having to painstakingly read through the entire file
|
||
to try figure out which subexpression could possibly have an incorrect type.
|
||
This is far from the ideal debugging experience.
|
||
|
||
Addressing all this is a multi-step change in itself. We want to:
|
||
|
||
* Replace all `throw 0` code with actual exceptions.
|
||
* Replace some exceptions that shouldn't be possible for a user to trigger
|
||
with assertions.
|
||
* Keep track of source locations of each subexpression, so that we may
|
||
be able to print it if it causes an error.
|
||
* Be able to print out said source locations at will. This isn't
|
||
a _necessity_, but virtually all "big" compilers do this. Instead
|
||
of reporting that an error occurs on a particular line, we will
|
||
actually print the line.
|
||
|
||
Let's start with gathering the actual location data.
|
||
|
||
#### Bison's Locations
|
||
Bison actually has some rather nice support for location tracking. It can
|
||
automatically assemble the "from" and "to" locations of a nonterminal
|
||
from the locations of children, which would be very tedious to write
|
||
by hand. We enable this feature using the following option:
|
||
|
||
{{< codelines "C++" "compiler/13/parser.y" 46 46 >}}
|
||
|
||
There's just one hitch, though. Sure, Bison can compute bigger
|
||
locations from smaller ones, but it must get the smaller ones
|
||
from somewhere. Since Bison operates on _tokens_, rather
|
||
than _characters_, it effectively doesn't interact with the source
|
||
text at all, and can't determine from which line or column a token
|
||
originated. The task of determining the locations of input tokens
|
||
is delegated to the tokenizer -- Flex, in our case. Flex, on the
|
||
other hand, doesn't have a built-in mechanism for tracking
|
||
locations. Fortunately, Bison provides a `yy::location` class that
|
||
includes most of the needed functionality.
|
||
|
||
A `yy::location` consists of two source positions, `begin` and `end`,
|
||
which themselves are represented using lines and columns. It
|
||
also has the following methods:
|
||
|
||
* `yy::location::columns(int)` advances the `end` position by
|
||
the given number of columns, while `begin` stays the same.
|
||
If `begin` and `end` both point to the beginning of a token,
|
||
then `columns(token_length)` will move `end` to the token's end,
|
||
and thus make the whole `location` contain the token.
|
||
* `yy::location::lines(int)` behaves similarly to `columns`,
|
||
except that it advances `end` by the given number of lines,
|
||
rather than columns. It also resets the columns counter to `1`.
|
||
* `yy::location::step()` moves `begin` to where `end` is. This
|
||
is useful for when we've finished processing a token, and want
|
||
to move on to the next one.
|
||
|
||
For Flex specifically, `yyleng` has the length of the token
|
||
currently being processed. Rather than adding the calls
|
||
to `columns` and `step` to every rule, we can define the
|
||
`YY_USER_ACTION` macro, which is run before each token
|
||
is processed.
|
||
|
||
{{< codelines "C++" "compiler/13/scanner.l" 12 14 >}}
|
||
|
||
We'll see why we are using `LOC` instead of something like `location` soon;
|
||
for now, you can treat `LOC` as if it were a global variable declared
|
||
in the tokenizer. Before processing each token, we ensure that
|
||
the `yy::location` has its `begin` and `end` at the same position,
|
||
and then advance `end` by `yyleng` columns. This is
|
||
{{< sidenote "right" "sufficient-note" "sufficient" >}}
|
||
This doesn't hold for all languages. It may be possible for a language
|
||
to have tokens that contain <code>\n</code>, in which case,
|
||
rather than just using <code>yyleng</code>, we'd need to
|
||
add special logic to iterate over the token and detect the line
|
||
breaks.<br>
|
||
<br>
|
||
Also, this requires that the <code>end</code> of the previous token was
|
||
correctly computed.
|
||
{{< /sidenote >}}
|
||
to make `LOC` represent our token's source position. For
|
||
the moment, don't worry too much about `drv`; this is the
|
||
parsing driver, and we will talk about it shortly.
|
||
|
||
So now we have a "global" variable `LOC` that gives
|
||
us the source position of the current token. To get it
|
||
to Bison, we have to pass it as an argument to each
|
||
of the `make_TOKEN` calls. Here are a few sample lines
|
||
that should give you the general idea:
|
||
|
||
{{< codelines "C++" "compiler/13/scanner.l" 40 43 >}}
|
||
|
||
That last line is actually new. Previously, we somehow
|
||
got away without explicitly sending the end-of-file token to Bison.
|
||
I suspect that this was due to some kind of implicit conversion
|
||
of the Flex macro `YY_NULL` into a token; now that we have
|
||
to pass a position to every token constructor, such an implicit
|
||
conversion is probably impossible.
|
||
|
||
Now we have Bison computing source locations for each nonterminal.
|
||
However, at the moment, we still aren't using them. To change that,
|
||
we need to add a `yy::location` argument to each of our `ast` nodes,
|
||
as well as to the `pattern` subclasses, `definition_defn` and
|
||
`definition_data`. To avoid breaking all the code that creates
|
||
AST nodes and definitions outside of the parser, we'll make this
|
||
argument optional. Inside of `ast.hpp`, we define a new field as follows:
|
||
|
||
{{< codelines "C++" "compiler/13/ast.hpp" 16 16 >}}
|
||
|
||
Then, we add a constructor to `ast` as follows:
|
||
|
||
{{< codelines "C++" "compiler/13/ast.hpp" 18 18 >}}
|
||
|
||
Note that it's not optional here, since `ast` itself is an
|
||
abstract class, and thus will never be constructed directly.
|
||
It is in the subclasses of `ast` that we provide a default
|
||
value. The change is rather mechanical, but here's an example
|
||
from `ast_binop`:
|
||
|
||
{{< codelines "C++" "compiler/13/ast.hpp" 98 99 >}}
|
||
|
||
Finally, we tell Bison to pass the computed location
|
||
data as an argument when constructing our data structures.
|
||
This too is a mechanical change, and I think the following
|
||
few lines demonstrate the general idea in sufficient
|
||
detail:
|
||
|
||
{{< codelines "C++" "compiler/13/parser.y" 92 96 >}}
|
||
|
||
Here, the `@$` character is used to reference the current
|
||
nonterminal's location data.
|
||
|
||
#### Line Offsets, File Input, and the Parsing Driver
|
||
There are three more challenges with printing out the line
|
||
of code where an error occurred. First of all, to
|
||
print out a line of code, we need to have that line of code
|
||
available to us. We do not currently meet this requirement:
|
||
our compiler reads code form `stdin` (as is default for Flex),
|
||
and `stdin` doesn't always support rewinding. This, in turn,
|
||
means that once Flex has read a character from the input,
|
||
it may not be possible to go back and retrieve that character
|
||
again.
|
||
|
||
Second, even if we do have have the entire stream or buffer
|
||
available to us, to retrieve an offset and length within
|
||
that buffer from just a line and column number would be a lot
|
||
of work. A naive approach would be to iterate through
|
||
the input again, once more keeping track of lines and columns,
|
||
and print the desired line once we reach it. However, this
|
||
would lead us to redo a lot of work that our tokenizer
|
||
is already doing.
|
||
|
||
Third, Flex's input mechanism, even if it it's configured
|
||
not to read from `stdin`, uses a global file descriptor called
|
||
`yyin`. However, we're better off minimizing global state (especially
|
||
if we want to read, parse, and compile multiple files in
|
||
the future). While we're configuring Flex's input mechanism,
|
||
we may as well fix this, too.
|
||
|
||
There are several approaches to fixing the first issue. One possible
|
||
way is to store the content of `stdin` into a temporary file. Then,
|
||
it's possible to read from the file multiple times by using
|
||
the C functions `fseek` and `rewind`. However, since we're
|
||
working with files, why not just work directly with the files
|
||
created by the user? Instead of reading from `stdin`, we may
|
||
as well take in a path to a file via `argv`, and read from there.
|
||
Also, instead of `fseek` and `rewind`, we can just read the file
|
||
into memory, and access it like a normal character buffer. This
|
||
does mean that we can stick with `stdin`, but it's more conventional
|
||
to read source code from files, anyway.
|
||
|
||
To address the second issue, we can keep a mapping of line numbers
|
||
to their locations in the source buffer. This is rather easy to
|
||
maintain using an array: the first element of the array is 0,
|
||
which is the beginning of the first line in any source file. From there,
|
||
every time we encounter the character `\n`, we can push
|
||
the current source location to the top, marking it as
|
||
the beginning of another line. Where exactly we store this
|
||
array is as yet unclear, since we're trying to avoid global variables.
|
||
|
||
Finally, to begin addressing the third issue, we can use Flex's `reentrant`
|
||
option, which makes it so that all of the tokenizer's state is stored in an
|
||
opaque `yyscan_t` structure, rather than in global variables. This way,
|
||
we can configure `yyin` without setting a global variable, which is a step
|
||
in the right direction. We'll work on this momentarily.
|
||
|
||
Our tokenizing and parsing stack has more global variables
|
||
than just those specific to Flex. Among these variables is `global_defs`,
|
||
which receives all the top-level function and data type definitions. We
|
||
will also need some way of accessing the `yy::location` instance, and
|
||
a way of storing our file input in memory. Fortunately, we're not
|
||
the only ones to have ever come across the issue of creating non-global
|
||
state: the Bison documentation has a
|
||
[section in its C++ guide](https://www.gnu.org/software/bison/manual/html_node/Calc_002b_002b-Parsing-Driver.html)
|
||
that describes a technique for manipulating
|
||
state -- "parsing context", in their words. This technique involves the
|
||
creation of a _parsing driver_.
|
||
|
||
The parsing driver is a class (or struct) that holds all the parse-related
|
||
state. We can arrange for this class to be available to our tokenizing
|
||
and parsing functions, which will allow us to use it pretty much like we'd
|
||
use a global variable. This is the `drv` that we saw in `YY_USER_ACTION`.
|
||
We can define it as follows:
|
||
|
||
{{< codelines "C++" "compiler/13/parse_driver.hpp" 36 54 >}}
|
||
|
||
There aren't many fields here. The `file_name` string represents
|
||
the file that we'll be reading code from. The `location` field
|
||
will be accessed by Flex via `get_current_location`. Bison will
|
||
store the function and data type definitions it reads into `global_defs`
|
||
via `get_global_defs`. Finally, `file_m` will be used to keep track
|
||
of the content of the file we're reading, as well as the line offsets
|
||
within that file. Notice that a couple of these fields are pointers
|
||
that we take by reference in the constructor. The `parse_driver` doesn't
|
||
_own_ the global definitions, nor the file manager. They exist outside
|
||
of it, and will continue to be used in other ways the `parse_driver`
|
||
does not need to know about. Also, the `LOC` variable in Flex is
|
||
actually a call to `get_current_location`:
|
||
|
||
{{< codelines "C++" "compiler/13/scanner.l" 15 15 >}}
|
||
|
||
The methods of `parse_driver` are rather simple. The majority of
|
||
them deals with giving access to the parser's members: the `yy::location`,
|
||
the `definition_group`, and the `file_mgr`. The only exception
|
||
to this is `operator()`, which we use to actually trigger the parsing process.
|
||
We'll make this method return `true` if parsing succeeded, and `false`
|
||
otherwise (if, say, the file we tried to read doesn't exist).
|
||
Here's its implementation:
|
||
|
||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 48 60 >}}
|
||
|
||
We try open the user-specified file, and return `false` if we can't.
|
||
After this, we start doing the setup specific to a reentrant
|
||
Flex scanner. We declare a `yyscan_t` variable, which
|
||
will contain all of Flex's state. Then, we initialize
|
||
it using `yylex_init`. Finally, since we can no longer
|
||
touch the `yyin` global variable (it doesn't exist),
|
||
we have to resort to using a setter function provided by Flex
|
||
to configure the tokenizer's input stream.
|
||
|
||
Next, we construct our Bison-generated parser. Note that
|
||
unlike before, we have to pass in two arguments:
|
||
`scanner` and `*this`, the latter being of type `parse_driver&`.
|
||
We'll come back to how this works in a moment. With
|
||
the scanner and parser initialized, we invoke `parser::operator()`,
|
||
which actually runs the Flex- and Bison-generated code.
|
||
To clean up, we run `yylex_destroy` and `fclose`. Finally,
|
||
we call `file_mgr::finalize`, and return. But what
|
||
_is_ `file_mgr`?
|
||
|
||
The `file_mgr` class does two things: it stores the part of the file
|
||
that has already been read by Flex in memory, and it keeps track of
|
||
where each line in our source file begins within the text. Here is its
|
||
definition:
|
||
|
||
{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 34 >}}
|
||
|
||
In this class, the `string_stream` member is used to construct
|
||
an `std::string` from the bits of text that Flex reads,
|
||
processes, and feeds to the `file_mgr` using the `write` method.
|
||
It's more efficient to use a string stream than to concatenate
|
||
strings repeatedly. Once Flex is finished processing the file,
|
||
the final contents of the `string_stream` are transferred into
|
||
the `file_contents` string using the `finalize` method. The `offset`
|
||
and `line_offsets` fields will be used as we described earlier: each time Flex
|
||
encounters the `\n` character, the `offset` variable will pushed
|
||
in top of the `line_offsets` vector, marking the beginning of
|
||
the corresponding line. The methods of the class are as follows:
|
||
|
||
* `write` will be called from Flex, and will allow us to
|
||
record the content of the file we're processing to the `string_stream`.
|
||
We've already seen it used in the `YY_USER_ACTION` macro.
|
||
* `mark_line` will also be called from Flex, and will mark the current
|
||
`file_offset` as the beginning of a line by pushing it into `line_offsets`.
|
||
* `finalize` will be called by the `parse_driver` when the parsing
|
||
finishes. At this time, the `string_stream` should contain all of
|
||
the input file, and this data is transferred to `file_contents`, as
|
||
we mentioned above.
|
||
* `get_index` and `get_line_end` will be used for converting
|
||
`yy::location` instances to offsets within the source code buffer.
|
||
* `print_location` will be used for printing errors.
|
||
It will print the lines spanned by the given location, with the
|
||
location itself colored and underlined if the last argument is `true`.
|
||
This will make our errors easier on the eyes.
|
||
|
||
Let's take a look at their implementations. First, `write`.
|
||
For the most part, this method is a proxy for the `write`
|
||
method of our `string_stream`:
|
||
|
||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 9 12 >}}
|
||
|
||
We do, however, also keep track of the `file_offset` variable
|
||
here, which ensures we have up-to-date information
|
||
regarding our position in the source file. The implementation
|
||
of `mark_line` uses this information:
|
||
|
||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 14 16 >}}
|
||
|
||
The `finalize` method is trivial, and requires little additional
|
||
discussion:
|
||
|
||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 18 20 >}}
|
||
|
||
Once we have the line offsets, `get_index` becomes very simple:
|
||
|
||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 22 25 >}}
|
||
|
||
Here, we use an assertion for the first time. Calling
|
||
`get_index` with a negative or zero line doesn't make
|
||
any sense, since Bison starts tracking line numbers
|
||
at 1. Similarly, asking for a line for which we don't
|
||
have a recorded offset is invalid. Both
|
||
of these nonsensical calls to `get_index` cannot
|
||
be caused by the user under normal circumstances,
|
||
and indicate the method's misuse by the author of
|
||
the compiler (us!). Thus, we terminate the program.
|
||
|
||
Finally, the implementation of `line_end` just finds the
|
||
beginning of the next line. We stick to the C convention
|
||
of marking 'end' indices exclusive (pointing just past
|
||
the end of the array):
|
||
|
||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 27 30 >}}
|
||
|
||
Since `line_offsets` has as many elements as there are lines,
|
||
the last line number would be equal to the vector's size.
|
||
When looking up the end of the last line, we can't look for
|
||
the beginning of the next line, so instead we return the end of the file.
|
||
|
||
Next, the `print_location` method prints three sections
|
||
of the source file. These are the text "before" the error,
|
||
the error itself, and, finally, the text "after" the error.
|
||
For example, if an error began on the fifth column of the third
|
||
line, and ended on the eighth column of the fourth line, the
|
||
"before" section would include the first four columns of the third
|
||
line, and the "after" section would be the ninth column onward
|
||
on the fourth line. Before and after the error itself,
|
||
if the `highlight` argument is true,
|
||
we sprinkle the ANSI escape codes to enable and disable
|
||
special formatting, respectively. For now, the special
|
||
formatting involves underlining the text and making it red.
|
||
|
||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 32 46 >}}
|
||
|
||
Finally, to get the forward declarations for the `yy*` functions
|
||
and types, we set the `header-file` option in Flex:
|
||
|
||
{{< codelines "C++" "compiler/13/scanner.l" 3 3 >}}
|
||
|
||
We also include this `scanner.hpp` file in our `parse_driver.cpp`:
|
||
|
||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 2 2 >}}
|
||
|
||
#### Adding the Driver to Flex and Bison
|
||
Bison's C++ language template generates a class called
|
||
`yy::parser`. We don't really want to modify this class
|
||
in any way: not only is it generated code, but it's
|
||
also rather complex. Instead, Bison provides us
|
||
with a mechanism to pass more data in to the parser.
|
||
This data is made available to all the actions
|
||
that the parser runs. Better yet, Bison also attempts
|
||
to pass this data on to the tokenizer, which in our
|
||
case would mean that whatever data we provide Bison
|
||
will also be available to Flex. This is how we'll
|
||
allow the two components to access our new `parse_driver`
|
||
class. This is also how we'll pass in the `yyscan_t`
|
||
that Flex now needs to run its tokenizing code. To
|
||
do all this, we use Bison's `%param` option. I'm
|
||
going to include a few more lines from `parser.y`,
|
||
since they contain the necessary `#include` directives
|
||
and a required type definition:
|
||
|
||
{{< codelines "C++" "compiler/13/parser.y" 1 18 >}}
|
||
|
||
The `%param` option effectively adds the parameter listed
|
||
between the curly braces to the constructor of the generated
|
||
`yy::parser`. We've already seen this in the implementation
|
||
of our driver, where we passed `scanner` and `*this` as
|
||
arguments when creating the parser. The parameters we declare are also passed to the
|
||
`yylex` function, which is expected to accept them in the same order.
|
||
|
||
Since we're adding `parse_driver` as an argument we have to
|
||
declare it. However, we can't include the `parse_driver` header
|
||
right away because `parse_driver` itself includes the `parser` header:
|
||
we'd end up with a circular dependency. Instead, we resort to
|
||
forward-declaring the driver class, as well as the `yyscan_t`
|
||
structure containing Flex's state.
|
||
|
||
Adding a parameter to Bison doesn't automatically affect
|
||
Flex. To let Flex know that its `yylex` function must now accept
|
||
the state and the parsing driver, we have to define the
|
||
`YY_DECL` macro. We do this in `parse_driver.hpp`, since
|
||
this forward declaration will be used by both Flex
|
||
and Bison:
|
||
|
||
{{< codelines "C++" "compiler/13/parse_driver.hpp" 56 58 >}}
|
||
|
||
#### Improving Exceptions
|
||
Now, it's time to add location data (and a little bit more) to our
|
||
exceptions. We want to make it possible for exceptions to include
|
||
data about where the error occurred, and to print this data to the user.
|
||
However, it's also possible for us to have exceptions that simply
|
||
do not have that location data. Furthermore, we want to know
|
||
whether or not an exception has an associated location; we'd
|
||
rather not print an invalid or "default" location when an error
|
||
occurs.
|
||
|
||
In the old days of programming, we could represent the absence
|
||
of location data with a `nullptr`, or `NULL`. But not only
|
||
does this approach expose us to all kind of `NULl`-safety
|
||
bugs, but it also requires heap allocation! This doesn't
|
||
make it sound all that appealing; instead, I think we should
|
||
opt for using `std::optional`.
|
||
|
||
Though `std::optional` is standard (as may be obvious from its
|
||
namespace), it's a rather recent addition to the C++ STL.
|
||
In order to gain access to it, we need to ensure that our
|
||
project is compiled using C++17. To this end, we add
|
||
the following two lines to our CMakeLists.txt:
|
||
|
||
{{< codelines "CMake" "compiler/13/CMakeLists.txt" 5 6 >}}
|
||
|
||
Now, let's add a new base class for all of our compiler errors,
|
||
unsurprisingly called `compiler_error`:
|
||
|
||
{{< codelines "C++" "compiler/13/error.hpp" 10 26 >}}
|
||
|
||
We'll put some 'common' exception functionality
|
||
into the `print_location` and `print_about` methods. If the error
|
||
has an associated location, the former method will print that
|
||
location to the screen. We don't always want to highlight
|
||
the part of the code that caused the error: for instance,
|
||
an invalid data type definition may span several lines,
|
||
and coloring that whole section of text red would be
|
||
too much. To address this, we add the `highlight`
|
||
boolean argument, which can be used to switch the
|
||
colors on and off. The `print_about` method
|
||
will simply print the `what()` message of the exception,
|
||
in addition to the "specific" error that occurred (stored
|
||
in `description`). Here are the implementations of the
|
||
functions:
|
||
|
||
{{< codelines "C++" "compiler/13/error.cpp" 3 16 >}}
|
||
|
||
We will also add a `pretty_print` method to all of
|
||
our exceptions. This method will handle
|
||
all the exception-specific printing logic.
|
||
For the generic compiler error, this means
|
||
simply printing out the error text and the location:
|
||
|
||
{{< codelines "C++" "compiler/13/error.cpp" 18 21 >}}
|
||
|
||
For `type_error`, this logic slightly changes,
|
||
enabling colors when printing the location:
|
||
|
||
{{< codelines "C++" "compiler/13/error.cpp" 27 30 >}}
|
||
|
||
Finally, for `unification_error`, we also include
|
||
the code to print out the two types that our
|
||
compiler could not unify:
|
||
|
||
{{< codelines "C++" "compiler/13/error.cpp" 32 41 >}}
|
||
|
||
There's a subtle change here. Compared to the previous
|
||
type-printing code (which we had in `main`), what
|
||
we wrote here deals with "expected" and "actual" types.
|
||
The `left` type passed to the exception is printed
|
||
first, and is treat like the "correct" type. The
|
||
`right` type, on the other hand, is treated
|
||
like the "wrong" type that should have been
|
||
unifiable with `left`. This will affect the
|
||
calling conventions of our unification code.
|
||
|
||
Now, we can go through and find all the places where
|
||
we `throw 0`. One such place was in the data type
|
||
definition code, where declaring the same type parameter
|
||
twice is invalid. We replace the `0` with a
|
||
`compiler_error`:
|
||
|
||
{{< codelines "C++" "compiler/13/definition.cpp" 66 69 >}}
|
||
|
||
Not all `throw 0` statements should become exceptions.
|
||
For example, here's code from the previous version of
|
||
the compiler:
|
||
|
||
{{< codelines "C++" "compiler/12/definition.cpp" 123 127 >}}
|
||
|
||
If a definition `def_defn` has a dependency on a "nearby" (declared
|
||
in the same group) definition called `dependency`, and if
|
||
`dependency` does not exist within the same definition group,
|
||
we throw an exception. But this error is impossible
|
||
for a user to trigger: the only reason for a variable to appear
|
||
in the `nearby_variables` vector is that it was previously
|
||
found in the definition group. Here's the code that proves this
|
||
(from the current version of the compiler):
|
||
|
||
{{< codelines "C++" "compiler/13/definition.cpp" 102 106 >}}
|
||
|
||
Not being able to find the variable in the definition group
|
||
is a compiler bug, and should never occur. So, instead
|
||
of throwing an exception, we'll use an assertion:
|
||
|
||
{{< codelines "C++" "compiler/13/definition.cpp" 128 128 >}}
|
||
|
||
For more complicated error messages, we can use a `stringstream`.
|
||
Here's an example from `parsed_type`:
|
||
|
||
{{< codelines "C++" "compiler/13/parsed_type.cpp" 16 23 >}}
|
||
|
||
In general, this change is also rather mechanical. Before we
|
||
move on, to maintain a balance between exceptions and assertions, here
|
||
are a couple more assertions from `type_env`:
|
||
|
||
{{< codelines "C++" "compiler/13/type_env.cpp" 81 82 >}}
|
||
|
||
Once again, it should not be possible for the compiler
|
||
to try generalize the type of a variable that doesn't
|
||
exist, and nor should generalization occur twice.
|
||
|
||
While we're on the topic of types, let's talk about
|
||
`type_mgr::unify`. In practice, I suspect that a lot of
|
||
errors in our compiler will originate from this method.
|
||
However, at present, this method does not in any way
|
||
track the locations of where a unification error occurred.
|
||
To fix this, we add a new `loc` parameter to `unify`,
|
||
which we make optional to allow for unification without
|
||
a known location. Here's the declaration:
|
||
|
||
{{< codelines "C++" "compiler/13/type.hpp" 92 92 >}}
|
||
|
||
The change to the implementation is mechanical and repetitive,
|
||
so instead of showing you the whole method, I'll settle for
|
||
a couple of lines:
|
||
|
||
{{< codelines "C++" "compiler/13/type.cpp" 121 122 >}}
|
||
|
||
We want to make sure that a location provided to the
|
||
top-level call to `unify` is also forwarded to the
|
||
recursive calls, so we have to explicitly add it
|
||
to the call.
|
||
|
||
We'll also have to update the 'main' code to call the
|
||
`pretty_print` methods, but there's another big change
|
||
that we're going to make before then. However, once that
|
||
change is made, our errors will look a lot better.
|
||
Here is what's printed out to the user when a type error
|
||
occurs:
|
||
|
||
```
|
||
an error occured while checking the types of the program: failed to unify types
|
||
occuring on line 2:
|
||
3 + False
|
||
the expected type was:
|
||
Int
|
||
while the actual type was:
|
||
Bool
|
||
```
|
||
|
||
Here's an error that was previously a `throw 0` statement in our code:
|
||
|
||
```
|
||
an error occured while compiling the program: type variable a used twice in data type definition.
|
||
occuring on line 1:
|
||
data Pair a a = { MkPair a a }
|
||
```
|
||
|
||
Now, not only have we eliminated the lazy uses of `throw 0` in our
|
||
code, but we've also improved the presentation of the errors
|
||
to the user!
|
||
|
||
### Rethinking Name Mangling
|
||
In the previous post, I said the following:
|
||
|
||
> One more thing. Let’s adopt the convention of storing mangled names into the compilation environment. This way, rather than looking up mangled names only for global functions, which would be a ‘gotcha’ for anyone working on the compiler, we will always use the mangled names during compilation.
|
||
|
||
Now that I've had some more time to think about it
|
||
(and now that I've returned to the compiler after
|
||
a brief hiatus), I think that this was not the right call.
|
||
Mangled names make sense when translating to LLVM; we certainly
|
||
don't want to declare two LLVM functions
|
||
{{< sidenote "right" "mangling-note" "with the same name." >}}
|
||
By the way, LLVM has its own name mangling functionality. If you
|
||
declare two functions with the same name, they'll appear as
|
||
<code>function</code> and <code>function.0</code>. Since LLVM
|
||
uses the <code>Function*</code> C++ values to refer to functions,
|
||
as long as we keep them seaprate on <em>our</em> end, things will
|
||
work.<br>
|
||
<br>
|
||
However, in our compiler, name mangling occurs before LLVM is
|
||
introduced, at translation time. We could create LLVM functions
|
||
at that time, too, and associate them with variables. But then,
|
||
our G-machine instructions will be coupled to LLVM, which
|
||
would not be as clean.
|
||
{{< /sidenote >}}
|
||
But things are different for local variables. Our local variables
|
||
are graphs on a stack, and are not actually compiled to LLVM
|
||
definitions. It doesn't make sense to mangle their names, since
|
||
their names aren't present anywhere in the final executable.
|
||
It's not even "consistent" to mangle them, since global definitions
|
||
are compiled directly to __PushGlobal__ instructions, while local
|
||
variables are only referenced through the current `env`.
|
||
So, I opted to reverse my decision. We will go back to
|
||
placing variable names directly into `env_var`. Here's
|
||
an example of this from `global_scope.cpp`:
|
||
|
||
{{< codelines "C++" "compiler/13/global_scope.cpp" 6 8 >}}
|
||
|
||
Now that we've started using assertions, I also think it's worth
|
||
to put our new invariant -- "only global definitions have mangled
|
||
names" -- into code:
|
||
|
||
{{< codelines "C++" "compiler/13/type_env.cpp" 36 45 >}}
|
||
|
||
Furthermore, we'll _require_ that a global definition
|
||
has a mangled name. This way, we can be more confident
|
||
that a variable from a __PushGlobal__ instruction
|
||
is referencing the right function. To achieve
|
||
this, we change `get_mangled_name` to stop
|
||
returning the input string if a mangled name was not
|
||
found; doing so makes it impossible to check if a mangled
|
||
name was explicitly defined. Instead,
|
||
we add two assertions. First, if an environment scope doesn't
|
||
contain a variable, then it _must_ have a parent.
|
||
If it does contain variable, that variable _must_ have
|
||
a mangled name. We end up with the following:
|
||
|
||
{{< codelines "C++" "compiler/13/type_env.cpp" 47 55 >}}
|
||
|
||
For this to work, we make one more change. Now that we've
|
||
enabled C++17, we have access to `std::optional`. We
|
||
can thus represent the presence or absence of mangled
|
||
names using an optional field, rather than with the empty string `""`.
|
||
I hear that C++ compilers have pretty good
|
||
[empty string optimizations](https://www.youtube.com/watch?v=kPR8h4-qZdk),
|
||
but nonetheless, I think it makes more sense semantically
|
||
to use "absent" (`nullopt`) instead of "empty" (`""`).
|
||
Here's the definition of `type_env::variable_data` now:
|
||
|
||
{{< codelines "C++" "compiler/13/type_env.hpp" 16 25 >}}
|
||
|
||
Since looking up a mangled name for non-global variable
|
||
{{< sidenote "right" "unrepresentable-note" "will now result in an assertion failure," >}}
|
||
A very wise human at the very dawn of our species once said,
|
||
"make illegal states unrepresentable". Their friends and family were a little
|
||
busy making a fire, and didn't really understand what the heck they meant. Now,
|
||
we kind of do.<br>
|
||
<br>
|
||
It's <em>possible</em> for our <code>type_env</code> to include a
|
||
<code>variable_data</code> entry that is both global and has no mangled
|
||
name. But it doesn't have to be this way. We could define two subclasses
|
||
of <code>variable_data</code>, one global and one local,
|
||
where only the global one has a <code>mangled_name</code>
|
||
field. It would be impossible to reach this assertion failure then.
|
||
{{< /sidenote >}} we have to change
|
||
`ast_lid::compile` to only call `get_mangled_name` once
|
||
it ensures that the variable being compiled is, in fact,
|
||
global:
|
||
|
||
{{< codelines "C++" "compiler/13/ast.cpp" 58 63 >}}
|
||
|
||
Since all global functions now need to have mangled
|
||
names, we run into a bit of a problem. What are
|
||
the mangled names of `(+)`, `(-)`, and so on? We could
|
||
continue to hardcode them as `plus`, `minus`, etc., but this can
|
||
(and currently does!) lead to errors. Consider the following
|
||
piece of code:
|
||
|
||
```
|
||
defn plus x y = { x + y }
|
||
defn main = { plus 320 6 }
|
||
```
|
||
|
||
We've hardcoded the mangled name of `(+)` to be `plus`. However,
|
||
`global_scope` doesn't know about this, so when the actual
|
||
`plus` function gets translated, it also gets assigned the
|
||
mangled name `plus`. The name is also overwritten in the
|
||
`llvm_context`, which effectively means that `(+)` is
|
||
now compiled to a call of the user-defined `plus` function.
|
||
If we didn't overwrite the name, we would've run into an assertion
|
||
failure in this scenario anyway. In short, this example illustrates
|
||
an important point: mangling information needs to be available
|
||
outside of a `global_scope`. We don't want to do this by having
|
||
every function take in a `global_scope` to access the mangling
|
||
information; instead, we'll store the mangling information in
|
||
a new `mangler` class, which `global_scope` will take as an argument.
|
||
The new class is very simple:
|
||
|
||
{{< codelines "C++" "compiler/13/mangler.hpp" 5 11 >}}
|
||
|
||
As with `parse_driver`, `global_scope` takes `mangler` by reference
|
||
and stores a pointer:
|
||
|
||
{{< codelines "C++" "compiler/13/global_scope.hpp" 50 50 >}}
|
||
|
||
The implementation of `new_mangled_name` doesn't change, so I'm
|
||
not going to show it here. With this new mangling information
|
||
in hand, we can now correctly set the mangled names of binary
|
||
operators:
|
||
|
||
{{< codelines "C++" "compiler/13/compiler.cpp" 22 27 >}}
|
||
|
||
Wait a moment, what's a `compiler`? Let's talk about that next.
|
||
|
||
### A Top-Level Class
|
||
Now that we've moved name mangling out of `global_scope`, we have
|
||
to put it somewhere. The same goes for global definition group
|
||
and the file manager that are given to `parse_driver`. The two
|
||
classes _make use_ of the other data, but they don't _own it_.
|
||
That's why they take it by reference, and store it as a pointer.
|
||
They're just temporarily allowed access.
|
||
|
||
So, what should be the owner of all of these disparate components?
|
||
Thus far, that has been the `main` function, or the utility
|
||
functions that it calls out to. However, this is sloppy:
|
||
we have related data and operations on it, but we don't group
|
||
them into an object. We can group all of the components of our
|
||
compiler into a `compiler` object, and leave `main.cpp` with
|
||
exception printing code.
|
||
|
||
The definition of the `compiler` class begins with all of the data
|
||
structures that we use in the process of compilation:
|
||
|
||
{{< codelines "C++" "compiler/13/compiler.hpp" 12 20 >}}
|
||
|
||
There's a loose ordering to these fields. In C++, class members are
|
||
initialized in the order they are declared; we therefore want to make
|
||
sure that fields that are depended on by other fields are initialized first.
|
||
Otherwise, I tried to keep the order consistent with the conceptual path
|
||
of the code through the compiler.
|
||
* Parsing happens first, so we begin with `parse_driver`, which needs a
|
||
`file_manager` (to populate with line information) and a `definition_group`
|
||
(to receive the global definitions from the parser).
|
||
* We then proceed to typechecking, for which we use a global `type_env_ptr`
|
||
(to define the built-in functions and constructors) and a `type_mgr` (to
|
||
manage the assignments of type variables).
|
||
* Once a program is typechecked, we transform it, eliminating local
|
||
function definitions and lambda functions. This is done by storing
|
||
newly-emitted global functions into the `global_scope`, which requires a
|
||
`mangler` to generate new names for the target functions.
|
||
* Finally, to generate LLVM IR, we need our `llvm_context` class.
|
||
|
||
The methods of the compiler are arranged similarly:
|
||
|
||
{{< codelines "C++" "compiler/13/compiler.hpp" 22 31 >}}
|
||
|
||
The methods go as follows:
|
||
|
||
* `add_default_types` adds the built-in types to the `global_env`.
|
||
At this point, these types only include `Int`.
|
||
* `add_binop_type` adds a single binary operator to the global
|
||
type environment. We saw its implementation earlier: it deals
|
||
with both binding a type, and setting a mangled name.
|
||
* `add_default_types` adds the types for each binary operator.
|
||
* `parse`, `typecheck`, `translate` and `compile` all do exactly
|
||
what they say. In this case, compilation refers to creating G-machine
|
||
instructions.
|
||
* `create_llvm_binop` creates an internal function that forces the
|
||
evaluation of its two arguments, and actually applies the given binary
|
||
operator. Recall that the `(+)` in user code constructs a call to this
|
||
function, but leaves it unevaluated until it's needed.
|
||
* `generate_llvm` converts all the definitions in `global_scope`, which
|
||
are at this point compiled into G-machine `instruction`s, into LLVM IR.
|
||
* `output_llvm` contains all the code to actually generate an object
|
||
file from the LLVM IR.
|
||
|
||
These functions are mostly taken from part 12's `main.cpp`, and adjusted
|
||
to use the `compiler`'s members rather than local definitions or arguments.
|
||
You should compare part 12's
|
||
[`main.cpp`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/12/main.cpp)
|
||
file with the
|
||
[`compiler.cpp`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/13/compiler.cpp)
|
||
file that we end up with at the end of this post.
|
||
|
||
Next, we have the compiler's constructor, and its `operator()`. The
|
||
latter, analogously to our parsing driver, will trigger the compilation
|
||
process. Their implementations are straightforward:
|
||
|
||
{{< codelines "C++" "compiler/13/compiler.cpp" 131 145 >}}
|
||
|
||
We also add a couple of methods to give external code access to
|
||
some of the compiler's data structures. I omit their (trivial)
|
||
implementations, but they have the following signatures:
|
||
|
||
{{< codelines "C++" "compiler/13/compiler.hpp" 35 36 >}}
|
||
|
||
With all the compilation code tucked into our new `compiler` class,
|
||
`main` becomes very simple. We also finally get to use our exception
|
||
pretty printing code:
|
||
|
||
{{< codelines "C++" "compiler/13/main.cpp" 11 27 >}}
|
||
|
||
With this, we complete our transition to a compiler object.
|
||
All that's left is to clean up the code style.
|
||
|
||
### Keeping Things Private
|
||
Hand-writing or generating hundreds of trivial getters and setters
|
||
for the fields of a data class (which is standard in the world of Java) seems
|
||
absurd to me. So, for most of this project, I stuck with
|
||
`struct`s, rather than classes. But this is not a good policy
|
||
to apply _everywhere_. I still think it makes sense to make
|
||
data structures like `ast` and `type` public-by-default;
|
||
however, I _don't_ think that way about classes like `type_mgr`,
|
||
`llvm_context`, `type_env`, and `env`. All of these have information
|
||
that we should never be accessing directly. Some guard this
|
||
information with assertions. In short, it should be protected.
|
||
|
||
For most classes, the changes are mechanical. For instance, we
|
||
can make `type_env` a class simply by changing its declaration,
|
||
and marking all of its functions public. This requires a slight
|
||
refactoring of a line that used its `parent` field. Here's
|
||
what it used to be (in context):
|
||
|
||
{{< codelines "C++" "compiler/12/main.cpp" 57 60 >}}
|
||
|
||
And here's what it is now:
|
||
|
||
{{< codelines "C++" "compiler/13/compiler.cpp" 55 58 >}}
|
||
|
||
Rather than traversing the chain of environments from
|
||
the body of the definition, we just use the definition's
|
||
own `env_ptr`. This is cleaner and more explicit, and
|
||
it helps us not use the private members of `type_env`!
|
||
|
||
The deal with `env` is about as simple. We just make
|
||
it and its two descendants classes, and mark their
|
||
methods and constructors public. The same
|
||
goes for `global_scope`. To make `type_mgr`
|
||
a class, we have to add a new method: `lookup`.
|
||
Here's its implementation:
|
||
|
||
{{< codelines "C++" "compiler/13/type.cpp" 81 85 >}}
|
||
|
||
It's used in `type_var::print` as follows:
|
||
|
||
{{< codelines "C++" "compiler/13/type.cpp" 28 35 >}}
|
||
|
||
We can't use `resolve` here because it takes (and returns)
|
||
a `type_ptr`. If we make it _take_ a `type*`, it won't
|
||
be able to return its argument if it's already resolved. If we
|
||
allow it to _return_ `type*`, we won't have an owning
|
||
reference. We also don't want to duplicate the
|
||
method just for this one call. Notice, though, how similar
|
||
`type_var::print`/`lookup` and `resolve` are in terms of execution.
|
||
|
||
The change for `llvm_context` requires a little more work.
|
||
Right now, `ctx.builder` is used a _lot_ in `instruction.cpp`.
|
||
Since we don't want to forward each of the LLVM builder methods,
|
||
and since it feels weird to make `llvm_context` extend `llvm::IRBuilder`,
|
||
we'll just provide a getter for the `builder` field. The
|
||
same goes for `module`:
|
||
|
||
{{< codelines "C++" "compiler/13/llvm_context.hpp" 46 47 >}}
|
||
|
||
Here's what some of the code from `instruction.cpp` looks like now:
|
||
|
||
{{< codelines "C++" "compiler/13/instruction.cpp" 144 145 >}}
|
||
|
||
Right now, the `ctx` field of the `llvm_context` (which contains
|
||
the `llvm::LLVMContext`) is only externally used to create
|
||
instances of `llvm::BasicBlock`. We'll add a proxy method
|
||
for this functionality:
|
||
|
||
{{< codelines "C++" "compiler/13/llvm_context.cpp" 174 176 >}}
|
||
|
||
Finally, `instruction_pushglobal` needs to access the
|
||
`llvm::Function` instances that we create in the process
|
||
of compilation. We add a new `get_custom_function` method
|
||
to support this, which automatically prefixes the function
|
||
name with `f_`, much like `create_custom_function`:
|
||
|
||
{{< codelines "C++" "compiler/13/llvm_context.cpp" 292 294 >}}
|
||
|
||
I think that's enough. If we chose to turn more compiler
|
||
data structures into classes, I think we would've quickly drowned
|
||
in one-line getter and setter methods.
|
||
|
||
That's all for the cleanup! We've added locations and more errors
|
||
to the compiler, stopped throwing `0` in favor of proper exceptions
|
||
or assertions, made name mangling more reasonable, fixed a bug with
|
||
accidentally shadowing default functions, organized our compilation
|
||
process into a `compiler` class, and made more things into classes.
|
||
In the next post, I hope to tackle __strings__ and __Input/Output__.
|
||
I also think that implementing __modules__ would be a good idea,
|
||
though at the moment I don't know too much on the subject. I hope
|
||
you'll join me in my future writing!
|
||
|
||
### Appendix: Optimization
|
||
When I started working on the compiler after the previous post,
|
||
I went a little overboard. I started working on optimizing the generated programs,
|
||
but eventually decided I wasn't doing a
|
||
{{< sidenote "right" "good-note" "good enough" >}}
|
||
I think authors should feel a certain degree of responsibility
|
||
for the content they create. If I do something badly, somebody
|
||
else trusts me and learns from it, who knows how much damage I've done.
|
||
I try not to do damage.<br>
|
||
<br>
|
||
If anyone reads what I write, anyway!
|
||
{{< /sidenote >}} job to present it to others,
|
||
and scrapped that part of the compiler altogether. I'm not
|
||
sure if I will try again in the near future. But,
|
||
if you're curious about optimization, here are a few avenues
|
||
I've explored or thought about:
|
||
|
||
* __Unboxing numbers__. Right now, numbers are allocated and garbage
|
||
collected just like the rest of the graph nodes. This is far from ideal.
|
||
We could use pointers to represent numbers, by tagging their most significant
|
||
bits on 64-bit CPUs. Rather than allocating a node, the runtime will just
|
||
cast a number to a pointer, tag it, and push it on the stack.
|
||
* __Converting enumeration data types to numbers__. If no constructor
|
||
of a data type takes any arguments, then the tag uniquely identifies
|
||
each constructor. Combined with unboxed numbers, this can save unnecessary
|
||
allocations and memory accesses.
|
||
* __Special treatment for global constants__. It makes sense for
|
||
global functions to be converted into LLVM functions, but the
|
||
same is not the case for
|
||
{{< sidenote "right" "constant-note" "constants." >}}
|
||
Yeah, yeah, a constant is just a nullary function. Get
|
||
out of here with your pedantry!
|
||
{{< /sidenote >}} We can find a way to
|
||
initialize global constants once, which would save some work. To
|
||
make more constants suitable for this, we could employ
|
||
[monomorphism restriction](https://wiki.haskell.org/Monomorphism_restriction).
|
||
* __Optimizing stack operations.__ If you read through the LLVM IR
|
||
we produce, you can see a lot of code that peeks at something twice,
|
||
or pops-then-pushes the same value, or does other absurd things. LLVM
|
||
isn't aware of the semantics of our stacks, but perhaps we could write an
|
||
optimization pass to deal with some of the more blatant instances of
|
||
this issue.
|
||
|
||
If you attempt any of these, let me know how it goes, please!
|