blog-static/content/blog/13_compiler_cleanup/index.md

965 lines
43 KiB
Markdown
Raw Permalink Normal View History

---
title: Compiling a Functional Language Using C++, Part 13 - Cleanup
2020-09-19 16:27:41 -07:00
date: 2020-09-19T16:14:13-07:00
tags: ["C and C++", "Functional Languages", "Compilers"]
description: "In this post, we clean up our compiler."
---
In [part 12]({{< relref "12_compiler_let_in_lambda" >}}), we added `let/in`
and lambda expressions to our compiler. At the end of that post, I mentioned
that before we move on to bigger and better things, I wanted to take a
step back and clean up the compiler. Now is the time to do that.
2020-09-18 15:14:34 -07:00
In particular, I identified four things that could be improved
or cleaned up:
* __Error handling__. We need to stop using `throw 0` and start
using `assert`. We can also make our errors much more descriptive
by including source locations in the output.
* __Name mangling__. I don't think I got it quite right last
time. Now is the time to clean it up.
* __Code organization__. I think we can benefit from a top-level
class, and a more clear "dependency order" between the various
classes and structures we've defined.
* __Code style__. In particular, I've been lazily using `struct`
in a lot of places. That's not a good idea; it's better
to use `class`, and only expose _some_ fields and methods
to the rest of the code.
### Error Reporting and Handling
The previous post was rather long, which led me to omit
a rather important aspect of the compiler: proper error reporting.
Once again our compiler has instances of `throw 0`, which is a cheap way
of avoiding properly handling a runtime error. Before we move on,
it's best to get rid of such blatantly lazy code.
Our existing exceptions (mostly type errors) can use some work, too.
Even the most descriptive issues our compiler reports -- unification errors --
don't include the crucial information of _where_ the error is. For large
programs, this means having to painstakingly read through the entire file
to try figure out which subexpression could possibly have an incorrect type.
This is far from the ideal debugging experience.
Addressing all this is a multi-step change in itself. We want to:
* Replace all `throw 0` code with actual exceptions.
* Replace some exceptions that shouldn't be possible for a user to trigger
with assertions.
* Keep track of source locations of each subexpression, so that we may
be able to print it if it causes an error.
* Be able to print out said source locations at will. This isn't
a _necessity_, but virtually all "big" compilers do this. Instead
of reporting that an error occurs on a particular line, we will
actually print the line.
Let's start with gathering the actual location data.
#### Bison's Locations
Bison actually has some rather nice support for location tracking. It can
automatically assemble the "from" and "to" locations of a nonterminal
from the locations of children, which would be very tedious to write
by hand. We enable this feature using the following option:
{{< codelines "C++" "compiler/13/parser.y" 46 46 >}}
There's just one hitch, though. Sure, Bison can compute bigger
locations from smaller ones, but it must get the smaller ones
from somewhere. Since Bison operates on _tokens_, rather
than _characters_, it effectively doesn't interact with the source
text at all, and can't determine from which line or column a token
originated. The task of determining the locations of input tokens
is delegated to the tokenizer -- Flex, in our case. Flex, on the
other hand, doesn't have a built-in mechanism for tracking
locations. Fortunately, Bison provides a `yy::location` class that
includes most of the needed functionality.
A `yy::location` consists of two source positions, `begin` and `end`,
which themselves are represented using lines and columns. It
also has the following methods:
* `yy::location::columns(int)` advances the `end` position by
the given number of columns, while `begin` stays the same.
If `begin` and `end` both point to the beginning of a token,
then `columns(token_length)` will move `end` to the token's end,
and thus make the whole `location` contain the token.
* `yy::location::lines(int)` behaves similarly to `columns`,
except that it advances `end` by the given number of lines,
rather than columns. It also resets the columns counter to `1`.
* `yy::location::step()` moves `begin` to where `end` is. This
is useful for when we've finished processing a token, and want
to move on to the next one.
For Flex specifically, `yyleng` has the length of the token
currently being processed. Rather than adding the calls
to `columns` and `step` to every rule, we can define the
`YY_USER_ACTION` macro, which is run before each token
is processed.
{{< codelines "C++" "compiler/13/scanner.l" 12 14 >}}
We'll see why we are using `LOC` instead of something like `location` soon;
for now, you can treat `LOC` as if it were a global variable declared
in the tokenizer. Before processing each token, we ensure that
the `yy::location` has its `begin` and `end` at the same position,
and then advance `end` by `yyleng` columns. This is
{{< sidenote "right" "sufficient-note" "sufficient" >}}
This doesn't hold for all languages. It may be possible for a language
to have tokens that contain <code>\n</code>, in which case,
rather than just using <code>yyleng</code>, we'd need to
add special logic to iterate over the token and detect the line
breaks.<br>
<br>
Also, this requires that the <code>end</code> of the previous token was
correctly computed.
{{< /sidenote >}}
to make `LOC` represent our token's source position. For
the moment, don't worry too much about `drv`; this is the
parsing driver, and we will talk about it shortly.
So now we have a "global" variable `LOC` that gives
us the source position of the current token. To get it
to Bison, we have to pass it as an argument to each
of the `make_TOKEN` calls. Here are a few sample lines
that should give you the general idea:
{{< codelines "C++" "compiler/13/scanner.l" 40 43 >}}
That last line is actually new. Previously, we somehow
got away without explicitly sending the end-of-file token to Bison.
I suspect that this was due to some kind of implicit conversion
of the Flex macro `YY_NULL` into a token; now that we have
to pass a position to every token constructor, such an implicit
conversion is probably impossible.
Now we have Bison computing source locations for each nonterminal.
However, at the moment, we still aren't using them. To change that,
we need to add a `yy::location` argument to each of our `ast` nodes,
as well as to the `pattern` subclasses, `definition_defn` and
`definition_data`. To avoid breaking all the code that creates
AST nodes and definitions outside of the parser, we'll make this
argument optional. Inside of `ast.hpp`, we define a new field as follows:
{{< codelines "C++" "compiler/13/ast.hpp" 16 16 >}}
Then, we add a constructor to `ast` as follows:
{{< codelines "C++" "compiler/13/ast.hpp" 18 18 >}}
Note that it's not optional here, since `ast` itself is an
abstract class, and thus will never be constructed directly.
It is in the subclasses of `ast` that we provide a default
value. The change is rather mechanical, but here's an example
from `ast_binop`:
{{< codelines "C++" "compiler/13/ast.hpp" 98 99 >}}
2020-09-11 21:29:49 -07:00
Finally, we tell Bison to pass the computed location
data as an argument when constructing our data structures.
This too is a mechanical change, and I think the following
few lines demonstrate the general idea in sufficient
2020-09-11 21:29:49 -07:00
detail:
{{< codelines "C++" "compiler/13/parser.y" 92 96 >}}
2020-09-11 21:29:49 -07:00
Here, the `@$` character is used to reference the current
nonterminal's location data.
#### Line Offsets, File Input, and the Parsing Driver
There are three more challenges with printing out the line
of code where an error occurred. First of all, to
print out a line of code, we need to have that line of code
available to us. We do not currently meet this requirement:
our compiler reads code form `stdin` (as is default for Flex),
and `stdin` doesn't always support rewinding. This, in turn,
means that once Flex has read a character from the input,
it may not be possible to go back and retrieve that character
again.
Second, even if we do have have the entire stream or buffer
available to us, to retrieve an offset and length within
that buffer from just a line and column number would be a lot
of work. A naive approach would be to iterate through
the input again, once more keeping track of lines and columns,
and print the desired line once we reach it. However, this
would lead us to redo a lot of work that our tokenizer
is already doing.
Third, Flex's input mechanism, even if it it's configured
not to read from `stdin`, uses a global file descriptor called
`yyin`. However, we're better off minimizing global state (especially
if we want to read, parse, and compile multiple files in
the future). While we're configuring Flex's input mechanism,
we may as well fix this, too.
There are several approaches to fixing the first issue. One possible
way is to store the content of `stdin` into a temporary file. Then,
it's possible to read from the file multiple times by using
the C functions `fseek` and `rewind`. However, since we're
working with files, why not just work directly with the files
created by the user? Instead of reading from `stdin`, we may
as well take in a path to a file via `argv`, and read from there.
Also, instead of `fseek` and `rewind`, we can just read the file
into memory, and access it like a normal character buffer. This
does mean that we can stick with `stdin`, but it's more conventional
to read source code from files, anyway.
To address the second issue, we can keep a mapping of line numbers
to their locations in the source buffer. This is rather easy to
maintain using an array: the first element of the array is 0,
which is the beginning of the first line in any source file. From there,
every time we encounter the character `\n`, we can push
the current source location to the top, marking it as
the beginning of another line. Where exactly we store this
array is as yet unclear, since we're trying to avoid global variables.
Finally, to begin addressing the third issue, we can use Flex's `reentrant`
option, which makes it so that all of the tokenizer's state is stored in an
opaque `yyscan_t` structure, rather than in global variables. This way,
we can configure `yyin` without setting a global variable, which is a step
in the right direction. We'll work on this momentarily.
Our tokenizing and parsing stack has more global variables
than just those specific to Flex. Among these variables is `global_defs`,
which receives all the top-level function and data type definitions. We
will also need some way of accessing the `yy::location` instance, and
a way of storing our file input in memory. Fortunately, we're not
the only ones to have ever come across the issue of creating non-global
state: the Bison documentation has a
2020-09-11 21:29:49 -07:00
[section in its C++ guide](https://www.gnu.org/software/bison/manual/html_node/Calc_002b_002b-Parsing-Driver.html)
that describes a technique for manipulating
state -- "parsing context", in their words. This technique involves the
creation of a _parsing driver_.
The parsing driver is a class (or struct) that holds all the parse-related
state. We can arrange for this class to be available to our tokenizing
and parsing functions, which will allow us to use it pretty much like we'd
use a global variable. This is the `drv` that we saw in `YY_USER_ACTION`.
We can define it as follows:
{{< codelines "C++" "compiler/13/parse_driver.hpp" 36 54 >}}
There aren't many fields here. The `file_name` string represents
the file that we'll be reading code from. The `location` field
will be accessed by Flex via `get_current_location`. Bison will
store the function and data type definitions it reads into `global_defs`
via `get_global_defs`. Finally, `file_m` will be used to keep track
of the content of the file we're reading, as well as the line offsets
within that file. Notice that a couple of these fields are pointers
that we take by reference in the constructor. The `parse_driver` doesn't
_own_ the global definitions, nor the file manager. They exist outside
of it, and will continue to be used in other ways the `parse_driver`
does not need to know about. Also, the `LOC` variable in Flex is
actually a call to `get_current_location`:
{{< codelines "C++" "compiler/13/scanner.l" 15 15 >}}
The methods of `parse_driver` are rather simple. The majority of
them deals with giving access to the parser's members: the `yy::location`,
the `definition_group`, and the `file_mgr`. The only exception
to this is `operator()`, which we use to actually trigger the parsing process.
We'll make this method return `true` if parsing succeeded, and `false`
otherwise (if, say, the file we tried to read doesn't exist).
Here's its implementation:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 48 60 >}}
2020-09-11 21:29:49 -07:00
We try open the user-specified file, and return `false` if we can't.
After this, we start doing the setup specific to a reentrant
2020-09-11 21:29:49 -07:00
Flex scanner. We declare a `yyscan_t` variable, which
will contain all of Flex's state. Then, we initialize
it using `yylex_init`. Finally, since we can no longer
touch the `yyin` global variable (it doesn't exist),
we have to resort to using a setter function provided by Flex
to configure the tokenizer's input stream.
Next, we construct our Bison-generated parser. Note that
unlike before, we have to pass in two arguments:
`scanner` and `*this`, the latter being of type `parse_driver&`.
We'll come back to how this works in a moment. With
the scanner and parser initialized, we invoke `parser::operator()`,
which actually runs the Flex- and Bison-generated code.
To clean up, we run `yylex_destroy` and `fclose`. Finally,
we call `file_mgr::finalize`, and return. But what
_is_ `file_mgr`?
The `file_mgr` class does two things: it stores the part of the file
that has already been read by Flex in memory, and it keeps track of
where each line in our source file begins within the text. Here is its
definition:
{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 34 >}}
In this class, the `string_stream` member is used to construct
an `std::string` from the bits of text that Flex reads,
processes, and feeds to the `file_mgr` using the `write` method.
It's more efficient to use a string stream than to concatenate
strings repeatedly. Once Flex is finished processing the file,
the final contents of the `string_stream` are transferred into
the `file_contents` string using the `finalize` method. The `offset`
and `line_offsets` fields will be used as we described earlier: each time Flex
encounters the `\n` character, the `offset` variable will pushed
in top of the `line_offsets` vector, marking the beginning of
the corresponding line. The methods of the class are as follows:
* `write` will be called from Flex, and will allow us to
record the content of the file we're processing to the `string_stream`.
We've already seen it used in the `YY_USER_ACTION` macro.
* `mark_line` will also be called from Flex, and will mark the current
`file_offset` as the beginning of a line by pushing it into `line_offsets`.
* `finalize` will be called by the `parse_driver` when the parsing
finishes. At this time, the `string_stream` should contain all of
the input file, and this data is transferred to `file_contents`, as
we mentioned above.
* `get_index` and `get_line_end` will be used for converting
`yy::location` instances to offsets within the source code buffer.
* `print_location` will be used for printing errors.
It will print the lines spanned by the given location, with the
location itself colored and underlined if the last argument is `true`.
This will make our errors easier on the eyes.
2020-09-11 21:29:49 -07:00
Let's take a look at their implementations. First, `write`.
For the most part, this method is a proxy for the `write`
method of our `string_stream`:
2020-09-11 21:29:49 -07:00
{{< codelines "C++" "compiler/13/parse_driver.cpp" 9 12 >}}
2020-09-11 21:29:49 -07:00
We do, however, also keep track of the `file_offset` variable
here, which ensures we have up-to-date information
regarding our position in the source file. The implementation
of `mark_line` uses this information:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 14 16 >}}
The `finalize` method is trivial, and requires little additional
discussion:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 18 20 >}}
2020-09-11 21:29:49 -07:00
Once we have the line offsets, `get_index` becomes very simple:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 22 25 >}}
2020-09-11 21:29:49 -07:00
Here, we use an assertion for the first time. Calling
`get_index` with a negative or zero line doesn't make
any sense, since Bison starts tracking line numbers
at 1. Similarly, asking for a line for which we don't
have a recorded offset is invalid. Both
of these nonsensical calls to `get_index` cannot
be caused by the user under normal circumstances,
and indicate the method's misuse by the author of
the compiler (us!). Thus, we terminate the program.
Finally, the implementation of `line_end` just finds the
beginning of the next line. We stick to the C convention
of marking 'end' indices exclusive (pointing just past
the end of the array):
{{< codelines "C++" "compiler/13/parse_driver.cpp" 27 30 >}}
2020-09-11 21:29:49 -07:00
Since `line_offsets` has as many elements as there are lines,
the last line number would be equal to the vector's size.
When looking up the end of the last line, we can't look for
the beginning of the next line, so instead we return the end of the file.
Next, the `print_location` method prints three sections
of the source file. These are the text "before" the error,
the error itself, and, finally, the text "after" the error.
For example, if an error began on the fifth column of the third
line, and ended on the eighth column of the fourth line, the
"before" section would include the first four columns of the third
line, and the "after" section would be the ninth column onward
on the fourth line. Before and after the error itself,
if the `highlight` argument is true,
we sprinkle the ANSI escape codes to enable and disable
special formatting, respectively. For now, the special
formatting involves underlining the text and making it red.
{{< codelines "C++" "compiler/13/parse_driver.cpp" 32 46 >}}
2020-09-11 21:29:49 -07:00
Finally, to get the forward declarations for the `yy*` functions
and types, we set the `header-file` option in Flex:
{{< codelines "C++" "compiler/13/scanner.l" 3 3 >}}
We also include this `scanner.hpp` file in our `parse_driver.cpp`:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 2 2 >}}
#### Adding the Driver to Flex and Bison
Bison's C++ language template generates a class called
`yy::parser`. We don't really want to modify this class
in any way: not only is it generated code, but it's
also rather complex. Instead, Bison provides us
with a mechanism to pass more data in to the parser.
This data is made available to all the actions
that the parser runs. Better yet, Bison also attempts
to pass this data on to the tokenizer, which in our
case would mean that whatever data we provide Bison
will also be available to Flex. This is how we'll
allow the two components to access our new `parse_driver`
class. This is also how we'll pass in the `yyscan_t`
that Flex now needs to run its tokenizing code. To
do all this, we use Bison's `%param` option. I'm
going to include a few more lines from `parser.y`,
since they contain the necessary `#include` directives
and a required type definition:
{{< codelines "C++" "compiler/13/parser.y" 1 18 >}}
The `%param` option effectively adds the parameter listed
between the curly braces to the constructor of the generated
`yy::parser`. We've already seen this in the implementation
of our driver, where we passed `scanner` and `*this` as
arguments when creating the parser. The parameters we declare are also passed to the
`yylex` function, which is expected to accept them in the same order.
Since we're adding `parse_driver` as an argument we have to
declare it. However, we can't include the `parse_driver` header
right away because `parse_driver` itself includes the `parser` header:
we'd end up with a circular dependency. Instead, we resort to
forward-declaring the driver class, as well as the `yyscan_t`
structure containing Flex's state.
Adding a parameter to Bison doesn't automatically affect
Flex. To let Flex know that its `yylex` function must now accept
the state and the parsing driver, we have to define the
2020-09-11 21:29:49 -07:00
`YY_DECL` macro. We do this in `parse_driver.hpp`, since
this forward declaration will be used by both Flex
and Bison:
{{< codelines "C++" "compiler/13/parse_driver.hpp" 56 58 >}}
2020-09-11 21:29:49 -07:00
#### Improving Exceptions
Now, it's time to add location data (and a little bit more) to our
exceptions. We want to make it possible for exceptions to include
data about where the error occurred, and to print this data to the user.
However, it's also possible for us to have exceptions that simply
do not have that location data. Furthermore, we want to know
whether or not an exception has an associated location; we'd
rather not print an invalid or "default" location when an error
occurs.
In the old days of programming, we could represent the absence
of location data with a `nullptr`, or `NULL`. But not only
does this approach expose us to all kind of `NULl`-safety
bugs, but it also requires heap allocation! This doesn't
make it sound all that appealing; instead, I think we should
opt for using `std::optional`.
Though `std::optional` is standard (as may be obvious from its
namespace), it's a rather recent addition to the C++ STL.
In order to gain access to it, we need to ensure that our
project is compiled using C++17. To this end, we add
the following two lines to our CMakeLists.txt:
{{< codelines "CMake" "compiler/13/CMakeLists.txt" 5 6 >}}
Now, let's add a new base class for all of our compiler errors,
unsurprisingly called `compiler_error`:
{{< codelines "C++" "compiler/13/error.hpp" 10 26 >}}
2020-09-11 21:29:49 -07:00
We'll put some 'common' exception functionality
into the `print_location` and `print_about` methods. If the error
has an associated location, the former method will print that
location to the screen. We don't always want to highlight
the part of the code that caused the error: for instance,
an invalid data type definition may span several lines,
and coloring that whole section of text red would be
too much. To address this, we add the `highlight`
boolean argument, which can be used to switch the
colors on and off. The `print_about` method
will simply print the `what()` message of the exception,
in addition to the "specific" error that occurred (stored
in `description`). Here are the implementations of the
functions:
{{< codelines "C++" "compiler/13/error.cpp" 3 16 >}}
We will also add a `pretty_print` method to all of
our exceptions. This method will handle
all the exception-specific printing logic.
For the generic compiler error, this means
simply printing out the error text and the location:
{{< codelines "C++" "compiler/13/error.cpp" 18 21 >}}
For `type_error`, this logic slightly changes,
enabling colors when printing the location:
{{< codelines "C++" "compiler/13/error.cpp" 27 30 >}}
Finally, for `unification_error`, we also include
the code to print out the two types that our
compiler could not unify:
{{< codelines "C++" "compiler/13/error.cpp" 32 41 >}}
There's a subtle change here. Compared to the previous
type-printing code (which we had in `main`), what
we wrote here deals with "expected" and "actual" types.
The `left` type passed to the exception is printed
first, and is treat like the "correct" type. The
`right` type, on the other hand, is treated
like the "wrong" type that should have been
unifiable with `left`. This will affect the
calling conventions of our unification code.
2020-09-11 21:29:49 -07:00
Now, we can go through and find all the places where
we `throw 0`. One such place was in the data type
definition code, where declaring the same type parameter
twice is invalid. We replace the `0` with a
`compiler_error`:
{{< codelines "C++" "compiler/13/definition.cpp" 66 69 >}}
Not all `throw 0` statements should become exceptions.
For example, here's code from the previous version of
the compiler:
{{< codelines "C++" "compiler/12/definition.cpp" 123 127 >}}
If a definition `def_defn` has a dependency on a "nearby" (declared
in the same group) definition called `dependency`, and if
`dependency` does not exist within the same definition group,
we throw an exception. But this error is impossible
for a user to trigger: the only reason for a variable to appear
in the `nearby_variables` vector is that it was previously
found in the definition group. Here's the code that proves this
(from the current version of the compiler):
{{< codelines "C++" "compiler/13/definition.cpp" 102 106 >}}
Not being able to find the variable in the definition group
is a compiler bug, and should never occur. So, instead
of throwing an exception, we'll use an assertion:
{{< codelines "C++" "compiler/13/definition.cpp" 128 128 >}}
For more complicated error messages, we can use a `stringstream`.
Here's an example from `parsed_type`:
{{< codelines "C++" "compiler/13/parsed_type.cpp" 16 23 >}}
In general, this change is also rather mechanical. Before we
move on, to maintain a balance between exceptions and assertions, here
2020-09-11 21:29:49 -07:00
are a couple more assertions from `type_env`:
2020-09-18 14:42:50 -07:00
{{< codelines "C++" "compiler/13/type_env.cpp" 81 82 >}}
2020-09-11 21:29:49 -07:00
Once again, it should not be possible for the compiler
to try generalize the type of a variable that doesn't
exist, and nor should generalization occur twice.
While we're on the topic of types, let's talk about
`type_mgr::unify`. In practice, I suspect that a lot of
errors in our compiler will originate from this method.
However, at present, this method does not in any way
track the locations of where a unification error occurred.
To fix this, we add a new `loc` parameter to `unify`,
which we make optional to allow for unification without
a known location. Here's the declaration:
{{< codelines "C++" "compiler/13/type.hpp" 92 92 >}}
2020-09-11 21:29:49 -07:00
The change to the implementation is mechanical and repetitive,
so instead of showing you the whole method, I'll settle for
a couple of lines:
{{< codelines "C++" "compiler/13/type.cpp" 121 122 >}}
2020-09-11 21:29:49 -07:00
We want to make sure that a location provided to the
top-level call to `unify` is also forwarded to the
recursive calls, so we have to explicitly add it
to the call.
We'll also have to update the 'main' code to call the
`pretty_print` methods, but there's another big change
that we're going to make before then. However, once that
change is made, our errors will look a lot better.
Here is what's printed out to the user when a type error
occurs:
2020-09-11 21:29:49 -07:00
```
an error occured while checking the types of the program: failed to unify types
occuring on line 2:
3 + False
the expected type was:
Int
2020-09-11 21:29:49 -07:00
while the actual type was:
Bool
2020-09-11 21:29:49 -07:00
```
Here's an error that was previously a `throw 0` statement in our code:
2020-09-11 21:29:49 -07:00
```
an error occured while compiling the program: type variable a used twice in data type definition.
occuring on line 1:
data Pair a a = { MkPair a a }
```
Now, not only have we eliminated the lazy uses of `throw 0` in our
code, but we've also improved the presentation of the errors
to the user!
### Rethinking Name Mangling
In the previous post, I said the following:
> One more thing. Lets adopt the convention of storing mangled names into the compilation environment. This way, rather than looking up mangled names only for global functions, which would be a gotcha for anyone working on the compiler, we will always use the mangled names during compilation.
Now that I've had some more time to think about it
(and now that I've returned to the compiler after
a brief hiatus), I think that this was not the right call.
Mangled names make sense when translating to LLVM; we certainly
don't want to declare two LLVM functions
{{< sidenote "right" "mangling-note" "with the same name." >}}
By the way, LLVM has its own name mangling functionality. If you
declare two functions with the same name, they'll appear as
<code>function</code> and <code>function.0</code>. Since LLVM
uses the <code>Function*</code> C++ values to refer to functions,
as long as we keep them seaprate on <em>our</em> end, things will
work.<br>
<br>
However, in our compiler, name mangling occurs before LLVM is
introduced, at translation time. We could create LLVM functions
at that time, too, and associate them with variables. But then,
our G-machine instructions will be coupled to LLVM, which
would not be as clean.
{{< /sidenote >}}
But things are different for local variables. Our local variables
are graphs on a stack, and are not actually compiled to LLVM
definitions. It doesn't make sense to mangle their names, since
their names aren't present anywhere in the final executable.
It's not even "consistent" to mangle them, since global definitions
are compiled directly to __PushGlobal__ instructions, while local
variables are only referenced through the current `env`.
So, I opted to reverse my decision. We will go back to
placing variable names directly into `env_var`. Here's
an example of this from `global_scope.cpp`:
{{< codelines "C++" "compiler/13/global_scope.cpp" 6 8 >}}
Now that we've started using assertions, I also think it's worth
to put our new invariant -- "only global definitions have mangled
names" -- into code:
2020-09-18 14:42:50 -07:00
{{< codelines "C++" "compiler/13/type_env.cpp" 36 45 >}}
Furthermore, we'll _require_ that a global definition
has a mangled name. This way, we can be more confident
that a variable from a __PushGlobal__ instruction
is referencing the right function. To achieve
this, we change `get_mangled_name` to stop
returning the input string if a mangled name was not
found; doing so makes it impossible to check if a mangled
name was explicitly defined. Instead,
2020-09-18 14:42:50 -07:00
we add two assertions. First, if an environment scope doesn't
contain a variable, then it _must_ have a parent.
If it does contain variable, that variable _must_ have
a mangled name. We end up with the following:
{{< codelines "C++" "compiler/13/type_env.cpp" 47 55 >}}
For this to work, we make one more change. Now that we've
enabled C++17, we have access to `std::optional`. We
can thus represent the presence or absence of mangled
names using an optional field, rather than with the empty string `""`.
I hear that C++ compilers have pretty good
[empty string optimizations](https://www.youtube.com/watch?v=kPR8h4-qZdk),
but nonetheless, I think it makes more sense semantically
to use "absent" (`nullopt`) instead of "empty" (`""`).
Here's the definition of `type_env::variable_data` now:
{{< codelines "C++" "compiler/13/type_env.hpp" 16 25 >}}
Since looking up a mangled name for non-global variable
{{< sidenote "right" "unrepresentable-note" "will now result in an assertion failure," >}}
A very wise human at the very dawn of our species once said,
"make illegal states unrepresentable". Their friends and family were a little
busy making a fire, and didn't really understand what the heck they meant. Now,
we kind of do.<br>
<br>
It's <em>possible</em> for our <code>type_env</code> to include a
<code>variable_data</code> entry that is both global and has no mangled
name. But it doesn't have to be this way. We could define two subclasses
of <code>variable_data</code>, one global and one local,
where only the global one has a <code>mangled_name</code>
field. It would be impossible to reach this assertion failure then.
{{< /sidenote >}} we have to change
`ast_lid::compile` to only call `get_mangled_name` once
it ensures that the variable being compiled is, in fact,
global:
{{< codelines "C++" "compiler/13/ast.cpp" 58 63 >}}
Since all global functions now need to have mangled
names, we run into a bit of a problem. What are
the mangled names of `(+)`, `(-)`, and so on? We could
continue to hardcode them as `plus`, `minus`, etc., but this can
(and currently does!) lead to errors. Consider the following
piece of code:
```
defn plus x y = { x + y }
defn main = { plus 320 6 }
```
We've hardcoded the mangled name of `(+)` to be `plus`. However,
`global_scope` doesn't know about this, so when the actual
`plus` function gets translated, it also gets assigned the
mangled name `plus`. The name is also overwritten in the
`llvm_context`, which effectively means that `(+)` is
now compiled to a call of the user-defined `plus` function.
If we didn't overwrite the name, we would've run into an assertion
failure in this scenario anyway. In short, this example illustrates
an important point: mangling information needs to be available
outside of a `global_scope`. We don't want to do this by having
every function take in a `global_scope` to access the mangling
information; instead, we'll store the mangling information in
a new `mangler` class, which `global_scope` will take as an argument.
The new class is very simple:
{{< codelines "C++" "compiler/13/mangler.hpp" 5 11 >}}
As with `parse_driver`, `global_scope` takes `mangler` by reference
and stores a pointer:
{{< codelines "C++" "compiler/13/global_scope.hpp" 50 50 >}}
The implementation of `new_mangled_name` doesn't change, so I'm
not going to show it here. With this new mangling information
in hand, we can now correctly set the mangled names of binary
operators:
{{< codelines "C++" "compiler/13/compiler.cpp" 22 27 >}}
Wait a moment, what's a `compiler`? Let's talk about that next.
### A Top-Level Class
Now that we've moved name mangling out of `global_scope`, we have
to put it somewhere. The same goes for global definition group
and the file manager that are given to `parse_driver`. The two
classes _make use_ of the other data, but they don't _own it_.
That's why they take it by reference, and store it as a pointer.
They're just temporarily allowed access.
So, what should be the owner of all of these disparate components?
Thus far, that has been the `main` function, or the utility
functions that it calls out to. However, this is sloppy:
we have related data and operations on it, but we don't group
them into an object. We can group all of the components of our
compiler into a `compiler` object, and leave `main.cpp` with
exception printing code.
The definition of the `compiler` class begins with all of the data
structures that we use in the process of compilation:
{{< codelines "C++" "compiler/13/compiler.hpp" 12 20 >}}
There's a loose ordering to these fields. In C++, class members are
initialized in the order they are declared; we therefore want to make
sure that fields that are depended on by other fields are initialized first.
Otherwise, I tried to keep the order consistent with the conceptual path
of the code through the compiler.
* Parsing happens first, so we begin with `parse_driver`, which needs a
`file_manager` (to populate with line information) and a `definition_group`
(to receive the global definitions from the parser).
* We then proceed to typechecking, for which we use a global `type_env_ptr`
(to define the built-in functions and constructors) and a `type_mgr` (to
manage the assignments of type variables).
* Once a program is typechecked, we transform it, eliminating local
function definitions and lambda functions. This is done by storing
newly-emitted global functions into the `global_scope`, which requires a
`mangler` to generate new names for the target functions.
* Finally, to generate LLVM IR, we need our `llvm_context` class.
The methods of the compiler are arranged similarly:
{{< codelines "C++" "compiler/13/compiler.hpp" 22 31 >}}
The methods go as follows:
* `add_default_types` adds the built-in types to the `global_env`.
At this point, these types only include `Int`.
* `add_binop_type` adds a single binary operator to the global
type environment. We saw its implementation earlier: it deals
with both binding a type, and setting a mangled name.
* `add_default_types` adds the types for each binary operator.
* `parse`, `typecheck`, `translate` and `compile` all do exactly
what they say. In this case, compilation refers to creating G-machine
instructions.
* `create_llvm_binop` creates an internal function that forces the
evaluation of its two arguments, and actually applies the given binary
operator. Recall that the `(+)` in user code constructs a call to this
function, but leaves it unevaluated until it's needed.
* `generate_llvm` converts all the definitions in `global_scope`, which
are at this point compiled into G-machine `instruction`s, into LLVM IR.
* `output_llvm` contains all the code to actually generate an object
file from the LLVM IR.
These functions are mostly taken from part 12's `main.cpp`, and adjusted
to use the `compiler`'s members rather than local definitions or arguments.
You should compare part 12's
[`main.cpp`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/12/main.cpp)
file with the
[`compiler.cpp`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/13/compiler.cpp)
file that we end up with at the end of this post.
Next, we have the compiler's constructor, and its `operator()`. The
latter, analogously to our parsing driver, will trigger the compilation
process. Their implementations are straightforward:
{{< codelines "C++" "compiler/13/compiler.cpp" 131 145 >}}
We also add a couple of methods to give external code access to
some of the compiler's data structures. I omit their (trivial)
implementations, but they have the following signatures:
{{< codelines "C++" "compiler/13/compiler.hpp" 35 36 >}}
With all the compilation code tucked into our new `compiler` class,
`main` becomes very simple. We also finally get to use our exception
pretty printing code:
{{< codelines "C++" "compiler/13/main.cpp" 11 27 >}}
With this, we complete our transition to a compiler object.
All that's left is to clean up the code style.
### Keeping Things Private
Hand-writing or generating hundreds of trivial getters and setters
for the fields of a data class (which is standard in the world of Java) seems
absurd to me. So, for most of this project, I stuck with
`struct`s, rather than classes. But this is not a good policy
to apply _everywhere_. I still think it makes sense to make
data structures like `ast` and `type` public-by-default;
however, I _don't_ think that way about classes like `type_mgr`,
`llvm_context`, `type_env`, and `env`. All of these have information
that we should never be accessing directly. Some guard this
information with assertions. In short, it should be protected.
For most classes, the changes are mechanical. For instance, we
can make `type_env` a class simply by changing its declaration,
and marking all of its functions public. This requires a slight
refactoring of a line that used its `parent` field. Here's
what it used to be (in context):
{{< codelines "C++" "compiler/12/main.cpp" 57 60 >}}
And here's what it is now:
{{< codelines "C++" "compiler/13/compiler.cpp" 55 58 >}}
Rather than traversing the chain of environments from
the body of the definition, we just use the definition's
own `env_ptr`. This is cleaner and more explicit, and
it helps us not use the private members of `type_env`!
The deal with `env` is about as simple. We just make
it and its two descendants classes, and mark their
methods and constructors public. The same
goes for `global_scope`. To make `type_mgr`
a class, we have to add a new method: `lookup`.
Here's its implementation:
{{< codelines "C++" "compiler/13/type.cpp" 81 85 >}}
It's used in `type_var::print` as follows:
{{< codelines "C++" "compiler/13/type.cpp" 28 35 >}}
We can't use `resolve` here because it takes (and returns)
a `type_ptr`. If we make it _take_ a `type*`, it won't
be able to return its argument if it's already resolved. If we
allow it to _return_ `type*`, we won't have an owning
reference. We also don't want to duplicate the
method just for this one call. Notice, though, how similar
`type_var::print`/`lookup` and `resolve` are in terms of execution.
The change for `llvm_context` requires a little more work.
Right now, `ctx.builder` is used a _lot_ in `instruction.cpp`.
Since we don't want to forward each of the LLVM builder methods,
and since it feels weird to make `llvm_context` extend `llvm::IRBuilder`,
we'll just provide a getter for the `builder` field. The
same goes for `module`:
{{< codelines "C++" "compiler/13/llvm_context.hpp" 46 47 >}}
Here's what some of the code from `instruction.cpp` looks like now:
{{< codelines "C++" "compiler/13/instruction.cpp" 144 145 >}}
Right now, the `ctx` field of the `llvm_context` (which contains
the `llvm::LLVMContext`) is only externally used to create
instances of `llvm::BasicBlock`. We'll add a proxy method
for this functionality:
{{< codelines "C++" "compiler/13/llvm_context.cpp" 174 176 >}}
Finally, `instruction_pushglobal` needs to access the
`llvm::Function` instances that we create in the process
of compilation. We add a new `get_custom_function` method
to support this, which automatically prefixes the function
name with `f_`, much like `create_custom_function`:
{{< codelines "C++" "compiler/13/llvm_context.cpp" 292 294 >}}
I think that's enough. If we chose to turn more compiler
data structures into classes, I think we would've quickly drowned
in one-line getter and setter methods.
That's all for the cleanup! We've added locations and more errors
to the compiler, stopped throwing `0` in favor of proper exceptions
or assertions, made name mangling more reasonable, fixed a bug with
accidentally shadowing default functions, organized our compilation
process into a `compiler` class, and made more things into classes.
In the next post, I hope to tackle __strings__ and __Input/Output__.
I also think that implementing __modules__ would be a good idea,
though at the moment I don't know too much on the subject. I hope
you'll join me in my future writing!
### Appendix: Optimization
When I started working on the compiler after the previous post,
I went a little overboard. I started working on optimizing the generated programs,
but eventually decided I wasn't doing a
{{< sidenote "right" "good-note" "good enough" >}}
I think authors should feel a certain degree of responsibility
for the content they create. If I do something badly, somebody
else trusts me and learns from it, who knows how much damage I've done.
I try not to do damage.<br>
<br>
If anyone reads what I write, anyway!
{{< /sidenote >}} job to present it to others,
and scrapped that part of the compiler altogether. I'm not
sure if I will try again in the near future. But,
if you're curious about optimization, here are a few avenues
I've explored or thought about:
* __Unboxing numbers__. Right now, numbers are allocated and garbage
collected just like the rest of the graph nodes. This is far from ideal.
We could use pointers to represent numbers, by tagging their most significant
bits on 64-bit CPUs. Rather than allocating a node, the runtime will just
cast a number to a pointer, tag it, and push it on the stack.
* __Converting enumeration data types to numbers__. If no constructor
of a data type takes any arguments, then the tag uniquely identifies
each constructor. Combined with unboxed numbers, this can save unnecessary
allocations and memory accesses.
* __Special treatment for global constants__. It makes sense for
global functions to be converted into LLVM functions, but the
same is not the case for
{{< sidenote "right" "constant-note" "constants." >}}
Yeah, yeah, a constant is just a nullary function. Get
out of here with your pedantry!
{{< /sidenote >}} We can find a way to
initialize global constants once, which would save some work. To
make more constants suitable for this, we could employ
[monomorphism restriction](https://wiki.haskell.org/Monomorphism_restriction).
* __Optimizing stack operations.__ If you read through the LLVM IR
we produce, you can see a lot of code that peeks at something twice,
or pops-then-pushes the same value, or does other absurd things. LLVM
isn't aware of the semantics of our stacks, but perhaps we could write an
optimization pass to deal with some of the more blatant instances of
this issue.
If you attempt any of these, let me know how it goes, please!