Update blog post, switching away from two sections.

This commit is contained in:
Danila Fedorin 2020-09-17 22:35:40 -07:00
parent 7226d66f67
commit 98cac103c4

View File

@ -8,28 +8,26 @@ description: "In this post, we clean up our compiler and add some basic optimiza
In [part 12]({{< relref "12_compiler_let_in_lambda" >}}), we added `let/in` In [part 12]({{< relref "12_compiler_let_in_lambda" >}}), we added `let/in`
and lambda expressions to our compiler. At the end of that post, I mentioned and lambda expressions to our compiler. At the end of that post, I mentioned
that before we move on to bigger and better things, I wanted to take a that before we move on to bigger and better things, I wanted to take a
step back and clean up the compiler. step back and clean up the compiler. Now is the time to do that.
Recently, I got around to doing that. Unfortunately, I also got around to doing In particular, I identified three things that could be improved
a lot more. Furthermore, I managed to make the changes in such a way that I or cleaned up:
can't cleanly separate the 'cleanup' and 'optimization' portions of my work.
This is partially due to the way in which I organize code, where each post
is associated with a version of the compiler with the necessary changes.
Because of all this, instead of making this post about the cleanup, and the
next post about the optimizations, I have to merge them into one.
So, this post is split into two major portions: cleanup, which deals mostly * __Error handling__. We need to stop using `throw 0` and start
with touching up exceptions and improving the 'name mangling' logic, and using `assert`. We can also make our errors much more descriptive
optimizations, which deals with adding special treatment to booleans, by including source locations in the output.
unboxing integers, and implementing more binary operators. * __Name mangling__. I don't think I got it quite right last
time. Now is the time to clean it up.
* __Code organization__. I think we can benefit from a top-level
class, and a more clear "dependency order" between the various
classes and structures we've defined.
* __Code style__. In particular, I've been lazily using `struct`
in a lot of places. That's not a good idea; it's better
to use `class`, and only expose _some_ fields and methods
to the rest of the code.
### Section 1: Cleanup ### Error Reporting and Handling
The previous post was rather long, which led me to omit
The previous post was
{{< sidenote "right" "long-note" "rather long," >}}
Probably not as long as this one, though! I really need to get the
size of my posts under control.
{{< /sidenote >}} which led me to omit
a rather important aspect of the compiler: proper error reporting. a rather important aspect of the compiler: proper error reporting.
Once again our compiler has instances of `throw 0`, which is a cheap way Once again our compiler has instances of `throw 0`, which is a cheap way
of avoiding properly handling a runtime error. Before we move on, of avoiding properly handling a runtime error. Before we move on,
@ -62,7 +60,7 @@ automatically assemble the "from" and "to" locations of a nonterminal
from the locations of children, which would be very tedious to write from the locations of children, which would be very tedious to write
by hand. We enable this feature using the following option: by hand. We enable this feature using the following option:
{{< codelines "C++" "compiler/13/parser.y" 50 50 >}} {{< codelines "C++" "compiler/13/parser.y" 46 46 >}}
There's just one hitch, though. Sure, Bison can compute bigger There's just one hitch, though. Sure, Bison can compute bigger
locations from smaller ones, but it must get the smaller ones locations from smaller ones, but it must get the smaller ones
@ -97,25 +95,27 @@ to `columns` and `step` to every rule, we can define the
`YY_USER_ACTION` macro, which is run before each token `YY_USER_ACTION` macro, which is run before each token
is processed. is processed.
{{< codelines "C++" "compiler/13/scanner.l" 12 12 >}} {{< codelines "C++" "compiler/13/scanner.l" 12 14 >}}
We'll see why we are using `drv` soon; for now, you can treat We'll see why we are using `LOC` instead of something like `location` soon;
`location` as if it were a global variable declared in the for now, you can treat `LOC` as if it were a global variable declared
tokenizer. Before processing each token, we ensure that in the tokenizer. Before processing each token, we ensure that
`location` has its `begin` and `end` at the same position, the `yy::location` has its `begin` and `end` at the same position,
and then advance `end` by `yyleng` columns. This is sufficient and then advance `end` by `yyleng` columns. This is sufficient
to make `location` represent our token's source position. to make `LOC` represent our token's source position. For
the moment, don't worry too much about `drv`; this is the
parse driver, and we will talk about it shortly.
So now we have a "global" variable `location` that gives So now we have a "global" variable `LOC` that gives
us the source position of the current token. To get it us the source position of the current token. To get it
to Bison, we have to pass it as an argument to each to Bison, we have to pass it as an argument to each
of the `make_TOKEN` calls. Here are a few sample lines of the `make_TOKEN` calls. Here are a few sample lines
that should give you the general idea: that should give you the general idea:
{{< codelines "C++" "compiler/13/scanner.l" 41 44 >}} {{< codelines "C++" "compiler/13/scanner.l" 40 43 >}}
That last line is actually new. Previously, we somehow That last line is actually new. Previously, we somehow
got away without explicitly sending the EOF token to Bison. got away without explicitly sending the end-of-file token to Bison.
I suspect that this was due to some kind of implicit conversion I suspect that this was due to some kind of implicit conversion
of the Flex macro `YY_NULL` into a token; now that we have of the Flex macro `YY_NULL` into a token; now that we have
to pass a position to every token constructor, such an implicit to pass a position to every token constructor, such an implicit
@ -146,10 +146,10 @@ from `ast_binop`:
Finally, we tell Bison to pass the computed location Finally, we tell Bison to pass the computed location
data as an argument when constructing our data structures. data as an argument when constructing our data structures.
This too is a mechanical change, and I think the following This too is a mechanical change, and I think the following
couple of lines demonstrate the general idea in sufficient few lines demonstrate the general idea in sufficient
detail: detail:
{{< codelines "C++" "compiler/13/parser.y" 107 110 >}} {{< codelines "C++" "compiler/13/parser.y" 92 96 >}}
Here, the `@$` character is used to reference the current Here, the `@$` character is used to reference the current
nonterminal's location data. nonterminal's location data.
@ -189,7 +189,9 @@ working with files, why not just work directly with the files
created by the user? Instead of reading from `stdin`, we may created by the user? Instead of reading from `stdin`, we may
as well take in a path to a file via `argv`, and read from there. as well take in a path to a file via `argv`, and read from there.
Also, instead of `fseek` and `rewind`, we can just read the file Also, instead of `fseek` and `rewind`, we can just read the file
into memory, and access it like a normal character buffer. into memory, and access it like a normal character buffer. This
does mean that we can stick with `stdin`, but it's more conventional
to read source code from files, anyway.
To address the second issue, we can keep a mapping of line numbers To address the second issue, we can keep a mapping of line numbers
to their locations in the source buffer. This is rather easy to to their locations in the source buffer. This is rather easy to
@ -200,7 +202,7 @@ the current source location to the top, marking it as
the beginning of another line. Where exactly we store this the beginning of another line. Where exactly we store this
array is as yet unclear, since we're trying to avoid global variables. array is as yet unclear, since we're trying to avoid global variables.
Finally, begin addressing the third issue, we can use Flex's `reentrant` Finally, to begin addressing the third issue, we can use Flex's `reentrant`
option, which makes it so that all of the tokenizer's state is stored in an option, which makes it so that all of the tokenizer's state is stored in an
opaque `yyscan_t` structure, rather than in global variables. This way, opaque `yyscan_t` structure, rather than in global variables. This way,
we can configure `yyin` without setting a global variable, which is a step we can configure `yyin` without setting a global variable, which is a step
@ -221,50 +223,38 @@ creation of a _parsing driver_.
The parsing driver is a class (or struct) that holds all the parse-related The parsing driver is a class (or struct) that holds all the parse-related
state. We can arrange for this class to be available to our tokenizing state. We can arrange for this class to be available to our tokenizing
and parsing functions, which will allow us to use it pretty much like we'd and parsing functions, which will allow us to use it pretty much like we'd
use a global variable. We can define it as follows: use a global variable. This is the `drv` that we saw in `YY_USER_ACTION`.
We can define it as follows:
{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 37 >}} {{< codelines "C++" "compiler/13/parse_driver.hpp" 36 54 >}}
There are quite a few fields here. The `file_name` string represents There aren't many fields here. The `file_name` string represents
the file that we'll be reading code from. the `string_stream` will the file that we'll be reading code from. The `location` field
be used to back up the contents of source file as Flex reads them; will be accessed by Flex via `get_current_location`. Bison will
once Flex is done, the content of the `string_stream` will be store the function and data type definitions it reads into `global_defs`
saved into the `file_content` string. via `get_global_defs`. Finally, `file_m` will be used to keep track
of the content of the file we're reading, as well as the line offsets
within that file. Notice that a couple of these fields are pointers
that we take by reference in the constructor. The `parse_driver` doesn't
_own_ the global definitions, nor the file manager. They exist outside
of it, and will continue to be used in other ways the `parse_driver`
does not need to know about. Also, the `LOC` variable in Flex is
actually a call to `get_current_location`:
The next three fields deal with tracking source code {{< codelines "C++" "compiler/13/scanner.l" 15 15 >}}
locations. The `location` field will be accessed by Flex
via `drv.location` (where `drv` is a reference to our driver class).
The `file_offset` and `line_offsets` fields will be used to
keep track of where each line begins, as we have discussed above.
Finally, `global_defs` will be the new home of our top-level
definitions.
The methods on `parse_driver` are rather simple, too: The methods of `parse_driver` are rather simple. The majority of
them deals with giving access to the parser's members: the `yy::location`,
the `definition_group`, and the `file_mgr`. The only exception
to this is `operator()`, which we use to actually trigger the parsing process.
We'll make this method return `true` if parsing succeeded, and `false`
otherwise (if, say, the file we tried to read doesn't exist).
Here's its implementation:
* `run_parse` handles the initialization of the tokenizer {{< codelines "C++" "compiler/13/parse_driver.cpp" 48 60 >}}
and parser, which includes obtaining the `FILE*` and configuring
Flex to use it. It also handles invoking the parsing code.
We'll make this method return `true` if parsing succeeded,
and `false` otherwise (if, say, the file we tried to read doesn't exist).
* `write` will be called from Flex, and will allow us to
record the content of the file we're processing to the `string_stream`.
We've already seen it used in the `YY_USER_ACTION` macro.
* `mark_line` will also be called from Flex, and will mark the current
`file_offset` as the beginning of a line by pushing it into `line_offsets`.
* `get_index` and `get_line_end` will be used for converting
`yy::location` instances to offsets within the source code buffer.
* `print_location` will be used for printing errors.
It will print the lines spanned by the given location, with the
location itself colored and underlined if the last argument is `true`.
This will make our errors easier on the eyes.
Let's take a look at their implementations. First, `run_parse`:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 5 18 >}}
We try open the user-specified file, and return `false` if we can't. We try open the user-specified file, and return `false` if we can't.
We then initialize `line_offsets` as we discussed above. After After this, we start doing the setup specific to a reentrant
this, we start doing the setup specific to a reentrant
Flex scanner. We declare a `yyscan_t` variable, which Flex scanner. We declare a `yyscan_t` variable, which
will contain all of Flex's state. Then, we initialize will contain all of Flex's state. Then, we initialize
it using `yylex_init`. Finally, since we can no longer it using `yylex_init`. Finally, since we can no longer
@ -279,24 +269,65 @@ We'll come back to how this works in a moment. With
the scanner and parser initialized, we invoke `parser::operator()`, the scanner and parser initialized, we invoke `parser::operator()`,
which actually runs the Flex- and Bison-generated code. which actually runs the Flex- and Bison-generated code.
To clean up, we run `yylex_destroy` and `fclose`. Finally, To clean up, we run `yylex_destroy` and `fclose`. Finally,
we extract the contents of our file into the `file_contents` we call `file_mgr::finalize`, and return. But what
string, and return. _is_ `file_mgr`?
Next, the `write` method. For the most part, this method The `file_mgr` class does two things: it stores the part of the file
is a proxy for the `write` method of our `string_stream`: that has already been read by Flex in memory, and it keeps track of
where each line in our source file begins within the text. Here is its
definition:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 20 23 >}} {{< codelines "C++" "compiler/13/parse_driver.hpp" 14 34 >}}
In this class, the `string_stream` member is used to construct
an `std::string` from the bits of text that Flex reads,
processes, and feeds to the `file_mgr` using the `write` method.
It's more efficient to use a string stream than to concatenate
strings repeatedly. Once Flex is finished processing the file,
the final contents of the `string_stream` are transferred into
the `file_contents` string using the `finalize` method. The `offset`
and `line_offsets` fields will be used as we described earlier: each time Flex
encounters the `\n` character, the `offset` variable will pushed
in top of the `line_offsets` vector, marking the beginning of
the corresponding line. The methods of the class are as follows:
* `write` will be called from Flex, and will allow us to
record the content of the file we're processing to the `string_stream`.
We've already seen it used in the `YY_USER_ACTION` macro.
* `mark_line` will also be called from Flex, and will mark the current
`file_offset` as the beginning of a line by pushing it into `line_offsets`.
* `finalize` will be called by the `parse_driver` when the parsing
finishes. At this time, the `string_stream` should contain all of
the input file, and this data is transferred to `file_contents`, as
we mentioned above.
* `get_index` and `get_line_end` will be used for converting
`yy::location` instances to offsets within the source code buffer.
* `print_location` will be used for printing errors.
It will print the lines spanned by the given location, with the
location itself colored and underlined if the last argument is `true`.
This will make our errors easier on the eyes.
Let's take a look at their implementations. First, `write`.
For the most part, this method is a proxy for the `write`
method of our `string_stream`:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 9 12 >}}
We do, however, also keep track of the `file_offset` variable We do, however, also keep track of the `file_offset` variable
here, which ensures we have up-to-date information here, which ensures we have up-to-date information
regarding our position in the source file. The implementation regarding our position in the source file. The implementation
of `mark_line` uses this information: of `mark_line` uses this information:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 25 27 >}} {{< codelines "C++" "compiler/13/parse_driver.cpp" 14 16 >}}
The `finalize` method is trivial, and requires little additional
discussion:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 18 20 >}}
Once we have the line offsets, `get_index` becomes very simple: Once we have the line offsets, `get_index` becomes very simple:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 29 32 >}} {{< codelines "C++" "compiler/13/parse_driver.cpp" 22 25 >}}
Here, we use an assertion for the first time. Calling Here, we use an assertion for the first time. Calling
`get_index` with a negative or zero line doesn't make `get_index` with a negative or zero line doesn't make
@ -313,7 +344,7 @@ beginning of the next line. We stick to the C convention
of marking 'end' indices exclusive (pointing just past of marking 'end' indices exclusive (pointing just past
the end of the array): the end of the array):
{{< codelines "C++" "compiler/13/parse_driver.cpp" 34 37 >}} {{< codelines "C++" "compiler/13/parse_driver.cpp" 27 30 >}}
Since `line_offsets` has as many elements as there are lines, Since `line_offsets` has as many elements as there are lines,
the last line number would be equal to the vector's size. the last line number would be equal to the vector's size.
@ -333,7 +364,7 @@ we sprinkle the ANSI escape codes to enable and disable
special formatting, respectively. For now, the special special formatting, respectively. For now, the special
formatting involves underlining the text and making it red. formatting involves underlining the text and making it red.
{{< codelines "C++" "compiler/13/parse_driver.cpp" 39 53 >}} {{< codelines "C++" "compiler/13/parse_driver.cpp" 32 46 >}}
Finally, to get the forward declarations for the `yy*` functions Finally, to get the forward declarations for the `yy*` functions
and types, we set the `header-file` option in Flex: and types, we set the `header-file` option in Flex:
@ -386,12 +417,7 @@ the state and the parse driver, we have to define the
this forward declaration will be used by both Flex this forward declaration will be used by both Flex
and Bison: and Bison:
{{< codelines "C++" "compiler/13/parse_driver.hpp" 39 41 >}} {{< codelines "C++" "compiler/13/parse_driver.hpp" 56 58 >}}
Finally, we can change our `main.cpp` file to use the
`parse_driver`:
{{< codelines "C++" "compiler/13/main.cpp" 178 186 >}}
#### Improving Exceptions #### Improving Exceptions
Now, it's time to add location data (and a little bit more) to our Now, it's time to add location data (and a little bit more) to our
@ -421,7 +447,7 @@ the following two lines to our CMakeLists.txt:
Now, let's add a new base class for all of our compiler errors, Now, let's add a new base class for all of our compiler errors,
unsurprisingly called `compiler_error`: unsurprisingly called `compiler_error`:
{{< codelines "C++" "compiler/13/error.hpp" 8 23 >}} {{< codelines "C++" "compiler/13/error.hpp" 10 26 >}}
We'll put some 'common' exception functionality We'll put some 'common' exception functionality
into the `print_location` and `print_about` methods. If the error into the `print_location` and `print_about` methods. If the error
@ -467,11 +493,7 @@ first, and is treat like the "correct" type. The
`right` type, on the other hand, is treated `right` type, on the other hand, is treated
like the "wrong" type that should have been like the "wrong" type that should have been
unifiable with `left`. This will affect the unifiable with `left`. This will affect the
calling conventions of our unification code. In calling conventions of our unification code.
`main`, we remove all our old exception printing code
in favor of calls to `pretty_print`:
{{< codelines "C++" "compiler/13/main.cpp" 207 213 >}}
Now, we can go through and find all the places where Now, we can go through and find all the places where
we `throw 0`. One such place was in the data type we `throw 0`. One such place was in the data type
@ -513,7 +535,7 @@ In general, this change is also rather mechanical, but, to
maintain a balance between exceptions and assertions, here maintain a balance between exceptions and assertions, here
are a couple more assertions from `type_env`: are a couple more assertions from `type_env`:
{{< codelines "C++" "compiler/13/type_env.cpp" 77 78 >}} {{< codelines "C++" "compiler/13/type_env.cpp" 76 77 >}}
Once again, it should not be possible for the compiler Once again, it should not be possible for the compiler
to try generalize the type of a variable that doesn't to try generalize the type of a variable that doesn't
@ -528,35 +550,34 @@ To fix this, we add a new `loc` parameter to `unify`,
which we make optional to allow for unification without which we make optional to allow for unification without
a known location. Here's the declaration: a known location. Here's the declaration:
{{< codelines "C++" "compiler/13/type.hpp" 101 101 >}} {{< codelines "C++" "compiler/13/type.hpp" 92 92 >}}
The change to the implementation is mechanical and repetitive, The change to the implementation is mechanical and repetitive,
so instead of showing you the whole method, I'll settle for so instead of showing you the whole method, I'll settle for
a couple of lines: a couple of lines:
{{< codelines "C++" "compiler/13/type.cpp" 119 121 >}} {{< codelines "C++" "compiler/13/type.cpp" 121 122 >}}
We want to make sure that a location provided to the We want to make sure that a location provided to the
top-level call to `unify` is also forwarded to the top-level call to `unify` is also forwarded to the
recursive calls, so we have to explicitly add it recursive calls, so we have to explicitly add it
to the call. to the call.
With all of that done, we can finally stand back and We'll also have to update the 'main' code to call the
marvel at the results of our hard work. Here is what a `pretty_print` methods, but there's another big change
basic unification error looks like now: that we're going to make before then. However, once that
change is made, our errors will look a lot better.
{{< figure src="unification_error.png" caption="The result of a unification error." >}} Here is what's printed out to the user when a type error
occurs:
I used an image to show colors, but here is the content of the error in textual form:
``` ```
an error occured while checking the types of the program: failed to unify types an error occured while checking the types of the program: failed to unify types
occuring on line 2: occuring on line 2:
3 + False 3 + False
the expected type was: the expected type was:
!Int Int
while the actual type was: while the actual type was:
!Bool Bool
``` ```
The exclamation marks in front of the two types are due to some The exclamation marks in front of the two types are due to some
@ -572,3 +593,275 @@ data Pair a a = { MkPair a a }
Now, not only have we eliminated the lazy uses of `throw 0` in our Now, not only have we eliminated the lazy uses of `throw 0` in our
code, but we've also improved the presentation of the errors code, but we've also improved the presentation of the errors
to the user! to the user!
### Rethinking Name Mangling
In the previous post, I said the following:
> One more thing. Lets adopt the convention of storing mangled names into the compilation environment. This way, rather than looking up mangled names only for global functions, which would be a gotcha for anyone working on the compiler, we will always use the mangled names during compilation.
Now that I've had some more time to think about it
(and now that I've returned to the compiler after
a brief hiatus), I think that this was not the right call.
Mangled names make sense when translating to LLVM; we certainly
don't want to declare two LLVM functions with the same name.
But things are different for local variables. Our local variables
are graphs on a stack, and are not actually compiled to LLVM
definitions. It doesn't make sense to mangle their names, since
their names aren't present anywhere in the final executable.
It's not even "consistent" to mangle them, since global definitions
are compiled directly to __PushGlobal__ instructions, while local
variables are only referenced through the current `env`.
So, I decided to reverse my decision. We will go back to
placing variable names directly onto `env_var`. Here's
an example of this from `global_scope.cpp`:
{{< codelines "C++" "compiler/13/global_scope.cpp" 6 8 >}}
Now that we've started using assertions, I also think it's worth
to put our new invariant -- "only global definitions have mangled
names" -- into code:
{{< codelines "C++" "compiler/13/type_env.cpp" 35 43 >}}
Furthermore, we'll _require_ that a global definition
has a mangled name. This way, we can be more confident
that a variable from a __PushGlobal__ instruction
is referencing the right function. To achieve
this, we change `get_mangled_name` to stop
returning the input string if a mangled name was not
found; now that we _must_ have a mangled name, doing
so is effectively obscuring the error. Instead,
we add another assertion: if an environment scope doesn't
contain a mangled name for a variable, then it _must_
have a parent. We end up with the following:
{{< codelines "C++" "compiler/13/type_env.cpp" 45 51 >}}
Since looking up a mangled name for non-global variable
will now result in an assertion failure, we have to change
`ast_lid::compile` to only call `get_mangled_name` once
it ensures that the variable being compiled is, in fact,
global:
{{< codelines "C++" "compiler/13/ast.cpp" 58 63 >}}
Since all global functions now need to have mangled
names, we run into a bit of a problem. What are
the mangled names of `(+)`, `(-)`, and so on? We could
continue to hardcode them as `plus`, `minus`, etc., but this can
(and currently does!) lead to errors. Consider the following
piece of code:
```
defn plus x y = { x + y }
defn main = { plus 320 6 }
```
We've hardcoded the mangled name of `(+)` to be `plus`. However,
`global_scope` doesn't know about this, so when the actual
`plus` function gets translated, it also gets assigned the
mangled name `plus`. The name is also overwritten in the
`llvm_context`, which effectively means that `(+)` is
now compiled to a call of the user-defined `plus` function.
If we didn't overwrite the name, we would've run into an assertion
failure in this scenario anyway. In short, this example illustrates
an important point: mangling information needs to be available
outside of a `global_scope`. We don't want to do this by having
every function take in a `global_scope` to access the mangling
information; instead, we'll store the mangling information in
a new `mangler` class, which `global_scope` will take as an argument.
The new class is very simple:
{{< codelines "C++" "compiler/13/mangler.hpp" 5 11 >}}
As with `parse_driver`, `global_scope` takes `mangler` by reference
and stores a pointer:
{{< codelines "C++" "compiler/13/global_scope.hpp" 50 50 >}}
The implementation of `new_mangled_name` doesn't change, so I'm
not going to show it here. With this new mangling information
in hand, we can now correctly set the mangled names of binary
operators:
{{< codelines "C++" "compiler/13/compiler.cpp" 22 27 >}}
Wait a moment, what's a `compiler`? Let's talk about that next.
### A Top-Level Class
Now that we've moved name mangling out of `global_scope`, we have
to put it somewhere. The same goes for global definition group
and the file manager that are given to `parse_driver`. The two
classes _make use_ of the other data, but they don't _own it_.
That's why they take it by reference, and store it as a pointer.
They're just temporarily allowed access.
So, what should be the owner of all of these disparate components?
Thus far, that has been the `main` function, or the utility
functions that it calls out to. However, this is in bad taste:
we have related data and operations on it, but we don't group
them into an object. We can group all of the components of our
compiler into a `compiler` object, and leave `main.cpp` with
exception printing code.
The definition of the `compiler` class begins with all of the data
structures that we use in the process of compilation:
{{< codelines "C++" "compiler/13/compiler.hpp" 12 20 >}}
There's a loose ordering to these fields. In C++, class members are
initialized in the order they are declared; we therefore want to make
sure that fields that are depended on by other fields are initialized first.
Otherwise, I tried to keep the order consistent with the conceptual path
of the code through the compiler.
* Parsing happens first, so we begin with `parse_driver`, which needs a
`file_manager` (to populate with line information) and a `definition_group`
(to receive the global definitions from the parser).
* We then proceed to typechecking, for which we use a global `type_env_ptr`
(to define the built-in functions and constructors) and a `type_mgr` (to
manage the assignments of type variables).
* Once a program is typechecked, we transform it, eliminating local
function definitions and lambda functions. This is done by storing
newly-emitted global functions into the `global_scope`, which requires a
`mangler` to generate new names for the target functions.
* Finally, to generate LLVM IR, we need our `llvm_context` class.
The methods of the compiler are arranged similarly:
{{< codelines "C++" "compiler/13/compiler.hpp" 22 31 >}}
The methods go as follows:
* `add_default_types` adds the built-in types to the `global_env`.
At this point in the post, these types only include `Int`. However,
in the second section, we'll make `Bool` a built-in type, too.
* `add_binop_type` adds a single binary operator to the global
type environment. We saw its implementation earlier: it deals
with both binding a type, and setting a mangled name.
* `add_default_types` adds the types for each binary operator,
and also for the `True` and `False` constructors (which we will
cover in the second section).
* `parse`, `typecheck`, `translate` and `compile` all do exactly
what they say. In this case, compilation refers to creating G-machine
instructions.
* `create_llvm_binop` creates an internal function that forces the
evaluation of its two arguments, and actually applies the given binary
operator. Recall that the `(+)` in user code constructs a call to this
function, but leaves it unevaluated until it's needed.
* `generate_llvm` converts all the definitions in `global_scope`, which
are at this point compiled into G-machine `instruction`s, into LLVM IR.
* `output_llvm` contains all the code to actually generate an object
file from the LLVM IR.
These functions are mostly taken from part 12's `main.cpp`, and adjusted
to use the `compiler`'s members rather than local definitions or arguments.
You should compare part 12's
[`main.cpp`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/12/main.cpp)
file with the
[`compiler.cpp`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/13/compiler.cpp)
file that we end up with at the end of this post.
Next, we have the compiler's constructor, and its `operator()`. The
latter, analogously to our parse driver, will trigger the compilation
process. Their implementations are straightforward:
{{< codelines "C++" "compiler/13/compiler.cpp" 131 145 >}}
We also add a couple of methods to give external code access to
some of the compiler's data structures. I omit their (trivial)
implementations, but they have the following signatures:
{{< codelines "C++" "compiler/13/compiler.hpp" 35 36 >}}
With all the compilation code tucked into our new `compiler` class,
`main` becomes very simple. We also finally get to use our exception
pretty printing code:
{{< codelines "C++" "compiler/13/main.cpp" 11 27 >}}
That's all for the cleanup! We've added locations and more errors
the compiler, stopped throwing `0` in favor of proper exceptions
or assertions, made name mangling more reasonable, fixed a bug with
accidentally shadowing default functions, and organized our compilation
process into a `compiler` class.
### Keeping Things Private
Hand-writing or generating hundreds of trivial getters and setters
for the fields of a data class (which is standard in the world of Java) seems
absurd to me. So, for most of this project, I stuck with
`struct`s, rather than classes. But this is not a good policy
to apply _everywhere_. I still think it makes sense to make
data structures like `ast` and `type` public-by-default;
however, I _don't_ think that way about classes like `type_mgr`,
`llvm_context`, `type_env`, and `env`. All of these have information
that we should never be accessing directly. Some guard this
information with assertions. In short, it should be protected.
For most classes, the changes are mechanical. For instance, we
can make `type_env` a class simply by changing its declaration,
and marking all of its functions public. This requires a slight
refactoring of a line that used its `parent` field. Here's
what it used to be (in context):
{{< codelines "C++" "compiler/12/main.cpp" 57 60 >}}
And here's what it is now:
{{< codelines "C++" "compiler/13/compiler.cpp" 55 58 >}}
We always declare the `definition_defn` function in
the `global_env`. Thus, that's the only environment
we need to know about to update the mangled name.
The deal with `env` is about as simple. We just make
it and its two descendants classes, and mark their
methods and constructors public. The same
goes for `global_scope`. To make `type_mgr`
a class, we have to add a new method: `lookup`.
Here's its implementation:
{{< codelines "C++" "compiler/13/type.cpp" 81 85 >}}
It's used in `type_var::print` as follows:
{{< codelines "C++" "compiler/13/type.cpp" 28 35 >}}
We can't use `resolve` here because it takes (and returns)
a `type_ptr`. If we make it _take_ a `type*`, it won't
be able to return its argument if it's already resolved. If we
allow it to _return_ `type*`, we won't have an owning
reference. We also don't want to duplicate the
method just for this one call. Notice, though, how similar
`type_var::print`/`lookup` and `resolve` are in terms of execution.
The change for `llvm_context` requires a little more work.
Right now, `ctx.builder` is used a _lot_ in `instruction.cpp`.
Since we don't want to forward each of the LLVM builder methods,
and since it feels weird to make `llvm_context` extend `llvm::IRBuilder`,
we'll just provide a getter for the `builder` field. The
same goes for `module`:
{{< codelines "C++" "compiler/13/llvm_context.hpp" 46 47 >}}
Here's what some of the code from `instruction.cpp` looks like now:
{{< codelines "C++" "compiler/13/instruction.cpp" 144 145 >}}
Right now, the `ctx` field of the `llvm_context` (which contains
the `llvm::LLVMContext`) is only externally used to create
instances of `llvm::BasicBlock`. We'll add a proxy method
for this functionality:
{{< codelines "C++" "compiler/13/llvm_context.cpp" 174 176 >}}
Finally, `instruction_pushglobal` needs to access the
`llvm::Function` instances that we create in the process
of compilation. We add a new `get_custom_function` method
to support this, which automatically prefixes the function
name with `f_`, much like `create_custom_function`:
{{< codelines "C++" "compiler/13/llvm_context.cpp" 292 294 >}}
I think that's enough. If we chose to turn more compiler
data structures into classes, I think we would've quickly drowned
in one-line getter and setter methods.