Update blog post, switching away from two sections.

2020-09-17 22:35:40 -07:00 · 2020-09-17 22:35:40 -07:00 · 98cac103c4
commit 98cac103c4
parent 7226d66f67
1 changed files with 396 additions and 103 deletions
--- a/content/blog/13_compiler_cleanup_optimization/index.md
+++ b/content/blog/13_compiler_cleanup_optimization/index.md
@ -8,28 +8,26 @@ description: "In this post, we clean up our compiler and add some basic optimiza
 In [part 12]({{< relref "12_compiler_let_in_lambda" >}}), we added `let/in`
 and lambda expressions to our compiler. At the end of that post, I mentioned
 that before we move on to bigger and better things, I wanted to take a 
-step back and clean up the compiler.
+step back and clean up the compiler. Now is the time to do that.

-Recently, I got around to doing that. Unfortunately, I also got around to doing
-a lot more. Furthermore, I managed to make the changes in such a way that I
-can't cleanly separate the 'cleanup' and 'optimization' portions of my work.
-This is partially due to the way in which I organize code, where each post
-is associated with a version of the compiler with the necessary changes.
-Because of all this, instead of making this post about the cleanup, and the
-next post about the optimizations, I have to merge them into one.
+In particular, I identified three things that could be improved
+or cleaned up:

-So, this post is split into two major portions: cleanup, which deals mostly
-with touching up exceptions and improving the 'name mangling' logic, and
-optimizations, which deals with adding special treatment to booleans,
-unboxing integers, and implementing more binary operators.
+* __Error handling__. We need to stop using `throw 0` and start
+using `assert`. We can also make our errors much more descriptive
+by including source locations in the output.
+* __Name mangling__. I don't think I got it quite right last
+time. Now is the time to clean it up.
+* __Code organization__. I think we can benefit from a top-level
+class, and a more clear "dependency order" between the various
+classes and structures we've defined.
+* __Code style__. In particular, I've been lazily using `struct`
+in a lot of places. That's not a good idea; it's better
+to use `class`, and only expose _some_ fields and methods
+to the rest of the code.

-### Section 1: Cleanup
-
-The previous post was
-{{< sidenote "right" "long-note" "rather long," >}}
-Probably not as long as this one, though! I really need to get the
-size of my posts under control.
-{{< /sidenote >}} which led me to omit
+### Error Reporting and Handling
+The previous post was rather long, which led me to omit
 a rather important aspect of the compiler: proper error reporting.
 Once again our compiler has instances of `throw 0`, which is a cheap way
 of avoiding properly handling a runtime error. Before we move on,
@ -62,7 +60,7 @@ automatically assemble the "from" and "to" locations of a nonterminal
 from the locations of children, which would be very tedious to write
 by hand. We enable this feature using the following option:

-{{< codelines "C++" "compiler/13/parser.y" 50 50 >}}
+{{< codelines "C++" "compiler/13/parser.y" 46 46 >}}

 There's just one hitch, though. Sure, Bison can compute bigger
 locations from smaller ones, but it must get the smaller ones
@ -97,25 +95,27 @@ to `columns` and `step` to every rule, we can define the
 `YY_USER_ACTION` macro, which is run before each token
 is processed.

-{{< codelines "C++" "compiler/13/scanner.l" 12 12 >}}
+{{< codelines "C++" "compiler/13/scanner.l" 12 14 >}}

-We'll see why we are using `drv` soon; for now, you can treat
-`location` as if it were a global variable declared in the
-tokenizer. Before processing each token, we ensure that
-`location` has its `begin` and `end` at the same position,
+We'll see why we are using `LOC` instead of something like `location` soon;
+for now, you can treat `LOC` as if it were a global variable declared 
+in the tokenizer. Before processing each token, we ensure that
+the `yy::location` has its `begin` and `end` at the same position,
 and then advance `end` by `yyleng` columns. This is sufficient
-to make `location` represent our token's source position.
+to make `LOC` represent our token's source position. For
+the moment, don't worry too much about `drv`; this is the
+parse driver, and we will talk about it shortly.

-So now we have a "global" variable `location` that gives
+So now we have a "global" variable `LOC` that gives
 us the source position of the current token. To get it
 to Bison, we have to pass it as an argument to each
 of the `make_TOKEN` calls. Here are a few sample lines
 that should give you the general idea:

-{{< codelines "C++" "compiler/13/scanner.l" 41 44 >}}
+{{< codelines "C++" "compiler/13/scanner.l" 40 43 >}}

 That last line is actually new. Previously, we somehow
-got away without explicitly sending the EOF token to Bison.
+got away without explicitly sending the end-of-file token to Bison.
 I suspect that this was due to some kind of implicit conversion
 of the Flex macro `YY_NULL` into a token; now that we have
 to pass a position to every token constructor, such an implicit
@ -146,10 +146,10 @@ from `ast_binop`:
 Finally, we tell Bison to pass the computed location
 data as an argument when constructing our data structures.
 This too is a mechanical change, and I think the following
-couple of lines demonstrate the general idea in sufficient
+few lines demonstrate the general idea in sufficient
 detail:

-{{< codelines "C++" "compiler/13/parser.y" 107 110 >}}
+{{< codelines "C++" "compiler/13/parser.y" 92 96 >}}

 Here, the `@$` character is used to reference the current
 nonterminal's location data.
@ -189,7 +189,9 @@ working with files, why not just work directly with the files
 created by the user? Instead of reading from `stdin`, we may
 as well take in a path to a file via `argv`, and read from there.
 Also, instead of `fseek` and `rewind`, we can just read the file
-into memory, and access it like a normal character buffer.
+into memory, and access it like a normal character buffer. This
+does mean that we can stick with `stdin`, but it's more conventional
+to read source code from files, anyway.

 To address the second issue, we can keep a mapping of line numbers
 to their locations in the source buffer. This is rather easy to
@ -200,7 +202,7 @@ the current source location to the top, marking it as
 the beginning of another line. Where exactly we store this
 array is as yet unclear, since we're trying to avoid global variables.

-Finally, begin addressing the third issue, we can use Flex's `reentrant`
+Finally, to begin addressing the third issue, we can use Flex's `reentrant`
 option, which makes it so that all of the tokenizer's state is stored in an
 opaque `yyscan_t` structure, rather than in global variables. This way,
 we can configure `yyin` without setting a global variable, which is a step
@ -221,50 +223,38 @@ creation of a _parsing driver_.
 The parsing driver is a class (or struct) that holds all the parse-related
 state. We can arrange for this class to be available to our tokenizing
 and parsing functions, which will allow us to use it pretty much like we'd
-use a global variable. We can define it as follows:
+use a global variable. This is the `drv` that we saw in `YY_USER_ACTION`.
+We can define it as follows:

-{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 37 >}}
+{{< codelines "C++" "compiler/13/parse_driver.hpp" 36 54 >}}

-There are quite a few fields here. The `file_name` string represents
-the file that we'll be reading code from. the `string_stream` will
-be used to back up the contents of source file as Flex reads them;
-once Flex is done, the content of the `string_stream` will be
-saved into the `file_content` string.
+There aren't many fields here. The `file_name` string represents
+the file that we'll be reading code from. The `location` field
+will be accessed by Flex via `get_current_location`. Bison will
+store the function and data type definitions it reads into `global_defs`
+via `get_global_defs`. Finally, `file_m` will be used to keep track
+of the content of the file we're reading, as well as the line offsets
+within that file. Notice that a couple of these fields are pointers
+that we take by reference in the constructor. The `parse_driver` doesn't
+_own_ the global definitions, nor the file manager. They exist outside
+of it, and will continue to be used in other ways the `parse_driver`
+does not need to know about. Also, the `LOC` variable in Flex is
+actually a call to `get_current_location`:

-The next three fields deal with tracking source code
-locations. The `location` field will be accessed by Flex
-via `drv.location` (where `drv` is a reference to our driver class).
-The `file_offset` and `line_offsets` fields will be used to
-keep track of where each line begins, as we have discussed above.
-Finally, `global_defs` will be the new home of our top-level
-definitions.
+{{< codelines "C++" "compiler/13/scanner.l" 15 15 >}}

-The methods on `parse_driver` are rather simple, too:
+The methods of `parse_driver` are rather simple. The majority of 
+them deals with giving access to the parser's members: the `yy::location`,
+the `definition_group`, and the `file_mgr`. The only exception
+to this is `operator()`, which we use to actually trigger the parsing process.
+We'll make this method return `true` if parsing succeeded, and `false`
+otherwise (if, say, the file we tried to read doesn't exist). 
+Here's its implementation:

-* `run_parse` handles the initialization of the tokenizer
-and parser, which includes obtaining the `FILE*` and configuring
-Flex to use it. It also handles invoking the parsing code.
-We'll make this method return `true` if parsing succeeded,
-and `false` otherwise (if, say, the file we tried to read doesn't exist).
-* `write` will be called from Flex, and will allow us to
-record the content of the file we're processing to the `string_stream`.
-We've already seen it used in the `YY_USER_ACTION` macro.
-* `mark_line` will also be called from Flex, and will mark the current
-`file_offset` as the beginning of a line by pushing it into `line_offsets`.
-* `get_index` and `get_line_end` will be used for converting
-`yy::location` instances to offsets within the source code buffer.
-* `print_location` will be used for printing errors.
-It will print the lines spanned by the given location, with the
-location itself colored and underlined if the last argument is `true`.
-This will make our errors easier on the eyes.
-
-Let's take a look at their implementations. First, `run_parse`:
-
-{{< codelines "C++" "compiler/13/parse_driver.cpp" 5 18 >}}
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 48 60 >}}

 We try open the user-specified file, and return `false` if we can't.
-We then initialize `line_offsets` as we discussed above. After
-this, we start doing the setup specific to a reentrant
+After this, we start doing the setup specific to a reentrant
 Flex scanner. We declare a `yyscan_t` variable, which
 will contain all of Flex's state. Then, we initialize
 it using `yylex_init`. Finally, since we can no longer
@ -279,24 +269,65 @@ We'll come back to how this works in a moment. With
 the scanner and parser initialized, we invoke `parser::operator()`,
 which actually runs the Flex- and Bison-generated code.
 To clean up, we run `yylex_destroy` and `fclose`. Finally,
-we extract the contents of our file into the `file_contents`
-string, and return.
+we call `file_mgr::finalize`, and return. But what
+_is_ `file_mgr`?

-Next, the `write` method. For the most part, this method
-is a proxy for the `write` method of our `string_stream`:
+The `file_mgr` class does two things: it stores the part of the file
+that has already been read by Flex in memory, and it keeps track of
+where each line in our source file begins within the text. Here is its
+definition:

-{{< codelines "C++" "compiler/13/parse_driver.cpp" 20 23 >}}
+{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 34 >}}
+
+In this class, the `string_stream` member is used to construct
+an `std::string` from the bits of text that Flex reads,
+processes, and feeds to the `file_mgr` using the `write` method.
+It's more efficient to use a string stream than to concatenate
+strings repeatedly. Once Flex is finished processing the file,
+the final contents of the `string_stream` are transferred into
+the `file_contents` string using the `finalize` method. The `offset`
+and `line_offsets` fields will be used as we described earlier: each time Flex
+encounters the `\n` character, the `offset` variable will pushed
+in top of the `line_offsets` vector, marking the beginning of
+the corresponding line. The methods of the class are as follows:
+
+* `write` will be called from Flex, and will allow us to
+record the content of the file we're processing to the `string_stream`.
+We've already seen it used in the `YY_USER_ACTION` macro.
+* `mark_line` will also be called from Flex, and will mark the current
+`file_offset` as the beginning of a line by pushing it into `line_offsets`.
+* `finalize` will be called by the `parse_driver` when the parsing
+finishes. At this time, the `string_stream` should contain all of
+the input file, and this data is transferred to `file_contents`, as
+we mentioned above.
+* `get_index` and `get_line_end` will be used for converting
+`yy::location` instances to offsets within the source code buffer.
+* `print_location` will be used for printing errors.
+It will print the lines spanned by the given location, with the
+location itself colored and underlined if the last argument is `true`.
+This will make our errors easier on the eyes.
+
+Let's take a look at their implementations. First, `write`.
+For the most part, this method is a proxy for the `write`
+method of our `string_stream`:
+
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 9 12 >}}

 We do, however, also keep track of the `file_offset` variable
 here, which ensures we have up-to-date information
 regarding our position in the source file. The implementation
 of `mark_line` uses this information:

-{{< codelines "C++" "compiler/13/parse_driver.cpp" 25 27 >}}
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 14 16 >}}
+
+The `finalize` method is trivial, and requires little additional
+discussion:
+
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 18 20 >}}

 Once we have the line offsets, `get_index` becomes very simple:

-{{< codelines "C++" "compiler/13/parse_driver.cpp" 29 32 >}}
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 22 25 >}}

 Here, we use an assertion for the first time. Calling
 `get_index` with a negative or zero line doesn't make
@ -313,7 +344,7 @@ beginning of the next line. We stick to the C convention
 of marking 'end' indices exclusive (pointing just past
 the end of the array):

-{{< codelines "C++" "compiler/13/parse_driver.cpp" 34 37 >}}
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 27 30 >}}

 Since `line_offsets` has as many elements as there are lines,
 the last line number would be equal to the vector's size.
@ -333,7 +364,7 @@ we sprinkle the ANSI escape codes to enable and disable
 special formatting, respectively. For now, the special
 formatting involves underlining the text and making it red.

-{{< codelines "C++" "compiler/13/parse_driver.cpp" 39 53 >}}
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 32 46 >}}

 Finally, to get the forward declarations for the `yy*` functions
 and types, we set the `header-file` option in Flex:
@ -386,12 +417,7 @@ the state and the parse driver, we have to define the
 this forward declaration will be used by both Flex
 and Bison:

-{{< codelines "C++" "compiler/13/parse_driver.hpp" 39 41 >}}
-
-Finally, we can change our `main.cpp` file to use the
-`parse_driver`:
-
-{{< codelines "C++" "compiler/13/main.cpp" 178 186 >}}
+{{< codelines "C++" "compiler/13/parse_driver.hpp" 56 58 >}}

 #### Improving Exceptions
 Now, it's time to add location data (and a little bit more) to our
@ -421,7 +447,7 @@ the following two lines to our CMakeLists.txt:
 Now, let's add a new base class for all of our compiler errors,
 unsurprisingly called `compiler_error`:

-{{< codelines "C++" "compiler/13/error.hpp" 8 23 >}}
+{{< codelines "C++" "compiler/13/error.hpp" 10 26 >}}

 We'll put some 'common' exception functionality
 into the `print_location` and `print_about` methods. If the error
@ -467,11 +493,7 @@ first, and is treat like the "correct" type. The
 `right` type, on the other hand, is treated
 like the "wrong" type that should have been
 unifiable with `left`. This will affect the
-calling conventions of our unification code. In
-`main`, we remove all our old exception printing code
-in favor of calls to `pretty_print`:
-
-{{< codelines "C++" "compiler/13/main.cpp" 207 213 >}}
+calling conventions of our unification code.

 Now, we can go through and find all the places where
 we `throw 0`. One such place was in the data type
@ -513,7 +535,7 @@ In general, this change is also rather mechanical, but, to
 maintain a balance between exceptions and assertions, here
 are a couple more assertions from `type_env`:

-{{< codelines "C++" "compiler/13/type_env.cpp" 77 78 >}}
+{{< codelines "C++" "compiler/13/type_env.cpp" 76 77 >}}

 Once again, it should not be possible for the compiler
 to try generalize the type of a variable that doesn't
@ -528,35 +550,34 @@ To fix this, we add a new `loc` parameter to `unify`,
 which we make optional to allow for unification without
 a known location. Here's the declaration:

-{{< codelines "C++" "compiler/13/type.hpp" 101 101 >}}
+{{< codelines "C++" "compiler/13/type.hpp" 92 92 >}}

 The change to the implementation is mechanical and repetitive,
 so instead of showing you the whole method, I'll settle for
 a couple of lines:

-{{< codelines "C++" "compiler/13/type.cpp" 119 121 >}}
+{{< codelines "C++" "compiler/13/type.cpp" 121 122 >}}

 We want to make sure that a location provided to the
 top-level call to `unify` is also forwarded to the
 recursive calls, so we have to explicitly add it
 to the call.

-With all of that done, we can finally stand back and
-marvel at the results of our hard work. Here is what a
-basic unification error looks like now:
-
-{{< figure src="unification_error.png" caption="The result of a unification error." >}}
-
-I used an image to show colors, but here is the content of the error in textual form:
+We'll also have to update the 'main' code to call the
+`pretty_print` methods, but there's another big change
+that we're going to make before then. However, once that
+change is made, our errors will look a lot better.
+Here is what's printed out to the user when a type error
+occurs:

 ```
 an error occured while checking the types of the program: failed to unify types
 occuring on line 2:
    3 + False
 the expected type was:
-  !Int
+  Int
 while the actual type was:
-  !Bool
+  Bool
 ```

 The exclamation marks in front of the two types are due to some
@ -572,3 +593,275 @@ data Pair a a = { MkPair a a }
 Now, not only have we eliminated the lazy uses of `throw 0` in our
 code, but we've also improved the presentation of the errors
 to the user!
+
+### Rethinking Name Mangling
+In the previous post, I said the following:
+
+> One more thing. Let’s adopt the convention of storing mangled names into the compilation environment. This way, rather than looking up mangled names only for global functions, which would be a ‘gotcha’ for anyone working on the compiler, we will always use the mangled names during compilation.
+
+Now that I've had some more time to think about it
+(and now that I've returned to the compiler after
+a brief hiatus), I think that this was not the right call.
+Mangled names make sense when translating to LLVM; we certainly
+don't want to declare two LLVM functions with the same name.
+But things are different for local variables. Our local variables
+are graphs on a stack, and are not actually compiled to LLVM
+definitions. It doesn't make sense to mangle their names, since
+their names aren't present anywhere in the final executable.
+It's not even "consistent" to mangle them, since global definitions
+are compiled directly to __PushGlobal__ instructions, while local
+variables are only referenced through the current `env`.
+So, I decided to reverse my decision. We will go back to
+placing variable names directly onto `env_var`. Here's
+an example of this from `global_scope.cpp`:
+
+{{< codelines "C++" "compiler/13/global_scope.cpp" 6 8 >}}
+
+Now that we've started using assertions, I also think it's worth
+to put our new invariant -- "only global definitions have mangled
+names" -- into code:
+
+{{< codelines "C++" "compiler/13/type_env.cpp" 35 43 >}}
+
+Furthermore, we'll _require_ that a global definition
+has a mangled name. This way, we can be more confident
+that a variable from a __PushGlobal__ instruction
+is referencing the right function. To achieve
+this, we change `get_mangled_name` to stop
+returning the input string if a mangled name was not
+found; now that we _must_ have a mangled name, doing
+so is effectively obscuring the error. Instead,
+we add another assertion: if an environment scope doesn't
+contain a mangled name for a variable, then it _must_
+have a parent. We end up with the following:
+
+{{< codelines "C++" "compiler/13/type_env.cpp" 45 51 >}}
+
+Since looking up a mangled name for non-global variable
+will now result in an assertion failure, we have to change
+`ast_lid::compile` to only call `get_mangled_name` once
+it ensures that the variable being compiled is, in fact,
+global:
+
+{{< codelines "C++" "compiler/13/ast.cpp" 58 63 >}}
+
+Since all global functions now need to have mangled
+names, we run into a bit of a problem. What are
+the mangled names of `(+)`, `(-)`, and so on? We could
+continue to hardcode them as `plus`, `minus`, etc., but this can
+(and currently does!) lead to errors. Consider the following
+piece of code:
+
+```
+defn plus x y = { x + y }
+defn main = { plus 320 6 }
+```
+
+We've hardcoded the mangled name of `(+)` to be `plus`. However,
+`global_scope` doesn't know about this, so when the actual
+`plus` function gets translated, it also gets assigned the
+mangled name `plus`. The name is also overwritten in the
+`llvm_context`, which effectively means that `(+)` is
+now compiled to a call of the user-defined `plus` function.
+If we didn't overwrite the name, we would've run into an assertion
+failure in this scenario anyway. In short, this example illustrates
+an important point: mangling information needs to be available
+outside of a `global_scope`. We don't want to do this by having
+every function take in a `global_scope` to access the mangling
+information; instead, we'll store the mangling information in
+a new `mangler` class, which `global_scope` will take as an argument.
+The new class is very simple:
+
+{{< codelines "C++" "compiler/13/mangler.hpp" 5 11 >}}
+
+As with `parse_driver`, `global_scope` takes `mangler` by reference
+and stores a pointer:
+
+{{< codelines "C++" "compiler/13/global_scope.hpp" 50 50 >}}
+
+The implementation of `new_mangled_name` doesn't change, so I'm
+not going to show it here. With this new mangling information
+in hand, we can now correctly set the mangled names of binary
+operators:
+
+{{< codelines "C++" "compiler/13/compiler.cpp" 22 27 >}}
+
+Wait a moment, what's a `compiler`? Let's talk about that next.
+
+### A Top-Level Class
+Now that we've moved name mangling out of `global_scope`, we have
+to put it somewhere. The same goes for global definition group
+and the file manager that are given to `parse_driver`. The two
+classes _make use_ of the other data, but they don't _own it_.
+That's why they take it by reference, and store it as a pointer.
+They're just temporarily allowed access.
+
+So, what should be the owner of all of these disparate components?
+Thus far, that has been the `main` function, or the utility
+functions that it calls out to. However, this is in bad taste:
+we have related data and operations on it, but we don't group
+them into an object. We can group all of the components of our
+compiler into a `compiler` object, and leave `main.cpp` with
+exception printing code.
+
+The definition of the `compiler` class begins with all of the data
+structures that we use in the process of compilation:
+
+{{< codelines "C++" "compiler/13/compiler.hpp" 12 20 >}}
+
+There's a loose ordering to these fields. In C++, class members are
+initialized in the order they are declared; we therefore want to make
+sure that fields that are depended on by other fields are initialized first.
+Otherwise, I tried to keep the order consistent with the conceptual path
+of the code through the compiler.
+* Parsing happens first, so we begin with `parse_driver`, which needs a 
+`file_manager` (to populate with line information) and a `definition_group`
+(to receive the global definitions from the parser).
+* We then proceed to typechecking, for which we use a global `type_env_ptr`
+(to define the built-in functions and constructors) and a `type_mgr` (to
+manage the assignments of type variables).
+* Once a program is typechecked, we transform it, eliminating local
+function definitions and lambda functions. This is done by storing
+newly-emitted global functions into the `global_scope`, which requires a
+`mangler` to generate new names for the target functions.
+* Finally, to generate LLVM IR, we need our `llvm_context` class.
+
+The methods of the compiler are arranged similarly:
+
+{{< codelines "C++" "compiler/13/compiler.hpp" 22 31 >}}
+
+The methods go as follows:
+
+* `add_default_types` adds the built-in types to the `global_env`.
+At this point in the post, these types only include `Int`. However,
+in the second section, we'll make `Bool` a built-in type, too.
+* `add_binop_type` adds a single binary operator to the global
+type environment. We saw its implementation earlier: it deals
+with both binding a type, and setting a mangled name.
+* `add_default_types` adds the types for each binary operator,
+and also for the `True` and `False` constructors (which we will
+cover in the second section).
+* `parse`, `typecheck`, `translate` and `compile` all do exactly
+what they say. In this case, compilation refers to creating G-machine
+instructions.
+* `create_llvm_binop` creates an internal function that forces the
+evaluation of its two arguments, and actually applies the given binary
+operator. Recall that the `(+)` in user code constructs a call to this
+function, but leaves it unevaluated until it's needed.
+* `generate_llvm` converts all the definitions in `global_scope`, which
+are at this point compiled into G-machine `instruction`s, into LLVM IR.
+* `output_llvm` contains all the code to actually generate an object
+file from the LLVM IR.
+
+These functions are mostly taken from part 12's `main.cpp`, and adjusted
+to use the `compiler`'s members rather than local definitions or arguments.
+You should compare part 12's
+[`main.cpp`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/12/main.cpp)
+file with the 
+[`compiler.cpp`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/13/compiler.cpp)
+file that we end up with at the end of this post.
+
+Next, we have the compiler's constructor, and its `operator()`. The
+latter, analogously to our parse driver, will trigger the compilation
+process. Their implementations are straightforward:
+
+{{< codelines "C++" "compiler/13/compiler.cpp" 131 145 >}}
+
+We also add a couple of methods to give external code access to
+some of the compiler's data structures. I omit their (trivial)
+implementations, but they have the following signatures:
+
+{{< codelines "C++" "compiler/13/compiler.hpp" 35 36 >}}
+
+With all the compilation code tucked into our new `compiler` class,
+`main` becomes very simple. We also finally get to use our exception
+pretty printing code:
+
+{{< codelines "C++" "compiler/13/main.cpp" 11 27 >}}
+
+That's all for the cleanup! We've added locations and more errors
+the compiler, stopped throwing `0` in favor of proper exceptions
+or assertions, made name mangling more reasonable, fixed a bug with
+accidentally shadowing default functions, and organized our compilation
+process into a `compiler` class.
+
+### Keeping Things Private
+Hand-writing or generating hundreds of trivial getters and setters
+for the fields of a data class (which is standard in the world of Java) seems
+absurd to me. So, for most of this project, I stuck with
+`struct`s, rather than classes. But this is not a good policy
+to apply _everywhere_. I still think it makes sense to make
+data structures like `ast` and `type` public-by-default;
+however, I _don't_ think that way about classes like `type_mgr`,
+`llvm_context`, `type_env`, and `env`. All of these have information
+that we should never be accessing directly. Some guard this
+information with assertions. In short, it should be protected.
+
+For most classes, the changes are mechanical. For instance, we
+can make `type_env` a class simply by changing its declaration,
+and marking all of its functions public. This requires a slight
+refactoring of a line that used its `parent` field. Here's
+what it used to be (in context):
+
+{{< codelines "C++" "compiler/12/main.cpp" 57 60 >}}
+
+And here's what it is now:
+
+{{< codelines "C++" "compiler/13/compiler.cpp" 55 58 >}}
+
+We always declare the `definition_defn` function in
+the `global_env`. Thus, that's the only environment
+we need to know about to update the mangled name.
+
+The deal with `env` is about as simple. We just make
+it and its two descendants classes, and mark their
+methods and constructors public. The same
+goes for `global_scope`. To make `type_mgr`
+a class, we have to add a new method: `lookup`.
+Here's its implementation:
+
+{{< codelines "C++" "compiler/13/type.cpp" 81 85 >}}
+
+It's used in `type_var::print` as follows:
+
+{{< codelines "C++" "compiler/13/type.cpp" 28 35 >}}
+
+We can't use `resolve` here because it takes (and returns)
+a `type_ptr`. If we make it _take_ a `type*`, it won't
+be able to return its argument if it's already resolved. If we
+allow it to _return_ `type*`, we won't have an owning
+reference. We also don't want to duplicate the
+method just for this one call. Notice, though, how similar
+`type_var::print`/`lookup` and `resolve` are in terms of execution.
+
+The change for `llvm_context` requires a little more work.
+Right now, `ctx.builder` is used a _lot_ in `instruction.cpp`.
+Since we don't want to forward each of the LLVM builder methods,
+and since it feels weird to make `llvm_context` extend `llvm::IRBuilder`,
+we'll just provide a getter for the `builder` field. The
+same goes for `module`:
+
+{{< codelines "C++" "compiler/13/llvm_context.hpp" 46 47 >}}
+
+Here's what some of the code from `instruction.cpp` looks like now:
+
+{{< codelines "C++" "compiler/13/instruction.cpp" 144 145 >}}
+
+Right now, the `ctx` field of the `llvm_context` (which contains
+the `llvm::LLVMContext`) is only externally used to create
+instances of `llvm::BasicBlock`. We'll add a proxy method
+for this functionality:
+
+{{< codelines "C++" "compiler/13/llvm_context.cpp" 174 176 >}}
+
+Finally, `instruction_pushglobal` needs to access the
+`llvm::Function` instances that we create in the process
+of compilation. We add a new `get_custom_function` method
+to support this, which automatically prefixes the function
+name with `f_`, much like `create_custom_function`:
+
+{{< codelines "C++" "compiler/13/llvm_context.cpp" 292 294 >}}
+
+I think that's enough. If we chose to turn more compiler
+data structures into classes, I think we would've quickly drowned
+in one-line getter and setter methods.