Update blog post, switching away from two sections.
This commit is contained in:
parent
7226d66f67
commit
98cac103c4
@ -8,28 +8,26 @@ description: "In this post, we clean up our compiler and add some basic optimiza
|
||||
In [part 12]({{< relref "12_compiler_let_in_lambda" >}}), we added `let/in`
|
||||
and lambda expressions to our compiler. At the end of that post, I mentioned
|
||||
that before we move on to bigger and better things, I wanted to take a
|
||||
step back and clean up the compiler.
|
||||
step back and clean up the compiler. Now is the time to do that.
|
||||
|
||||
Recently, I got around to doing that. Unfortunately, I also got around to doing
|
||||
a lot more. Furthermore, I managed to make the changes in such a way that I
|
||||
can't cleanly separate the 'cleanup' and 'optimization' portions of my work.
|
||||
This is partially due to the way in which I organize code, where each post
|
||||
is associated with a version of the compiler with the necessary changes.
|
||||
Because of all this, instead of making this post about the cleanup, and the
|
||||
next post about the optimizations, I have to merge them into one.
|
||||
In particular, I identified three things that could be improved
|
||||
or cleaned up:
|
||||
|
||||
So, this post is split into two major portions: cleanup, which deals mostly
|
||||
with touching up exceptions and improving the 'name mangling' logic, and
|
||||
optimizations, which deals with adding special treatment to booleans,
|
||||
unboxing integers, and implementing more binary operators.
|
||||
* __Error handling__. We need to stop using `throw 0` and start
|
||||
using `assert`. We can also make our errors much more descriptive
|
||||
by including source locations in the output.
|
||||
* __Name mangling__. I don't think I got it quite right last
|
||||
time. Now is the time to clean it up.
|
||||
* __Code organization__. I think we can benefit from a top-level
|
||||
class, and a more clear "dependency order" between the various
|
||||
classes and structures we've defined.
|
||||
* __Code style__. In particular, I've been lazily using `struct`
|
||||
in a lot of places. That's not a good idea; it's better
|
||||
to use `class`, and only expose _some_ fields and methods
|
||||
to the rest of the code.
|
||||
|
||||
### Section 1: Cleanup
|
||||
|
||||
The previous post was
|
||||
{{< sidenote "right" "long-note" "rather long," >}}
|
||||
Probably not as long as this one, though! I really need to get the
|
||||
size of my posts under control.
|
||||
{{< /sidenote >}} which led me to omit
|
||||
### Error Reporting and Handling
|
||||
The previous post was rather long, which led me to omit
|
||||
a rather important aspect of the compiler: proper error reporting.
|
||||
Once again our compiler has instances of `throw 0`, which is a cheap way
|
||||
of avoiding properly handling a runtime error. Before we move on,
|
||||
@ -62,7 +60,7 @@ automatically assemble the "from" and "to" locations of a nonterminal
|
||||
from the locations of children, which would be very tedious to write
|
||||
by hand. We enable this feature using the following option:
|
||||
|
||||
{{< codelines "C++" "compiler/13/parser.y" 50 50 >}}
|
||||
{{< codelines "C++" "compiler/13/parser.y" 46 46 >}}
|
||||
|
||||
There's just one hitch, though. Sure, Bison can compute bigger
|
||||
locations from smaller ones, but it must get the smaller ones
|
||||
@ -97,25 +95,27 @@ to `columns` and `step` to every rule, we can define the
|
||||
`YY_USER_ACTION` macro, which is run before each token
|
||||
is processed.
|
||||
|
||||
{{< codelines "C++" "compiler/13/scanner.l" 12 12 >}}
|
||||
{{< codelines "C++" "compiler/13/scanner.l" 12 14 >}}
|
||||
|
||||
We'll see why we are using `drv` soon; for now, you can treat
|
||||
`location` as if it were a global variable declared in the
|
||||
tokenizer. Before processing each token, we ensure that
|
||||
`location` has its `begin` and `end` at the same position,
|
||||
We'll see why we are using `LOC` instead of something like `location` soon;
|
||||
for now, you can treat `LOC` as if it were a global variable declared
|
||||
in the tokenizer. Before processing each token, we ensure that
|
||||
the `yy::location` has its `begin` and `end` at the same position,
|
||||
and then advance `end` by `yyleng` columns. This is sufficient
|
||||
to make `location` represent our token's source position.
|
||||
to make `LOC` represent our token's source position. For
|
||||
the moment, don't worry too much about `drv`; this is the
|
||||
parse driver, and we will talk about it shortly.
|
||||
|
||||
So now we have a "global" variable `location` that gives
|
||||
So now we have a "global" variable `LOC` that gives
|
||||
us the source position of the current token. To get it
|
||||
to Bison, we have to pass it as an argument to each
|
||||
of the `make_TOKEN` calls. Here are a few sample lines
|
||||
that should give you the general idea:
|
||||
|
||||
{{< codelines "C++" "compiler/13/scanner.l" 41 44 >}}
|
||||
{{< codelines "C++" "compiler/13/scanner.l" 40 43 >}}
|
||||
|
||||
That last line is actually new. Previously, we somehow
|
||||
got away without explicitly sending the EOF token to Bison.
|
||||
got away without explicitly sending the end-of-file token to Bison.
|
||||
I suspect that this was due to some kind of implicit conversion
|
||||
of the Flex macro `YY_NULL` into a token; now that we have
|
||||
to pass a position to every token constructor, such an implicit
|
||||
@ -146,10 +146,10 @@ from `ast_binop`:
|
||||
Finally, we tell Bison to pass the computed location
|
||||
data as an argument when constructing our data structures.
|
||||
This too is a mechanical change, and I think the following
|
||||
couple of lines demonstrate the general idea in sufficient
|
||||
few lines demonstrate the general idea in sufficient
|
||||
detail:
|
||||
|
||||
{{< codelines "C++" "compiler/13/parser.y" 107 110 >}}
|
||||
{{< codelines "C++" "compiler/13/parser.y" 92 96 >}}
|
||||
|
||||
Here, the `@$` character is used to reference the current
|
||||
nonterminal's location data.
|
||||
@ -189,7 +189,9 @@ working with files, why not just work directly with the files
|
||||
created by the user? Instead of reading from `stdin`, we may
|
||||
as well take in a path to a file via `argv`, and read from there.
|
||||
Also, instead of `fseek` and `rewind`, we can just read the file
|
||||
into memory, and access it like a normal character buffer.
|
||||
into memory, and access it like a normal character buffer. This
|
||||
does mean that we can stick with `stdin`, but it's more conventional
|
||||
to read source code from files, anyway.
|
||||
|
||||
To address the second issue, we can keep a mapping of line numbers
|
||||
to their locations in the source buffer. This is rather easy to
|
||||
@ -200,7 +202,7 @@ the current source location to the top, marking it as
|
||||
the beginning of another line. Where exactly we store this
|
||||
array is as yet unclear, since we're trying to avoid global variables.
|
||||
|
||||
Finally, begin addressing the third issue, we can use Flex's `reentrant`
|
||||
Finally, to begin addressing the third issue, we can use Flex's `reentrant`
|
||||
option, which makes it so that all of the tokenizer's state is stored in an
|
||||
opaque `yyscan_t` structure, rather than in global variables. This way,
|
||||
we can configure `yyin` without setting a global variable, which is a step
|
||||
@ -221,50 +223,38 @@ creation of a _parsing driver_.
|
||||
The parsing driver is a class (or struct) that holds all the parse-related
|
||||
state. We can arrange for this class to be available to our tokenizing
|
||||
and parsing functions, which will allow us to use it pretty much like we'd
|
||||
use a global variable. We can define it as follows:
|
||||
use a global variable. This is the `drv` that we saw in `YY_USER_ACTION`.
|
||||
We can define it as follows:
|
||||
|
||||
{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 37 >}}
|
||||
{{< codelines "C++" "compiler/13/parse_driver.hpp" 36 54 >}}
|
||||
|
||||
There are quite a few fields here. The `file_name` string represents
|
||||
the file that we'll be reading code from. the `string_stream` will
|
||||
be used to back up the contents of source file as Flex reads them;
|
||||
once Flex is done, the content of the `string_stream` will be
|
||||
saved into the `file_content` string.
|
||||
There aren't many fields here. The `file_name` string represents
|
||||
the file that we'll be reading code from. The `location` field
|
||||
will be accessed by Flex via `get_current_location`. Bison will
|
||||
store the function and data type definitions it reads into `global_defs`
|
||||
via `get_global_defs`. Finally, `file_m` will be used to keep track
|
||||
of the content of the file we're reading, as well as the line offsets
|
||||
within that file. Notice that a couple of these fields are pointers
|
||||
that we take by reference in the constructor. The `parse_driver` doesn't
|
||||
_own_ the global definitions, nor the file manager. They exist outside
|
||||
of it, and will continue to be used in other ways the `parse_driver`
|
||||
does not need to know about. Also, the `LOC` variable in Flex is
|
||||
actually a call to `get_current_location`:
|
||||
|
||||
The next three fields deal with tracking source code
|
||||
locations. The `location` field will be accessed by Flex
|
||||
via `drv.location` (where `drv` is a reference to our driver class).
|
||||
The `file_offset` and `line_offsets` fields will be used to
|
||||
keep track of where each line begins, as we have discussed above.
|
||||
Finally, `global_defs` will be the new home of our top-level
|
||||
definitions.
|
||||
{{< codelines "C++" "compiler/13/scanner.l" 15 15 >}}
|
||||
|
||||
The methods on `parse_driver` are rather simple, too:
|
||||
The methods of `parse_driver` are rather simple. The majority of
|
||||
them deals with giving access to the parser's members: the `yy::location`,
|
||||
the `definition_group`, and the `file_mgr`. The only exception
|
||||
to this is `operator()`, which we use to actually trigger the parsing process.
|
||||
We'll make this method return `true` if parsing succeeded, and `false`
|
||||
otherwise (if, say, the file we tried to read doesn't exist).
|
||||
Here's its implementation:
|
||||
|
||||
* `run_parse` handles the initialization of the tokenizer
|
||||
and parser, which includes obtaining the `FILE*` and configuring
|
||||
Flex to use it. It also handles invoking the parsing code.
|
||||
We'll make this method return `true` if parsing succeeded,
|
||||
and `false` otherwise (if, say, the file we tried to read doesn't exist).
|
||||
* `write` will be called from Flex, and will allow us to
|
||||
record the content of the file we're processing to the `string_stream`.
|
||||
We've already seen it used in the `YY_USER_ACTION` macro.
|
||||
* `mark_line` will also be called from Flex, and will mark the current
|
||||
`file_offset` as the beginning of a line by pushing it into `line_offsets`.
|
||||
* `get_index` and `get_line_end` will be used for converting
|
||||
`yy::location` instances to offsets within the source code buffer.
|
||||
* `print_location` will be used for printing errors.
|
||||
It will print the lines spanned by the given location, with the
|
||||
location itself colored and underlined if the last argument is `true`.
|
||||
This will make our errors easier on the eyes.
|
||||
|
||||
Let's take a look at their implementations. First, `run_parse`:
|
||||
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 5 18 >}}
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 48 60 >}}
|
||||
|
||||
We try open the user-specified file, and return `false` if we can't.
|
||||
We then initialize `line_offsets` as we discussed above. After
|
||||
this, we start doing the setup specific to a reentrant
|
||||
After this, we start doing the setup specific to a reentrant
|
||||
Flex scanner. We declare a `yyscan_t` variable, which
|
||||
will contain all of Flex's state. Then, we initialize
|
||||
it using `yylex_init`. Finally, since we can no longer
|
||||
@ -279,24 +269,65 @@ We'll come back to how this works in a moment. With
|
||||
the scanner and parser initialized, we invoke `parser::operator()`,
|
||||
which actually runs the Flex- and Bison-generated code.
|
||||
To clean up, we run `yylex_destroy` and `fclose`. Finally,
|
||||
we extract the contents of our file into the `file_contents`
|
||||
string, and return.
|
||||
we call `file_mgr::finalize`, and return. But what
|
||||
_is_ `file_mgr`?
|
||||
|
||||
Next, the `write` method. For the most part, this method
|
||||
is a proxy for the `write` method of our `string_stream`:
|
||||
The `file_mgr` class does two things: it stores the part of the file
|
||||
that has already been read by Flex in memory, and it keeps track of
|
||||
where each line in our source file begins within the text. Here is its
|
||||
definition:
|
||||
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 20 23 >}}
|
||||
{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 34 >}}
|
||||
|
||||
In this class, the `string_stream` member is used to construct
|
||||
an `std::string` from the bits of text that Flex reads,
|
||||
processes, and feeds to the `file_mgr` using the `write` method.
|
||||
It's more efficient to use a string stream than to concatenate
|
||||
strings repeatedly. Once Flex is finished processing the file,
|
||||
the final contents of the `string_stream` are transferred into
|
||||
the `file_contents` string using the `finalize` method. The `offset`
|
||||
and `line_offsets` fields will be used as we described earlier: each time Flex
|
||||
encounters the `\n` character, the `offset` variable will pushed
|
||||
in top of the `line_offsets` vector, marking the beginning of
|
||||
the corresponding line. The methods of the class are as follows:
|
||||
|
||||
* `write` will be called from Flex, and will allow us to
|
||||
record the content of the file we're processing to the `string_stream`.
|
||||
We've already seen it used in the `YY_USER_ACTION` macro.
|
||||
* `mark_line` will also be called from Flex, and will mark the current
|
||||
`file_offset` as the beginning of a line by pushing it into `line_offsets`.
|
||||
* `finalize` will be called by the `parse_driver` when the parsing
|
||||
finishes. At this time, the `string_stream` should contain all of
|
||||
the input file, and this data is transferred to `file_contents`, as
|
||||
we mentioned above.
|
||||
* `get_index` and `get_line_end` will be used for converting
|
||||
`yy::location` instances to offsets within the source code buffer.
|
||||
* `print_location` will be used for printing errors.
|
||||
It will print the lines spanned by the given location, with the
|
||||
location itself colored and underlined if the last argument is `true`.
|
||||
This will make our errors easier on the eyes.
|
||||
|
||||
Let's take a look at their implementations. First, `write`.
|
||||
For the most part, this method is a proxy for the `write`
|
||||
method of our `string_stream`:
|
||||
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 9 12 >}}
|
||||
|
||||
We do, however, also keep track of the `file_offset` variable
|
||||
here, which ensures we have up-to-date information
|
||||
regarding our position in the source file. The implementation
|
||||
of `mark_line` uses this information:
|
||||
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 25 27 >}}
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 14 16 >}}
|
||||
|
||||
The `finalize` method is trivial, and requires little additional
|
||||
discussion:
|
||||
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 18 20 >}}
|
||||
|
||||
Once we have the line offsets, `get_index` becomes very simple:
|
||||
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 29 32 >}}
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 22 25 >}}
|
||||
|
||||
Here, we use an assertion for the first time. Calling
|
||||
`get_index` with a negative or zero line doesn't make
|
||||
@ -313,7 +344,7 @@ beginning of the next line. We stick to the C convention
|
||||
of marking 'end' indices exclusive (pointing just past
|
||||
the end of the array):
|
||||
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 34 37 >}}
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 27 30 >}}
|
||||
|
||||
Since `line_offsets` has as many elements as there are lines,
|
||||
the last line number would be equal to the vector's size.
|
||||
@ -333,7 +364,7 @@ we sprinkle the ANSI escape codes to enable and disable
|
||||
special formatting, respectively. For now, the special
|
||||
formatting involves underlining the text and making it red.
|
||||
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 39 53 >}}
|
||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 32 46 >}}
|
||||
|
||||
Finally, to get the forward declarations for the `yy*` functions
|
||||
and types, we set the `header-file` option in Flex:
|
||||
@ -386,12 +417,7 @@ the state and the parse driver, we have to define the
|
||||
this forward declaration will be used by both Flex
|
||||
and Bison:
|
||||
|
||||
{{< codelines "C++" "compiler/13/parse_driver.hpp" 39 41 >}}
|
||||
|
||||
Finally, we can change our `main.cpp` file to use the
|
||||
`parse_driver`:
|
||||
|
||||
{{< codelines "C++" "compiler/13/main.cpp" 178 186 >}}
|
||||
{{< codelines "C++" "compiler/13/parse_driver.hpp" 56 58 >}}
|
||||
|
||||
#### Improving Exceptions
|
||||
Now, it's time to add location data (and a little bit more) to our
|
||||
@ -421,7 +447,7 @@ the following two lines to our CMakeLists.txt:
|
||||
Now, let's add a new base class for all of our compiler errors,
|
||||
unsurprisingly called `compiler_error`:
|
||||
|
||||
{{< codelines "C++" "compiler/13/error.hpp" 8 23 >}}
|
||||
{{< codelines "C++" "compiler/13/error.hpp" 10 26 >}}
|
||||
|
||||
We'll put some 'common' exception functionality
|
||||
into the `print_location` and `print_about` methods. If the error
|
||||
@ -467,11 +493,7 @@ first, and is treat like the "correct" type. The
|
||||
`right` type, on the other hand, is treated
|
||||
like the "wrong" type that should have been
|
||||
unifiable with `left`. This will affect the
|
||||
calling conventions of our unification code. In
|
||||
`main`, we remove all our old exception printing code
|
||||
in favor of calls to `pretty_print`:
|
||||
|
||||
{{< codelines "C++" "compiler/13/main.cpp" 207 213 >}}
|
||||
calling conventions of our unification code.
|
||||
|
||||
Now, we can go through and find all the places where
|
||||
we `throw 0`. One such place was in the data type
|
||||
@ -513,7 +535,7 @@ In general, this change is also rather mechanical, but, to
|
||||
maintain a balance between exceptions and assertions, here
|
||||
are a couple more assertions from `type_env`:
|
||||
|
||||
{{< codelines "C++" "compiler/13/type_env.cpp" 77 78 >}}
|
||||
{{< codelines "C++" "compiler/13/type_env.cpp" 76 77 >}}
|
||||
|
||||
Once again, it should not be possible for the compiler
|
||||
to try generalize the type of a variable that doesn't
|
||||
@ -528,35 +550,34 @@ To fix this, we add a new `loc` parameter to `unify`,
|
||||
which we make optional to allow for unification without
|
||||
a known location. Here's the declaration:
|
||||
|
||||
{{< codelines "C++" "compiler/13/type.hpp" 101 101 >}}
|
||||
{{< codelines "C++" "compiler/13/type.hpp" 92 92 >}}
|
||||
|
||||
The change to the implementation is mechanical and repetitive,
|
||||
so instead of showing you the whole method, I'll settle for
|
||||
a couple of lines:
|
||||
|
||||
{{< codelines "C++" "compiler/13/type.cpp" 119 121 >}}
|
||||
{{< codelines "C++" "compiler/13/type.cpp" 121 122 >}}
|
||||
|
||||
We want to make sure that a location provided to the
|
||||
top-level call to `unify` is also forwarded to the
|
||||
recursive calls, so we have to explicitly add it
|
||||
to the call.
|
||||
|
||||
With all of that done, we can finally stand back and
|
||||
marvel at the results of our hard work. Here is what a
|
||||
basic unification error looks like now:
|
||||
|
||||
{{< figure src="unification_error.png" caption="The result of a unification error." >}}
|
||||
|
||||
I used an image to show colors, but here is the content of the error in textual form:
|
||||
We'll also have to update the 'main' code to call the
|
||||
`pretty_print` methods, but there's another big change
|
||||
that we're going to make before then. However, once that
|
||||
change is made, our errors will look a lot better.
|
||||
Here is what's printed out to the user when a type error
|
||||
occurs:
|
||||
|
||||
```
|
||||
an error occured while checking the types of the program: failed to unify types
|
||||
occuring on line 2:
|
||||
3 + False
|
||||
the expected type was:
|
||||
!Int
|
||||
Int
|
||||
while the actual type was:
|
||||
!Bool
|
||||
Bool
|
||||
```
|
||||
|
||||
The exclamation marks in front of the two types are due to some
|
||||
@ -572,3 +593,275 @@ data Pair a a = { MkPair a a }
|
||||
Now, not only have we eliminated the lazy uses of `throw 0` in our
|
||||
code, but we've also improved the presentation of the errors
|
||||
to the user!
|
||||
|
||||
### Rethinking Name Mangling
|
||||
In the previous post, I said the following:
|
||||
|
||||
> One more thing. Let’s adopt the convention of storing mangled names into the compilation environment. This way, rather than looking up mangled names only for global functions, which would be a ‘gotcha’ for anyone working on the compiler, we will always use the mangled names during compilation.
|
||||
|
||||
Now that I've had some more time to think about it
|
||||
(and now that I've returned to the compiler after
|
||||
a brief hiatus), I think that this was not the right call.
|
||||
Mangled names make sense when translating to LLVM; we certainly
|
||||
don't want to declare two LLVM functions with the same name.
|
||||
But things are different for local variables. Our local variables
|
||||
are graphs on a stack, and are not actually compiled to LLVM
|
||||
definitions. It doesn't make sense to mangle their names, since
|
||||
their names aren't present anywhere in the final executable.
|
||||
It's not even "consistent" to mangle them, since global definitions
|
||||
are compiled directly to __PushGlobal__ instructions, while local
|
||||
variables are only referenced through the current `env`.
|
||||
So, I decided to reverse my decision. We will go back to
|
||||
placing variable names directly onto `env_var`. Here's
|
||||
an example of this from `global_scope.cpp`:
|
||||
|
||||
{{< codelines "C++" "compiler/13/global_scope.cpp" 6 8 >}}
|
||||
|
||||
Now that we've started using assertions, I also think it's worth
|
||||
to put our new invariant -- "only global definitions have mangled
|
||||
names" -- into code:
|
||||
|
||||
{{< codelines "C++" "compiler/13/type_env.cpp" 35 43 >}}
|
||||
|
||||
Furthermore, we'll _require_ that a global definition
|
||||
has a mangled name. This way, we can be more confident
|
||||
that a variable from a __PushGlobal__ instruction
|
||||
is referencing the right function. To achieve
|
||||
this, we change `get_mangled_name` to stop
|
||||
returning the input string if a mangled name was not
|
||||
found; now that we _must_ have a mangled name, doing
|
||||
so is effectively obscuring the error. Instead,
|
||||
we add another assertion: if an environment scope doesn't
|
||||
contain a mangled name for a variable, then it _must_
|
||||
have a parent. We end up with the following:
|
||||
|
||||
{{< codelines "C++" "compiler/13/type_env.cpp" 45 51 >}}
|
||||
|
||||
Since looking up a mangled name for non-global variable
|
||||
will now result in an assertion failure, we have to change
|
||||
`ast_lid::compile` to only call `get_mangled_name` once
|
||||
it ensures that the variable being compiled is, in fact,
|
||||
global:
|
||||
|
||||
{{< codelines "C++" "compiler/13/ast.cpp" 58 63 >}}
|
||||
|
||||
Since all global functions now need to have mangled
|
||||
names, we run into a bit of a problem. What are
|
||||
the mangled names of `(+)`, `(-)`, and so on? We could
|
||||
continue to hardcode them as `plus`, `minus`, etc., but this can
|
||||
(and currently does!) lead to errors. Consider the following
|
||||
piece of code:
|
||||
|
||||
```
|
||||
defn plus x y = { x + y }
|
||||
defn main = { plus 320 6 }
|
||||
```
|
||||
|
||||
We've hardcoded the mangled name of `(+)` to be `plus`. However,
|
||||
`global_scope` doesn't know about this, so when the actual
|
||||
`plus` function gets translated, it also gets assigned the
|
||||
mangled name `plus`. The name is also overwritten in the
|
||||
`llvm_context`, which effectively means that `(+)` is
|
||||
now compiled to a call of the user-defined `plus` function.
|
||||
If we didn't overwrite the name, we would've run into an assertion
|
||||
failure in this scenario anyway. In short, this example illustrates
|
||||
an important point: mangling information needs to be available
|
||||
outside of a `global_scope`. We don't want to do this by having
|
||||
every function take in a `global_scope` to access the mangling
|
||||
information; instead, we'll store the mangling information in
|
||||
a new `mangler` class, which `global_scope` will take as an argument.
|
||||
The new class is very simple:
|
||||
|
||||
{{< codelines "C++" "compiler/13/mangler.hpp" 5 11 >}}
|
||||
|
||||
As with `parse_driver`, `global_scope` takes `mangler` by reference
|
||||
and stores a pointer:
|
||||
|
||||
{{< codelines "C++" "compiler/13/global_scope.hpp" 50 50 >}}
|
||||
|
||||
The implementation of `new_mangled_name` doesn't change, so I'm
|
||||
not going to show it here. With this new mangling information
|
||||
in hand, we can now correctly set the mangled names of binary
|
||||
operators:
|
||||
|
||||
{{< codelines "C++" "compiler/13/compiler.cpp" 22 27 >}}
|
||||
|
||||
Wait a moment, what's a `compiler`? Let's talk about that next.
|
||||
|
||||
### A Top-Level Class
|
||||
Now that we've moved name mangling out of `global_scope`, we have
|
||||
to put it somewhere. The same goes for global definition group
|
||||
and the file manager that are given to `parse_driver`. The two
|
||||
classes _make use_ of the other data, but they don't _own it_.
|
||||
That's why they take it by reference, and store it as a pointer.
|
||||
They're just temporarily allowed access.
|
||||
|
||||
So, what should be the owner of all of these disparate components?
|
||||
Thus far, that has been the `main` function, or the utility
|
||||
functions that it calls out to. However, this is in bad taste:
|
||||
we have related data and operations on it, but we don't group
|
||||
them into an object. We can group all of the components of our
|
||||
compiler into a `compiler` object, and leave `main.cpp` with
|
||||
exception printing code.
|
||||
|
||||
The definition of the `compiler` class begins with all of the data
|
||||
structures that we use in the process of compilation:
|
||||
|
||||
{{< codelines "C++" "compiler/13/compiler.hpp" 12 20 >}}
|
||||
|
||||
There's a loose ordering to these fields. In C++, class members are
|
||||
initialized in the order they are declared; we therefore want to make
|
||||
sure that fields that are depended on by other fields are initialized first.
|
||||
Otherwise, I tried to keep the order consistent with the conceptual path
|
||||
of the code through the compiler.
|
||||
* Parsing happens first, so we begin with `parse_driver`, which needs a
|
||||
`file_manager` (to populate with line information) and a `definition_group`
|
||||
(to receive the global definitions from the parser).
|
||||
* We then proceed to typechecking, for which we use a global `type_env_ptr`
|
||||
(to define the built-in functions and constructors) and a `type_mgr` (to
|
||||
manage the assignments of type variables).
|
||||
* Once a program is typechecked, we transform it, eliminating local
|
||||
function definitions and lambda functions. This is done by storing
|
||||
newly-emitted global functions into the `global_scope`, which requires a
|
||||
`mangler` to generate new names for the target functions.
|
||||
* Finally, to generate LLVM IR, we need our `llvm_context` class.
|
||||
|
||||
The methods of the compiler are arranged similarly:
|
||||
|
||||
{{< codelines "C++" "compiler/13/compiler.hpp" 22 31 >}}
|
||||
|
||||
The methods go as follows:
|
||||
|
||||
* `add_default_types` adds the built-in types to the `global_env`.
|
||||
At this point in the post, these types only include `Int`. However,
|
||||
in the second section, we'll make `Bool` a built-in type, too.
|
||||
* `add_binop_type` adds a single binary operator to the global
|
||||
type environment. We saw its implementation earlier: it deals
|
||||
with both binding a type, and setting a mangled name.
|
||||
* `add_default_types` adds the types for each binary operator,
|
||||
and also for the `True` and `False` constructors (which we will
|
||||
cover in the second section).
|
||||
* `parse`, `typecheck`, `translate` and `compile` all do exactly
|
||||
what they say. In this case, compilation refers to creating G-machine
|
||||
instructions.
|
||||
* `create_llvm_binop` creates an internal function that forces the
|
||||
evaluation of its two arguments, and actually applies the given binary
|
||||
operator. Recall that the `(+)` in user code constructs a call to this
|
||||
function, but leaves it unevaluated until it's needed.
|
||||
* `generate_llvm` converts all the definitions in `global_scope`, which
|
||||
are at this point compiled into G-machine `instruction`s, into LLVM IR.
|
||||
* `output_llvm` contains all the code to actually generate an object
|
||||
file from the LLVM IR.
|
||||
|
||||
These functions are mostly taken from part 12's `main.cpp`, and adjusted
|
||||
to use the `compiler`'s members rather than local definitions or arguments.
|
||||
You should compare part 12's
|
||||
[`main.cpp`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/12/main.cpp)
|
||||
file with the
|
||||
[`compiler.cpp`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/13/compiler.cpp)
|
||||
file that we end up with at the end of this post.
|
||||
|
||||
Next, we have the compiler's constructor, and its `operator()`. The
|
||||
latter, analogously to our parse driver, will trigger the compilation
|
||||
process. Their implementations are straightforward:
|
||||
|
||||
{{< codelines "C++" "compiler/13/compiler.cpp" 131 145 >}}
|
||||
|
||||
We also add a couple of methods to give external code access to
|
||||
some of the compiler's data structures. I omit their (trivial)
|
||||
implementations, but they have the following signatures:
|
||||
|
||||
{{< codelines "C++" "compiler/13/compiler.hpp" 35 36 >}}
|
||||
|
||||
With all the compilation code tucked into our new `compiler` class,
|
||||
`main` becomes very simple. We also finally get to use our exception
|
||||
pretty printing code:
|
||||
|
||||
{{< codelines "C++" "compiler/13/main.cpp" 11 27 >}}
|
||||
|
||||
That's all for the cleanup! We've added locations and more errors
|
||||
the compiler, stopped throwing `0` in favor of proper exceptions
|
||||
or assertions, made name mangling more reasonable, fixed a bug with
|
||||
accidentally shadowing default functions, and organized our compilation
|
||||
process into a `compiler` class.
|
||||
|
||||
### Keeping Things Private
|
||||
Hand-writing or generating hundreds of trivial getters and setters
|
||||
for the fields of a data class (which is standard in the world of Java) seems
|
||||
absurd to me. So, for most of this project, I stuck with
|
||||
`struct`s, rather than classes. But this is not a good policy
|
||||
to apply _everywhere_. I still think it makes sense to make
|
||||
data structures like `ast` and `type` public-by-default;
|
||||
however, I _don't_ think that way about classes like `type_mgr`,
|
||||
`llvm_context`, `type_env`, and `env`. All of these have information
|
||||
that we should never be accessing directly. Some guard this
|
||||
information with assertions. In short, it should be protected.
|
||||
|
||||
For most classes, the changes are mechanical. For instance, we
|
||||
can make `type_env` a class simply by changing its declaration,
|
||||
and marking all of its functions public. This requires a slight
|
||||
refactoring of a line that used its `parent` field. Here's
|
||||
what it used to be (in context):
|
||||
|
||||
{{< codelines "C++" "compiler/12/main.cpp" 57 60 >}}
|
||||
|
||||
And here's what it is now:
|
||||
|
||||
{{< codelines "C++" "compiler/13/compiler.cpp" 55 58 >}}
|
||||
|
||||
We always declare the `definition_defn` function in
|
||||
the `global_env`. Thus, that's the only environment
|
||||
we need to know about to update the mangled name.
|
||||
|
||||
The deal with `env` is about as simple. We just make
|
||||
it and its two descendants classes, and mark their
|
||||
methods and constructors public. The same
|
||||
goes for `global_scope`. To make `type_mgr`
|
||||
a class, we have to add a new method: `lookup`.
|
||||
Here's its implementation:
|
||||
|
||||
{{< codelines "C++" "compiler/13/type.cpp" 81 85 >}}
|
||||
|
||||
It's used in `type_var::print` as follows:
|
||||
|
||||
{{< codelines "C++" "compiler/13/type.cpp" 28 35 >}}
|
||||
|
||||
We can't use `resolve` here because it takes (and returns)
|
||||
a `type_ptr`. If we make it _take_ a `type*`, it won't
|
||||
be able to return its argument if it's already resolved. If we
|
||||
allow it to _return_ `type*`, we won't have an owning
|
||||
reference. We also don't want to duplicate the
|
||||
method just for this one call. Notice, though, how similar
|
||||
`type_var::print`/`lookup` and `resolve` are in terms of execution.
|
||||
|
||||
The change for `llvm_context` requires a little more work.
|
||||
Right now, `ctx.builder` is used a _lot_ in `instruction.cpp`.
|
||||
Since we don't want to forward each of the LLVM builder methods,
|
||||
and since it feels weird to make `llvm_context` extend `llvm::IRBuilder`,
|
||||
we'll just provide a getter for the `builder` field. The
|
||||
same goes for `module`:
|
||||
|
||||
{{< codelines "C++" "compiler/13/llvm_context.hpp" 46 47 >}}
|
||||
|
||||
Here's what some of the code from `instruction.cpp` looks like now:
|
||||
|
||||
{{< codelines "C++" "compiler/13/instruction.cpp" 144 145 >}}
|
||||
|
||||
Right now, the `ctx` field of the `llvm_context` (which contains
|
||||
the `llvm::LLVMContext`) is only externally used to create
|
||||
instances of `llvm::BasicBlock`. We'll add a proxy method
|
||||
for this functionality:
|
||||
|
||||
{{< codelines "C++" "compiler/13/llvm_context.cpp" 174 176 >}}
|
||||
|
||||
Finally, `instruction_pushglobal` needs to access the
|
||||
`llvm::Function` instances that we create in the process
|
||||
of compilation. We add a new `get_custom_function` method
|
||||
to support this, which automatically prefixes the function
|
||||
name with `f_`, much like `create_custom_function`:
|
||||
|
||||
{{< codelines "C++" "compiler/13/llvm_context.cpp" 292 294 >}}
|
||||
|
||||
I think that's enough. If we chose to turn more compiler
|
||||
data structures into classes, I think we would've quickly drowned
|
||||
in one-line getter and setter methods.
|
||||
|
Loading…
Reference in New Issue
Block a user