Update blog post, switching away from two sections.
This commit is contained in:
parent
7226d66f67
commit
98cac103c4
|
@ -8,28 +8,26 @@ description: "In this post, we clean up our compiler and add some basic optimiza
|
||||||
In [part 12]({{< relref "12_compiler_let_in_lambda" >}}), we added `let/in`
|
In [part 12]({{< relref "12_compiler_let_in_lambda" >}}), we added `let/in`
|
||||||
and lambda expressions to our compiler. At the end of that post, I mentioned
|
and lambda expressions to our compiler. At the end of that post, I mentioned
|
||||||
that before we move on to bigger and better things, I wanted to take a
|
that before we move on to bigger and better things, I wanted to take a
|
||||||
step back and clean up the compiler.
|
step back and clean up the compiler. Now is the time to do that.
|
||||||
|
|
||||||
Recently, I got around to doing that. Unfortunately, I also got around to doing
|
In particular, I identified three things that could be improved
|
||||||
a lot more. Furthermore, I managed to make the changes in such a way that I
|
or cleaned up:
|
||||||
can't cleanly separate the 'cleanup' and 'optimization' portions of my work.
|
|
||||||
This is partially due to the way in which I organize code, where each post
|
|
||||||
is associated with a version of the compiler with the necessary changes.
|
|
||||||
Because of all this, instead of making this post about the cleanup, and the
|
|
||||||
next post about the optimizations, I have to merge them into one.
|
|
||||||
|
|
||||||
So, this post is split into two major portions: cleanup, which deals mostly
|
* __Error handling__. We need to stop using `throw 0` and start
|
||||||
with touching up exceptions and improving the 'name mangling' logic, and
|
using `assert`. We can also make our errors much more descriptive
|
||||||
optimizations, which deals with adding special treatment to booleans,
|
by including source locations in the output.
|
||||||
unboxing integers, and implementing more binary operators.
|
* __Name mangling__. I don't think I got it quite right last
|
||||||
|
time. Now is the time to clean it up.
|
||||||
|
* __Code organization__. I think we can benefit from a top-level
|
||||||
|
class, and a more clear "dependency order" between the various
|
||||||
|
classes and structures we've defined.
|
||||||
|
* __Code style__. In particular, I've been lazily using `struct`
|
||||||
|
in a lot of places. That's not a good idea; it's better
|
||||||
|
to use `class`, and only expose _some_ fields and methods
|
||||||
|
to the rest of the code.
|
||||||
|
|
||||||
### Section 1: Cleanup
|
### Error Reporting and Handling
|
||||||
|
The previous post was rather long, which led me to omit
|
||||||
The previous post was
|
|
||||||
{{< sidenote "right" "long-note" "rather long," >}}
|
|
||||||
Probably not as long as this one, though! I really need to get the
|
|
||||||
size of my posts under control.
|
|
||||||
{{< /sidenote >}} which led me to omit
|
|
||||||
a rather important aspect of the compiler: proper error reporting.
|
a rather important aspect of the compiler: proper error reporting.
|
||||||
Once again our compiler has instances of `throw 0`, which is a cheap way
|
Once again our compiler has instances of `throw 0`, which is a cheap way
|
||||||
of avoiding properly handling a runtime error. Before we move on,
|
of avoiding properly handling a runtime error. Before we move on,
|
||||||
|
@ -62,7 +60,7 @@ automatically assemble the "from" and "to" locations of a nonterminal
|
||||||
from the locations of children, which would be very tedious to write
|
from the locations of children, which would be very tedious to write
|
||||||
by hand. We enable this feature using the following option:
|
by hand. We enable this feature using the following option:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/parser.y" 50 50 >}}
|
{{< codelines "C++" "compiler/13/parser.y" 46 46 >}}
|
||||||
|
|
||||||
There's just one hitch, though. Sure, Bison can compute bigger
|
There's just one hitch, though. Sure, Bison can compute bigger
|
||||||
locations from smaller ones, but it must get the smaller ones
|
locations from smaller ones, but it must get the smaller ones
|
||||||
|
@ -97,25 +95,27 @@ to `columns` and `step` to every rule, we can define the
|
||||||
`YY_USER_ACTION` macro, which is run before each token
|
`YY_USER_ACTION` macro, which is run before each token
|
||||||
is processed.
|
is processed.
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/scanner.l" 12 12 >}}
|
{{< codelines "C++" "compiler/13/scanner.l" 12 14 >}}
|
||||||
|
|
||||||
We'll see why we are using `drv` soon; for now, you can treat
|
We'll see why we are using `LOC` instead of something like `location` soon;
|
||||||
`location` as if it were a global variable declared in the
|
for now, you can treat `LOC` as if it were a global variable declared
|
||||||
tokenizer. Before processing each token, we ensure that
|
in the tokenizer. Before processing each token, we ensure that
|
||||||
`location` has its `begin` and `end` at the same position,
|
the `yy::location` has its `begin` and `end` at the same position,
|
||||||
and then advance `end` by `yyleng` columns. This is sufficient
|
and then advance `end` by `yyleng` columns. This is sufficient
|
||||||
to make `location` represent our token's source position.
|
to make `LOC` represent our token's source position. For
|
||||||
|
the moment, don't worry too much about `drv`; this is the
|
||||||
|
parse driver, and we will talk about it shortly.
|
||||||
|
|
||||||
So now we have a "global" variable `location` that gives
|
So now we have a "global" variable `LOC` that gives
|
||||||
us the source position of the current token. To get it
|
us the source position of the current token. To get it
|
||||||
to Bison, we have to pass it as an argument to each
|
to Bison, we have to pass it as an argument to each
|
||||||
of the `make_TOKEN` calls. Here are a few sample lines
|
of the `make_TOKEN` calls. Here are a few sample lines
|
||||||
that should give you the general idea:
|
that should give you the general idea:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/scanner.l" 41 44 >}}
|
{{< codelines "C++" "compiler/13/scanner.l" 40 43 >}}
|
||||||
|
|
||||||
That last line is actually new. Previously, we somehow
|
That last line is actually new. Previously, we somehow
|
||||||
got away without explicitly sending the EOF token to Bison.
|
got away without explicitly sending the end-of-file token to Bison.
|
||||||
I suspect that this was due to some kind of implicit conversion
|
I suspect that this was due to some kind of implicit conversion
|
||||||
of the Flex macro `YY_NULL` into a token; now that we have
|
of the Flex macro `YY_NULL` into a token; now that we have
|
||||||
to pass a position to every token constructor, such an implicit
|
to pass a position to every token constructor, such an implicit
|
||||||
|
@ -146,10 +146,10 @@ from `ast_binop`:
|
||||||
Finally, we tell Bison to pass the computed location
|
Finally, we tell Bison to pass the computed location
|
||||||
data as an argument when constructing our data structures.
|
data as an argument when constructing our data structures.
|
||||||
This too is a mechanical change, and I think the following
|
This too is a mechanical change, and I think the following
|
||||||
couple of lines demonstrate the general idea in sufficient
|
few lines demonstrate the general idea in sufficient
|
||||||
detail:
|
detail:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/parser.y" 107 110 >}}
|
{{< codelines "C++" "compiler/13/parser.y" 92 96 >}}
|
||||||
|
|
||||||
Here, the `@$` character is used to reference the current
|
Here, the `@$` character is used to reference the current
|
||||||
nonterminal's location data.
|
nonterminal's location data.
|
||||||
|
@ -189,7 +189,9 @@ working with files, why not just work directly with the files
|
||||||
created by the user? Instead of reading from `stdin`, we may
|
created by the user? Instead of reading from `stdin`, we may
|
||||||
as well take in a path to a file via `argv`, and read from there.
|
as well take in a path to a file via `argv`, and read from there.
|
||||||
Also, instead of `fseek` and `rewind`, we can just read the file
|
Also, instead of `fseek` and `rewind`, we can just read the file
|
||||||
into memory, and access it like a normal character buffer.
|
into memory, and access it like a normal character buffer. This
|
||||||
|
does mean that we can stick with `stdin`, but it's more conventional
|
||||||
|
to read source code from files, anyway.
|
||||||
|
|
||||||
To address the second issue, we can keep a mapping of line numbers
|
To address the second issue, we can keep a mapping of line numbers
|
||||||
to their locations in the source buffer. This is rather easy to
|
to their locations in the source buffer. This is rather easy to
|
||||||
|
@ -200,7 +202,7 @@ the current source location to the top, marking it as
|
||||||
the beginning of another line. Where exactly we store this
|
the beginning of another line. Where exactly we store this
|
||||||
array is as yet unclear, since we're trying to avoid global variables.
|
array is as yet unclear, since we're trying to avoid global variables.
|
||||||
|
|
||||||
Finally, begin addressing the third issue, we can use Flex's `reentrant`
|
Finally, to begin addressing the third issue, we can use Flex's `reentrant`
|
||||||
option, which makes it so that all of the tokenizer's state is stored in an
|
option, which makes it so that all of the tokenizer's state is stored in an
|
||||||
opaque `yyscan_t` structure, rather than in global variables. This way,
|
opaque `yyscan_t` structure, rather than in global variables. This way,
|
||||||
we can configure `yyin` without setting a global variable, which is a step
|
we can configure `yyin` without setting a global variable, which is a step
|
||||||
|
@ -221,50 +223,38 @@ creation of a _parsing driver_.
|
||||||
The parsing driver is a class (or struct) that holds all the parse-related
|
The parsing driver is a class (or struct) that holds all the parse-related
|
||||||
state. We can arrange for this class to be available to our tokenizing
|
state. We can arrange for this class to be available to our tokenizing
|
||||||
and parsing functions, which will allow us to use it pretty much like we'd
|
and parsing functions, which will allow us to use it pretty much like we'd
|
||||||
use a global variable. We can define it as follows:
|
use a global variable. This is the `drv` that we saw in `YY_USER_ACTION`.
|
||||||
|
We can define it as follows:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 37 >}}
|
{{< codelines "C++" "compiler/13/parse_driver.hpp" 36 54 >}}
|
||||||
|
|
||||||
There are quite a few fields here. The `file_name` string represents
|
There aren't many fields here. The `file_name` string represents
|
||||||
the file that we'll be reading code from. the `string_stream` will
|
the file that we'll be reading code from. The `location` field
|
||||||
be used to back up the contents of source file as Flex reads them;
|
will be accessed by Flex via `get_current_location`. Bison will
|
||||||
once Flex is done, the content of the `string_stream` will be
|
store the function and data type definitions it reads into `global_defs`
|
||||||
saved into the `file_content` string.
|
via `get_global_defs`. Finally, `file_m` will be used to keep track
|
||||||
|
of the content of the file we're reading, as well as the line offsets
|
||||||
|
within that file. Notice that a couple of these fields are pointers
|
||||||
|
that we take by reference in the constructor. The `parse_driver` doesn't
|
||||||
|
_own_ the global definitions, nor the file manager. They exist outside
|
||||||
|
of it, and will continue to be used in other ways the `parse_driver`
|
||||||
|
does not need to know about. Also, the `LOC` variable in Flex is
|
||||||
|
actually a call to `get_current_location`:
|
||||||
|
|
||||||
The next three fields deal with tracking source code
|
{{< codelines "C++" "compiler/13/scanner.l" 15 15 >}}
|
||||||
locations. The `location` field will be accessed by Flex
|
|
||||||
via `drv.location` (where `drv` is a reference to our driver class).
|
|
||||||
The `file_offset` and `line_offsets` fields will be used to
|
|
||||||
keep track of where each line begins, as we have discussed above.
|
|
||||||
Finally, `global_defs` will be the new home of our top-level
|
|
||||||
definitions.
|
|
||||||
|
|
||||||
The methods on `parse_driver` are rather simple, too:
|
The methods of `parse_driver` are rather simple. The majority of
|
||||||
|
them deals with giving access to the parser's members: the `yy::location`,
|
||||||
|
the `definition_group`, and the `file_mgr`. The only exception
|
||||||
|
to this is `operator()`, which we use to actually trigger the parsing process.
|
||||||
|
We'll make this method return `true` if parsing succeeded, and `false`
|
||||||
|
otherwise (if, say, the file we tried to read doesn't exist).
|
||||||
|
Here's its implementation:
|
||||||
|
|
||||||
* `run_parse` handles the initialization of the tokenizer
|
{{< codelines "C++" "compiler/13/parse_driver.cpp" 48 60 >}}
|
||||||
and parser, which includes obtaining the `FILE*` and configuring
|
|
||||||
Flex to use it. It also handles invoking the parsing code.
|
|
||||||
We'll make this method return `true` if parsing succeeded,
|
|
||||||
and `false` otherwise (if, say, the file we tried to read doesn't exist).
|
|
||||||
* `write` will be called from Flex, and will allow us to
|
|
||||||
record the content of the file we're processing to the `string_stream`.
|
|
||||||
We've already seen it used in the `YY_USER_ACTION` macro.
|
|
||||||
* `mark_line` will also be called from Flex, and will mark the current
|
|
||||||
`file_offset` as the beginning of a line by pushing it into `line_offsets`.
|
|
||||||
* `get_index` and `get_line_end` will be used for converting
|
|
||||||
`yy::location` instances to offsets within the source code buffer.
|
|
||||||
* `print_location` will be used for printing errors.
|
|
||||||
It will print the lines spanned by the given location, with the
|
|
||||||
location itself colored and underlined if the last argument is `true`.
|
|
||||||
This will make our errors easier on the eyes.
|
|
||||||
|
|
||||||
Let's take a look at their implementations. First, `run_parse`:
|
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 5 18 >}}
|
|
||||||
|
|
||||||
We try open the user-specified file, and return `false` if we can't.
|
We try open the user-specified file, and return `false` if we can't.
|
||||||
We then initialize `line_offsets` as we discussed above. After
|
After this, we start doing the setup specific to a reentrant
|
||||||
this, we start doing the setup specific to a reentrant
|
|
||||||
Flex scanner. We declare a `yyscan_t` variable, which
|
Flex scanner. We declare a `yyscan_t` variable, which
|
||||||
will contain all of Flex's state. Then, we initialize
|
will contain all of Flex's state. Then, we initialize
|
||||||
it using `yylex_init`. Finally, since we can no longer
|
it using `yylex_init`. Finally, since we can no longer
|
||||||
|
@ -279,24 +269,65 @@ We'll come back to how this works in a moment. With
|
||||||
the scanner and parser initialized, we invoke `parser::operator()`,
|
the scanner and parser initialized, we invoke `parser::operator()`,
|
||||||
which actually runs the Flex- and Bison-generated code.
|
which actually runs the Flex- and Bison-generated code.
|
||||||
To clean up, we run `yylex_destroy` and `fclose`. Finally,
|
To clean up, we run `yylex_destroy` and `fclose`. Finally,
|
||||||
we extract the contents of our file into the `file_contents`
|
we call `file_mgr::finalize`, and return. But what
|
||||||
string, and return.
|
_is_ `file_mgr`?
|
||||||
|
|
||||||
Next, the `write` method. For the most part, this method
|
The `file_mgr` class does two things: it stores the part of the file
|
||||||
is a proxy for the `write` method of our `string_stream`:
|
that has already been read by Flex in memory, and it keeps track of
|
||||||
|
where each line in our source file begins within the text. Here is its
|
||||||
|
definition:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 20 23 >}}
|
{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 34 >}}
|
||||||
|
|
||||||
|
In this class, the `string_stream` member is used to construct
|
||||||
|
an `std::string` from the bits of text that Flex reads,
|
||||||
|
processes, and feeds to the `file_mgr` using the `write` method.
|
||||||
|
It's more efficient to use a string stream than to concatenate
|
||||||
|
strings repeatedly. Once Flex is finished processing the file,
|
||||||
|
the final contents of the `string_stream` are transferred into
|
||||||
|
the `file_contents` string using the `finalize` method. The `offset`
|
||||||
|
and `line_offsets` fields will be used as we described earlier: each time Flex
|
||||||
|
encounters the `\n` character, the `offset` variable will pushed
|
||||||
|
in top of the `line_offsets` vector, marking the beginning of
|
||||||
|
the corresponding line. The methods of the class are as follows:
|
||||||
|
|
||||||
|
* `write` will be called from Flex, and will allow us to
|
||||||
|
record the content of the file we're processing to the `string_stream`.
|
||||||
|
We've already seen it used in the `YY_USER_ACTION` macro.
|
||||||
|
* `mark_line` will also be called from Flex, and will mark the current
|
||||||
|
`file_offset` as the beginning of a line by pushing it into `line_offsets`.
|
||||||
|
* `finalize` will be called by the `parse_driver` when the parsing
|
||||||
|
finishes. At this time, the `string_stream` should contain all of
|
||||||
|
the input file, and this data is transferred to `file_contents`, as
|
||||||
|
we mentioned above.
|
||||||
|
* `get_index` and `get_line_end` will be used for converting
|
||||||
|
`yy::location` instances to offsets within the source code buffer.
|
||||||
|
* `print_location` will be used for printing errors.
|
||||||
|
It will print the lines spanned by the given location, with the
|
||||||
|
location itself colored and underlined if the last argument is `true`.
|
||||||
|
This will make our errors easier on the eyes.
|
||||||
|
|
||||||
|
Let's take a look at their implementations. First, `write`.
|
||||||
|
For the most part, this method is a proxy for the `write`
|
||||||
|
method of our `string_stream`:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/parse_driver.cpp" 9 12 >}}
|
||||||
|
|
||||||
We do, however, also keep track of the `file_offset` variable
|
We do, however, also keep track of the `file_offset` variable
|
||||||
here, which ensures we have up-to-date information
|
here, which ensures we have up-to-date information
|
||||||
regarding our position in the source file. The implementation
|
regarding our position in the source file. The implementation
|
||||||
of `mark_line` uses this information:
|
of `mark_line` uses this information:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 25 27 >}}
|
{{< codelines "C++" "compiler/13/parse_driver.cpp" 14 16 >}}
|
||||||
|
|
||||||
|
The `finalize` method is trivial, and requires little additional
|
||||||
|
discussion:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/parse_driver.cpp" 18 20 >}}
|
||||||
|
|
||||||
Once we have the line offsets, `get_index` becomes very simple:
|
Once we have the line offsets, `get_index` becomes very simple:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 29 32 >}}
|
{{< codelines "C++" "compiler/13/parse_driver.cpp" 22 25 >}}
|
||||||
|
|
||||||
Here, we use an assertion for the first time. Calling
|
Here, we use an assertion for the first time. Calling
|
||||||
`get_index` with a negative or zero line doesn't make
|
`get_index` with a negative or zero line doesn't make
|
||||||
|
@ -313,7 +344,7 @@ beginning of the next line. We stick to the C convention
|
||||||
of marking 'end' indices exclusive (pointing just past
|
of marking 'end' indices exclusive (pointing just past
|
||||||
the end of the array):
|
the end of the array):
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 34 37 >}}
|
{{< codelines "C++" "compiler/13/parse_driver.cpp" 27 30 >}}
|
||||||
|
|
||||||
Since `line_offsets` has as many elements as there are lines,
|
Since `line_offsets` has as many elements as there are lines,
|
||||||
the last line number would be equal to the vector's size.
|
the last line number would be equal to the vector's size.
|
||||||
|
@ -333,7 +364,7 @@ we sprinkle the ANSI escape codes to enable and disable
|
||||||
special formatting, respectively. For now, the special
|
special formatting, respectively. For now, the special
|
||||||
formatting involves underlining the text and making it red.
|
formatting involves underlining the text and making it red.
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/parse_driver.cpp" 39 53 >}}
|
{{< codelines "C++" "compiler/13/parse_driver.cpp" 32 46 >}}
|
||||||
|
|
||||||
Finally, to get the forward declarations for the `yy*` functions
|
Finally, to get the forward declarations for the `yy*` functions
|
||||||
and types, we set the `header-file` option in Flex:
|
and types, we set the `header-file` option in Flex:
|
||||||
|
@ -386,12 +417,7 @@ the state and the parse driver, we have to define the
|
||||||
this forward declaration will be used by both Flex
|
this forward declaration will be used by both Flex
|
||||||
and Bison:
|
and Bison:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/parse_driver.hpp" 39 41 >}}
|
{{< codelines "C++" "compiler/13/parse_driver.hpp" 56 58 >}}
|
||||||
|
|
||||||
Finally, we can change our `main.cpp` file to use the
|
|
||||||
`parse_driver`:
|
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/main.cpp" 178 186 >}}
|
|
||||||
|
|
||||||
#### Improving Exceptions
|
#### Improving Exceptions
|
||||||
Now, it's time to add location data (and a little bit more) to our
|
Now, it's time to add location data (and a little bit more) to our
|
||||||
|
@ -421,7 +447,7 @@ the following two lines to our CMakeLists.txt:
|
||||||
Now, let's add a new base class for all of our compiler errors,
|
Now, let's add a new base class for all of our compiler errors,
|
||||||
unsurprisingly called `compiler_error`:
|
unsurprisingly called `compiler_error`:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/error.hpp" 8 23 >}}
|
{{< codelines "C++" "compiler/13/error.hpp" 10 26 >}}
|
||||||
|
|
||||||
We'll put some 'common' exception functionality
|
We'll put some 'common' exception functionality
|
||||||
into the `print_location` and `print_about` methods. If the error
|
into the `print_location` and `print_about` methods. If the error
|
||||||
|
@ -467,11 +493,7 @@ first, and is treat like the "correct" type. The
|
||||||
`right` type, on the other hand, is treated
|
`right` type, on the other hand, is treated
|
||||||
like the "wrong" type that should have been
|
like the "wrong" type that should have been
|
||||||
unifiable with `left`. This will affect the
|
unifiable with `left`. This will affect the
|
||||||
calling conventions of our unification code. In
|
calling conventions of our unification code.
|
||||||
`main`, we remove all our old exception printing code
|
|
||||||
in favor of calls to `pretty_print`:
|
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/main.cpp" 207 213 >}}
|
|
||||||
|
|
||||||
Now, we can go through and find all the places where
|
Now, we can go through and find all the places where
|
||||||
we `throw 0`. One such place was in the data type
|
we `throw 0`. One such place was in the data type
|
||||||
|
@ -513,7 +535,7 @@ In general, this change is also rather mechanical, but, to
|
||||||
maintain a balance between exceptions and assertions, here
|
maintain a balance between exceptions and assertions, here
|
||||||
are a couple more assertions from `type_env`:
|
are a couple more assertions from `type_env`:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/type_env.cpp" 77 78 >}}
|
{{< codelines "C++" "compiler/13/type_env.cpp" 76 77 >}}
|
||||||
|
|
||||||
Once again, it should not be possible for the compiler
|
Once again, it should not be possible for the compiler
|
||||||
to try generalize the type of a variable that doesn't
|
to try generalize the type of a variable that doesn't
|
||||||
|
@ -528,35 +550,34 @@ To fix this, we add a new `loc` parameter to `unify`,
|
||||||
which we make optional to allow for unification without
|
which we make optional to allow for unification without
|
||||||
a known location. Here's the declaration:
|
a known location. Here's the declaration:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/type.hpp" 101 101 >}}
|
{{< codelines "C++" "compiler/13/type.hpp" 92 92 >}}
|
||||||
|
|
||||||
The change to the implementation is mechanical and repetitive,
|
The change to the implementation is mechanical and repetitive,
|
||||||
so instead of showing you the whole method, I'll settle for
|
so instead of showing you the whole method, I'll settle for
|
||||||
a couple of lines:
|
a couple of lines:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/type.cpp" 119 121 >}}
|
{{< codelines "C++" "compiler/13/type.cpp" 121 122 >}}
|
||||||
|
|
||||||
We want to make sure that a location provided to the
|
We want to make sure that a location provided to the
|
||||||
top-level call to `unify` is also forwarded to the
|
top-level call to `unify` is also forwarded to the
|
||||||
recursive calls, so we have to explicitly add it
|
recursive calls, so we have to explicitly add it
|
||||||
to the call.
|
to the call.
|
||||||
|
|
||||||
With all of that done, we can finally stand back and
|
We'll also have to update the 'main' code to call the
|
||||||
marvel at the results of our hard work. Here is what a
|
`pretty_print` methods, but there's another big change
|
||||||
basic unification error looks like now:
|
that we're going to make before then. However, once that
|
||||||
|
change is made, our errors will look a lot better.
|
||||||
{{< figure src="unification_error.png" caption="The result of a unification error." >}}
|
Here is what's printed out to the user when a type error
|
||||||
|
occurs:
|
||||||
I used an image to show colors, but here is the content of the error in textual form:
|
|
||||||
|
|
||||||
```
|
```
|
||||||
an error occured while checking the types of the program: failed to unify types
|
an error occured while checking the types of the program: failed to unify types
|
||||||
occuring on line 2:
|
occuring on line 2:
|
||||||
3 + False
|
3 + False
|
||||||
the expected type was:
|
the expected type was:
|
||||||
!Int
|
Int
|
||||||
while the actual type was:
|
while the actual type was:
|
||||||
!Bool
|
Bool
|
||||||
```
|
```
|
||||||
|
|
||||||
The exclamation marks in front of the two types are due to some
|
The exclamation marks in front of the two types are due to some
|
||||||
|
@ -572,3 +593,275 @@ data Pair a a = { MkPair a a }
|
||||||
Now, not only have we eliminated the lazy uses of `throw 0` in our
|
Now, not only have we eliminated the lazy uses of `throw 0` in our
|
||||||
code, but we've also improved the presentation of the errors
|
code, but we've also improved the presentation of the errors
|
||||||
to the user!
|
to the user!
|
||||||
|
|
||||||
|
### Rethinking Name Mangling
|
||||||
|
In the previous post, I said the following:
|
||||||
|
|
||||||
|
> One more thing. Let’s adopt the convention of storing mangled names into the compilation environment. This way, rather than looking up mangled names only for global functions, which would be a ‘gotcha’ for anyone working on the compiler, we will always use the mangled names during compilation.
|
||||||
|
|
||||||
|
Now that I've had some more time to think about it
|
||||||
|
(and now that I've returned to the compiler after
|
||||||
|
a brief hiatus), I think that this was not the right call.
|
||||||
|
Mangled names make sense when translating to LLVM; we certainly
|
||||||
|
don't want to declare two LLVM functions with the same name.
|
||||||
|
But things are different for local variables. Our local variables
|
||||||
|
are graphs on a stack, and are not actually compiled to LLVM
|
||||||
|
definitions. It doesn't make sense to mangle their names, since
|
||||||
|
their names aren't present anywhere in the final executable.
|
||||||
|
It's not even "consistent" to mangle them, since global definitions
|
||||||
|
are compiled directly to __PushGlobal__ instructions, while local
|
||||||
|
variables are only referenced through the current `env`.
|
||||||
|
So, I decided to reverse my decision. We will go back to
|
||||||
|
placing variable names directly onto `env_var`. Here's
|
||||||
|
an example of this from `global_scope.cpp`:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/global_scope.cpp" 6 8 >}}
|
||||||
|
|
||||||
|
Now that we've started using assertions, I also think it's worth
|
||||||
|
to put our new invariant -- "only global definitions have mangled
|
||||||
|
names" -- into code:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/type_env.cpp" 35 43 >}}
|
||||||
|
|
||||||
|
Furthermore, we'll _require_ that a global definition
|
||||||
|
has a mangled name. This way, we can be more confident
|
||||||
|
that a variable from a __PushGlobal__ instruction
|
||||||
|
is referencing the right function. To achieve
|
||||||
|
this, we change `get_mangled_name` to stop
|
||||||
|
returning the input string if a mangled name was not
|
||||||
|
found; now that we _must_ have a mangled name, doing
|
||||||
|
so is effectively obscuring the error. Instead,
|
||||||
|
we add another assertion: if an environment scope doesn't
|
||||||
|
contain a mangled name for a variable, then it _must_
|
||||||
|
have a parent. We end up with the following:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/type_env.cpp" 45 51 >}}
|
||||||
|
|
||||||
|
Since looking up a mangled name for non-global variable
|
||||||
|
will now result in an assertion failure, we have to change
|
||||||
|
`ast_lid::compile` to only call `get_mangled_name` once
|
||||||
|
it ensures that the variable being compiled is, in fact,
|
||||||
|
global:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/ast.cpp" 58 63 >}}
|
||||||
|
|
||||||
|
Since all global functions now need to have mangled
|
||||||
|
names, we run into a bit of a problem. What are
|
||||||
|
the mangled names of `(+)`, `(-)`, and so on? We could
|
||||||
|
continue to hardcode them as `plus`, `minus`, etc., but this can
|
||||||
|
(and currently does!) lead to errors. Consider the following
|
||||||
|
piece of code:
|
||||||
|
|
||||||
|
```
|
||||||
|
defn plus x y = { x + y }
|
||||||
|
defn main = { plus 320 6 }
|
||||||
|
```
|
||||||
|
|
||||||
|
We've hardcoded the mangled name of `(+)` to be `plus`. However,
|
||||||
|
`global_scope` doesn't know about this, so when the actual
|
||||||
|
`plus` function gets translated, it also gets assigned the
|
||||||
|
mangled name `plus`. The name is also overwritten in the
|
||||||
|
`llvm_context`, which effectively means that `(+)` is
|
||||||
|
now compiled to a call of the user-defined `plus` function.
|
||||||
|
If we didn't overwrite the name, we would've run into an assertion
|
||||||
|
failure in this scenario anyway. In short, this example illustrates
|
||||||
|
an important point: mangling information needs to be available
|
||||||
|
outside of a `global_scope`. We don't want to do this by having
|
||||||
|
every function take in a `global_scope` to access the mangling
|
||||||
|
information; instead, we'll store the mangling information in
|
||||||
|
a new `mangler` class, which `global_scope` will take as an argument.
|
||||||
|
The new class is very simple:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/mangler.hpp" 5 11 >}}
|
||||||
|
|
||||||
|
As with `parse_driver`, `global_scope` takes `mangler` by reference
|
||||||
|
and stores a pointer:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/global_scope.hpp" 50 50 >}}
|
||||||
|
|
||||||
|
The implementation of `new_mangled_name` doesn't change, so I'm
|
||||||
|
not going to show it here. With this new mangling information
|
||||||
|
in hand, we can now correctly set the mangled names of binary
|
||||||
|
operators:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/compiler.cpp" 22 27 >}}
|
||||||
|
|
||||||
|
Wait a moment, what's a `compiler`? Let's talk about that next.
|
||||||
|
|
||||||
|
### A Top-Level Class
|
||||||
|
Now that we've moved name mangling out of `global_scope`, we have
|
||||||
|
to put it somewhere. The same goes for global definition group
|
||||||
|
and the file manager that are given to `parse_driver`. The two
|
||||||
|
classes _make use_ of the other data, but they don't _own it_.
|
||||||
|
That's why they take it by reference, and store it as a pointer.
|
||||||
|
They're just temporarily allowed access.
|
||||||
|
|
||||||
|
So, what should be the owner of all of these disparate components?
|
||||||
|
Thus far, that has been the `main` function, or the utility
|
||||||
|
functions that it calls out to. However, this is in bad taste:
|
||||||
|
we have related data and operations on it, but we don't group
|
||||||
|
them into an object. We can group all of the components of our
|
||||||
|
compiler into a `compiler` object, and leave `main.cpp` with
|
||||||
|
exception printing code.
|
||||||
|
|
||||||
|
The definition of the `compiler` class begins with all of the data
|
||||||
|
structures that we use in the process of compilation:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/compiler.hpp" 12 20 >}}
|
||||||
|
|
||||||
|
There's a loose ordering to these fields. In C++, class members are
|
||||||
|
initialized in the order they are declared; we therefore want to make
|
||||||
|
sure that fields that are depended on by other fields are initialized first.
|
||||||
|
Otherwise, I tried to keep the order consistent with the conceptual path
|
||||||
|
of the code through the compiler.
|
||||||
|
* Parsing happens first, so we begin with `parse_driver`, which needs a
|
||||||
|
`file_manager` (to populate with line information) and a `definition_group`
|
||||||
|
(to receive the global definitions from the parser).
|
||||||
|
* We then proceed to typechecking, for which we use a global `type_env_ptr`
|
||||||
|
(to define the built-in functions and constructors) and a `type_mgr` (to
|
||||||
|
manage the assignments of type variables).
|
||||||
|
* Once a program is typechecked, we transform it, eliminating local
|
||||||
|
function definitions and lambda functions. This is done by storing
|
||||||
|
newly-emitted global functions into the `global_scope`, which requires a
|
||||||
|
`mangler` to generate new names for the target functions.
|
||||||
|
* Finally, to generate LLVM IR, we need our `llvm_context` class.
|
||||||
|
|
||||||
|
The methods of the compiler are arranged similarly:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/compiler.hpp" 22 31 >}}
|
||||||
|
|
||||||
|
The methods go as follows:
|
||||||
|
|
||||||
|
* `add_default_types` adds the built-in types to the `global_env`.
|
||||||
|
At this point in the post, these types only include `Int`. However,
|
||||||
|
in the second section, we'll make `Bool` a built-in type, too.
|
||||||
|
* `add_binop_type` adds a single binary operator to the global
|
||||||
|
type environment. We saw its implementation earlier: it deals
|
||||||
|
with both binding a type, and setting a mangled name.
|
||||||
|
* `add_default_types` adds the types for each binary operator,
|
||||||
|
and also for the `True` and `False` constructors (which we will
|
||||||
|
cover in the second section).
|
||||||
|
* `parse`, `typecheck`, `translate` and `compile` all do exactly
|
||||||
|
what they say. In this case, compilation refers to creating G-machine
|
||||||
|
instructions.
|
||||||
|
* `create_llvm_binop` creates an internal function that forces the
|
||||||
|
evaluation of its two arguments, and actually applies the given binary
|
||||||
|
operator. Recall that the `(+)` in user code constructs a call to this
|
||||||
|
function, but leaves it unevaluated until it's needed.
|
||||||
|
* `generate_llvm` converts all the definitions in `global_scope`, which
|
||||||
|
are at this point compiled into G-machine `instruction`s, into LLVM IR.
|
||||||
|
* `output_llvm` contains all the code to actually generate an object
|
||||||
|
file from the LLVM IR.
|
||||||
|
|
||||||
|
These functions are mostly taken from part 12's `main.cpp`, and adjusted
|
||||||
|
to use the `compiler`'s members rather than local definitions or arguments.
|
||||||
|
You should compare part 12's
|
||||||
|
[`main.cpp`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/12/main.cpp)
|
||||||
|
file with the
|
||||||
|
[`compiler.cpp`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/13/compiler.cpp)
|
||||||
|
file that we end up with at the end of this post.
|
||||||
|
|
||||||
|
Next, we have the compiler's constructor, and its `operator()`. The
|
||||||
|
latter, analogously to our parse driver, will trigger the compilation
|
||||||
|
process. Their implementations are straightforward:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/compiler.cpp" 131 145 >}}
|
||||||
|
|
||||||
|
We also add a couple of methods to give external code access to
|
||||||
|
some of the compiler's data structures. I omit their (trivial)
|
||||||
|
implementations, but they have the following signatures:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/compiler.hpp" 35 36 >}}
|
||||||
|
|
||||||
|
With all the compilation code tucked into our new `compiler` class,
|
||||||
|
`main` becomes very simple. We also finally get to use our exception
|
||||||
|
pretty printing code:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/main.cpp" 11 27 >}}
|
||||||
|
|
||||||
|
That's all for the cleanup! We've added locations and more errors
|
||||||
|
the compiler, stopped throwing `0` in favor of proper exceptions
|
||||||
|
or assertions, made name mangling more reasonable, fixed a bug with
|
||||||
|
accidentally shadowing default functions, and organized our compilation
|
||||||
|
process into a `compiler` class.
|
||||||
|
|
||||||
|
### Keeping Things Private
|
||||||
|
Hand-writing or generating hundreds of trivial getters and setters
|
||||||
|
for the fields of a data class (which is standard in the world of Java) seems
|
||||||
|
absurd to me. So, for most of this project, I stuck with
|
||||||
|
`struct`s, rather than classes. But this is not a good policy
|
||||||
|
to apply _everywhere_. I still think it makes sense to make
|
||||||
|
data structures like `ast` and `type` public-by-default;
|
||||||
|
however, I _don't_ think that way about classes like `type_mgr`,
|
||||||
|
`llvm_context`, `type_env`, and `env`. All of these have information
|
||||||
|
that we should never be accessing directly. Some guard this
|
||||||
|
information with assertions. In short, it should be protected.
|
||||||
|
|
||||||
|
For most classes, the changes are mechanical. For instance, we
|
||||||
|
can make `type_env` a class simply by changing its declaration,
|
||||||
|
and marking all of its functions public. This requires a slight
|
||||||
|
refactoring of a line that used its `parent` field. Here's
|
||||||
|
what it used to be (in context):
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/12/main.cpp" 57 60 >}}
|
||||||
|
|
||||||
|
And here's what it is now:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/compiler.cpp" 55 58 >}}
|
||||||
|
|
||||||
|
We always declare the `definition_defn` function in
|
||||||
|
the `global_env`. Thus, that's the only environment
|
||||||
|
we need to know about to update the mangled name.
|
||||||
|
|
||||||
|
The deal with `env` is about as simple. We just make
|
||||||
|
it and its two descendants classes, and mark their
|
||||||
|
methods and constructors public. The same
|
||||||
|
goes for `global_scope`. To make `type_mgr`
|
||||||
|
a class, we have to add a new method: `lookup`.
|
||||||
|
Here's its implementation:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/type.cpp" 81 85 >}}
|
||||||
|
|
||||||
|
It's used in `type_var::print` as follows:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/type.cpp" 28 35 >}}
|
||||||
|
|
||||||
|
We can't use `resolve` here because it takes (and returns)
|
||||||
|
a `type_ptr`. If we make it _take_ a `type*`, it won't
|
||||||
|
be able to return its argument if it's already resolved. If we
|
||||||
|
allow it to _return_ `type*`, we won't have an owning
|
||||||
|
reference. We also don't want to duplicate the
|
||||||
|
method just for this one call. Notice, though, how similar
|
||||||
|
`type_var::print`/`lookup` and `resolve` are in terms of execution.
|
||||||
|
|
||||||
|
The change for `llvm_context` requires a little more work.
|
||||||
|
Right now, `ctx.builder` is used a _lot_ in `instruction.cpp`.
|
||||||
|
Since we don't want to forward each of the LLVM builder methods,
|
||||||
|
and since it feels weird to make `llvm_context` extend `llvm::IRBuilder`,
|
||||||
|
we'll just provide a getter for the `builder` field. The
|
||||||
|
same goes for `module`:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/llvm_context.hpp" 46 47 >}}
|
||||||
|
|
||||||
|
Here's what some of the code from `instruction.cpp` looks like now:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/instruction.cpp" 144 145 >}}
|
||||||
|
|
||||||
|
Right now, the `ctx` field of the `llvm_context` (which contains
|
||||||
|
the `llvm::LLVMContext`) is only externally used to create
|
||||||
|
instances of `llvm::BasicBlock`. We'll add a proxy method
|
||||||
|
for this functionality:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/llvm_context.cpp" 174 176 >}}
|
||||||
|
|
||||||
|
Finally, `instruction_pushglobal` needs to access the
|
||||||
|
`llvm::Function` instances that we create in the process
|
||||||
|
of compilation. We add a new `get_custom_function` method
|
||||||
|
to support this, which automatically prefixes the function
|
||||||
|
name with `f_`, much like `create_custom_function`:
|
||||||
|
|
||||||
|
{{< codelines "C++" "compiler/13/llvm_context.cpp" 292 294 >}}
|
||||||
|
|
||||||
|
I think that's enough. If we chose to turn more compiler
|
||||||
|
data structures into classes, I think we would've quickly drowned
|
||||||
|
in one-line getter and setter methods.
|
||||||
|
|
Loading…
Reference in New Issue
Block a user