25 KiB
title | date | tags | description | |||
---|---|---|---|---|---|---|
Compiling a Functional Language Using C++, Part 13 - More Improvements | 2020-09-10T18:50:02-07:00 |
|
In this post, we clean up our compiler and add some basic optimizations. |
In [part 12]({{< relref "12_compiler_let_in_lambda" >}}), we added let/in
and lambda expressions to our compiler. At the end of that post, I mentioned
that before we move on to bigger and better things, I wanted to take a
step back and clean up the compiler.
Recently, I got around to doing that. Unfortunately, I also got around to doing a lot more. Furthermore, I managed to make the changes in such a way that I can't cleanly separate the 'cleanup' and 'optimization' portions of my work. This is partially due to the way in which I organize code, where each post is associated with a version of the compiler with the necessary changes. Because of all this, instead of making this post about the cleanup, and the next post about the optimizations, I have to merge them into one.
So, this post is split into two major portions: cleanup, which deals mostly with touching up exceptions and improving the 'name mangling' logic, and optimizations, which deals with adding special treatment to booleans, unboxing integers, and implementing more binary operators.
Section 1: Cleanup
The previous post was
{{< sidenote "right" "long-note" "rather long," >}}
Probably not as long as this one, though! I really need to get the
size of my posts under control.
{{< /sidenote >}} which led me to omit
a rather important aspect of the compiler: proper error reporting.
Once again our compiler has instances of throw 0
, which is a cheap way
of avoiding properly handling a runtime error. Before we move on,
it's best to get rid of such blatantly lazy code.
Our existing exceptions (mostly type errors) can use some work, too. Even the most descriptive issues our compiler reports -- unification errors -- don't include the crucial information of where the error is. For large programs, this means having to painstakingly read through the entire file to try figure out which subexpression could possibly have an incorrect type. This is far from the ideal debugging experience.
Addressing all this is a multi-step change in itself. We want to:
- Replace all
throw 0
code with actual exceptions. - Replace some exceptions that shouldn't be possible for a user to trigger with assertions.
- Keep track of source locations of each subexpression, so that we may be able to print it if it causes an error.
- Be able to print out said source locations at will. This isn't a necessity, but virtually all "big" compilers do this. Instead of reporting that an error occurs on a particular line, we will actually print the line.
Let's start with gathering the actual location data.
Bison's Locations
Bison actually has some rather nice support for location tracking. It can automatically assemble the "from" and "to" locations of a nonterminal from the locations of children, which would be very tedious to write by hand. We enable this feature using the following option:
{{< codelines "C++" "compiler/13/parser.y" 50 50 >}}
There's just one hitch, though. Sure, Bison can compute bigger
locations from smaller ones, but it must get the smaller ones
from somewhere. Since Bison operates on tokens, rather
than characters, it effectively doesn't interact with the source
text at all, and can't determine from which line or column a token
originated. The task of determining the locations of input tokens
is delegated to the tokenizer -- Flex, in our case. Flex, on the
other hand, doesn't doesn't have a built-in mechanism for tracking
locations. Fortunately, Bison provides a yy::location
class that
includes most of the needed functionality.
A yy::location
consists of begin
and end
source position,
which themselves are represented using lines and columns. It
also has the following methods:
yy::location::columns(int)
advances theend
position by the given number of columns, whilebegin
stays the same. Ifbegin
andend
both point to the beginning of a token, thencolumns(token_length)
will moveend
to the token's end, and thus make the wholelocation
contain the token.yy::location::lines(int)
behaves similarly tocolumns
, except that it advancesend
by the given number of lines, rather than columns.yy::location::step()
movesbegin
to whereend
is. This is useful for when we've finished processing a token, and want to move on to the next one.
For Flex specifically, yyleng
has the length of the token
currently being processed. Rather than adding the calls
to columns
and step
to every rule, we can define the
YY_USER_ACTION
macro, which is run before each token
is processed.
{{< codelines "C++" "compiler/13/scanner.l" 12 12 >}}
We'll see why we are using drv
soon; for now, you can treat
location
as if it were a global variable declared in the
tokenizer. Before processing each token, we ensure that
location
has its begin
and end
at the same position,
and then advance end
by yyleng
columns. This is sufficient
to make location
represent our token's source position.
So now we have a "global" variable location
that gives
us the source position of the current token. To get it
to Bison, we have to pass it as an argument to each
of the make_TOKEN
calls. Here are a few sample lines
that should give you the general idea:
{{< codelines "C++" "compiler/13/scanner.l" 41 44 >}}
That last line is actually new. Previously, we somehow
got away without explicitly sending the EOF token to Bison.
I suspect that this was due to some kind of implicit conversion
of the Flex macro YY_NULL
into a token; now that we have
to pass a position to every token constructor, such an implicit
conversion is probably impossible.
Now we have Bison computing source locations for each nonterminal.
However, at the moment, we still aren't using them. To change that,
we need to add a yy::location
argument to each of our ast
nodes,
as well as to the pattern
subclasses, definition_defn
and
definition_data
. To avoid breaking all the code that creates
AST nodes and definitions outside of the parser, we'll make this
argument optional. Inside of ast.hpp
, we define it as follows:
{{< codelines "C++" "compiler/13/ast.hpp" 16 16 >}}
Then, we add a constructor to ast
as follows:
{{< codelines "C++" "compiler/13/ast.hpp" 18 18 >}}
Note that it's not default here, since ast
itself is an
abstract class, and thus will never be constructed directly.
It is in the subclasses of ast
that we provide a default
value. The change is rather mechanical, but here's an example
from ast_binop
:
{{< codelines "C++" "compiler/13/ast.hpp" 98 99 >}}
Finally, we tell Bison to pass the computed location data as an argument when constructing our data structures. This too is a mechanical change, and I think the following couple of lines demonstrate the general idea in sufficient detail:
{{< codelines "C++" "compiler/13/parser.y" 107 110 >}}
Here, the @$
character is used to reference the current
nonterminal's location data.
Line Offsets, File Input, and the Parse Driver
There are three more challenges with printing out the line
of code where an error occurred. First of all, to
print out a line of code, we need to have that line of code
available to us. We do not currently meet this requirement:
our compiler reads code form stdin
(as is default for Flex),
and stdin
doesn't always support rewinding. This, in turn,
means that once Flex has read a character from the input,
it may not be possible to go back and retrieve that character
again.
Second, even if we do have have the entire stream or buffer available to us, to retrieve an offset and length within that buffer from just a line and column number would be a lot of work. A naive approach would be to iterate through the input again, once more keeping track of lines and columns, and print the desired line once we reach it. However, this would lead us to redo a lot of work that our tokenizer is already doing.
Third, Flex's input mechanism, even if it it's configured
not to read from stdin
, uses a global file descriptor called
yyin
. However, we're better off minimizing global state (especially
if we want to read, parse, and compile multiple files in
the future). While we're configuring Flex's input mechanism,
we may as well fix this, too.
There are several approaches to fixing the first issue. One possible
way is to store the content of stdin
into a temporary file. Then,
it's possible to read from the file multiple times by using
the C functions fseek
and rewind
. However, since we're
working with files, why not just work directly with the files
created by the user? Instead of reading from stdin
, we may
as well take in a path to a file via argv
, and read from there.
Also, instead of fseek
and rewind
, we can just read the file
into memory, and access it like a normal character buffer.
To address the second issue, we can keep a mapping of line numbers
to their locations in the source buffer. This is rather easy to
maintain using an array: the first element of the array is 0,
which is the beginning of any line in any source file. From there,
every time we encounter the character \n
, we can push
the current source location to the top, marking it as
the beginning of another line. Where exactly we store this
array is as yet unclear, since we're trying to avoid global variables.
Finally, begin addressing the third issue, we can use Flex's reentrant
option, which makes it so that all of the tokenizer's state is stored in an
opaque yyscan_t
structure, rather than in global variables. This way,
we can configure yyin
without setting a global variable, which is a step
in the right direction. We'll work on this momentarily.
Our tokenizing and parsing stack has more global variables
than just those specific to Flex. Among these variables is global_defs
,
which receives all the top-level function and data type definitions. We
will also need some way of accessing the yy::location
instance, and
a way of storing our file input in memory. Fortunately, we're not
the only ones to have ever come across the issue of creating non-global
state: the Bison documentation has a
section in its C++ guide
that describes a technique for manipulating
state -- "parsing context", in their words. This technique involves the
creation of a parsing driver.
The parsing driver is a class (or struct) that holds all the parse-related state. We can arrange for this class to be available to our tokenizing and parsing functions, which will allow us to use it pretty much like we'd use a global variable. We can define it as follows:
{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 37 >}}
There are quite a few fields here. The file_name
string represents
the file that we'll be reading code from. the string_stream
will
be used to back up the contents of source file as Flex reads them;
once Flex is done, the content of the string_stream
will be
saved into the file_content
string.
The next three fields deal with tracking source code
locations. The location
field will be accessed by Flex
via drv.location
(where drv
is a reference to our driver class).
The file_offset
and line_offsets
fields will be used to
keep track of where each line begins, as we have discussed above.
Finally, global_defs
will be the new home of our top-level
definitions.
The methods on parse_driver
are rather simple, too:
run_parse
handles the initialization of the tokenizer and parser, which includes obtaining theFILE*
and configuring Flex to use it. It also handles invoking the parsing code. We'll make this method returntrue
if parsing succeeded, andfalse
otherwise (if, say, the file we tried to read doesn't exist).write
will be called from Flex, and will allow us to record the content of the file we're processing to thestring_stream
. We've already seen it used in theYY_USER_ACTION
macro.mark_line
will also be called from Flex, and will mark the currentfile_offset
as the beginning of a line by pushing it intoline_offsets
.get_index
andget_line_end
will be used for convertingyy::location
instances to offsets within the source code buffer.print_location
will be used for printing errors. It will print the lines spanned by the given location, with the location itself colored and underlined if the last argument istrue
. This will make our errors easier on the eyes.
Let's take a look at their implementations. First, run_parse
:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 5 18 >}}
We try open the user-specified file, and return false
if we can't.
We then initialize line_offsets
as we discussed above. After
this, we start doing the setup specific to a reentrant
Flex scanner. We declare a yyscan_t
variable, which
will contain all of Flex's state. Then, we initialize
it using yylex_init
. Finally, since we can no longer
touch the yyin
global variable (it doesn't exist),
we have to resort to using a setter function provided by Flex
to configure the tokenizer's input stream.
Next, we construct our Bison-generated parser. Note that
unlike before, we have to pass in two arguments:
scanner
and *this
, the latter being of type parse_driver&
.
We'll come back to how this works in a moment. With
the scanner and parser initialized, we invoke parser::operator()
,
which actually runs the Flex- and Bison-generated code.
To clean up, we run yylex_destroy
and fclose
. Finally,
we extract the contents of our file into the file_contents
string, and return.
Next, the write
method. For the most part, this method
is a proxy for the write
method of our string_stream
:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 20 23 >}}
We do, however, also keep track of the file_offset
variable
here, which ensures we have up-to-date information
regarding our position in the source file. The implementation
of mark_line
uses this information:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 25 27 >}}
Once we have the line offsets, get_index
becomes very simple:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 29 32 >}}
Here, we use an assertion for the first time. Calling
get_index
with a negative or zero line doesn't make
any sense, since Bison starts tracking line numbers
at 1. Similarly, asking for a line for which we don't
have a recorded offset is invalid. Both
of these nonsensical calls to get_index
cannot
be caused by the user under normal circumstances,
and indicate the method's misuse by the author of
the compiler (us!). Thus, we terminate the program.
Finally, the implementation of line_end
just finds the
beginning of the next line. We stick to the C convention
of marking 'end' indices exclusive (pointing just past
the end of the array):
{{< codelines "C++" "compiler/13/parse_driver.cpp" 34 37 >}}
Since line_offsets
has as many elements as there are lines,
the last line number would be equal to the vector's size.
When looking up the end of the last line, we can't look for
the beginning of the next line, so instead we return the end of the file.
Next, the print_location
method prints three sections
of the source file. These are the text "before" the error,
the error itself, and, finally, the text "after" the error.
For example, if an error began on the fifth column of the third
line, and ended on the eighth column of the fourth line, the
"before" section would include the first four columns of the third
line, and the "after" section would be the ninth column onward
on the fourth line. Before and after the error itself,
if the highlight
argument is true,
we sprinkle the ANSI escape codes to enable and disable
special formatting, respectively. For now, the special
formatting involves underlining the text and making it red.
{{< codelines "C++" "compiler/13/parse_driver.cpp" 39 53 >}}
Finally, to get the forward declarations for the yy*
functions
and types, we set the header-file
option in Flex:
{{< codelines "C++" "compiler/13/scanner.l" 3 3 >}}
We also include this scanner.hpp
file in our parse_driver.cpp
:
{{< codelines "C++" "compiler/13/parse_driver.cpp" 2 2 >}}
Adding the Driver to Flex and Bison
Bison's C++ language template generates a class called
yy::parser
. We don't really want to modify this class
in any way: not only is it generated code, but it's
also rather complex. Instead, Bison provides us
with a mechanism to pass more data in to the parser.
This data is made available to all the actions
that the parser runs. Better yet, Bison also attempts
to pass this data on to the tokenizer, which in our
case would mean that whatever data we provide Bison
will also be available to Flex. This is how we'll
allow the two components to access our new parse_driver
class. This is also how we'll pass in the yyscan_t
that Flex now needs to run its tokenizing code. To
do all this, we use Bison's %param
option. I'm
going to include a few more lines from parser.y
,
since they contain the necessary #include
directives
and a required type definition:
{{< codelines "C++" "compiler/13/parser.y" 1 18 >}}
The %param
option effectively adds the parameter listed
between the curly braces to the constructor of the generated
yy::parser
. We've already seen this in the implementation
of our driver, where we passed scanner
and *this
as
arguments when creating the parser. The parameters we declare are also passed to the
yylex
function, which is expected to accept them in the same order.
Since we're adding parse_driver
as an argument we have to
declare it. However, we can't include the parse_driver
header
right away because parse_driver
itself includes the parser
header:
we'd end up with a circular dependency. Instead, we resort to
forward-declaring the driver class, as well as the yyscan_t
structure containing Flex's state.
Adding a parameter to Bison doesn't automatically affect
Flex. To let Flex know that its yylex
function must now accept
the state and the parse driver, we have to define the
YY_DECL
macro. We do this in parse_driver.hpp
, since
this forward declaration will be used by both Flex
and Bison:
{{< codelines "C++" "compiler/13/parse_driver.hpp" 39 41 >}}
Finally, we can change our main.cpp
file to use the
parse_driver
:
{{< codelines "C++" "compiler/13/main.cpp" 178 186 >}}
Improving Exceptions
Now, it's time to add location data (and a little bit more) to our exceptions. We want to make it possible for exceptions to include data about where the error occurred, and to print this data to the user. However, it's also possible for us to have exceptions that simply do not have that location data. Furthermore, we want to know whether or not an exception has an associated location; we'd rather not print an invalid or "default" location when an error occurs.
In the old days of programming, we could represent the absence
of location data with a nullptr
, or NULL
. But not only
does this approach expose us to all kind of NULl
-safety
bugs, but it also requires heap allocation! This doesn't
make it sound all that appealing; instead, I think we should
opt for using std::optional
.
Though std::optional
is standard (as may be obvious from its
namespace), it's a rather recent addition to the C++ STL.
In order to gain access to it, we need to ensure that our
project is compiled using C++17. To this end, we add
the following two lines to our CMakeLists.txt:
{{< codelines "CMake" "compiler/13/CMakeLists.txt" 5 6 >}}
Now, let's add a new base class for all of our compiler errors,
unsurprisingly called compiler_error
:
{{< codelines "C++" "compiler/13/error.hpp" 8 23 >}}
We'll put some 'common' exception functionality
into the print_location
and print_about
methods. If the error
has an associated location, the former method will print that
location to the screen. We don't always want to highlight
the part of the code that caused the error: for instance,
an invalid data type definition may span several lines,
and coloring that whole section of text red would be
too much. To address this, we add the highlight
boolean argument, which can be used to switch the
colors on and off. The print_about
method
will simply print the what()
message of the exception,
in addition to the "specific" error that occurred (stored
in description
). Here are the implementations of the
functions:
{{< codelines "C++" "compiler/13/error.cpp" 3 16 >}}
We will also add a pretty_print
method to all of
our exceptions. This method will handle
all the exception-specific printing logic.
For the generic compiler error, this means
simply printing out the error text and the location:
{{< codelines "C++" "compiler/13/error.cpp" 18 21 >}}
For type_error
, this logic slightly changes,
enabling colors when printing the location:
{{< codelines "C++" "compiler/13/error.cpp" 27 30 >}}
Finally, for unification_error
, we also include
the code to print out the two types that our
compiler could not unify:
{{< codelines "C++" "compiler/13/error.cpp" 32 41 >}}
There's a subtle change here. Compared to the previous
type-printing code (which we had in main
), what
we wrote here deals with "expected" and "actual" types.
The left
type passed to the exception is printed
first, and is treat like the "correct" type. The
right
type, on the other hand, is treated
like the "wrong" type that should have been
unifiable with left
. This will affect the
calling conventions of our unification code. In
main
, we remove all our old exception printing code
in favor of calls to pretty_print
:
{{< codelines "C++" "compiler/13/main.cpp" 207 213 >}}
Now, we can go through and find all the places where
we throw 0
. One such place was in the data type
definition code, where declaring the same type parameter
twice is invalid. We replace the 0
with a
compiler_error
:
{{< codelines "C++" "compiler/13/definition.cpp" 66 69 >}}
Not all throw 0
statements should become exceptions.
For example, here's code from the previous version of
the compiler:
{{< codelines "C++" "compiler/12/definition.cpp" 123 127 >}}
If a definition def_defn
has a dependency on a "nearby" (declared
in the same group) definition called dependency
, and if
dependency
does not exist within the same definition group,
we throw an exception. But this error is impossible
for a user to trigger: the only reason for a variable to appear
in the nearby_variables
vector is that it was previously
found in the definition group. Here's the code that proves this
(from the current version of the compiler):
{{< codelines "C++" "compiler/13/definition.cpp" 102 106 >}}
Not being able to find the variable in the definition group is a compiler bug, and should never occur. So, instead of throwing an exception, we'll use an assertion:
{{< codelines "C++" "compiler/13/definition.cpp" 128 128 >}}
For more complicated error messages, we can use a stringstream
.
Here's an example from parsed_type
:
{{< codelines "C++" "compiler/13/parsed_type.cpp" 16 23 >}}
In general, this change is also rather mechanical, but, to
maintain a balance between exceptions and assertions, here
are a couple more assertions from type_env
:
{{< codelines "C++" "compiler/13/type_env.cpp" 77 78 >}}
Once again, it should not be possible for the compiler to try generalize the type of a variable that doesn't exist, and nor should generalization occur twice.
While we're on the topic of types, let's talk about
type_mgr::unify
. In practice, I suspect that a lot of
errors in our compiler will originate from this method.
However, at present, this method does not in any way
track the locations of where a unification error occurred.
To fix this, we add a new loc
parameter to unify
,
which we make optional to allow for unification without
a known location. Here's the declaration:
{{< codelines "C++" "compiler/13/type.hpp" 101 101 >}}
The change to the implementation is mechanical and repetitive, so instead of showing you the whole method, I'll settle for a couple of lines:
{{< codelines "C++" "compiler/13/type.cpp" 119 121 >}}
We want to make sure that a location provided to the
top-level call to unify
is also forwarded to the
recursive calls, so we have to explicitly add it
to the call.
With all of that done, we can finally stand back and marvel at the results of our hard work. Here is what a basic unification error looks like now:
{{< figure src="unification_error.png" caption="The result of a unification error." >}}
I used an image to show colors, but here is the content of the error in textual form:
an error occured while checking the types of the program: failed to unify types
occuring on line 2:
3 + False
the expected type was:
!Int
while the actual type was:
!Bool
The exclamation marks in front of the two types are due to some
changes from section 2. Here's an error that was previously
a throw 0
statement in our code:
an error occured while compiling the program: type variable a used twice in data type definition.
occuring on line 1:
data Pair a a = { MkPair a a }
Now, not only have we eliminated the lazy uses of throw 0
in our
code, but we've also improved the presentation of the errors
to the user!