Refactor errors and update post draft.

2020-09-11 21:29:49 -07:00
parent 725958137a
commit 6b8d3b0f8a
8 changed files with 431 additions and 42 deletions
--- a/content/blog/13_compiler_cleanup_optimization/index.md
+++ b/content/blog/13_compiler_cleanup_optimization/index.md
@@ -62,7 +62,7 @@ automatically assemble the "from" and "to" locations of a nonterminal
 from the locations of children, which would be very tedious to write
 by hand. We enable this feature using the following option:

-{{< codelines "text" "compiler/13/parser.y" 50 50 >}}
+{{< codelines "C++" "compiler/13/parser.y" 50 50 >}}

 There's just one hitch, though. Sure, Bison can compute bigger
 locations from smaller ones, but it must get the smaller ones
@@ -143,6 +143,17 @@ from `ast_binop`:

 {{< codelines "C++" "compiler/13/ast.hpp" 98 99 >}}

+Finally, we tell Bison to pass the computed location
+data as an argument when constructing our data structures.
+This too is a mechanical change, and I think the following
+couple of lines demonstrate the general idea in sufficient
+detail:
+
+{{< codelines "C++" "compiler/13/parser.y" 107 110 >}}
+
+Here, the `@$` character is used to reference the current
+nonterminal's location data.
+
 #### Line Offsets, File Input, and the Parse Driver
 There are three more challenges with printing out the line
 of code where an error occurred. First of all, to
@@ -202,7 +213,8 @@ will also need some way of accessing the `yy::location` instance, and
 a way of storing our file input in memory. Fortunately, we're not
 the only ones to have ever come across the issue of creating non-global
 state: the Bison documentation has a
-[section in its C++ guide](https://www.gnu.org/software/bison/manual/html_node/Calc_002b_002b-Parsing-Driver.html) that describes a technique for manipulating
+[section in its C++ guide](https://www.gnu.org/software/bison/manual/html_node/Calc_002b_002b-Parsing-Driver.html)
+that describes a technique for manipulating
 state -- "parsing context", in their words. This technique involves the
 creation of a _parsing driver_.

@@ -211,4 +223,352 @@ state. We can arrange for this class to be available to our tokenizing
 and parsing functions, which will allow us to use it pretty much like we'd
 use a global variable. We can define it as follows:

-{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 34 >}}
+{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 37 >}}
+
+There are quite a few fields here. The `file_name` string represents
+the file that we'll be reading code from. the `string_stream` will
+be used to back up the contents of source file as Flex reads them;
+once Flex is done, the content of the `string_stream` will be
+saved into the `file_content` string.
+
+The next three fields deal with tracking source code
+locations. The `location` field will be accessed by Flex
+via `drv.location` (where `drv` is a reference to our driver class).
+The `file_offset` and `line_offsets` fields will be used to
+keep track of where each line begins, as we have discussed above.
+Finally, `global_defs` will be the new home of our top-level
+definitions.
+
+The methods on `parse_driver` are rather simple, too:
+
+* `run_parse` handles the initialization of the tokenizer
+and parser, which includes obtaining the `FILE*` and configuring
+Flex to use it. It also handles invoking the parsing code.
+We'll make this method return `true` if parsing succeeded,
+and `false` otherwise (if, say, the file we tried to read doesn't exist).
+* `write` will be called from Flex, and will allow us to
+record the content of the file we're processing to the `string_stream`.
+We've already seen it used in the `YY_USER_ACTION` macro.
+* `mark_line` will also be called from Flex, and will mark the current
+`file_offset` as the beginning of a line by pushing it into `line_offsets`.
+* `get_index` and `get_line_end` will be used for converting
+`yy::location` instances to offsets within the source code buffer.
+* `print_location` will be used for printing errors.
+It will print the lines spanned by the given location, with the
+location itself colored and underlined if the last argument is `true`.
+This will make our errors easier on the eyes.
+
+Let's take a look at their implementations. First, `run_parse`:
+
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 5 18 >}}
+
+We try open the user-specified file, and return `false` if we can't.
+We then initialize `line_offsets` as we discussed above. After
+this, we start doing the setup specific to a reentrant
+Flex scanner. We declare a `yyscan_t` variable, which
+will contain all of Flex's state. Then, we initialize
+it using `yylex_init`. Finally, since we can no longer
+touch the `yyin` global variable (it doesn't exist),
+we have to resort to using a setter function provided by Flex
+to configure the tokenizer's input stream.
+
+Next, we construct our Bison-generated parser. Note that
+unlike before, we have to pass in two arguments:
+`scanner` and `*this`, the latter being of type `parse_driver&`.
+We'll come back to how this works in a moment. With
+the scanner and parser initialized, we invoke `parser::operator()`,
+which actually runs the Flex- and Bison-generated code.
+To clean up, we run `yylex_destroy` and `fclose`. Finally,
+we extract the contents of our file into the `file_contents`
+string, and return.
+
+Next, the `write` method. For the most part, this method
+is a proxy for the `write` method of our `string_stream`:
+
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 20 23 >}}
+
+We do, however, also keep track of the `file_offset` variable
+here, which ensures we have up-to-date information
+regarding our position in the source file. The implementation
+of `mark_line` uses this information:
+
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 25 27 >}}
+
+Once we have the line offsets, `get_index` becomes very simple:
+
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 29 32 >}}
+
+Here, we use an assertion for the first time. Calling
+`get_index` with a negative or zero line doesn't make
+any sense, since Bison starts tracking line numbers
+at 1. Similarly, asking for a line for which we don't
+have a recorded offset is invalid. Both
+of these nonsensical calls to `get_index` cannot
+be caused by the user under normal circumstances,
+and indicate the method's misuse by the author of
+the compiler (us!). Thus, we terminate the program.
+
+Finally, the implementation of `line_end` just finds the
+beginning of the next line. We stick to the C convention
+of marking 'end' indices exclusive (pointing just past
+the end of the array):
+
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 34 37 >}}
+
+Since `line_offsets` has as many elements as there are lines,
+the last line number would be equal to the vector's size.
+When looking up the end of the last line, we can't look for
+the beginning of the next line, so instead we return the end of the file.
+
+Next, the `print_location` method prints three sections
+of the source file. These are the text "before" the error,
+the error itself, and, finally, the text "after" the error.
+For example, if an error began on the fifth column of the third
+line, and ended on the eighth column of the fourth line, the
+"before" section would include the first four columns of the third
+line, and the "after" section would be the ninth column onward
+on the fourth line. Before and after the error itself,
+if the `highlight` argument is true,
+we sprinkle the ANSI escape codes to enable and disable
+special formatting, respectively. For now, the special
+formatting involves underlining the text and making it red.
+
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 39 53 >}}
+
+Finally, to get the forward declarations for the `yy*` functions
+and types, we set the `header-file` option in Flex:
+
+{{< codelines "C++" "compiler/13/scanner.l" 3 3 >}}
+
+We also include this `scanner.hpp` file in our `parse_driver.cpp`:
+
+{{< codelines "C++" "compiler/13/parse_driver.cpp" 2 2 >}}
+
+#### Adding the Driver to Flex and Bison
+Bison's C++ language template generates a class called
+`yy::parser`. We don't really want to modify this class
+in any way: not only is it generated code, but it's
+also rather complex. Instead, Bison provides us
+with a mechanism to pass more data in to the parser.
+This data is made available to all the actions
+that the parser runs. Better yet, Bison also attempts
+to pass this data on to the tokenizer, which in our
+case would mean that whatever data we provide Bison
+will also be available to Flex. This is how we'll
+allow the two components to access our new `parse_driver`
+class. This is also how we'll pass in the `yyscan_t`
+that Flex now needs to run its tokenizing code. To
+do all this, we use Bison's `%param` option. I'm
+going to include a few more lines from `parser.y`,
+since they contain the necessary `#include` directives
+and a required type definition:
+
+{{< codelines "C++" "compiler/13/parser.y" 1 18 >}}
+
+The `%param` option effectively adds the parameter listed
+between the curly braces to the constructor of the generated
+`yy::parser`. We've already seen this in the implementation
+of our driver, where we passed `scanner` and `*this` as
+arguments when creating the parser. The parameters we declare are also passed to the
+`yylex` function, which is expected to accept them in the same order.
+
+Since we're adding `parse_driver` as an argument we have to
+declare it. However, we can't include the `parse_driver` header
+right away because `parse_driver` itself includes the `parser` header:
+we'd end up with a circular dependency. Instead, we resort to
+forward-declaring the driver class, as well as the `yyscan_t`
+structure containing Flex's state.
+
+Adding a parameter to Bison doesn't automatically affect
+Flex. To let Flex know that its `yylex` function must now accept
+the state and the parse driver, we have to define the
+`YY_DECL` macro. We do this in `parse_driver.hpp`, since
+this forward declaration will be used by both Flex
+and Bison:
+
+{{< codelines "C++" "compiler/13/parse_driver.hpp" 39 41 >}}
+
+Finally, we can change our `main.cpp` file to use the
+`parse_driver`:
+
+{{< codelines "C++" "compiler/13/main.cpp" 178 186 >}}
+
+#### Improving Exceptions
+Now, it's time to add location data (and a little bit more) to our
+exceptions. We want to make it possible for exceptions to include
+data about where the error occurred, and to print this data to the user.
+However, it's also possible for us to have exceptions that simply
+do not have that location data. Furthermore, we want to know
+whether or not an exception has an associated location; we'd
+rather not print an invalid or "default" location when an error
+occurs.
+
+In the old days of programming, we could represent the absence
+of location data with a `nullptr`, or `NULL`. But not only
+does this approach expose us to all kind of `NULl`-safety
+bugs, but it also requires heap allocation! This doesn't
+make it sound all that appealing; instead, I think we should
+opt for using `std::optional`.
+
+Though `std::optional` is standard (as may be obvious from its
+namespace), it's a rather recent addition to the C++ STL.
+In order to gain access to it, we need to ensure that our
+project is compiled using C++17. To this end, we add
+the following two lines to our CMakeLists.txt:
+
+{{< codelines "CMake" "compiler/13/CMakeLists.txt" 5 6 >}}
+
+Now, let's add a new base class for all of our compiler errors,
+unsurprisingly called `compiler_error`:
+
+{{< codelines "C++" "compiler/13/error.hpp" 8 23 >}}
+
+We'll put some 'common' exception functionality
+into the `print_location` and `print_about` methods. If the error
+has an associated location, the former method will print that
+location to the screen. We don't always want to highlight
+the part of the code that caused the error: for instance,
+an invalid data type definition may span several lines,
+and coloring that whole section of text red would be
+too much. To address this, we add the `highlight`
+boolean argument, which can be used to switch the
+colors on and off. The `print_about` method
+will simply print the `what()` message of the exception,
+in addition to the "specific" error that occurred (stored
+in `description`). Here are the implementations of the
+functions:
+
+{{< codelines "C++" "compiler/13/error.cpp" 3 16 >}}
+
+We will also add a `pretty_print` method to all of
+our exceptions. This method will handle
+all the exception-specific printing logic.
+For the generic compiler error, this means
+simply printing out the error text and the location:
+
+{{< codelines "C++" "compiler/13/error.cpp" 18 21 >}}
+
+For `type_error`, this logic slightly changes,
+enabling colors when printing the location:
+
+{{< codelines "C++" "compiler/13/error.cpp" 27 30 >}}
+
+Finally, for `unification_error`, we also include
+the code to print out the two types that our
+compiler could not unify:
+
+{{< codelines "C++" "compiler/13/error.cpp" 32 41 >}}
+
+There's a subtle change here. Compared to the previous
+type-printing code (which we had in `main`), what
+we wrote here deals with "expected" and "actual" types.
+The `left` type passed to the exception is printed
+first, and is treat like the "correct" type. The
+`right` type, on the other hand, is treated
+like the "wrong" type that should have been
+unifiable with `left`. This will affect the
+calling conventions of our unification code. In
+`main`, we remove all our old exception printing code
+in favor of calls to `pretty_print`:
+
+{{< codelines "C++" "compiler/13/main.cpp" 207 213 >}}
+
+Now, we can go through and find all the places where
+we `throw 0`. One such place was in the data type
+definition code, where declaring the same type parameter
+twice is invalid. We replace the `0` with a 
+`compiler_error`:
+
+{{< codelines "C++" "compiler/13/definition.cpp" 66 69 >}}
+
+Not all `throw 0` statements should become exceptions.
+For example, here's code from the previous version of
+the compiler:
+
+{{< codelines "C++" "compiler/12/definition.cpp" 123 127 >}}
+
+If a definition `def_defn` has a dependency on a "nearby" (declared
+in the same group) definition called `dependency`, and if
+`dependency` does not exist within the same definition group,
+we throw an exception. But this error is impossible
+for a user to trigger: the only reason for a variable to appear
+in the `nearby_variables` vector is that it was previously
+found in the definition group. Here's the code that proves this
+(from the current version of the compiler):
+
+{{< codelines "C++" "compiler/13/definition.cpp" 102 106 >}}
+
+Not being able to find the variable in the definition group
+is a compiler bug, and should never occur. So, instead
+of throwing an exception, we'll use an assertion:
+
+{{< codelines "C++" "compiler/13/definition.cpp" 128 128 >}}
+
+For more complicated error messages, we can use a `stringstream`.
+Here's an example from `parsed_type`:
+
+{{< codelines "C++" "compiler/13/parsed_type.cpp" 16 23 >}}
+
+In general, this change is also rather mechanical, but, to
+maintain a balance between exceptions and assertions, here
+are a couple more assertions from `type_env`:
+
+{{< codelines "C++" "compiler/13/type_env.cpp" 77 78 >}}
+
+Once again, it should not be possible for the compiler
+to try generalize the type of a variable that doesn't
+exist, and nor should generalization occur twice.
+
+While we're on the topic of types, let's talk about
+`type_mgr::unify`. In practice, I suspect that a lot of
+errors in our compiler will originate from this method.
+However, at present, this method does not in any way
+track the locations of where a unification error occurred.
+To fix this, we add a new `loc` parameter to `unify`,
+which we make optional to allow for unification without
+a known location. Here's the declaration:
+
+{{< codelines "C++" "compiler/13/type.hpp" 101 101 >}}
+
+The change to the implementation is mechanical and repetitive,
+so instead of showing you the whole method, I'll settle for
+a couple of lines:
+
+{{< codelines "C++" "compiler/13/type.cpp" 119 121 >}}
+
+We want to make sure that a location provided to the
+top-level call to `unify` is also forwarded to the
+recursive calls, so we have to explicitly add it
+to the call.
+
+With all of that done, we can finally stand back and
+marvel at the results of our hard work. Here is what a
+basic unification error looks like now:
+
+{{< figure src="unification_error.png" caption="The result of a unification error." >}}
+
+I used an image to show colors, but here is the content of the error in textual form:
+
+```
+an error occured while checking the types of the program: failed to unify types
+occuring on line 2:
+    3 + False
+the expected type was:
+  !Int
+while the actual type was:
+  !Bool
+```
+
+The exclamation marks in front of the two types are due to some
+changes from section 2. Here's an error that was previously
+a `throw 0` statement in our code:
+
+```
+an error occured while compiling the program: type variable a used twice in data type definition.
+occuring on line 1:
+data Pair a a = { MkPair a a }
+```
+
+Now, not only have we eliminated the lazy uses of `throw 0` in our
+code, but we've also improved the presentation of the errors
+to the user!