diff --git a/content/blog/13_compiler_cleanup_optimization/index.md b/content/blog/13_compiler_cleanup_optimization/index.md new file mode 100644 index 0000000..aba465b --- /dev/null +++ b/content/blog/13_compiler_cleanup_optimization/index.md @@ -0,0 +1,214 @@ +--- +title: Compiling a Functional Language Using C++, Part 13 - More Improvements +date: 2020-09-10T18:50:02-07:00 +tags: ["C and C++", "Functional Languages", "Compilers"] +description: "In this post, we clean up our compiler and add some basic optimizations." +--- + +In [part 12]({{< relref "12_compiler_let_in_lambda" >}}), we added `let/in` +and lambda expressions to our compiler. At the end of that post, I mentioned +that before we move on to bigger and better things, I wanted to take a +step back and clean up the compiler. + +Recently, I got around to doing that. Unfortunately, I also got around to doing +a lot more. Furthermore, I managed to make the changes in such a way that I +can't cleanly separate the 'cleanup' and 'optimization' portions of my work. +This is partially due to the way in which I organize code, where each post +is associated with a version of the compiler with the necessary changes. +Because of all this, instead of making this post about the cleanup, and the +next post about the optimizations, I have to merge them into one. + +So, this post is split into two major portions: cleanup, which deals mostly +with touching up exceptions and improving the 'name mangling' logic, and +optimizations, which deals with adding special treatment to booleans, +unboxing integers, and implementing more binary operators. + +### Section 1: Cleanup + +The previous post was +{{< sidenote "right" "long-note" "rather long," >}} +Probably not as long as this one, though! I really need to get the +size of my posts under control. +{{< /sidenote >}} which led me to omit +a rather important aspect of the compiler: proper error reporting. +Once again our compiler has instances of `throw 0`, which is a cheap way +of avoiding properly handling a runtime error. Before we move on, +it's best to get rid of such blatantly lazy code. + +Our existing exceptions (mostly type errors) can use some work, too. +Even the most descriptive issues our compiler reports -- unification errors -- +don't include the crucial information of _where_ the error is. For large +programs, this means having to painstakingly read through the entire file +to try figure out which subexpression could possibly have an incorrect type. +This is far from the ideal debugging experience. + +Addressing all this is a multi-step change in itself. We want to: + +* Replace all `throw 0` code with actual exceptions. +* Replace some exceptions that shouldn't be possible for a user to trigger +with assertions. +* Keep track of source locations of each subexpression, so that we may +be able to print it if it causes an error. +* Be able to print out said source locations at will. This isn't +a _necessity_, but virtually all "big" compilers do this. Instead +of reporting that an error occurs on a particular line, we will +actually print the line. + +Let's start with gathering the actual location data. + +#### Bison's Locations +Bison actually has some rather nice support for location tracking. It can +automatically assemble the "from" and "to" locations of a nonterminal +from the locations of children, which would be very tedious to write +by hand. We enable this feature using the following option: + +{{< codelines "text" "compiler/13/parser.y" 50 50 >}} + +There's just one hitch, though. Sure, Bison can compute bigger +locations from smaller ones, but it must get the smaller ones +from somewhere. Since Bison operates on _tokens_, rather +than _characters_, it effectively doesn't interact with the source +text at all, and can't determine from which line or column a token +originated. The task of determining the locations of input tokens +is delegated to the tokenizer -- Flex, in our case. Flex, on the +other hand, doesn't doesn't have a built-in mechanism for tracking +locations. Fortunately, Bison provides a `yy::location` class that +includes most of the needed functionality. + +A `yy::location` consists of `begin` and `end` source position, +which themselves are represented using lines and columns. It +also has the following methods: + +* `yy::location::columns(int)` advances the `end` position by +the given number of columns, while `begin` stays the same. +If `begin` and `end` both point to the beginning of a token, +then `columns(token_length)` will move `end` to the token's end, +and thus make the whole `location` contain the token. +* `yy::location::lines(int)` behaves similarly to `columns`, +except that it advances `end` by the given number of lines, +rather than columns. +* `yy::location::step()` moves `begin` to where `end` is. This +is useful for when we've finished processing a token, and want +to move on to the next one. + +For Flex specifically, `yyleng` has the length of the token +currently being processed. Rather than adding the calls +to `columns` and `step` to every rule, we can define the +`YY_USER_ACTION` macro, which is run before each token +is processed. + +{{< codelines "C++" "compiler/13/scanner.l" 12 12 >}} + +We'll see why we are using `drv` soon; for now, you can treat +`location` as if it were a global variable declared in the +tokenizer. Before processing each token, we ensure that +`location` has its `begin` and `end` at the same position, +and then advance `end` by `yyleng` columns. This is sufficient +to make `location` represent our token's source position. + +So now we have a "global" variable `location` that gives +us the source position of the current token. To get it +to Bison, we have to pass it as an argument to each +of the `make_TOKEN` calls. Here are a few sample lines +that should give you the general idea: + +{{< codelines "C++" "compiler/13/scanner.l" 41 44 >}} + +That last line is actually new. Previously, we somehow +got away without explicitly sending the EOF token to Bison. +I suspect that this was due to some kind of implicit conversion +of the Flex macro `YY_NULL` into a token; now that we have +to pass a position to every token constructor, such an implicit +conversion is probably impossible. + +Now we have Bison computing source locations for each nonterminal. +However, at the moment, we still aren't using them. To change that, +we need to add a `yy::location` argument to each of our `ast` nodes, +as well as to the `pattern` subclasses, `definition_defn` and +`definition_data`. To avoid breaking all the code that creates +AST nodes and definitions outside of the parser, we'll make this +argument optional. Inside of `ast.hpp`, we define it as follows: + +{{< codelines "C++" "compiler/13/ast.hpp" 16 16 >}} + +Then, we add a constructor to `ast` as follows: + +{{< codelines "C++" "compiler/13/ast.hpp" 18 18 >}} + +Note that it's not default here, since `ast` itself is an +abstract class, and thus will never be constructed directly. +It is in the subclasses of `ast` that we provide a default +value. The change is rather mechanical, but here's an example +from `ast_binop`: + +{{< codelines "C++" "compiler/13/ast.hpp" 98 99 >}} + +#### Line Offsets, File Input, and the Parse Driver +There are three more challenges with printing out the line +of code where an error occurred. First of all, to +print out a line of code, we need to have that line of code +available to us. We do not currently meet this requirement: +our compiler reads code form `stdin` (as is default for Flex), +and `stdin` doesn't always support rewinding. This, in turn, +means that once Flex has read a character from the input, +it may not be possible to go back and retrieve that character +again. + +Second, even if we do have have the entire stream or buffer +available to us, to retrieve an offset and length within +that buffer from just a line and column number would be a lot +of work. A naive approach would be to iterate through +the input again, once more keeping track of lines and columns, +and print the desired line once we reach it. However, this +would lead us to redo a lot of work that our tokenizer +is already doing. + +Third, Flex's input mechanism, even if it it's configured +not to read from `stdin`, uses a global file descriptor called +`yyin`. However, we're better off minimizing global state (especially +if we want to read, parse, and compile multiple files in +the future). While we're configuring Flex's input mechanism, +we may as well fix this, too. + +There are several approaches to fixing the first issue. One possible +way is to store the content of `stdin` into a temporary file. Then, +it's possible to read from the file multiple times by using +the C functions `fseek` and `rewind`. However, since we're +working with files, why not just work directly with the files +created by the user? Instead of reading from `stdin`, we may +as well take in a path to a file via `argv`, and read from there. +Also, instead of `fseek` and `rewind`, we can just read the file +into memory, and access it like a normal character buffer. + +To address the second issue, we can keep a mapping of line numbers +to their locations in the source buffer. This is rather easy to +maintain using an array: the first element of the array is 0, +which is the beginning of any line in any source file. From there, +every time we encounter the character `\n`, we can push +the current source location to the top, marking it as +the beginning of another line. Where exactly we store this +array is as yet unclear, since we're trying to avoid global variables. + +Finally, begin addressing the third issue, we can use Flex's `reentrant` +option, which makes it so that all of the tokenizer's state is stored in an +opaque `yyscan_t` structure, rather than in global variables. This way, +we can configure `yyin` without setting a global variable, which is a step +in the right direction. We'll work on this momentarily. + +Our tokenizing and parsing stack has more global variables +than just those specific to Flex. Among these variables is `global_defs`, +which receives all the top-level function and data type definitions. We +will also need some way of accessing the `yy::location` instance, and +a way of storing our file input in memory. Fortunately, we're not +the only ones to have ever come across the issue of creating non-global +state: the Bison documentation has a +[section in its C++ guide](https://www.gnu.org/software/bison/manual/html_node/Calc_002b_002b-Parsing-Driver.html) that describes a technique for manipulating +state -- "parsing context", in their words. This technique involves the +creation of a _parsing driver_. + +The parsing driver is a class (or struct) that holds all the parse-related +state. We can arrange for this class to be available to our tokenizing +and parsing functions, which will allow us to use it pretty much like we'd +use a global variable. We can define it as follows: + +{{< codelines "C++" "compiler/13/parse_driver.hpp" 14 34 >}}