From 9f77f07ed2c92c624d89ad8b23ead93a2935566f Mon Sep 17 00:00:00 2001 From: Danila Fedorin Date: Sat, 19 Sep 2020 16:14:07 -0700 Subject: [PATCH] Finish 13th part of the compiler series. --- content/blog/13_compiler_cleanup/index.md | 149 +++++++++++++++++----- 1 file changed, 116 insertions(+), 33 deletions(-) diff --git a/content/blog/13_compiler_cleanup/index.md b/content/blog/13_compiler_cleanup/index.md index d0a50d4..d7b2d1a 100644 --- a/content/blog/13_compiler_cleanup/index.md +++ b/content/blog/13_compiler_cleanup/index.md @@ -70,11 +70,11 @@ than _characters_, it effectively doesn't interact with the source text at all, and can't determine from which line or column a token originated. The task of determining the locations of input tokens is delegated to the tokenizer -- Flex, in our case. Flex, on the -other hand, doesn't doesn't have a built-in mechanism for tracking +other hand, doesn't have a built-in mechanism for tracking locations. Fortunately, Bison provides a `yy::location` class that includes most of the needed functionality. -A `yy::location` consists of `begin` and `end` source position, +A `yy::location` consists of two source positions, `begin` and `end`, which themselves are represented using lines and columns. It also has the following methods: @@ -85,7 +85,7 @@ then `columns(token_length)` will move `end` to the token's end, and thus make the whole `location` contain the token. * `yy::location::lines(int)` behaves similarly to `columns`, except that it advances `end` by the given number of lines, -rather than columns. +rather than columns. It also resets the columns counter to `1`. * `yy::location::step()` moves `begin` to where `end` is. This is useful for when we've finished processing a token, and want to move on to the next one. @@ -102,10 +102,20 @@ We'll see why we are using `LOC` instead of something like `location` soon; for now, you can treat `LOC` as if it were a global variable declared in the tokenizer. Before processing each token, we ensure that the `yy::location` has its `begin` and `end` at the same position, -and then advance `end` by `yyleng` columns. This is sufficient +and then advance `end` by `yyleng` columns. This is +{{< sidenote "right" "sufficient-note" "sufficient" >}} +This doesn't hold for all languages. It may be possible for a language +to have tokens that contain \n, in which case, +rather than just using yyleng, we'd need to +add special logic to iterate over the token and detect the line +breaks.
+
+Also, this requires that the end of the previous token was +correctly computed. +{{< /sidenote >}} to make `LOC` represent our token's source position. For the moment, don't worry too much about `drv`; this is the -parse driver, and we will talk about it shortly. +parsing driver, and we will talk about it shortly. So now we have a "global" variable `LOC` that gives us the source position of the current token. To get it @@ -128,7 +138,7 @@ we need to add a `yy::location` argument to each of our `ast` nodes, as well as to the `pattern` subclasses, `definition_defn` and `definition_data`. To avoid breaking all the code that creates AST nodes and definitions outside of the parser, we'll make this -argument optional. Inside of `ast.hpp`, we define it as follows: +argument optional. Inside of `ast.hpp`, we define a new field as follows: {{< codelines "C++" "compiler/13/ast.hpp" 16 16 >}} @@ -136,7 +146,7 @@ Then, we add a constructor to `ast` as follows: {{< codelines "C++" "compiler/13/ast.hpp" 18 18 >}} -Note that it's not default here, since `ast` itself is an +Note that it's not optional here, since `ast` itself is an abstract class, and thus will never be constructed directly. It is in the subclasses of `ast` that we provide a default value. The change is rather mechanical, but here's an example @@ -155,7 +165,7 @@ detail: Here, the `@$` character is used to reference the current nonterminal's location data. -#### Line Offsets, File Input, and the Parse Driver +#### Line Offsets, File Input, and the Parsing Driver There are three more challenges with printing out the line of code where an error occurred. First of all, to print out a line of code, we need to have that line of code @@ -197,7 +207,7 @@ to read source code from files, anyway. To address the second issue, we can keep a mapping of line numbers to their locations in the source buffer. This is rather easy to maintain using an array: the first element of the array is 0, -which is the beginning of any line in any source file. From there, +which is the beginning of the first line in any source file. From there, every time we encounter the character `\n`, we can push the current source location to the top, marking it as the beginning of another line. Where exactly we store this @@ -413,7 +423,7 @@ structure containing Flex's state. Adding a parameter to Bison doesn't automatically affect Flex. To let Flex know that its `yylex` function must now accept -the state and the parse driver, we have to define the +the state and the parsing driver, we have to define the `YY_DECL` macro. We do this in `parse_driver.hpp`, since this forward declaration will be used by both Flex and Bison: @@ -532,8 +542,8 @@ Here's an example from `parsed_type`: {{< codelines "C++" "compiler/13/parsed_type.cpp" 16 23 >}} -In general, this change is also rather mechanical, but, to -maintain a balance between exceptions and assertions, here +In general, this change is also rather mechanical. Before we +move on, to maintain a balance between exceptions and assertions, here are a couple more assertions from `type_env`: {{< codelines "C++" "compiler/13/type_env.cpp" 81 82 >}} @@ -581,9 +591,7 @@ while the actual type was: Bool ``` -The exclamation marks in front of the two types are due to some -changes from section 2. Here's an error that was previously -a `throw 0` statement in our code: +Here's an error that was previously a `throw 0` statement in our code: ``` an error occured while compiling the program: type variable a used twice in data type definition. @@ -604,7 +612,21 @@ Now that I've had some more time to think about it (and now that I've returned to the compiler after a brief hiatus), I think that this was not the right call. Mangled names make sense when translating to LLVM; we certainly -don't want to declare two LLVM functions with the same name. +don't want to declare two LLVM functions +{{< sidenote "right" "mangling-note" "with the same name." >}} +By the way, LLVM has its own name mangling functionality. If you +declare two functions with the same name, they'll appear as +function and function.0. Since LLVM +uses the Function* C++ values to refer to functions, +as long as we keep them seaprate on our end, things will +work.
+
+However, in our compiler, name mangling occurs before LLVM is +introduced, at translation time. We could create LLVM functions +at that time, too, and associate them with variables. But then, +our G-machine instructions will be coupled to LLVM, which +would not be as clean. +{{< /sidenote >}} But things are different for local variables. Our local variables are graphs on a stack, and are not actually compiled to LLVM definitions. It doesn't make sense to mangle their names, since @@ -612,8 +634,8 @@ their names aren't present anywhere in the final executable. It's not even "consistent" to mangle them, since global definitions are compiled directly to __PushGlobal__ instructions, while local variables are only referenced through the current `env`. -So, I decided to reverse my decision. We will go back to -placing variable names directly onto `env_var`. Here's +So, I opted to reverse my decision. We will go back to +placing variable names directly into `env_var`. Here's an example of this from `global_scope.cpp`: {{< codelines "C++" "compiler/13/global_scope.cpp" 6 8 >}} @@ -630,8 +652,8 @@ that a variable from a __PushGlobal__ instruction is referencing the right function. To achieve this, we change `get_mangled_name` to stop returning the input string if a mangled name was not -found; now that we _must_ have a mangled name, doing -so is effectively obscuring the error. Instead, +found; doing so makes it impossible to check if a mangled +name was explicitly defined. Instead, we add two assertions. First, if an environment scope doesn't contain a variable, then it _must_ have a parent. If it does contain variable, that variable _must_ have @@ -652,7 +674,19 @@ Here's the definition of `type_env::variable_data` now: {{< codelines "C++" "compiler/13/type_env.hpp" 16 25 >}} Since looking up a mangled name for non-global variable -will now result in an assertion failure, we have to change +{{< sidenote "right" "unrepresentable-note" "will now result in an assertion failure," >}} +A very wise human at the very dawn of our species once said, +"make illegal states unrepresentable". Their friends and family were a little +busy making a fire, and didn't really understand what the heck they meant. Now, +we kind of do.
+
+It's possible for our type_env to include a +variable_data entry that is both global and has no mangled +name. But it doesn't have to be this way. We could define two subclasses +of variable_data, one global and one local, +where only the global one has a mangled_name +field. It would be impossible to reach this assertion failure then. +{{< /sidenote >}} we have to change `ast_lid::compile` to only call `get_mangled_name` once it ensures that the variable being compiled is, in fact, global: @@ -712,7 +746,7 @@ They're just temporarily allowed access. So, what should be the owner of all of these disparate components? Thus far, that has been the `main` function, or the utility -functions that it calls out to. However, this is in bad taste: +functions that it calls out to. However, this is sloppy: we have related data and operations on it, but we don't group them into an object. We can group all of the components of our compiler into a `compiler` object, and leave `main.cpp` with @@ -747,14 +781,11 @@ The methods of the compiler are arranged similarly: The methods go as follows: * `add_default_types` adds the built-in types to the `global_env`. -At this point in the post, these types only include `Int`. However, -in the second section, we'll make `Bool` a built-in type, too. +At this point, these types only include `Int`. * `add_binop_type` adds a single binary operator to the global type environment. We saw its implementation earlier: it deals with both binding a type, and setting a mangled name. -* `add_default_types` adds the types for each binary operator, -and also for the `True` and `False` constructors (which we will -cover in the second section). +* `add_default_types` adds the types for each binary operator. * `parse`, `typecheck`, `translate` and `compile` all do exactly what they say. In this case, compilation refers to creating G-machine instructions. @@ -776,7 +807,7 @@ file with the file that we end up with at the end of this post. Next, we have the compiler's constructor, and its `operator()`. The -latter, analogously to our parse driver, will trigger the compilation +latter, analogously to our parsing driver, will trigger the compilation process. Their implementations are straightforward: {{< codelines "C++" "compiler/13/compiler.cpp" 131 145 >}} @@ -793,11 +824,8 @@ pretty printing code: {{< codelines "C++" "compiler/13/main.cpp" 11 27 >}} -That's all for the cleanup! We've added locations and more errors -the compiler, stopped throwing `0` in favor of proper exceptions -or assertions, made name mangling more reasonable, fixed a bug with -accidentally shadowing default functions, and organized our compilation -process into a `compiler` class. +With this, we complete our transition to a compiler object. +All that's left is to clean up the code style. ### Keeping Things Private Hand-writing or generating hundreds of trivial getters and setters @@ -880,3 +908,58 @@ name with `f_`, much like `create_custom_function`: I think that's enough. If we chose to turn more compiler data structures into classes, I think we would've quickly drowned in one-line getter and setter methods. + +That's all for the cleanup! We've added locations and more errors +to the compiler, stopped throwing `0` in favor of proper exceptions +or assertions, made name mangling more reasonable, fixed a bug with +accidentally shadowing default functions, organized our compilation +process into a `compiler` class, and made more things into classes. +In the next post, I hope to tackle __strings__ and __Input/Output__. +I also think that implementing __modules__ would be a good idea, +though at the moment I don't know too much on the subject. I hope +you'll join me in my future writing! + +### Appendix: Optimization +When I started working on the compiler after the previous post, +I went a little overboard. I started working on optimizing the generated programs, +but eventually decided I wasn't doing a +{{< sidenote "right" "good-note" "good enough" >}} +I think authors should feel a certain degree of responsibility +for the content they create. If I do something badly, somebody +else trusts me and learns from it, who knows how much damage I've done. +I try not to do damage.
+
+If anyone reads what I write, anyway! +{{< /sidenote >}} job to present it to others, +and scrapped that part of the compiler altogether. I'm not +sure if I will try again in the near future. But, +if you're curious about optimization, here are a few avenues +I've explored or thought about: + +* __Unboxing numbers__. Right now, numbers are allocated and garbage +collected just like the rest of the graph nodes. This is far from ideal. +We could use pointers to represent numbers, by tagging their most significant +bits on 64-bit CPUs. Rather than allocating a node, the runtime will just +cast a number to a pointer, tag it, and push it on the stack. +* __Converting enumeration data types to numbers__. If no constructor +of a data type takes any arguments, then the tag uniquely identifies +each constructor. Combined with unboxed numbers, this can save unnecessary +allocations and memory accesses. +* __Special treatment for global constants__. It makes sense for +global functions to be converted into LLVM functions, but the +same is not the case for +{{< sidenote "right" "constant-note" "constants." >}} +Yeah, yeah, a constant is just a nullary function. Get +out of here with your pedantry! +{{< /sidenote >}} We can find a way to +initialize global constants once, which would save some work. To +make more constants suitable for this, we could employ +[monomorphism restriction](https://wiki.haskell.org/Monomorphism_restriction). +* __Optimizing stack operations.__ If you read through the LLVM IR +we produce, you can see a lot of code that peeks at something twice, +or pops-then-pushes the same value, or does other absurd things. LLVM +isn't aware of the semantics of our stacks, but perhaps we could write an +optimization pass to deal with some of the more blatant instances of +this issue. + +If you attempt any of these, let me know how it goes, please!