Finish 13th part of the compiler series.
This commit is contained in:
		
							parent
							
								
									04ab1a137c
								
							
						
					
					
						commit
						9f77f07ed2
					
				| @ -70,11 +70,11 @@ than _characters_, it effectively doesn't interact with the source | ||||
| text at all, and can't determine from which line or column a token | ||||
| originated. The task of determining the locations of input tokens | ||||
| is delegated to the tokenizer -- Flex, in our case. Flex, on the | ||||
| other hand, doesn't doesn't have a built-in mechanism for tracking | ||||
| other hand, doesn't have a built-in mechanism for tracking | ||||
| locations. Fortunately, Bison provides a `yy::location` class that | ||||
| includes most of the needed functionality. | ||||
| 
 | ||||
| A `yy::location` consists of `begin` and `end` source position, | ||||
| A `yy::location` consists of two source positions, `begin` and `end`, | ||||
| which themselves are represented using lines and columns. It | ||||
| also has the following methods: | ||||
| 
 | ||||
| @ -85,7 +85,7 @@ then `columns(token_length)` will move `end` to the token's end, | ||||
| and thus make the whole `location` contain the token. | ||||
| * `yy::location::lines(int)` behaves similarly to `columns`, | ||||
| except that it advances `end` by the given number of lines, | ||||
| rather than columns. | ||||
| rather than columns. It also resets the columns counter to `1`. | ||||
| * `yy::location::step()` moves `begin` to where `end` is. This | ||||
| is useful for when we've finished processing a token, and want | ||||
| to move on to the next one. | ||||
| @ -102,10 +102,20 @@ We'll see why we are using `LOC` instead of something like `location` soon; | ||||
| for now, you can treat `LOC` as if it were a global variable declared  | ||||
| in the tokenizer. Before processing each token, we ensure that | ||||
| the `yy::location` has its `begin` and `end` at the same position, | ||||
| and then advance `end` by `yyleng` columns. This is sufficient | ||||
| and then advance `end` by `yyleng` columns. This is | ||||
| {{< sidenote "right" "sufficient-note" "sufficient" >}} | ||||
| This doesn't hold for all languages. It may be possible for a language | ||||
| to have tokens that contain <code>\n</code>, in which case, | ||||
| rather than just using <code>yyleng</code>, we'd need to | ||||
| add special logic to iterate over the token and detect the line | ||||
| breaks.<br> | ||||
| <br> | ||||
| Also, this requires that the <code>end</code> of the previous token was | ||||
| correctly computed. | ||||
| {{< /sidenote >}} | ||||
| to make `LOC` represent our token's source position. For | ||||
| the moment, don't worry too much about `drv`; this is the | ||||
| parse driver, and we will talk about it shortly. | ||||
| parsing driver, and we will talk about it shortly. | ||||
| 
 | ||||
| So now we have a "global" variable `LOC` that gives | ||||
| us the source position of the current token. To get it | ||||
| @ -128,7 +138,7 @@ we need to add a `yy::location` argument to each of our `ast` nodes, | ||||
| as well as to the `pattern` subclasses, `definition_defn` and | ||||
| `definition_data`. To avoid breaking all the code that creates | ||||
| AST nodes and definitions outside of the parser, we'll make this | ||||
| argument optional. Inside of `ast.hpp`, we define it as follows: | ||||
| argument optional. Inside of `ast.hpp`, we define a new field as follows: | ||||
| 
 | ||||
| {{< codelines "C++" "compiler/13/ast.hpp" 16 16 >}} | ||||
| 
 | ||||
| @ -136,7 +146,7 @@ Then, we add a constructor to `ast` as follows: | ||||
| 
 | ||||
| {{< codelines "C++" "compiler/13/ast.hpp" 18 18 >}} | ||||
| 
 | ||||
| Note that it's not default here, since `ast` itself is an | ||||
| Note that it's not optional here, since `ast` itself is an | ||||
| abstract class, and thus will never be constructed directly. | ||||
| It is in the subclasses of `ast` that we provide a default | ||||
| value. The change is rather mechanical, but here's an example | ||||
| @ -155,7 +165,7 @@ detail: | ||||
| Here, the `@$` character is used to reference the current | ||||
| nonterminal's location data. | ||||
| 
 | ||||
| #### Line Offsets, File Input, and the Parse Driver | ||||
| #### Line Offsets, File Input, and the Parsing Driver | ||||
| There are three more challenges with printing out the line | ||||
| of code where an error occurred. First of all, to | ||||
| print out a line of code, we need to have that line of code | ||||
| @ -197,7 +207,7 @@ to read source code from files, anyway. | ||||
| To address the second issue, we can keep a mapping of line numbers | ||||
| to their locations in the source buffer. This is rather easy to | ||||
| maintain using an array: the first element of the array is 0, | ||||
| which is the beginning of any line in any source file. From there, | ||||
| which is the beginning of the first line in any source file. From there, | ||||
| every time we encounter the character `\n`, we can push | ||||
| the current source location to the top, marking it as | ||||
| the beginning of another line. Where exactly we store this | ||||
| @ -413,7 +423,7 @@ structure containing Flex's state. | ||||
| 
 | ||||
| Adding a parameter to Bison doesn't automatically affect | ||||
| Flex. To let Flex know that its `yylex` function must now accept | ||||
| the state and the parse driver, we have to define the | ||||
| the state and the parsing driver, we have to define the | ||||
| `YY_DECL` macro. We do this in `parse_driver.hpp`, since | ||||
| this forward declaration will be used by both Flex | ||||
| and Bison: | ||||
| @ -532,8 +542,8 @@ Here's an example from `parsed_type`: | ||||
| 
 | ||||
| {{< codelines "C++" "compiler/13/parsed_type.cpp" 16 23 >}} | ||||
| 
 | ||||
| In general, this change is also rather mechanical, but, to | ||||
| maintain a balance between exceptions and assertions, here | ||||
| In general, this change is also rather mechanical. Before we | ||||
| move on, to maintain a balance between exceptions and assertions, here | ||||
| are a couple more assertions from `type_env`: | ||||
| 
 | ||||
| {{< codelines "C++" "compiler/13/type_env.cpp" 81 82 >}} | ||||
| @ -581,9 +591,7 @@ while the actual type was: | ||||
|   Bool | ||||
| ``` | ||||
| 
 | ||||
| The exclamation marks in front of the two types are due to some | ||||
| changes from section 2. Here's an error that was previously | ||||
| a `throw 0` statement in our code: | ||||
| Here's an error that was previously a `throw 0` statement in our code: | ||||
| 
 | ||||
| ``` | ||||
| an error occured while compiling the program: type variable a used twice in data type definition. | ||||
| @ -604,7 +612,21 @@ Now that I've had some more time to think about it | ||||
| (and now that I've returned to the compiler after | ||||
| a brief hiatus), I think that this was not the right call. | ||||
| Mangled names make sense when translating to LLVM; we certainly | ||||
| don't want to declare two LLVM functions with the same name. | ||||
| don't want to declare two LLVM functions | ||||
| {{< sidenote "right" "mangling-note" "with the same name." >}} | ||||
| By the way, LLVM has its own name mangling functionality. If you | ||||
| declare two functions with the same name, they'll appear as | ||||
| <code>function</code> and <code>function.0</code>. Since LLVM | ||||
| uses the <code>Function*</code> C++ values to refer to functions, | ||||
| as long as we keep them seaprate on <em>our</em> end, things will | ||||
| work.<br> | ||||
| <br> | ||||
| However, in our compiler, name mangling occurs before LLVM is | ||||
| introduced, at translation time. We could create LLVM functions | ||||
| at that time, too, and associate them with variables. But then, | ||||
| our G-machine instructions will be coupled to LLVM, which | ||||
| would not be as clean. | ||||
| {{< /sidenote >}} | ||||
| But things are different for local variables. Our local variables | ||||
| are graphs on a stack, and are not actually compiled to LLVM | ||||
| definitions. It doesn't make sense to mangle their names, since | ||||
| @ -612,8 +634,8 @@ their names aren't present anywhere in the final executable. | ||||
| It's not even "consistent" to mangle them, since global definitions | ||||
| are compiled directly to __PushGlobal__ instructions, while local | ||||
| variables are only referenced through the current `env`. | ||||
| So, I decided to reverse my decision. We will go back to | ||||
| placing variable names directly onto `env_var`. Here's | ||||
| So, I opted to reverse my decision. We will go back to | ||||
| placing variable names directly into `env_var`. Here's | ||||
| an example of this from `global_scope.cpp`: | ||||
| 
 | ||||
| {{< codelines "C++" "compiler/13/global_scope.cpp" 6 8 >}} | ||||
| @ -630,8 +652,8 @@ that a variable from a __PushGlobal__ instruction | ||||
| is referencing the right function. To achieve | ||||
| this, we change `get_mangled_name` to stop | ||||
| returning the input string if a mangled name was not | ||||
| found; now that we _must_ have a mangled name, doing | ||||
| so is effectively obscuring the error. Instead, | ||||
| found; doing so makes it impossible to check if a mangled | ||||
| name was explicitly defined. Instead, | ||||
| we add two assertions. First, if an environment scope doesn't | ||||
| contain a variable, then it _must_ have a parent.  | ||||
| If it does contain variable, that variable _must_ have | ||||
| @ -652,7 +674,19 @@ Here's the definition of `type_env::variable_data` now: | ||||
| {{< codelines "C++" "compiler/13/type_env.hpp" 16 25 >}} | ||||
| 
 | ||||
| Since looking up a mangled name for non-global variable | ||||
| will now result in an assertion failure, we have to change | ||||
| {{< sidenote "right" "unrepresentable-note" "will now result in an assertion failure," >}} | ||||
| A very wise human at the very dawn of our species once said, | ||||
| "make illegal states unrepresentable". Their friends and family were a little | ||||
| busy making a fire, and didn't really understand what the heck they meant. Now, | ||||
| we kind of do.<br> | ||||
| <br> | ||||
| It's <em>possible</em> for our <code>type_env</code> to include a | ||||
| <code>variable_data</code> entry that is both global and has no mangled | ||||
| name. But it doesn't have to be this way. We could define two subclasses | ||||
| of <code>variable_data</code>, one global and one local, | ||||
| where only the global one has a <code>mangled_name</code> | ||||
| field. It would be impossible to reach this assertion failure then. | ||||
| {{< /sidenote >}} we have to change | ||||
| `ast_lid::compile` to only call `get_mangled_name` once | ||||
| it ensures that the variable being compiled is, in fact, | ||||
| global: | ||||
| @ -712,7 +746,7 @@ They're just temporarily allowed access. | ||||
| 
 | ||||
| So, what should be the owner of all of these disparate components? | ||||
| Thus far, that has been the `main` function, or the utility | ||||
| functions that it calls out to. However, this is in bad taste: | ||||
| functions that it calls out to. However, this is sloppy: | ||||
| we have related data and operations on it, but we don't group | ||||
| them into an object. We can group all of the components of our | ||||
| compiler into a `compiler` object, and leave `main.cpp` with | ||||
| @ -747,14 +781,11 @@ The methods of the compiler are arranged similarly: | ||||
| The methods go as follows: | ||||
| 
 | ||||
| * `add_default_types` adds the built-in types to the `global_env`. | ||||
| At this point in the post, these types only include `Int`. However, | ||||
| in the second section, we'll make `Bool` a built-in type, too. | ||||
| At this point, these types only include `Int`.  | ||||
| * `add_binop_type` adds a single binary operator to the global | ||||
| type environment. We saw its implementation earlier: it deals | ||||
| with both binding a type, and setting a mangled name. | ||||
| * `add_default_types` adds the types for each binary operator, | ||||
| and also for the `True` and `False` constructors (which we will | ||||
| cover in the second section). | ||||
| * `add_default_types` adds the types for each binary operator. | ||||
| * `parse`, `typecheck`, `translate` and `compile` all do exactly | ||||
| what they say. In this case, compilation refers to creating G-machine | ||||
| instructions. | ||||
| @ -776,7 +807,7 @@ file with the | ||||
| file that we end up with at the end of this post. | ||||
| 
 | ||||
| Next, we have the compiler's constructor, and its `operator()`. The | ||||
| latter, analogously to our parse driver, will trigger the compilation | ||||
| latter, analogously to our parsing driver, will trigger the compilation | ||||
| process. Their implementations are straightforward: | ||||
| 
 | ||||
| {{< codelines "C++" "compiler/13/compiler.cpp" 131 145 >}} | ||||
| @ -793,11 +824,8 @@ pretty printing code: | ||||
| 
 | ||||
| {{< codelines "C++" "compiler/13/main.cpp" 11 27 >}} | ||||
| 
 | ||||
| That's all for the cleanup! We've added locations and more errors | ||||
| the compiler, stopped throwing `0` in favor of proper exceptions | ||||
| or assertions, made name mangling more reasonable, fixed a bug with | ||||
| accidentally shadowing default functions, and organized our compilation | ||||
| process into a `compiler` class. | ||||
| With this, we complete our transition to a compiler object. | ||||
| All that's left is to clean up the code style. | ||||
| 
 | ||||
| ### Keeping Things Private | ||||
| Hand-writing or generating hundreds of trivial getters and setters | ||||
| @ -880,3 +908,58 @@ name with `f_`, much like `create_custom_function`: | ||||
| I think that's enough. If we chose to turn more compiler | ||||
| data structures into classes, I think we would've quickly drowned | ||||
| in one-line getter and setter methods. | ||||
| 
 | ||||
| That's all for the cleanup! We've added locations and more errors | ||||
| to the compiler, stopped throwing `0` in favor of proper exceptions | ||||
| or assertions, made name mangling more reasonable, fixed a bug with | ||||
| accidentally shadowing default functions, organized our compilation | ||||
| process into a `compiler` class, and made more things into classes. | ||||
| In the next post, I hope to tackle __strings__ and __Input/Output__. | ||||
| I also think that implementing __modules__ would be a good idea, | ||||
| though at the moment I don't know too much on the subject. I hope | ||||
| you'll join me in my future writing! | ||||
| 
 | ||||
| ### Appendix: Optimization | ||||
| When I started working on the compiler after the previous post, | ||||
| I went a little overboard. I started working on optimizing the generated programs, | ||||
| but eventually decided I wasn't doing a | ||||
| {{< sidenote "right" "good-note" "good enough" >}} | ||||
| I think authors should feel a certain degree of responsibility | ||||
| for the content they create. If I do something badly, somebody | ||||
| else trusts me and learns from it, who knows how much damage I've done. | ||||
| I try not to do damage.<br> | ||||
| <br> | ||||
| If anyone reads what I write, anyway! | ||||
| {{< /sidenote >}} job to present it to others, | ||||
| and scrapped that part of the compiler altogether. I'm not | ||||
| sure if I will try again in the near future. But, | ||||
| if you're curious about optimization, here are a few avenues | ||||
| I've explored or thought about: | ||||
| 
 | ||||
| * __Unboxing numbers__. Right now, numbers are allocated and garbage | ||||
| collected just like the rest of the graph nodes. This is far from ideal. | ||||
| We could use pointers to represent numbers, by tagging their most significant | ||||
| bits on 64-bit CPUs. Rather than allocating a node, the runtime will just | ||||
| cast a number to a pointer, tag it, and push it on the stack. | ||||
| * __Converting enumeration data types to numbers__. If no constructor | ||||
| of a data type takes any arguments, then the tag uniquely identifies | ||||
| each constructor. Combined with unboxed numbers, this can save unnecessary | ||||
| allocations and memory accesses. | ||||
| * __Special treatment for global constants__. It makes sense for | ||||
| global functions to be converted into LLVM functions, but the | ||||
| same is not the case for | ||||
| {{< sidenote "right" "constant-note" "constants." >}} | ||||
| Yeah, yeah, a constant is just a nullary function. Get | ||||
| out of here with your pedantry! | ||||
| {{< /sidenote >}} We can find a way to | ||||
| initialize global constants once, which would save some work. To | ||||
| make more constants suitable for this, we could employ | ||||
| [monomorphism restriction](https://wiki.haskell.org/Monomorphism_restriction). | ||||
| * __Optimizing stack operations.__ If you read through the LLVM IR | ||||
| we produce, you can see a lot of code that peeks at something twice, | ||||
| or pops-then-pushes the same value, or does other absurd things. LLVM | ||||
| isn't aware of the semantics of our stacks, but perhaps we could write an | ||||
| optimization pass to deal with some of the more blatant instances of | ||||
| this issue. | ||||
| 
 | ||||
| If you attempt any of these, let me know how it goes, please! | ||||
|  | ||||
		Loading…
	
		Reference in New Issue
	
	Block a user