Finish 13th part of the compiler series.

This commit is contained in:
Danila Fedorin 2020-09-19 16:14:07 -07:00
parent 04ab1a137c
commit 9f77f07ed2
1 changed files with 116 additions and 33 deletions

View File

@ -70,11 +70,11 @@ than _characters_, it effectively doesn't interact with the source
text at all, and can't determine from which line or column a token
originated. The task of determining the locations of input tokens
is delegated to the tokenizer -- Flex, in our case. Flex, on the
other hand, doesn't doesn't have a built-in mechanism for tracking
other hand, doesn't have a built-in mechanism for tracking
locations. Fortunately, Bison provides a `yy::location` class that
includes most of the needed functionality.
A `yy::location` consists of `begin` and `end` source position,
A `yy::location` consists of two source positions, `begin` and `end`,
which themselves are represented using lines and columns. It
also has the following methods:
@ -85,7 +85,7 @@ then `columns(token_length)` will move `end` to the token's end,
and thus make the whole `location` contain the token.
* `yy::location::lines(int)` behaves similarly to `columns`,
except that it advances `end` by the given number of lines,
rather than columns.
rather than columns. It also resets the columns counter to `1`.
* `yy::location::step()` moves `begin` to where `end` is. This
is useful for when we've finished processing a token, and want
to move on to the next one.
@ -102,10 +102,20 @@ We'll see why we are using `LOC` instead of something like `location` soon;
for now, you can treat `LOC` as if it were a global variable declared
in the tokenizer. Before processing each token, we ensure that
the `yy::location` has its `begin` and `end` at the same position,
and then advance `end` by `yyleng` columns. This is sufficient
and then advance `end` by `yyleng` columns. This is
{{< sidenote "right" "sufficient-note" "sufficient" >}}
This doesn't hold for all languages. It may be possible for a language
to have tokens that contain <code>\n</code>, in which case,
rather than just using <code>yyleng</code>, we'd need to
add special logic to iterate over the token and detect the line
breaks.<br>
<br>
Also, this requires that the <code>end</code> of the previous token was
correctly computed.
{{< /sidenote >}}
to make `LOC` represent our token's source position. For
the moment, don't worry too much about `drv`; this is the
parse driver, and we will talk about it shortly.
parsing driver, and we will talk about it shortly.
So now we have a "global" variable `LOC` that gives
us the source position of the current token. To get it
@ -128,7 +138,7 @@ we need to add a `yy::location` argument to each of our `ast` nodes,
as well as to the `pattern` subclasses, `definition_defn` and
`definition_data`. To avoid breaking all the code that creates
AST nodes and definitions outside of the parser, we'll make this
argument optional. Inside of `ast.hpp`, we define it as follows:
argument optional. Inside of `ast.hpp`, we define a new field as follows:
{{< codelines "C++" "compiler/13/ast.hpp" 16 16 >}}
@ -136,7 +146,7 @@ Then, we add a constructor to `ast` as follows:
{{< codelines "C++" "compiler/13/ast.hpp" 18 18 >}}
Note that it's not default here, since `ast` itself is an
Note that it's not optional here, since `ast` itself is an
abstract class, and thus will never be constructed directly.
It is in the subclasses of `ast` that we provide a default
value. The change is rather mechanical, but here's an example
@ -155,7 +165,7 @@ detail:
Here, the `@$` character is used to reference the current
nonterminal's location data.
#### Line Offsets, File Input, and the Parse Driver
#### Line Offsets, File Input, and the Parsing Driver
There are three more challenges with printing out the line
of code where an error occurred. First of all, to
print out a line of code, we need to have that line of code
@ -197,7 +207,7 @@ to read source code from files, anyway.
To address the second issue, we can keep a mapping of line numbers
to their locations in the source buffer. This is rather easy to
maintain using an array: the first element of the array is 0,
which is the beginning of any line in any source file. From there,
which is the beginning of the first line in any source file. From there,
every time we encounter the character `\n`, we can push
the current source location to the top, marking it as
the beginning of another line. Where exactly we store this
@ -413,7 +423,7 @@ structure containing Flex's state.
Adding a parameter to Bison doesn't automatically affect
Flex. To let Flex know that its `yylex` function must now accept
the state and the parse driver, we have to define the
the state and the parsing driver, we have to define the
`YY_DECL` macro. We do this in `parse_driver.hpp`, since
this forward declaration will be used by both Flex
and Bison:
@ -532,8 +542,8 @@ Here's an example from `parsed_type`:
{{< codelines "C++" "compiler/13/parsed_type.cpp" 16 23 >}}
In general, this change is also rather mechanical, but, to
maintain a balance between exceptions and assertions, here
In general, this change is also rather mechanical. Before we
move on, to maintain a balance between exceptions and assertions, here
are a couple more assertions from `type_env`:
{{< codelines "C++" "compiler/13/type_env.cpp" 81 82 >}}
@ -581,9 +591,7 @@ while the actual type was:
Bool
```
The exclamation marks in front of the two types are due to some
changes from section 2. Here's an error that was previously
a `throw 0` statement in our code:
Here's an error that was previously a `throw 0` statement in our code:
```
an error occured while compiling the program: type variable a used twice in data type definition.
@ -604,7 +612,21 @@ Now that I've had some more time to think about it
(and now that I've returned to the compiler after
a brief hiatus), I think that this was not the right call.
Mangled names make sense when translating to LLVM; we certainly
don't want to declare two LLVM functions with the same name.
don't want to declare two LLVM functions
{{< sidenote "right" "mangling-note" "with the same name." >}}
By the way, LLVM has its own name mangling functionality. If you
declare two functions with the same name, they'll appear as
<code>function</code> and <code>function.0</code>. Since LLVM
uses the <code>Function*</code> C++ values to refer to functions,
as long as we keep them seaprate on <em>our</em> end, things will
work.<br>
<br>
However, in our compiler, name mangling occurs before LLVM is
introduced, at translation time. We could create LLVM functions
at that time, too, and associate them with variables. But then,
our G-machine instructions will be coupled to LLVM, which
would not be as clean.
{{< /sidenote >}}
But things are different for local variables. Our local variables
are graphs on a stack, and are not actually compiled to LLVM
definitions. It doesn't make sense to mangle their names, since
@ -612,8 +634,8 @@ their names aren't present anywhere in the final executable.
It's not even "consistent" to mangle them, since global definitions
are compiled directly to __PushGlobal__ instructions, while local
variables are only referenced through the current `env`.
So, I decided to reverse my decision. We will go back to
placing variable names directly onto `env_var`. Here's
So, I opted to reverse my decision. We will go back to
placing variable names directly into `env_var`. Here's
an example of this from `global_scope.cpp`:
{{< codelines "C++" "compiler/13/global_scope.cpp" 6 8 >}}
@ -630,8 +652,8 @@ that a variable from a __PushGlobal__ instruction
is referencing the right function. To achieve
this, we change `get_mangled_name` to stop
returning the input string if a mangled name was not
found; now that we _must_ have a mangled name, doing
so is effectively obscuring the error. Instead,
found; doing so makes it impossible to check if a mangled
name was explicitly defined. Instead,
we add two assertions. First, if an environment scope doesn't
contain a variable, then it _must_ have a parent.
If it does contain variable, that variable _must_ have
@ -652,7 +674,19 @@ Here's the definition of `type_env::variable_data` now:
{{< codelines "C++" "compiler/13/type_env.hpp" 16 25 >}}
Since looking up a mangled name for non-global variable
will now result in an assertion failure, we have to change
{{< sidenote "right" "unrepresentable-note" "will now result in an assertion failure," >}}
A very wise human at the very dawn of our species once said,
"make illegal states unrepresentable". Their friends and family were a little
busy making a fire, and didn't really understand what the heck they meant. Now,
we kind of do.<br>
<br>
It's <em>possible</em> for our <code>type_env</code> to include a
<code>variable_data</code> entry that is both global and has no mangled
name. But it doesn't have to be this way. We could define two subclasses
of <code>variable_data</code>, one global and one local,
where only the global one has a <code>mangled_name</code>
field. It would be impossible to reach this assertion failure then.
{{< /sidenote >}} we have to change
`ast_lid::compile` to only call `get_mangled_name` once
it ensures that the variable being compiled is, in fact,
global:
@ -712,7 +746,7 @@ They're just temporarily allowed access.
So, what should be the owner of all of these disparate components?
Thus far, that has been the `main` function, or the utility
functions that it calls out to. However, this is in bad taste:
functions that it calls out to. However, this is sloppy:
we have related data and operations on it, but we don't group
them into an object. We can group all of the components of our
compiler into a `compiler` object, and leave `main.cpp` with
@ -747,14 +781,11 @@ The methods of the compiler are arranged similarly:
The methods go as follows:
* `add_default_types` adds the built-in types to the `global_env`.
At this point in the post, these types only include `Int`. However,
in the second section, we'll make `Bool` a built-in type, too.
At this point, these types only include `Int`.
* `add_binop_type` adds a single binary operator to the global
type environment. We saw its implementation earlier: it deals
with both binding a type, and setting a mangled name.
* `add_default_types` adds the types for each binary operator,
and also for the `True` and `False` constructors (which we will
cover in the second section).
* `add_default_types` adds the types for each binary operator.
* `parse`, `typecheck`, `translate` and `compile` all do exactly
what they say. In this case, compilation refers to creating G-machine
instructions.
@ -776,7 +807,7 @@ file with the
file that we end up with at the end of this post.
Next, we have the compiler's constructor, and its `operator()`. The
latter, analogously to our parse driver, will trigger the compilation
latter, analogously to our parsing driver, will trigger the compilation
process. Their implementations are straightforward:
{{< codelines "C++" "compiler/13/compiler.cpp" 131 145 >}}
@ -793,11 +824,8 @@ pretty printing code:
{{< codelines "C++" "compiler/13/main.cpp" 11 27 >}}
That's all for the cleanup! We've added locations and more errors
the compiler, stopped throwing `0` in favor of proper exceptions
or assertions, made name mangling more reasonable, fixed a bug with
accidentally shadowing default functions, and organized our compilation
process into a `compiler` class.
With this, we complete our transition to a compiler object.
All that's left is to clean up the code style.
### Keeping Things Private
Hand-writing or generating hundreds of trivial getters and setters
@ -880,3 +908,58 @@ name with `f_`, much like `create_custom_function`:
I think that's enough. If we chose to turn more compiler
data structures into classes, I think we would've quickly drowned
in one-line getter and setter methods.
That's all for the cleanup! We've added locations and more errors
to the compiler, stopped throwing `0` in favor of proper exceptions
or assertions, made name mangling more reasonable, fixed a bug with
accidentally shadowing default functions, organized our compilation
process into a `compiler` class, and made more things into classes.
In the next post, I hope to tackle __strings__ and __Input/Output__.
I also think that implementing __modules__ would be a good idea,
though at the moment I don't know too much on the subject. I hope
you'll join me in my future writing!
### Appendix: Optimization
When I started working on the compiler after the previous post,
I went a little overboard. I started working on optimizing the generated programs,
but eventually decided I wasn't doing a
{{< sidenote "right" "good-note" "good enough" >}}
I think authors should feel a certain degree of responsibility
for the content they create. If I do something badly, somebody
else trusts me and learns from it, who knows how much damage I've done.
I try not to do damage.<br>
<br>
If anyone reads what I write, anyway!
{{< /sidenote >}} job to present it to others,
and scrapped that part of the compiler altogether. I'm not
sure if I will try again in the near future. But,
if you're curious about optimization, here are a few avenues
I've explored or thought about:
* __Unboxing numbers__. Right now, numbers are allocated and garbage
collected just like the rest of the graph nodes. This is far from ideal.
We could use pointers to represent numbers, by tagging their most significant
bits on 64-bit CPUs. Rather than allocating a node, the runtime will just
cast a number to a pointer, tag it, and push it on the stack.
* __Converting enumeration data types to numbers__. If no constructor
of a data type takes any arguments, then the tag uniquely identifies
each constructor. Combined with unboxed numbers, this can save unnecessary
allocations and memory accesses.
* __Special treatment for global constants__. It makes sense for
global functions to be converted into LLVM functions, but the
same is not the case for
{{< sidenote "right" "constant-note" "constants." >}}
Yeah, yeah, a constant is just a nullary function. Get
out of here with your pedantry!
{{< /sidenote >}} We can find a way to
initialize global constants once, which would save some work. To
make more constants suitable for this, we could employ
[monomorphism restriction](https://wiki.haskell.org/Monomorphism_restriction).
* __Optimizing stack operations.__ If you read through the LLVM IR
we produce, you can see a lot of code that peeks at something twice,
or pops-then-pushes the same value, or does other absurd things. LLVM
isn't aware of the semantics of our stacks, but perhaps we could write an
optimization pass to deal with some of the more blatant instances of
this issue.
If you attempt any of these, let me know how it goes, please!