Finish 13th part of the compiler series.
This commit is contained in:
parent
04ab1a137c
commit
9f77f07ed2
|
@ -70,11 +70,11 @@ than _characters_, it effectively doesn't interact with the source
|
||||||
text at all, and can't determine from which line or column a token
|
text at all, and can't determine from which line or column a token
|
||||||
originated. The task of determining the locations of input tokens
|
originated. The task of determining the locations of input tokens
|
||||||
is delegated to the tokenizer -- Flex, in our case. Flex, on the
|
is delegated to the tokenizer -- Flex, in our case. Flex, on the
|
||||||
other hand, doesn't doesn't have a built-in mechanism for tracking
|
other hand, doesn't have a built-in mechanism for tracking
|
||||||
locations. Fortunately, Bison provides a `yy::location` class that
|
locations. Fortunately, Bison provides a `yy::location` class that
|
||||||
includes most of the needed functionality.
|
includes most of the needed functionality.
|
||||||
|
|
||||||
A `yy::location` consists of `begin` and `end` source position,
|
A `yy::location` consists of two source positions, `begin` and `end`,
|
||||||
which themselves are represented using lines and columns. It
|
which themselves are represented using lines and columns. It
|
||||||
also has the following methods:
|
also has the following methods:
|
||||||
|
|
||||||
|
@ -85,7 +85,7 @@ then `columns(token_length)` will move `end` to the token's end,
|
||||||
and thus make the whole `location` contain the token.
|
and thus make the whole `location` contain the token.
|
||||||
* `yy::location::lines(int)` behaves similarly to `columns`,
|
* `yy::location::lines(int)` behaves similarly to `columns`,
|
||||||
except that it advances `end` by the given number of lines,
|
except that it advances `end` by the given number of lines,
|
||||||
rather than columns.
|
rather than columns. It also resets the columns counter to `1`.
|
||||||
* `yy::location::step()` moves `begin` to where `end` is. This
|
* `yy::location::step()` moves `begin` to where `end` is. This
|
||||||
is useful for when we've finished processing a token, and want
|
is useful for when we've finished processing a token, and want
|
||||||
to move on to the next one.
|
to move on to the next one.
|
||||||
|
@ -102,10 +102,20 @@ We'll see why we are using `LOC` instead of something like `location` soon;
|
||||||
for now, you can treat `LOC` as if it were a global variable declared
|
for now, you can treat `LOC` as if it were a global variable declared
|
||||||
in the tokenizer. Before processing each token, we ensure that
|
in the tokenizer. Before processing each token, we ensure that
|
||||||
the `yy::location` has its `begin` and `end` at the same position,
|
the `yy::location` has its `begin` and `end` at the same position,
|
||||||
and then advance `end` by `yyleng` columns. This is sufficient
|
and then advance `end` by `yyleng` columns. This is
|
||||||
|
{{< sidenote "right" "sufficient-note" "sufficient" >}}
|
||||||
|
This doesn't hold for all languages. It may be possible for a language
|
||||||
|
to have tokens that contain <code>\n</code>, in which case,
|
||||||
|
rather than just using <code>yyleng</code>, we'd need to
|
||||||
|
add special logic to iterate over the token and detect the line
|
||||||
|
breaks.<br>
|
||||||
|
<br>
|
||||||
|
Also, this requires that the <code>end</code> of the previous token was
|
||||||
|
correctly computed.
|
||||||
|
{{< /sidenote >}}
|
||||||
to make `LOC` represent our token's source position. For
|
to make `LOC` represent our token's source position. For
|
||||||
the moment, don't worry too much about `drv`; this is the
|
the moment, don't worry too much about `drv`; this is the
|
||||||
parse driver, and we will talk about it shortly.
|
parsing driver, and we will talk about it shortly.
|
||||||
|
|
||||||
So now we have a "global" variable `LOC` that gives
|
So now we have a "global" variable `LOC` that gives
|
||||||
us the source position of the current token. To get it
|
us the source position of the current token. To get it
|
||||||
|
@ -128,7 +138,7 @@ we need to add a `yy::location` argument to each of our `ast` nodes,
|
||||||
as well as to the `pattern` subclasses, `definition_defn` and
|
as well as to the `pattern` subclasses, `definition_defn` and
|
||||||
`definition_data`. To avoid breaking all the code that creates
|
`definition_data`. To avoid breaking all the code that creates
|
||||||
AST nodes and definitions outside of the parser, we'll make this
|
AST nodes and definitions outside of the parser, we'll make this
|
||||||
argument optional. Inside of `ast.hpp`, we define it as follows:
|
argument optional. Inside of `ast.hpp`, we define a new field as follows:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/ast.hpp" 16 16 >}}
|
{{< codelines "C++" "compiler/13/ast.hpp" 16 16 >}}
|
||||||
|
|
||||||
|
@ -136,7 +146,7 @@ Then, we add a constructor to `ast` as follows:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/ast.hpp" 18 18 >}}
|
{{< codelines "C++" "compiler/13/ast.hpp" 18 18 >}}
|
||||||
|
|
||||||
Note that it's not default here, since `ast` itself is an
|
Note that it's not optional here, since `ast` itself is an
|
||||||
abstract class, and thus will never be constructed directly.
|
abstract class, and thus will never be constructed directly.
|
||||||
It is in the subclasses of `ast` that we provide a default
|
It is in the subclasses of `ast` that we provide a default
|
||||||
value. The change is rather mechanical, but here's an example
|
value. The change is rather mechanical, but here's an example
|
||||||
|
@ -155,7 +165,7 @@ detail:
|
||||||
Here, the `@$` character is used to reference the current
|
Here, the `@$` character is used to reference the current
|
||||||
nonterminal's location data.
|
nonterminal's location data.
|
||||||
|
|
||||||
#### Line Offsets, File Input, and the Parse Driver
|
#### Line Offsets, File Input, and the Parsing Driver
|
||||||
There are three more challenges with printing out the line
|
There are three more challenges with printing out the line
|
||||||
of code where an error occurred. First of all, to
|
of code where an error occurred. First of all, to
|
||||||
print out a line of code, we need to have that line of code
|
print out a line of code, we need to have that line of code
|
||||||
|
@ -197,7 +207,7 @@ to read source code from files, anyway.
|
||||||
To address the second issue, we can keep a mapping of line numbers
|
To address the second issue, we can keep a mapping of line numbers
|
||||||
to their locations in the source buffer. This is rather easy to
|
to their locations in the source buffer. This is rather easy to
|
||||||
maintain using an array: the first element of the array is 0,
|
maintain using an array: the first element of the array is 0,
|
||||||
which is the beginning of any line in any source file. From there,
|
which is the beginning of the first line in any source file. From there,
|
||||||
every time we encounter the character `\n`, we can push
|
every time we encounter the character `\n`, we can push
|
||||||
the current source location to the top, marking it as
|
the current source location to the top, marking it as
|
||||||
the beginning of another line. Where exactly we store this
|
the beginning of another line. Where exactly we store this
|
||||||
|
@ -413,7 +423,7 @@ structure containing Flex's state.
|
||||||
|
|
||||||
Adding a parameter to Bison doesn't automatically affect
|
Adding a parameter to Bison doesn't automatically affect
|
||||||
Flex. To let Flex know that its `yylex` function must now accept
|
Flex. To let Flex know that its `yylex` function must now accept
|
||||||
the state and the parse driver, we have to define the
|
the state and the parsing driver, we have to define the
|
||||||
`YY_DECL` macro. We do this in `parse_driver.hpp`, since
|
`YY_DECL` macro. We do this in `parse_driver.hpp`, since
|
||||||
this forward declaration will be used by both Flex
|
this forward declaration will be used by both Flex
|
||||||
and Bison:
|
and Bison:
|
||||||
|
@ -532,8 +542,8 @@ Here's an example from `parsed_type`:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/parsed_type.cpp" 16 23 >}}
|
{{< codelines "C++" "compiler/13/parsed_type.cpp" 16 23 >}}
|
||||||
|
|
||||||
In general, this change is also rather mechanical, but, to
|
In general, this change is also rather mechanical. Before we
|
||||||
maintain a balance between exceptions and assertions, here
|
move on, to maintain a balance between exceptions and assertions, here
|
||||||
are a couple more assertions from `type_env`:
|
are a couple more assertions from `type_env`:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/type_env.cpp" 81 82 >}}
|
{{< codelines "C++" "compiler/13/type_env.cpp" 81 82 >}}
|
||||||
|
@ -581,9 +591,7 @@ while the actual type was:
|
||||||
Bool
|
Bool
|
||||||
```
|
```
|
||||||
|
|
||||||
The exclamation marks in front of the two types are due to some
|
Here's an error that was previously a `throw 0` statement in our code:
|
||||||
changes from section 2. Here's an error that was previously
|
|
||||||
a `throw 0` statement in our code:
|
|
||||||
|
|
||||||
```
|
```
|
||||||
an error occured while compiling the program: type variable a used twice in data type definition.
|
an error occured while compiling the program: type variable a used twice in data type definition.
|
||||||
|
@ -604,7 +612,21 @@ Now that I've had some more time to think about it
|
||||||
(and now that I've returned to the compiler after
|
(and now that I've returned to the compiler after
|
||||||
a brief hiatus), I think that this was not the right call.
|
a brief hiatus), I think that this was not the right call.
|
||||||
Mangled names make sense when translating to LLVM; we certainly
|
Mangled names make sense when translating to LLVM; we certainly
|
||||||
don't want to declare two LLVM functions with the same name.
|
don't want to declare two LLVM functions
|
||||||
|
{{< sidenote "right" "mangling-note" "with the same name." >}}
|
||||||
|
By the way, LLVM has its own name mangling functionality. If you
|
||||||
|
declare two functions with the same name, they'll appear as
|
||||||
|
<code>function</code> and <code>function.0</code>. Since LLVM
|
||||||
|
uses the <code>Function*</code> C++ values to refer to functions,
|
||||||
|
as long as we keep them seaprate on <em>our</em> end, things will
|
||||||
|
work.<br>
|
||||||
|
<br>
|
||||||
|
However, in our compiler, name mangling occurs before LLVM is
|
||||||
|
introduced, at translation time. We could create LLVM functions
|
||||||
|
at that time, too, and associate them with variables. But then,
|
||||||
|
our G-machine instructions will be coupled to LLVM, which
|
||||||
|
would not be as clean.
|
||||||
|
{{< /sidenote >}}
|
||||||
But things are different for local variables. Our local variables
|
But things are different for local variables. Our local variables
|
||||||
are graphs on a stack, and are not actually compiled to LLVM
|
are graphs on a stack, and are not actually compiled to LLVM
|
||||||
definitions. It doesn't make sense to mangle their names, since
|
definitions. It doesn't make sense to mangle their names, since
|
||||||
|
@ -612,8 +634,8 @@ their names aren't present anywhere in the final executable.
|
||||||
It's not even "consistent" to mangle them, since global definitions
|
It's not even "consistent" to mangle them, since global definitions
|
||||||
are compiled directly to __PushGlobal__ instructions, while local
|
are compiled directly to __PushGlobal__ instructions, while local
|
||||||
variables are only referenced through the current `env`.
|
variables are only referenced through the current `env`.
|
||||||
So, I decided to reverse my decision. We will go back to
|
So, I opted to reverse my decision. We will go back to
|
||||||
placing variable names directly onto `env_var`. Here's
|
placing variable names directly into `env_var`. Here's
|
||||||
an example of this from `global_scope.cpp`:
|
an example of this from `global_scope.cpp`:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/global_scope.cpp" 6 8 >}}
|
{{< codelines "C++" "compiler/13/global_scope.cpp" 6 8 >}}
|
||||||
|
@ -630,8 +652,8 @@ that a variable from a __PushGlobal__ instruction
|
||||||
is referencing the right function. To achieve
|
is referencing the right function. To achieve
|
||||||
this, we change `get_mangled_name` to stop
|
this, we change `get_mangled_name` to stop
|
||||||
returning the input string if a mangled name was not
|
returning the input string if a mangled name was not
|
||||||
found; now that we _must_ have a mangled name, doing
|
found; doing so makes it impossible to check if a mangled
|
||||||
so is effectively obscuring the error. Instead,
|
name was explicitly defined. Instead,
|
||||||
we add two assertions. First, if an environment scope doesn't
|
we add two assertions. First, if an environment scope doesn't
|
||||||
contain a variable, then it _must_ have a parent.
|
contain a variable, then it _must_ have a parent.
|
||||||
If it does contain variable, that variable _must_ have
|
If it does contain variable, that variable _must_ have
|
||||||
|
@ -652,7 +674,19 @@ Here's the definition of `type_env::variable_data` now:
|
||||||
{{< codelines "C++" "compiler/13/type_env.hpp" 16 25 >}}
|
{{< codelines "C++" "compiler/13/type_env.hpp" 16 25 >}}
|
||||||
|
|
||||||
Since looking up a mangled name for non-global variable
|
Since looking up a mangled name for non-global variable
|
||||||
will now result in an assertion failure, we have to change
|
{{< sidenote "right" "unrepresentable-note" "will now result in an assertion failure," >}}
|
||||||
|
A very wise human at the very dawn of our species once said,
|
||||||
|
"make illegal states unrepresentable". Their friends and family were a little
|
||||||
|
busy making a fire, and didn't really understand what the heck they meant. Now,
|
||||||
|
we kind of do.<br>
|
||||||
|
<br>
|
||||||
|
It's <em>possible</em> for our <code>type_env</code> to include a
|
||||||
|
<code>variable_data</code> entry that is both global and has no mangled
|
||||||
|
name. But it doesn't have to be this way. We could define two subclasses
|
||||||
|
of <code>variable_data</code>, one global and one local,
|
||||||
|
where only the global one has a <code>mangled_name</code>
|
||||||
|
field. It would be impossible to reach this assertion failure then.
|
||||||
|
{{< /sidenote >}} we have to change
|
||||||
`ast_lid::compile` to only call `get_mangled_name` once
|
`ast_lid::compile` to only call `get_mangled_name` once
|
||||||
it ensures that the variable being compiled is, in fact,
|
it ensures that the variable being compiled is, in fact,
|
||||||
global:
|
global:
|
||||||
|
@ -712,7 +746,7 @@ They're just temporarily allowed access.
|
||||||
|
|
||||||
So, what should be the owner of all of these disparate components?
|
So, what should be the owner of all of these disparate components?
|
||||||
Thus far, that has been the `main` function, or the utility
|
Thus far, that has been the `main` function, or the utility
|
||||||
functions that it calls out to. However, this is in bad taste:
|
functions that it calls out to. However, this is sloppy:
|
||||||
we have related data and operations on it, but we don't group
|
we have related data and operations on it, but we don't group
|
||||||
them into an object. We can group all of the components of our
|
them into an object. We can group all of the components of our
|
||||||
compiler into a `compiler` object, and leave `main.cpp` with
|
compiler into a `compiler` object, and leave `main.cpp` with
|
||||||
|
@ -747,14 +781,11 @@ The methods of the compiler are arranged similarly:
|
||||||
The methods go as follows:
|
The methods go as follows:
|
||||||
|
|
||||||
* `add_default_types` adds the built-in types to the `global_env`.
|
* `add_default_types` adds the built-in types to the `global_env`.
|
||||||
At this point in the post, these types only include `Int`. However,
|
At this point, these types only include `Int`.
|
||||||
in the second section, we'll make `Bool` a built-in type, too.
|
|
||||||
* `add_binop_type` adds a single binary operator to the global
|
* `add_binop_type` adds a single binary operator to the global
|
||||||
type environment. We saw its implementation earlier: it deals
|
type environment. We saw its implementation earlier: it deals
|
||||||
with both binding a type, and setting a mangled name.
|
with both binding a type, and setting a mangled name.
|
||||||
* `add_default_types` adds the types for each binary operator,
|
* `add_default_types` adds the types for each binary operator.
|
||||||
and also for the `True` and `False` constructors (which we will
|
|
||||||
cover in the second section).
|
|
||||||
* `parse`, `typecheck`, `translate` and `compile` all do exactly
|
* `parse`, `typecheck`, `translate` and `compile` all do exactly
|
||||||
what they say. In this case, compilation refers to creating G-machine
|
what they say. In this case, compilation refers to creating G-machine
|
||||||
instructions.
|
instructions.
|
||||||
|
@ -776,7 +807,7 @@ file with the
|
||||||
file that we end up with at the end of this post.
|
file that we end up with at the end of this post.
|
||||||
|
|
||||||
Next, we have the compiler's constructor, and its `operator()`. The
|
Next, we have the compiler's constructor, and its `operator()`. The
|
||||||
latter, analogously to our parse driver, will trigger the compilation
|
latter, analogously to our parsing driver, will trigger the compilation
|
||||||
process. Their implementations are straightforward:
|
process. Their implementations are straightforward:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/compiler.cpp" 131 145 >}}
|
{{< codelines "C++" "compiler/13/compiler.cpp" 131 145 >}}
|
||||||
|
@ -793,11 +824,8 @@ pretty printing code:
|
||||||
|
|
||||||
{{< codelines "C++" "compiler/13/main.cpp" 11 27 >}}
|
{{< codelines "C++" "compiler/13/main.cpp" 11 27 >}}
|
||||||
|
|
||||||
That's all for the cleanup! We've added locations and more errors
|
With this, we complete our transition to a compiler object.
|
||||||
the compiler, stopped throwing `0` in favor of proper exceptions
|
All that's left is to clean up the code style.
|
||||||
or assertions, made name mangling more reasonable, fixed a bug with
|
|
||||||
accidentally shadowing default functions, and organized our compilation
|
|
||||||
process into a `compiler` class.
|
|
||||||
|
|
||||||
### Keeping Things Private
|
### Keeping Things Private
|
||||||
Hand-writing or generating hundreds of trivial getters and setters
|
Hand-writing or generating hundreds of trivial getters and setters
|
||||||
|
@ -880,3 +908,58 @@ name with `f_`, much like `create_custom_function`:
|
||||||
I think that's enough. If we chose to turn more compiler
|
I think that's enough. If we chose to turn more compiler
|
||||||
data structures into classes, I think we would've quickly drowned
|
data structures into classes, I think we would've quickly drowned
|
||||||
in one-line getter and setter methods.
|
in one-line getter and setter methods.
|
||||||
|
|
||||||
|
That's all for the cleanup! We've added locations and more errors
|
||||||
|
to the compiler, stopped throwing `0` in favor of proper exceptions
|
||||||
|
or assertions, made name mangling more reasonable, fixed a bug with
|
||||||
|
accidentally shadowing default functions, organized our compilation
|
||||||
|
process into a `compiler` class, and made more things into classes.
|
||||||
|
In the next post, I hope to tackle __strings__ and __Input/Output__.
|
||||||
|
I also think that implementing __modules__ would be a good idea,
|
||||||
|
though at the moment I don't know too much on the subject. I hope
|
||||||
|
you'll join me in my future writing!
|
||||||
|
|
||||||
|
### Appendix: Optimization
|
||||||
|
When I started working on the compiler after the previous post,
|
||||||
|
I went a little overboard. I started working on optimizing the generated programs,
|
||||||
|
but eventually decided I wasn't doing a
|
||||||
|
{{< sidenote "right" "good-note" "good enough" >}}
|
||||||
|
I think authors should feel a certain degree of responsibility
|
||||||
|
for the content they create. If I do something badly, somebody
|
||||||
|
else trusts me and learns from it, who knows how much damage I've done.
|
||||||
|
I try not to do damage.<br>
|
||||||
|
<br>
|
||||||
|
If anyone reads what I write, anyway!
|
||||||
|
{{< /sidenote >}} job to present it to others,
|
||||||
|
and scrapped that part of the compiler altogether. I'm not
|
||||||
|
sure if I will try again in the near future. But,
|
||||||
|
if you're curious about optimization, here are a few avenues
|
||||||
|
I've explored or thought about:
|
||||||
|
|
||||||
|
* __Unboxing numbers__. Right now, numbers are allocated and garbage
|
||||||
|
collected just like the rest of the graph nodes. This is far from ideal.
|
||||||
|
We could use pointers to represent numbers, by tagging their most significant
|
||||||
|
bits on 64-bit CPUs. Rather than allocating a node, the runtime will just
|
||||||
|
cast a number to a pointer, tag it, and push it on the stack.
|
||||||
|
* __Converting enumeration data types to numbers__. If no constructor
|
||||||
|
of a data type takes any arguments, then the tag uniquely identifies
|
||||||
|
each constructor. Combined with unboxed numbers, this can save unnecessary
|
||||||
|
allocations and memory accesses.
|
||||||
|
* __Special treatment for global constants__. It makes sense for
|
||||||
|
global functions to be converted into LLVM functions, but the
|
||||||
|
same is not the case for
|
||||||
|
{{< sidenote "right" "constant-note" "constants." >}}
|
||||||
|
Yeah, yeah, a constant is just a nullary function. Get
|
||||||
|
out of here with your pedantry!
|
||||||
|
{{< /sidenote >}} We can find a way to
|
||||||
|
initialize global constants once, which would save some work. To
|
||||||
|
make more constants suitable for this, we could employ
|
||||||
|
[monomorphism restriction](https://wiki.haskell.org/Monomorphism_restriction).
|
||||||
|
* __Optimizing stack operations.__ If you read through the LLVM IR
|
||||||
|
we produce, you can see a lot of code that peeks at something twice,
|
||||||
|
or pops-then-pushes the same value, or does other absurd things. LLVM
|
||||||
|
isn't aware of the semantics of our stacks, but perhaps we could write an
|
||||||
|
optimization pass to deal with some of the more blatant instances of
|
||||||
|
this issue.
|
||||||
|
|
||||||
|
If you attempt any of these, let me know how it goes, please!
|
||||||
|
|
Loading…
Reference in New Issue
Block a user