blog-static/content/blog/04_compiler_improvements.md

239 lines
9.6 KiB
Markdown

---
title: Compiling a Functional Language Using C++, Part 4 - Small Improvements
date: 2019-08-06T14:26:38-07:00
tags: ["C and C++", "Functional Languages", "Compilers"]
---
We've done quite a big push in the previous post. We defined
type rules for our language, implemented unification,
and then implemented unification to enforce these rules for
our program. The post was pretty long, and even then we
weren't able to fit quite everything into it.
For instance, we threw 0 whenever an error occured. This
gives us no indication of what actually went wrong. We should
probably define an exception class, one that can contain
information about the error, and report it to the user.
Also, when there's no error, our compiler doesn't
really tell us anything at all about the code besides
the number of definitions. We probably want to see the types
of these definitions, or at least some intermediate information.
At the very least, we want to have the __ability__ to see
this information.
Finally, we have no build system. We are creating more
and more source files, and so far (unless you've taken
initiative), we've been compiling them by hand. We want
to only compile source files that have changed,
and we want to have a standard definition of how to
build our program.
### Printing Syntax Trees
Let's start by printing the trees we get from our parser.
This is long overdue - we had no way to verify the structure
of what our parser returned to us since Part 2. We'll print
the trees top-down, with the children of a node
indent one block further than the node itself. For this,
we'll make a new virtual function with the signature:
```
virtual void print(int indent, std::ostream& to) const;
```
We'll include a similar printing function into our
pattern struct, too:
```
virtual void print(std::ostream& to) const;
```
Let's take a look at the implementation. For `ast_int`,
`ast_lid`, and `ast_uid`:
{{< codelines "C++" "compiler/04/ast.cpp" 19 22 >}}
{{< codelines "C++" "compiler/04/ast.cpp" 28 31 >}}
{{< codelines "C++" "compiler/04/ast.cpp" 37 40 >}}
With `ast_binop` things get a bit more interesting.
We call `print` recursively on the children of the
`binop` node:
{{< codelines "C++" "compiler/04/ast.cpp" 46 51 >}}
The same idea for `ast_app`:
{{< codelines "C++" "compiler/04/ast.cpp" 67 72 >}}
Finally, just like `ast_case::typecheck` called
`pattern::match`, `ast_case::print` calls `pattern::print`:
{{< codelines "C++" "compiler/04/ast.cpp" 84 93 >}}
We follow the same implementation strategy for patterns,
but we don't need indentation, or recursion:
{{< codelines "C++" "compiler/04/ast.cpp" 115 117 >}}
{{< codelines "C++" "compiler/04/ast.cpp" 123 128 >}}
In `main`, let's print the bodies of each function we receive from the parser:
{{< codelines "C++" "compiler/04/main.cpp" 47 56 >}}
### Printing Types
Types are another thing that we want to be able to inspect, so let's
add a similar print method to them:
```
virtual void print(const type_mgr& mgr, std::ostream& to) const;
```
We need the type manager so we can follow substitutions.
The implementation is simple enough:
{{< codelines "C++" "compiler/04/type.cpp" 6 24 >}}
Let's also print out the types we infer. We'll make it a separate loop
at the bottom of the `typecheck_program` function, because it's mostly just
for debugging purposes:
{{< codelines "C++" "compiler/04/main.cpp" 34 38 >}}
### Fixing Bugs
We actually discover not one, but two bugs in our implementation thanks
to the output we get from printing trees and types.
Observe the output for `works3.txt`:
```
length l:
CASE:
Nil
INT: 0
*: Int -> (Int -> (Int))
+: Int -> (Int -> (Int))
-: Int -> (Int -> (Int))
/: Int -> (Int -> (Int))
Cons: List -> (Int -> (List))
Nil: List
length: List -> (Int)
2
```
First, we're missing the `Cons` branch. The culprit is `parser.y`, specifically
this line:
```C++
: branches branch { $$ = std::move($1); $1.push_back(std::move($2)); }
```
Notice that we move our list of branches out of `$1`. However, when we
`push_back`, we use `$1` again. That's wrong! We need to `push_back`
to `$$` instead:
{{< codelines "C++" "compiler/04/parser.y" 110 110 >}}
Next, observe that `Cons` has type `List -> Int -> List`. That's not right,
since `Int` comes first in our definition. The culprit is this fragment of code:
```C++
for(auto& type_name : constructor->types) {
type_ptr type = type_ptr(new type_base(type_name));
full_type = type_ptr(new type_arr(type, full_type));
}
```
Remember how we build the function type backwards in Part 3? We have to do the same here.
We replace the fragment with the proper reverse iteration:
{{< codelines "C++" "compiler/04/definition.cpp" 37 40 >}}
### Throwing Exceptions
Throwing 0 is never a good idea. Such an exception doesn't contain any information
that we may find useful in debugging, nor any information that would benefit
the users of the compiler. Instead, let's define our own exception classes,
and throw them instead. We'll make two:
{{< codeblock "C++" "compiler/04/error.hpp" >}}
Only one function needs to be implemented, and it's pretty boring:
{{< codeblock "C++" "compiler/04/error.cpp" >}}
It's time to throw these instead of 0. Let's take a look at the places
we do so.
First, we throw 0 in `type.cpp`, in the `type_mgr::unify` method. This is
where our `unification_error` comes in. The error will
contain the two types that we failed to unify, which we will
later report to the user:
{{< codelines "C++" "compiler/04/type.cpp" 91 91 >}}
Next up, we have a few throws in `ast.cpp`. The first is in `op_string`, but
we will simply replace it with `return "??"`, which will be caught later on
(either way, the case expression falling through would be a compiler bug,
since the user has no way of providing an invalid binary operator). The
first throw we need to address is in `ast_binop::typecheck`, in the case
that we don't find a type for a binary operator. We report this
directly:
{{< codelines "C++" "compiler/04/ast.cpp" 57 57 >}}
We will introduce a new exception into `ast_case::typecheck`. Previously,
we simply pass the type of the expression to be case analyzed into
the pattern matching method. However, since we don't want
case analysis on functions, we ensure that the type of the expression
is `type_base`. If not, we report this:
{{< codelines "C++" "compiler/04/ast.cpp" 107 110 >}}
The next exception is in `pattern_constr::match`. It occurs
when the pattern has a constructor we don't recognize, and
that's exactly what we report:
{{< codelines "C++" "compiler/04/ast.cpp" 132 134 >}}
The next exception occurs in a loop, when we bind
types for each of the constructor pattern's variables.
We throw when we are unable to cast the remaining
constructor type to a `type_arr`. Conceptually,
this means that the pattern wants to apply the
constructor to more parameters than it actually
takes:
{{< codelines "C++" "compiler/04/ast.cpp" 138 138 >}}
We remove the last throw at the bottom of `pattern_constr::match`.
This is because once unification succeeds, we know
that the return type of the pattern is a base type since
we know the type of the case expression is a base type
(we know this because we added that check to `ast_case::typecheck`).
Finally, let's catch and report these exceptions. We could do it
in `typecheck_program`, but I think doing so in `main` is neater.
Since printing types requires a `type_mgr`, we'll move the
declarations of both `type_mgr` and `type_env` to the top of
main, and pass them to `typecheck_program` as parameters. Then,
we can surround the call to `typecheck_program` with
try/catch:
{{< codelines "C++" "compiler/04/main.cpp" 57 69 >}}
We use some [ANSI escape codes](https://en.wikipedia.org/wiki/ANSI_escape_code)
to color the types in the case of a unification error.
### Setting up CMake
We will set up CMake as our build system. This would be extremely easy
if not for Flex and Bison, but it's not hard either way. We start with the usual:
{{< codelines "CMake" "compiler/04/CMakeLists.txt" 1 2 >}}
Next, we want to set up Flex and Bison. CMake provides two commands for this:
{{< codelines "CMake" "compiler/04/CMakeLists.txt" 4 5 >}}
We now have access to commands that allow us to tell CMake about our parser
and tokenizer (or scanner). We use them as follows:
{{< codelines "CMake" "compiler/04/CMakeLists.txt" 6 12 >}}
We also want CMake to know that the scanner needs to parser's header file
in order to compile. We add this dependency:
{{< codelines "CMake" "compiler/04/CMakeLists.txt" 13 13 >}}
Finally, we add our source code to a CMake target. We use
the `BISON_parser_OUTPUTS` and `FLEX_scanner_OUTPUTS` to
pass in the source files generated by Flex and Bison.
{{< codelines "CMake" "compiler/04/CMakeLists.txt" 15 23 >}}
Almost there! `parser.cpp` will be generated in the `build` directory
during an out-of-source build, and so will `parser.hpp`. When building,
`parser.cpp` will try to look for `ast.hpp`, and `main.cpp` will look for
`parser.hpp`. We want them to be able to find each other, so we
add both the source directory and the build (binary) directory to
the list of include directories:
{{< codelines "CMake" "compiler/04/CMakeLists.txt" 24 25 >}}
That's it for CMake! Let's try our build:
```
cmake -S . -B build
cd build && make -j8
```
### Updated Code
We've made a lot of changes to the codebase, and I've only shown snippets of the code
so far. If you'de like to see the whole codebase, you can go to my site's git repository
and check out [the code so far](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/04).
Having taken this little break, it's time for our next push. We will define
how our programs will be evaluated in [Part 5 - Execution]({{< relref "05_compiler_execution.md" >}}).