From 03581f7af0971c1034626bcf68236243357ea607 Mon Sep 17 00:00:00 2001 From: Danila Fedorin Date: Wed, 12 Jun 2019 21:05:30 -0700 Subject: [PATCH] Add design document --- DESIGN.md | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 DESIGN.md diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 0000000..ac2a515 --- /dev/null +++ b/DESIGN.md @@ -0,0 +1,87 @@ +# Lily Design +Lily is a lazily evaluated functional language based on basic graph reduction. The graph construction is performed via a G-Machine (an abstract virtual stack machine), and is then mapped to LLVM constructs. Due to the fact that the G-Machine uses a stack, some stack operations are very common, and are written in C in a provided "runtime" file. + +## Compiling The Lily Compiler +On my local machine, I use CMake with LLVM's config program. This doesn't seem to work on the server, so I have written a makefile that should be able to compile lily just the same. + +## Compiling Lily Code +To compile a lily file in code.lily: +``` +make # Build the compiler +./lily < code.lily +gcc runtime.c lily.o -o prog +./prog +``` +Lily requires a "main" function to link properly, which has to have 0 parameters. Lily will only print the result of evaluating the program if it is a number. + +## Quirks +1) The stack size is fixed to 16. That's pretty small, and might cause assertion errors. +2) Not having a main function will cause a linker error. +3) If a main function has arguments, this will cause a stack assertion error. + +## Stages of the Compilation Process +### LALR(1) Parsing Via Pegasus +Pegasus is a project I worked on during fall term, which implements a language agnostic LALR parser generator. Since unlike Bioson, Pegasus is not available on OSU servers (or most other machines, for that matter, as it is a homebrew program), I provide a pre-generated parser file, as well as a source grammar file. A limitaton (or rather, a design choice) of pegasus is that it does not perform semantic actions, and simply generates a parse tree. Code generated by pegasus is in `src/parser.c`. After a parse tree is generated, it is converted into an Abstract Syntax Tree in `src/parser.cpp`. +### Type Checking and Inference +Lily does not require the user to speciy types for variables. As such, it needs to infer these types from the code, and verify that programs are type-correct, as this enables more efficient code generation later. For instance, the function +``` +defn add x y = { x + y } +``` +Is inferred to be a function of `Int->Int->Int`, as `(+)` requires its operands to be numbers. Lily's numbers are closed under all binary operators (that is, all binary operators are of type `Int->Int->Int`). Type checking is done using simple substitution. When a function is created and the types of the parameters are not known, it is assumed to be `? -> ?`. Then, as more information is provided (for instance, when a parameter is used in an addition, implying that it is a number), the gaps are filled in. +### G-Machine Code Generation +A G-Machine is used to quickly construct graphs. The evaluation of functional programs is frequently backed by graph reduction, and every time a function is called a new instance of a graph is to be constructed. Since this always follows the same steps (the function body is always the same structure), this can be optimized into a series of stack-based operations similar in a way to converting arithmetic to reverse polish notation. Lily's compiler uses an "environment" to keep track of the positions of variables on the ever-changing stack, which is represented simply by a linked list, with each node increasing the distance from the top of the stack. Generated instructions may look something like the following: +``` +push 1 +push 1 +pushglobal 2 @add +mkapp +mkapp +``` +In this case, a variable from the stack at offset 1 is pushed onto the stack, followed by the variable that was formerly at offset 0 (now at offset 1 after the first variable was pushed). Then, a "global variable" is pushed. mkapp pops two values from the stack and converts them into a graph node representing function application. Running mkapp twice results in the application of the function @add to two parameters. +### LLVM Code Generation +G-Machine instructions can fairly easy be converted into a sequence of LLVM IR instructions. The only problen is that LLVM is not stack-based, while the G-Machine is. This leads to a large amount of `push`/`pop` calls in the generated code, even ones that could potentially be optimized. Because `push`, `pop`, and memory allocation instructions are so common, and because they are always the same, they are placed in a separate C file called `runtime.c`. Stubs for these functions are generated on the LLVM side, allowing the IR to call these functions without knowing their body. These calls are resolved when the program is linked. + +Because graph nodes can be of several different types, they are best represented using a tagged union. However, since LLVM does not support tagged unions (or unions in general), we simply declare several structs that are semantically not related. For instance: +``` +struct node_parent { + char tag; +}; + +struct node_num { + char tag; + int value; +}; +``` +These structs all have the same "initial structure", allowing them to be, in a way, compatible with each other. The "tag" of a node is checked to properly cast a node (LLVM's pointer cast is used for this). To get the fields in the struct, LLVM's GetElementPtr instruction is used, which allows the IR to load / store a value inside the struct. The most complicated instructions are `op`, which unwraps two number nodes and adds their values, `eq`, which unwraps two number nodes, compares their values, and pushes depending on that value either a "True" or "False", and `jump`, which is used for case analysis. + +### Linking +Finally, the `runtime.c` library expects there to be a function `main` in the Lily source code, which takes no parameters. This function will be used as the entry point - after the runtime sets up the stack, the execution will be transferred to the LLVM-generated code (which will frequently call back to C to use malloc or to manipulate the stack). + +## Sample Programs +The most basic Lily program can just return 0: +``` +main = { 0 } +``` +This is boring. We can use addition and subtraction to spice it up! +``` +main = { 1 - 1 + 3 } +``` +Lily supports arbitrary data type definitions (if you've used Haskell, these should be familliar). For instance, we can define and use a `Maybe` data structure as follows: +``` +data Maybe = { Ok(Int), Nothing } +defn maybeOr42 m = { + case m of { + Ok(i) -> { i } + Nothing -> { 42 } + } +} +defn main = { maybeOr42 (Ok 3) } +``` +Notice how during declaration, constructors for data types have the form "C(P1, P2)", while in an expression, they are treated like normal functions. + +Finally, Lily has lazy evaluation, allowing for the construction of infinite data structures: +``` +data IntList = { Nil, Cons(Int, IntList) } +defn ones = { Cons 1 ones } +``` +This will not take any computation time unless something else uses the values from the list.