Add design document

2019-06-12 21:05:30 -07:00 · 2019-06-12 21:05:30 -07:00 · 03581f7af0
commit 03581f7af0
parent 7ee084385c
1 changed files with 87 additions and 0 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@ -0,0 +1,87 @@
+# Lily Design
+Lily is a lazily evaluated functional language based on basic graph reduction. The graph construction is performed via a G-Machine (an abstract virtual stack machine), and is then mapped to LLVM constructs. Due to the fact that the G-Machine uses a stack, some stack operations are very common, and are written in C in a provided "runtime" file.
+
+## Compiling The Lily Compiler
+On my local machine, I use CMake with LLVM's config program. This doesn't seem to work on the server, so I have written a makefile that should be able to compile lily just the same.
+
+## Compiling Lily Code
+To compile a lily file in code.lily:
+```
+make # Build the compiler
+./lily < code.lily
+gcc runtime.c lily.o -o prog
+./prog
+```
+Lily requires a "main" function to link properly, which has to have 0 parameters. Lily will only print the result of evaluating the program if it is a number.
+
+## Quirks
+1) The stack size is fixed to 16. That's pretty small, and might cause assertion errors.
+2) Not having a main function will cause a linker error.
+3) If a main function has arguments, this will cause a stack assertion error.
+
+## Stages of the Compilation Process
+### LALR(1) Parsing Via Pegasus
+Pegasus is a project I worked on during fall term, which implements a language agnostic LALR parser generator. Since unlike Bioson, Pegasus is not available on OSU servers (or most other machines, for that matter, as it is a homebrew program), I provide a pre-generated parser file, as well as a source grammar file. A limitaton (or rather, a design choice) of pegasus is that it does not perform semantic actions, and simply generates a parse tree. Code generated by pegasus is in `src/parser.c`. After a parse tree is generated, it is converted into an Abstract Syntax Tree in `src/parser.cpp`. 
+### Type Checking and Inference
+Lily does not require the user to speciy types for variables. As such, it needs to infer these types from the code, and verify that programs are type-correct, as this enables more efficient code generation later. For instance, the function
+```
+defn add x y = { x + y }
+```
+Is inferred to be a function of `Int->Int->Int`, as `(+)` requires its operands to be numbers. Lily's numbers are closed under all binary operators (that is, all binary operators are of type `Int->Int->Int`). Type checking is done using simple substitution. When a function is created and the types of the parameters are not known, it is assumed to be `? -> ?`. Then, as more information is provided (for instance, when a parameter is used in an addition, implying that it is a number), the gaps are filled in.
+### G-Machine Code Generation
+A G-Machine is used to quickly construct graphs. The evaluation of functional programs is frequently backed by graph reduction, and every time a function is called a new instance of a graph is to be constructed. Since this always follows the same steps (the function body is always the same structure), this can be optimized into a series of stack-based operations similar in a way to converting arithmetic to reverse polish notation. Lily's compiler uses an "environment" to keep track of the positions of variables on the ever-changing stack, which is represented simply by a linked list, with each node increasing the distance from the top of the stack. Generated instructions may look something like the following:
+```
+push 1
+push 1
+pushglobal 2 @add
+mkapp
+mkapp
+```
+In this case, a variable from the stack at offset 1 is pushed onto the stack, followed by the variable that was formerly at offset 0 (now at offset 1 after the first variable was pushed). Then, a "global variable" is pushed. mkapp pops two values from the stack and converts them into a graph node representing function application. Running mkapp twice results in the application of the function @add to two parameters.
+### LLVM Code Generation
+G-Machine instructions can fairly easy be converted into a sequence of LLVM IR instructions. The only problen is that LLVM is not stack-based, while the G-Machine is. This leads to a large amount of `push`/`pop` calls in the generated code, even ones that could potentially be optimized. Because `push`, `pop`, and memory allocation instructions are so common, and because they are always the same, they are placed in a separate C file called `runtime.c`. Stubs for these functions are generated on the LLVM side, allowing the IR to call these functions without knowing their body. These calls are resolved when the program is linked.
+
+Because graph nodes can be of several different types, they are best represented using a tagged union. However, since LLVM does not support tagged unions (or unions in general), we simply declare several structs that are semantically not related. For instance:
+```
+struct node_parent {
+    char tag;
+};
+
+struct node_num {
+    char tag;
+    int value;
+};
+```
+These structs all have the same "initial structure", allowing them to be, in a way, compatible with each other. The "tag" of a node is checked to properly cast a node (LLVM's pointer cast is used for this). To get the fields in the struct, LLVM's GetElementPtr instruction is used, which allows the IR to load / store a value inside the struct. The most complicated instructions are `op`, which unwraps two number nodes and adds their values, `eq`, which unwraps two number nodes, compares their values, and pushes depending on that value either a "True" or "False", and `jump`, which is used for case analysis.
+
+### Linking
+Finally, the `runtime.c` library expects there to be a function `main` in the Lily source code, which takes no parameters. This function will be used as the entry point - after the runtime sets up the stack, the execution will be transferred to the LLVM-generated code (which will frequently call back to C to use malloc or to manipulate the stack).
+
+## Sample Programs
+The most basic Lily program can just return 0:
+```
+main = { 0 }
+```
+This is boring. We can use addition and subtraction to spice it up!
+```
+main = { 1 - 1 + 3 }
+```
+Lily supports arbitrary data type definitions (if you've used Haskell, these should be familliar). For instance, we can define and use a `Maybe` data structure as follows:
+```
+data Maybe = { Ok(Int), Nothing }
+defn maybeOr42 m = {
+    case m of {
+        Ok(i) -> { i }
+        Nothing -> { 42 }
+    }
+}
+defn main = { maybeOr42 (Ok 3) }
+```
+Notice how during declaration, constructors for data types have the form "C(P1, P2)", while in an expression, they are treated like normal functions.
+
+Finally, Lily has lazy evaluation, allowing for the construction of infinite data structures:
+```
+data IntList = { Nil, Cons(Int, IntList) }
+defn ones = { Cons 1 ones }
+```
+This will not take any computation time unless something else uses the values from the list.