lily/DESIGN.md

# Lily Design
Lily is a lazily evaluated functional language based on basic graph reduction. The graph construction is performed via a G-Machine (an abstract virtual stack machine), and is then mapped to LLVM constructs. Due to the fact that the G-Machine uses a stack, some stack operations are very common, and are written in C in a provided "runtime" file.

## Compiling The Lily Compiler
On my local machine, I use CMake with LLVM's config program. This doesn't seem to work on the server, so I have written a makefile that should be able to compile lily just the same.

## Compiling Lily Code
To compile a lily file in code.lily:
```
make # Build the compiler
./lily < code.lily
gcc runtime.c lily.o -o prog
./prog
```
Lily requires a "main" function to link properly, which has to have 0 parameters. Lily will only print the result of evaluating the program if it is a number.

## Quirks
1) The stack size is fixed to 16. That's pretty small, and might cause assertion errors.
2) Not having a main function will cause a linker error.
3) If a main function has arguments, this will cause a stack assertion error.

## Stages of the Compilation Process
### LALR(1) Parsing Via Pegasus
Pegasus is a project I worked on during fall term, which implements a language agnostic LALR parser generator. Since unlike Bioson, Pegasus is not available on OSU servers (or most other machines, for that matter, as it is a homebrew program), I provide a pre-generated parser file, as well as a source grammar file. A limitaton (or rather, a design choice) of pegasus is that it does not perform semantic actions, and simply generates a parse tree. Code generated by pegasus is in `src/parser.c`. After a parse tree is generated, it is converted into an Abstract Syntax Tree in `src/parser.cpp`. 
### Type Checking and Inference
Lily does not require the user to speciy types for variables. As such, it needs to infer these types from the code, and verify that programs are type-correct, as this enables more efficient code generation later. For instance, the function
```
defn add x y = { x + y }
```
Is inferred to be a function of `Int->Int->Int`, as `(+)` requires its operands to be numbers. Lily's numbers are closed under all binary operators (that is, all binary operators are of type `Int->Int->Int`). Type checking is done using simple substitution. When a function is created and the types of the parameters are not known, it is assumed to be `? -> ?`. Then, as more information is provided (for instance, when a parameter is used in an addition, implying that it is a number), the gaps are filled in.
### G-Machine Code Generation
A G-Machine is used to quickly construct graphs. The evaluation of functional programs is frequently backed by graph reduction, and every time a function is called a new instance of a graph is to be constructed. Since this always follows the same steps (the function body is always the same structure), this can be optimized into a series of stack-based operations similar in a way to converting arithmetic to reverse polish notation. Lily's compiler uses an "environment" to keep track of the positions of variables on the ever-changing stack, which is represented simply by a linked list, with each node increasing the distance from the top of the stack. Generated instructions may look something like the following:
```
push 1
push 1
pushglobal 2 @add
mkapp
mkapp
```
In this case, a variable from the stack at offset 1 is pushed onto the stack, followed by the variable that was formerly at offset 0 (now at offset 1 after the first variable was pushed). Then, a "global variable" is pushed. mkapp pops two values from the stack and converts them into a graph node representing function application. Running mkapp twice results in the application of the function @add to two parameters.
### LLVM Code Generation
G-Machine instructions can fairly easy be converted into a sequence of LLVM IR instructions. The only problen is that LLVM is not stack-based, while the G-Machine is. This leads to a large amount of `push`/`pop` calls in the generated code, even ones that could potentially be optimized. Because `push`, `pop`, and memory allocation instructions are so common, and because they are always the same, they are placed in a separate C file called `runtime.c`. Stubs for these functions are generated on the LLVM side, allowing the IR to call these functions without knowing their body. These calls are resolved when the program is linked.

Because graph nodes can be of several different types, they are best represented using a tagged union. However, since LLVM does not support tagged unions (or unions in general), we simply declare several structs that are semantically not related. For instance:
```
struct node_parent {
    char tag;
};

struct node_num {
    char tag;
    int value;
};
```
These structs all have the same "initial structure", allowing them to be, in a way, compatible with each other. The "tag" of a node is checked to properly cast a node (LLVM's pointer cast is used for this). To get the fields in the struct, LLVM's GetElementPtr instruction is used, which allows the IR to load / store a value inside the struct. The most complicated instructions are `op`, which unwraps two number nodes and adds their values, `eq`, which unwraps two number nodes, compares their values, and pushes depending on that value either a "True" or "False", and `jump`, which is used for case analysis.

### Linking
Finally, the `runtime.c` library expects there to be a function `main` in the Lily source code, which takes no parameters. This function will be used as the entry point - after the runtime sets up the stack, the execution will be transferred to the LLVM-generated code (which will frequently call back to C to use malloc or to manipulate the stack).

## Sample Programs
The most basic Lily program can just return 0:
```
main = { 0 }
```
This is boring. We can use addition and subtraction to spice it up!
```
main = { 1 - 1 + 3 }
```
Lily supports arbitrary data type definitions (if you've used Haskell, these should be familliar). For instance, we can define and use a `Maybe` data structure as follows:
```
data Maybe = { Ok(Int), Nothing }
defn maybeOr42 m = {
    case m of {
        Ok(i) -> { i }
        Nothing -> { 42 }
    }
}
defn main = { maybeOr42 (Ok 3) }
```
Notice how during declaration, constructors for data types have the form "C(P1, P2)", while in an expression, they are treated like normal functions.

Finally, Lily has lazy evaluation, allowing for the construction of infinite data structures:
```
data IntList = { Nil, Cons(Int, IntList) }
defn ones = { Cons 1 ones }
```
This will not take any computation time unless something else uses the values from the list.
Add design document 2019-06-12 21:05:30 -07:00			`# Lily Design`
			`Lily is a lazily evaluated functional language based on basic graph reduction. The graph construction is performed via a G-Machine (an abstract virtual stack machine), and is then mapped to LLVM constructs. Due to the fact that the G-Machine uses a stack, some stack operations are very common, and are written in C in a provided "runtime" file.`

			`## Compiling The Lily Compiler`
			`On my local machine, I use CMake with LLVM's config program. This doesn't seem to work on the server, so I have written a makefile that should be able to compile lily just the same.`

			`## Compiling Lily Code`
			`To compile a lily file in code.lily:`
			```
			`make # Build the compiler`
			`./lily < code.lily`
			`gcc runtime.c lily.o -o prog`
			`./prog`
			```
			`Lily requires a "main" function to link properly, which has to have 0 parameters. Lily will only print the result of evaluating the program if it is a number.`

			`## Quirks`
			`1) The stack size is fixed to 16. That's pretty small, and might cause assertion errors.`
			`2) Not having a main function will cause a linker error.`
			`3) If a main function has arguments, this will cause a stack assertion error.`

			`## Stages of the Compilation Process`
			`### LALR(1) Parsing Via Pegasus`
			Pegasus is a project I worked on during fall term, which implements a language agnostic LALR parser generator. Since unlike Bioson, Pegasus is not available on OSU servers (or most other machines, for that matter, as it is a homebrew program), I provide a pre-generated parser file, as well as a source grammar file. A limitaton (or rather, a design choice) of pegasus is that it does not perform semantic actions, and simply generates a parse tree. Code generated by pegasus is in `src/parser.c`. After a parse tree is generated, it is converted into an Abstract Syntax Tree in `src/parser.cpp`.
			`### Type Checking and Inference`
			`Lily does not require the user to speciy types for variables. As such, it needs to infer these types from the code, and verify that programs are type-correct, as this enables more efficient code generation later. For instance, the function`
			```
			`defn add x y = { x + y }`
			```
			Is inferred to be a function of `Int->Int->Int`, as `(+)` requires its operands to be numbers. Lily's numbers are closed under all binary operators (that is, all binary operators are of type `Int->Int->Int`). Type checking is done using simple substitution. When a function is created and the types of the parameters are not known, it is assumed to be `? -> ?`. Then, as more information is provided (for instance, when a parameter is used in an addition, implying that it is a number), the gaps are filled in.
			`### G-Machine Code Generation`
			A G-Machine is used to quickly construct graphs. The evaluation of functional programs is frequently backed by graph reduction, and every time a function is called a new instance of a graph is to be constructed. Since this always follows the same steps (the function body is always the same structure), this can be optimized into a series of stack-based operations similar in a way to converting arithmetic to reverse polish notation. Lily's compiler uses an "environment" to keep track of the positions of variables on the ever-changing stack, which is represented simply by a linked list, with each node increasing the distance from the top of the stack. Generated instructions may look something like the following:
			```
			`push 1`
			`push 1`
			`pushglobal 2 @add`
			`mkapp`
			`mkapp`
			```
			`In this case, a variable from the stack at offset 1 is pushed onto the stack, followed by the variable that was formerly at offset 0 (now at offset 1 after the first variable was pushed). Then, a "global variable" is pushed. mkapp pops two values from the stack and converts them into a graph node representing function application. Running mkapp twice results in the application of the function @add to two parameters.`
			`### LLVM Code Generation`
			G-Machine instructions can fairly easy be converted into a sequence of LLVM IR instructions. The only problen is that LLVM is not stack-based, while the G-Machine is. This leads to a large amount of `push`/`pop` calls in the generated code, even ones that could potentially be optimized. Because `push`, `pop`, and memory allocation instructions are so common, and because they are always the same, they are placed in a separate C file called `runtime.c`. Stubs for these functions are generated on the LLVM side, allowing the IR to call these functions without knowing their body. These calls are resolved when the program is linked.

			`Because graph nodes can be of several different types, they are best represented using a tagged union. However, since LLVM does not support tagged unions (or unions in general), we simply declare several structs that are semantically not related. For instance:`
			```
			`struct node_parent {`
			`char tag;`
			`};`

			`struct node_num {`
			`char tag;`
			`int value;`
			`};`
			```
			These structs all have the same "initial structure", allowing them to be, in a way, compatible with each other. The "tag" of a node is checked to properly cast a node (LLVM's pointer cast is used for this). To get the fields in the struct, LLVM's GetElementPtr instruction is used, which allows the IR to load / store a value inside the struct. The most complicated instructions are `op`, which unwraps two number nodes and adds their values, `eq`, which unwraps two number nodes, compares their values, and pushes depending on that value either a "True" or "False", and `jump`, which is used for case analysis.

			`### Linking`
			Finally, the `runtime.c` library expects there to be a function `main` in the Lily source code, which takes no parameters. This function will be used as the entry point - after the runtime sets up the stack, the execution will be transferred to the LLVM-generated code (which will frequently call back to C to use malloc or to manipulate the stack).

			`## Sample Programs`
			`The most basic Lily program can just return 0:`
			```
			`main = { 0 }`
			```
			`This is boring. We can use addition and subtraction to spice it up!`
			```
			`main = { 1 - 1 + 3 }`
			```
			Lily supports arbitrary data type definitions (if you've used Haskell, these should be familliar). For instance, we can define and use a `Maybe` data structure as follows:
			```
			`data Maybe = { Ok(Int), Nothing }`
			`defn maybeOr42 m = {`
			`case m of {`
			`Ok(i) -> { i }`
			`Nothing -> { 42 }`
			`}`
			`}`
			`defn main = { maybeOr42 (Ok 3) }`
			```
			`Notice how during declaration, constructors for data types have the form "C(P1, P2)", while in an expression, they are treated like normal functions.`

			`Finally, Lily has lazy evaluation, allowing for the construction of infinite data structures:`
			```
			`data IntList = { Nil, Cons(Int, IntList) }`
			`defn ones = { Cons 1 ones }`
			```
			`This will not take any computation time unless something else uses the values from the list.`