blog-static/content/blog/00_cs325_languages_hw1.md

512 lines
21 KiB
Markdown

---
title: A Language for an Assignment - Homework 1
date: 2019-12-27T23:27:09-08:00
tags: ["Haskell", "Python", "Algorithms"]
---
On a rainy Oregon day, I was walking between classes with a group of friends.
We were discussing the various ways to obfuscate solutions to the weekly
homework assignments in our Algorithms course: replace every `if` with
a ternary expression, use single variable names, put everything on one line.
I said:
> The
{{< sidenote "right" "chad-note" "chad" >}}
This is in reference to a meme, <a href="https://knowyourmeme.com/memes/virgin-vs-chad">Virgin vs Chad</a>.
A "chad" characteristic is masculine or "alpha" to the point of absurdity.
{{< /sidenote >}} move would be to make your own, different language for every homework assignment.
It was required of us to use
{{< sidenote "left" "python-note" "Python" >}}
A friend suggested making a Haskell program
that generates Python-based interpreters for languages. While that would be truly
absurd, I'll leave <em>this</em> challenge for another day.
{{< /sidenote >}} for our solutions, so that was the first limitation on this challenge.
Someone suggested to write the languages in Haskell, since that's what we used
in our Programming Languages class. So the final goal ended up:
* For each of the 10 homework assignments in CS325 - Analysis of Algorithms,
* Create a Haskell program that translates a language into,
* A valid Python program that works (nearly) out of the box and passes all the test cases.
It may not be worth it to create a whole
{{< sidenote "right" "general-purpose-note" "general-purpose" >}}
A general purpose language is one that's designed to be used in various
domains. For instance, C++ is a general-purpose language because it can
be used for embedded systems, GUI programs, and pretty much anything else.
This is in contrast to a domain-specific language, such as Game Maker Language,
which is aimed at a much narrower set of uses.
{{< /sidenote >}} language for each problem,
but nowhere in the challenge did we say that it had to be general-purpose. In
fact, some interesting design thinking can go into designing a domain-specific
language for a particular assignment. So let's jump right into it, and make
a language for the first homework assignment.
### Homework 1
There are two problems in Homework 1. Here they are, verbatim:
{{< codelines "text" "cs325-langs/hws/hw1.txt" 32 38 >}}
And the second:
{{< codelines "text" "cs325-langs/hws/hw1.txt" 47 68 >}}
We want to make a language __specifically__ for these two tasks (one of which
is split into many tasks). What common things can we isolate? I see two:
First, __all the problems deal with lists__. This may seem like a trivial observation,
but these two problems are the __only__ thing we use our language for. We have
list access,
{{< sidenote "right" "filterting-note" "list filtering" >}}
Quickselect is a variation on quicksort, which itself
finds all the "lesser" and "greater" elements in the input array.
{{< /sidenote >}} and list creation. That should serve as a good base!
If you squint a little bit, __all the problems are recursive with the same base case__.
Consider the first few lines of `search`, implemented naively:
```Python
def search(xs, k):
if xs == []:
return false
```
How about `sorted`? Take a look:
```Python
def sorted(xs):
if xs == []:
return []
```
I'm sure you see the picture. But it will take some real mental gymnastics to twist the
rest of the problems into this shape. What about `qselect`, for instance? There's two
cases for what it may return:
* `None` or equivalent if the index is out of bounds (we give it `4` an a list `[1, 2]`).
* A number if `qselect` worked.
The test cases never provide a concrete example of what should be returned from
`qselect` in the first case, so we'll interpret it like
{{< sidenote "right" "undefined-note" "undefined behavior" >}}
For a quick sidenote about undefined behavior, check out how
C++ optimizes the <a href="https://godbolt.org/z/3skK9j">Collatz Conjecture function</a>.
Clang doesn't know whether or not the function will terminate (whether the Collatz Conjecture
function terminates is an <a href="https://en.wikipedia.org/wiki/Collatz_conjecture">unsolved problem</a>),
but functions that don't terminate are undefined behavior. There's only one other way the function
returns, and that's with "1". Thus, clang optimizes the entire function to a single "return 1" call.
{{< /sidenote >}} in C++:
we can do whatever we want. So, let's allow it to return `[]` in the `None` case.
This makes this base case valid:
```Python
def qselect(xs, k):
if xs == []:
return []
```
"Oh yeah, now it's all coming together." With one more observation (which will come
from a piece I haven't yet shown you!), we'll be able to generalize this base case.
The observation is this section in the assignment:
{{< codelines "text" "cs325-langs/hws/hw1.txt" 83 98 >}}
The real key is the part about "returning the `[]` where x should be inserted". It so
happens that when the list given to the function is empty, the number should be inserted
precisely into that list. Thus:
```Python
def _search(xs, k):
if xs == []:
return xs
```
The same works for `qselect`:
```Python
def qselect(xs, k):
if xs == []:
return xs
```
And for sorted, too:
```Python
def sorted(xs):
if xs == []:
return xs
```
There are some functions that are exceptions, though:
```Python
def insert(xs, k):
# We can't return early here!
# If we do, we'll never insert anything.
```
Also:
```Python
def search(xs, k):
# We have to return true or false, never
# an empty list.
```
So, whenever we __don't__ return a list, we don't want to add a special case.
We arrive at the following common base case: __whenever a function returns a list, if its first argument
is the empty list, the first argument is immediately returned__.
We've largely exhasuted the conclusiosn we can draw from these problems. Let's get to designing a language.
### A Silly Language
Let's start by visualizing our goals. Without base cases, the solution to `_search`
would be something like this:
{{< codelines "text" "cs325-langs/sols/hw1.lang" 11 14 >}}
Here we have an __`if`-expression__. It has to have an `else`, and evaluates to the value
of the chosen branch. That is, `if true then 0 else 1` evaluates to `0`, while
`if false then 0 else 1` evaluates to `1`. Otherwise, we follow the binary tree search
algorithm faithfully.
Using this definition of `_search`, we can define `search` pretty easily:
{{< codelines "text" "cs325-langs/sols/hw1.lang" 17 17 >}}
Let's use Haskell's `(++)` operator for concatentation. This will help us understand
when the user is operating on lists, and when they're not. With this, `sorted` becomes:
{{< codelines "text" "cs325-langs/sols/hw1.lang" 16 16 >}}
Let's go for `qselect` now. We'll introduce a very silly language feature for this
problem:
{{< sidenote "right" "selector-note" "list selectors" >}}
You've probably never heard of list selectors, and for a good reason:
this is a <em>terrible</em> language feature. I'll go in more detail
later, but I wanted to make this clear right away.
{{< /sidenote >}}. We observe that `qselect` aims to partition the list into
other lists. We thus add the following pieces of syntax:
```
~xs -> {
pivot <- xs[rand]!
left <- xs[#0 <= pivot]
...
} -> ...
```
There are three new things here.
1. The actual "list selector": `~xs -> { .. } -> ...`. Between the curly braces
are branches which select parts of the list and assign them to new variables.
Thus, `pivot <- xs[rand]!` assigns the element at a random index to the variable `pivot`.
the `!` at the end means "after taking this out of `xs`, delete it from `xs`". The
syntax {{< sidenote "right" "curly-note" "starts with \"~\"" >}}
An observant reader will note that there's no need for the "xs" after the "~".
The idea was to add a special case syntax to reference the "selected list", but
I ended up not bothering. So in fact, this part of the syntax is useless.
{{< /sidenote >}} to make it easier to parse.
2. The `rand` list access syntax. `xs[rand]` is a special case that picks a random
element from `xs`.
3. The `xs[#0 <= pivot]` syntax. This is another special case that selects all elements
from `xs` that match the given predicate (where `#0` is replaced with each element in `xs`).
The big part of qselect is to not evaluate `right` unless you have to. So, we shouldn't
eagerly evaluate the list selector. We also don't want something like `right[|right|-1]` to evaluate
`right` twice. So we settle on
{{< sidenote "right" "lazy-note" "lazy evaluation" >}}
Lazy evaluation means only evaluating an expression when we need to. Thus,
although we might encounter the expression for <code>right</code>, we
only evaluate it when the time comes. Lazy evaluation, at least
the way that Haskell has it, is more specific: an expression is evaluated only
once, or not at all.
{{</ sidenote >}}.
Ah, but the `!` marker introduces
{{< sidenote "left" "side-effect-note" "side effects" >}}
A side effect is a term frequently used when talking about functional programming.
Evaluating the expression <code>xs[rand]!</code> doesn't just get a random element,
it also changes <em>something else</em>. In this case, that something else is
the <code>xs</code> list.
{{< /sidenote >}}. So we can't just evaluate these things all willy-nilly.
So, let's make it so that each expression in the selector list requires the ones above it. Thus,
`left` will require `pivot`, and `right` will require `left` and `pivot`. So,
lazily evaluated, ordered expressions. The whole `qselect` becomes:
{{< codelines "text" "cs325-langs/sols/hw1.lang" 1 9 >}}
We've now figured out all the language constructs. Let's start working on
some implementation!
#### Implementation
It would be silly of me to explain every detail of creating a language in Haskell
in this post; this is neither the purpose of the post, nor is it plausible
to do this without covering monads, parser combinators, grammars, abstract syntax
trees, and more. So, instead, I'll discuss the _interesting_ parts of the
implementation.
##### Temporary Variables
Our language is expression-based, yes. A function is a single,
arbitrarily complex expression (involving `if/else`, list
selectors, and more). So it would make sense to translate
a function to a single, arbitrarily complex Python expression.
However, the way we've designed our language makes it
not-so-suitable for converting to a single expression! For
instance, consider `xs[rand]`. We need to compute the list,
get its length, generate a random number, and then access
the corresponding element in the list. We use the list
here twice, and simply repeating the expression would not
be very smart: we'd be evaluating twice. So instead,
we'll use a variable, assign the list to that variable,
and then access that variable multiple times.
To be extra safe, let's use a fresh temporary variable
every time we need to store something. The simplest
way is to simply maintain a counter of how many temporary
variables we've already used, and generate a new variable
by prepending the word "temp" to that number. We start
with `temp0`, then `temp1`, and so on. To keep a counter,
we can use a state monad:
{{< codelines "Haskell" "cs325-langs/src/LanguageOne.hs" 269 269 >}}
Don't worry about the `Map.Map String [String]`, we'll get to that in a bit.
For now, all we have to worry about is the second element of the tuple,
the integer counting how many temporary variables we've used. We can
get the current temporary variable as follows:
{{< codelines "Haskell" "cs325-langs/src/LanguageOne.hs" 271 274 >}}
We can also get a fresh temporary variable like this:
{{< codelines "Haskell" "cs325-langs/src/LanguageOne.hs" 276 279 >}}
Now, the
{{< sidenote "left" "code-note" "code" >}}
Since we are translating an expression, we must have the result of
the translation yield an Python expression we can use in generating
larger Python expressions. However, as we've seen, we occasionally
have to use statements. Thus, the <code>translateExpr</code> function
returns a <code>Translator ([Py.PyStmt], Py.PyExpr)</code>.
{{< /sidenote >}}for generating a random list access looks like
{{< sidenote "right" "ast-note" "this:" >}}
The <code>Py.*</code> constructors are a part of a Python AST module I quickly
threw together. I won't showcase it here, but you can always look at the
source code for the blog (which includes this project)
<a href="https://dev.danilafe.com/Web-Projects/blog-static">here</a>.
{{< /sidenote >}}
{{< codelines "Haskell" "cs325-langs/src/LanguageOne.hs" 364 369 >}}
##### Implementing "lazy evaluation"
Lazy evaluation in functional programs usually arises from
{{< sidenote "right" "graph-note" "graph reduction" >}}
Graph reduction, more specifically the <em>Spineless,
Tagless G-machine</em> is at the core of the Glasgow Haskell
Compiler (GHC). Simon Peyton Jones' earlier book,
<em>Implementing Functional Languages: a tutorial</em>
details an earlier version of the G-machine.
{{< /sidenote >}}. However, Python is neither
functional nor graph-based, and we only lazily
evaluate list selectors. Thus, we'll have to do
some work to get our lazy evaluation to work as we desire.
Here's what I came up with:
1. It's difficult to insert Python statements where they are
needed: we'd have to figure out in which scope each variable
has already been declared, and in which scope it's yet
to be assigned.
2. Instead, we can use a Python dictionary, called `cache`,
and store computed versions of each variable in the cache.
3. It's pretty difficult to check if a variable
is in the cache, compute it if not, and then return the
result of the computation, in one expression. This is
true, unless that single expression is a function call, and we have a dedicated
function that takes no arguments, computes the expression if needed,
and uses the cache otherwise. We choose this route.
4. We have already promised that we'd evaluate all the selected
variables above a given variable before evaluating the variable
itself. So, each function will first call (and therefore
{{< sidenote "right" "force-note" "force" >}}
Forcing, in this case, comes from the context of lazy evaluation. To
force a variable or an expression is to tell the program to compute its
value, even though it may have been putting it off.
{{< /sidenote >}}) the functions
generated for variables declared above the function's own variable.
5. To keep track of all of this, we use the already-existing state monad
as a reader monad (that is, we clear the changes we make to the monad
after we're done translating the list selector). This is where the `Map.Map String [String]`
comes from.
The `Map.Map String [String]` keeps track of variables that will be lazily computed,
and also of the dependencies of each variable (the variables that need
to be access before the variable itself). We compute such a map for
each selector as follows:
{{< codelines "Haskell" "cs325-langs/src/LanguageOne.hs" 337 337 >}}
We update the existing map using `Map.union`:
{{< codelines "Haskell" "cs325-langs/src/LanguageOne.hs" 338 338 >}}
And, after we're done generating expressions in the body of this selector,
we clear it to its previous value `vs`:
{{< codelines "Haskell" "cs325-langs/src/LanguageOne.hs" 341 341 >}}
We generate a single selector as follows:
{{< codelines "Haskell" "cs325-langs/src/LanguageOne.hs" 307 320 >}}
This generates a function definition statement, which we will examine in
generated Python code later on.
Solving the problem this way also introduces another gotcha: sometimes,
a variable is produced by a function call, and other times the variable
is just a Python variable. We write this as follows:
{{< codelines "Haskell" "cs325-langs/src/LanguageOne.hs" 322 327 >}}
##### Special Case Insertion
This is a silly language for a single homework assignment. I'm not
planning to implement Hindley-Milner type inference, or anything
of that sort. For the purpose of this language, things will be
either a list, or not a list. And as long as a function __can__ return
a list, it can also return the list from its base case. Thus,
that's all we will try to figure out. The checking code is so
short that we can include the whole snippet at once:
{{< codelines "Haskell" "cs325-langs/src/LanguageOne.hs" 258 266 >}}
`mergePossibleType`
{{< sidenote "right" "bool-identity-note" "figures out" >}}
An observant reader will note that this is just a logical
OR function. It's not, however, good practice to use
booleans for types that have two constructors with no arguments.
Check out this <a href="https://programming-elm.com/blog/2019-05-20-solving-the-boolean-identity-crisis-part-1/">
Elm-based article</a> about this, which the author calls the
Boolean Identity Crisis.
{{< /sidenote >}}, given two possible types for an
expression, the final type for the expression.
There's only one real trick to this. Sometimes, like in
`_search`, the only time we return something _known_ to be a list, that
something is `xs`. Since we're making a list manipulation language,
let's __assume the first argument to the function is a list__, and
__use this information to determine expression types__. We guess
types in a very basic manner otherwise: If you use the concatenation
operator, or a list literal, then obviously we're working on a list.
If you're returning the first argument of the function, that's also
a list. Otherwise, it could be anything.
My Haskell linter actually suggested a pretty clever way of writing
the whole "add a base case if this function returns a list" code.
Check it out:
{{< codelines "Haskell" "cs325-langs/src/LanguageOne.hs" 299 305 >}}
Specifically, look at the line with `let fastReturn = ...`. It
uses a list comprehension: we take a parameter `p` from the list of
parameter `ps`, but only produce the statements for the base case
if the possible type computed using `p` is `List`.
### The Output
What kind of beast have we created? Take a look for yourself:
```Python
def qselect(xs,k):
if xs==[]:
return xs
cache = {}
def pivot():
if ("pivot") not in (cache):
cache["pivot"] = xs.pop(0)
return cache["pivot"]
def left():
def temp2(arg):
out = []
for arg0 in arg:
if arg0<=pivot():
out.append(arg0)
return out
pivot()
if ("left") not in (cache):
cache["left"] = temp2(xs)
return cache["left"]
def right():
def temp3(arg):
out = []
for arg0 in arg:
if arg0>pivot():
out.append(arg0)
return out
left()
pivot()
if ("right") not in (cache):
cache["right"] = temp3(xs)
return cache["right"]
if k>(len(left())+1):
temp4 = qselect(right(), k-len(left())-1)
else:
if k==(len(left())+1):
temp5 = [pivot()]
else:
temp5 = qselect(left(), k)
temp4 = temp5
return temp4
def _search(xs,k):
if xs==[]:
return xs
if xs[1]==k:
temp6 = xs
else:
if xs[1]>k:
temp8 = _search(xs[0], k)
else:
temp8 = _search(xs[2], k)
temp6 = temp8
return temp6
def sorted(xs):
if xs==[]:
return xs
return sorted(xs[0])+[xs[1]]+sorted(xs[2])
def search(xs,k):
return len(_search(xs, k))!=0
def insert(xs,k):
return _insert(k, _search(xs, k))
def _insert(k,xs):
if k==[]:
return k
if len(xs)==0:
temp16 = xs
temp16.append([])
temp17 = temp16
temp17.append(k)
temp18 = temp17
temp18.append([])
temp15 = temp18
else:
temp15 = xs
return temp15
```
It's...horrible! All the `tempX` variables, __three layers of nested function declarations__, hardcoded cache access. This is not something you'd ever want to write.
Even to get this code, I had to come up with hacks __in a language I created__.
The first is the hack is to make the `qselect` function use the `xs == []` base
case. This doesn't happen by default, because `qselect` doesn't return a list!
To "fix" this, I made `qselect` return the number it found, wrapped in a
list literal. This is not up to spec, and would require another function
to unwrap this list.
While `qselect` was struggling with not having the base case, `insert` had
a base case it didn't need: `insert` shouldn't return the list itself
when it's empty, it should insert into it! However, when we use the `<<`
list insertion operator, the language infers `insert` to be a list-returning
function itself, inserting into an empty list will always fail. So, we
make a function `_insert`, which __takes the arguments in reverse__.
The base case will still be generated, but the first argument (against
which the base case is checked) will be a number, so the `k == []` check
will always fail.
That concludes this post. I'll be working on more solutions to homework
assignments in self-made languages, so keep an eye out!