blog-static/content/blog/haskell_catamorphisms.md

18 KiB

title date tags
Generalizing Folds in Haskell 2022-04-22T12:19:22-07:00
Haskell

Have you encountered Haskell's foldr function? Did you know that you can use it to express any function on a list? What's more, there's a way to derive similar functions for {{< sidenote "right" "positive-note" "a large class of data types in Haskell." >}} Specifically, this is the class of inductive types. {{< /sidenote >}} This is precisely the focus of this post. Before we get into the details, it's good to review the underlying concepts in a more familiar setting: functions.

Recursive Functions

Let's start off with a little bit of a warmup, and take a look at a simple recursive function: length. Here's a straightforward definition:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 4 6 >}}

Haskell is nice because it allows for clean definitions of recursive functions; length can just reference itself in its definition, and everything works out in the end. In the underlying lambda calculus, though, a function definition doesn't come with a name -- you only get anonymous functions via the lambda abstraction. There's no way for such functions to just refer to themselves by their name in their body. But the lambda calculus is Turing complete, so something is making recursive definitions possible.

The trick is to rewrite your recursive function in such a way that instead of calling itself by its name (which, with anonymous functions, is hard to come by), it receives a reference to itself as an argument. As a concrete example:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 8 10 >}}

This new function can easily me anonymous; if we enable the LambdaCase extension, we can write it using only lambda functions as:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 12 14 >}}

This function is not equivalent to length, however. It expects "itself", or a function which has type [a] -> Int, to be passed in as the first argument. Once fed this rec argument, though, lengthF returns a length function. Let's try feed it something, then!

lengthF _something

But if lengthF produces a length function when given this something, why can't we feed this newly-produced length function back to it?

lengthF (lengthF _something)

And again:

lengthF (lengthF (lengthF _something))

If we kept going with this process infinitely, we'd eventually have what we need:

{{< latex >}} \text{length} = \text{lengthF}(\text{lengthF}(\text{lengthF}(...))) {{< /latex >}}

But hey, the stuff inside the first set of parentheses is still an infinite sequence of applications of the function \(\text{lengthF}\), and we have just defined this to be \(\text{length}\). Thus, we can rewrite the above equation as:

{{< latex >}} \text{length} = \text{lengthF}(\text{length}) {{< /latex >}}

What we have just discovered is that the actual function that we want, length, is a fixed point of the non-recursive function lengthF. Fortunately, Haskell comes with a function that can find such a fixed point. It's defined like this:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 16 16 >}}

This definition is as declarative as can be; fix returns the \(x\) such that \(x = f(x)\). With this, we finally write:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 18 18 >}}

Loading up the file in GHCi, and running the above function, we get exactly the expected results.

ghci> Main.length' [1,2,3]
3

You may be dissatisfied with the way we handled fix here; we went through and pretended that we didn't have recursive function definitions, but then used a recursive let-expression in the body fix! This is a valid criticism, so I'd like to briefly talk about how fix is used in the context of the lambda calculus.

In the untyped typed lambda calculus, we can just define a term that behaves like fix does. The most common definition is the \(Y\) combinator, defined as follows:

{{< latex >}} Y = \lambda f. (\lambda x. f (x x)) (\lambda x. f (x x )) {{< /latex >}}

When applied to a function, this combinator goes through the following evaluation steps:

{{< latex >}} Y f = f (Y f) = f (f (Y f)) =\ ... {{< /latex >}}

This is the exact sort of infinite series of function applications that we saw above with \(\text{lengthF}\).

Recursive Data Types

We have now seen how we can rewrite a recursive function as a fixed point of some non-recursive function. Another cool thing we can do, though, is to transform recursive data types in a similar manner! Let's start with something pretty simple.

{{< codelines "Haskell" "catamorphisms/Cata.hs" 20 20 >}}

Just like we did with functions, we can extract the recursive occurrences of MyList into a parameter.

{{< codelines "Haskell" "catamorphisms/Cata.hs" 21 21 >}}

Just like lengthF, MyListF isn't really a list. We can't write a function sum :: MyListF -> Int. MyListF requires something as an argument, and once given that, produces a type of integer lists. Once again, let's try feeding it:

MyListF a

From the definition, we can clearly see that a is where the "rest of the list" is in the original MyList. So, let's try fill a with a list that we can get out of MyListF:

MyListF (MyListF a)

And again:

MyListF (MyListF (MyListF a))

Much like we used a fix function to turn our lengthF into length, we need a data type, which we'll call Fix (and which has been implemented before). Here's the definition:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 23 23 >}}

Looking past the constructors and accessors, we might write the above in pseudo-Haskell as follows:

newtype Fix f = f (Fix f)

This is just like the lambda calculus \(Y\) combinator above! Unfortunately, we do have to deal with the cruft induced by the constructors here. Thus, to write down the list [1,2,3] using MyListF, we'd have to produce the following:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 25 26 >}}

This is actually done in practice when using some approaches to help address the expression problem; however, it's quite unpleasant to write code in this way, so we'll set it aside.

Let's go back to our infinite chain of type applications. We've a similar pattern before, with \(\text{length}\) and \(\text{lengthF}\). Just like we did then, it seems like we might be able to write something like the following:

{{< latex >}} \begin{aligned} & \text{MyList} = \text{MyListF}(\text{MyListF}(\text{MyListF}(...))) \ \Leftrightarrow\ & \text{MyList} = \text{MyListF}(\text{MyList}) \end{aligned} {{< /latex >}}

In something like Haskell, though, the above is not quite true. MyListF is a non-recursive data type, with a different set of constructors to MyList; they aren't really equal. Instead of equality, though, we use the next-best thing: isomorphism.

{{< latex >}} \text{MyList} \cong \text{MyListF}(\text{MyList}) {{< /latex >}}

Two types are isomorphic when there exist a {{< sidenote "right" "fix-isomorphic-note" "pair of functions, f and g," >}} Let's a look at the types of Fix and unFix, by the way. Suppose that we did define MyList to be Fix MyListF. Let's specialize the f type parameter of Fix to MyListF for a moment, and check:

In one direction, Fix :: MyListF MyList -> MyList
And in the other, unFix :: MyList -> MyListF MyList

The two mutual inverses f and g fall out of the definition of the Fix data type! If we didn't have to deal with the constructor cruft, this would be more ergonomic than writing our own myIn and myOut functions. {{< /sidenote >}} that take you from one type to the other (and vice versa), such that applying \(f\) after \(g\), or \(g\) after \(f\), gets you right back where you started. That is, \(f\) and \(g\) need to be each other's inverses. For our specific case, let's call the two functions myOut and myIn (I'm matching the naming in this paper). They are not hard to define:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 28 34 >}}

By the way, when a data type is a fixed point of some other, non-recursive type constructor, this second type constructor is called a base functor. We can verify that MyListF is a functor by providing an instance (which is rather straightforward):

{{< codelines "Haskell" "catamorphisms/Cata.hs" 36 38 >}}

Recursive Functions with Base Functors

One neat thing you can do with a base functor is define recursive functions on the actual data type!

Let's go back to the very basics. When we write recursive functions, we try to think of it as solving a problem, assuming that we are given solutions to the sub-problems that make it up. In the more specific case of recursive functions on data types, we think of it as performing a given operation, assuming that we know how to perform this operation on the smaller pieces of the data structure. Some quick examples:

  1. When writing a sum function on a list, we assume we know how to find the sum of the list's tail (sum xs), and add to it the current element (x+). Of course, if we're looking at a part of a data structure that's not recursive, we don't need to perform any work on its constituent pieces.
    sum [] = 0
    sum (x:xs) = x + sum xs
    
  2. When writing a function to invert a binary tree, we assume that we can invert the left and right children of a non-leaf node. We might write:
    invert Leaf = Leaf
    invert (Node l r) = Node (invert r) (invert l)
    

What does this have to do with base functors? Well, recall how we arrived at MyListF from MyList: we replaced every occurrence of MyList in the definition with a type parameter a. Let me reiterate: wherever we had a sub-list in our definition, we replaced it with a. The a in MyListF marks the locations where we would have to use recursion if we were to define a function on MyList.

What if instead of a stand-in for the list type (as it was until now), we use a to represent the result of the recursive call on that sub-list? To finish computing the sum of the list, then, the following would suffice:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 40 42 >}}

Actually, this is enough to define the whole sum function. First things first, let's use myOut to unpack one level of the Mylist type:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 28 28 >}}

We know that MyListF is a functor; we can thus use fmap sum to compute the sum of the remaining list:

fmap mySum :: MyListF MyList -> MyListF Int

Finally, we can use our mySumF to handle the last addition:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 40 40 >}}

Let's put all of these together:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 44 45 >}}

Notice, though, that the exact same approach would work for any function with type:

MyListF a -> a

We can thus write a generalized version of mySum that, instead of using mySumF, uses some arbitrary function f with the aforementioned type:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 47 48 >}}

Let's use myCata to write a few other functions:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 50 60 >}}

It's just a foldr!

When you write a function with the type MyListF a -> a, you are actually providing two things: a "base case" element of type a, for when you match MyNilF, and a "combining function" with type Int -> a -> a, for when you match MyConsF. We can thus define:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 64 66 >}}

We could also go in the opposite direction, by writing:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 68 69 >}}

Hey, what was it that we said about types with two functions between them, which are inverses of each other? That's right, MyListF a -> a and (a, Int -> a -> a) are isomorphic. The function myCata, and the "traditional" definition of foldr are equivalent!

Base Functors for All!

We've been playing with MyList for a while now, but it's kind of getting boring: it's just a list of integers! Furthermore, we're not really getting anything out of this new "generalization" procedure -- foldr is part of the standard library, and we've just reinvented the wheel.

But you see, we haven't quite. This is because, while we've only been working with MyListF, the base functor for MyList, our approach works for any recursive data type, provided an out function. Let's define a type class, Cata, which pairs a data type a with its base functor f, and specifies how to "unpack" a:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 71 72 >}}

We can now provide a more generic version of our myCata, one that works for all types with a base functor:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 74 75 >}}

Clearly, MyList and MyListF are one instance of this type class:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 77 78 >}}

We can also write a base functor for Haskell's built-in list type, [a]:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 80 84 >}}

We can use our cata function for regular lists to define a generic sum:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 86 89 >}}

It works perfectly:

ghci> Main.sum [1,2,3]
6
ghci> Main.sum [1,2,3.0]
6.0
ghci> Main.sum [1,2,3.0,-1]
5.0

What about binary trees, which served as our second example of a recursive data structure? We can do that, too:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 91 96 >}}

Given this, here's an implementation of that invert function we mentioned earlier:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 98 101 >}}

Degenerate Cases

Actually, the data types we consider don't have to be recursive. We can apply the same procedure of replacing recursive occurrences in a data type's definition with a new type parameter to Maybe; the only difference is that now the new parameter will not be used!

{{< codelines "Haskell" "catamorphisms/Cata.hs" 103 107 >}}

And then we can define a function on Maybe using cata:

{{< codelines "Haskell" "catamorphisms/Cata.hs" 109 112 >}}

This isn't really useful, since we're still pattern matching on a type that looks identical to Maybe itself. There is one reason that I bring it up, though. Remember how foldr was equivalent to cata for MyList, because defining a function MyListF a -> a was the same as providing a base case a and a "combining function" Int -> a -> a? Well, defining a function MaybeF x a -> a is the same as providing a base case a (for NothingF) and a handler for the contained value, x -> a. So we might imagine the foldr function for Maybe to have type:

maybeFold :: a -> (x -> a) -> Maybe x -> a

This is exactly the function maybe from Data.Maybe! Hopefully you can follow a similar process in your head to arrive at "fold" functions for Either and Bool. Indeed, there are functions that correspond to these data types in the Haskell standard library, named either and bool. Much like fold can be used to represent any function on lists, maybe, either, and bool can be used to represent any function on their corresponding data types. I think that's neat.

What About Foldable?

If you've been around the Haskell ecosystem, you may know the Foldable type class. Isn't this exactly what we've been working towards here? No, not at all. Take a look at how the documentation describes Data.Foldable:

The Foldable class represents data structures that can be reduced to a summary value one element at a time.

One at a time, huh? Take a look at the signature of foldMap, which is sufficient for an instance of Foldable:

foldMap :: Monoid m => (a -> m) -> t a -> m

A Monoid is just a type with an associative binary operation that has an identity element. Then, foldMap simply visits the data structure in order, and applies this binary operation pairwise to each monoid produced via f. Alas, this function is not enough to be able to implement something like inverting a binary tree; there are different configurations of binary tree that, when visited in-order, result in the same sequence of elements. For example:

ghci> fold (Node "Hello" Leaf (Node ", " Leaf (Node "World!" Leaf Leaf)))
"Hello, World!"
ghci> fold (Node "Hello" (Node ", " Leaf Leaf) (Node "World!" Leaf Leaf))
"Hello, World!"

As far as fold (which is just foldMap id) is concerned, the two trees are equivalent. They are very much not equivalent for the purposes of inversion! Thus, whereas Foldable helps us work with list-like data types, the Cata type class lets us express any function on a recursive data type similarly to how we'd do it with foldr and lists.

Catamorphisms

Why is the type class called Cata, and the function cata? Well, a function that performs a computation by recursively visiting the data structure is called a catamorphism. Indeed, foldr f b, for function f an "base value" b is an example of a list catamorophism. It's a fancy word, and there are some fancier descriptions of what it is, especially when you step into category theory (check out the Wikipedia entry if you want to know what I mean). However, for our purposes, a catamorphism is just a generalization of foldr from lists to any data type!