// Advent of Code 2022, Day 7: Traversing Directories // authors: ["Daniel Fedorin"] // summary: "A solution to day seven of AoC 2022, introducing classes and memory management." // tags: ["Advent of Code 2022"] // date: 2022-12-07 /* Welcome to day 7 of Chapel's Advent of Code 2022 series. We're over halway through the twelve days of Chapel AoC! In case you haven't been following the series, check out the introductory [Advent of Code 2022: Twelve Days of Chapel](../aoc2022-day00-intro/) article for more context. */ /* ### The Task at Hand and My Approach In [today's puzzle](https://adventofcode.com/2022/day/7), we are given a list of terminal-like commands ( [`ls`](https://man7.org/linux/man-pages/man1/ls.1.html) and [`cd`](https://man7.org/linux/man-pages/man1/cd.1p.html) ), as well as output corresponding to running these commands. The commands explore a fictional file system, which can have files (objects with size) as well as directories that group files and other (sub-)directories. The problem then asks to compute the sizes of each folder, and to total up the sizes of all folders that are smaller than a particular threshold. The tree-like nature of the file system does not make it amenable to representations based on arrays, lists, or maps alone. The trouble with these data types is that they're flat. Our input could have arbitrary levels of nested directories. However, arrays, lists, and maps cannot have such arbitrary nesting --- we'd need something like a list of lists of lists of... We could, of course, use the `map` and `list` data types to represent the file system with some sort of [adjacency list](https://en.wikipedia.org/wiki/Adjacency_list). However, such an implementation would be somewhat clunky and hard to use. Instead, in this article I use a different tool from the repertoire of Chapel language features, one we haven't seen so far: classes. Specifically, I use a class, `Dir`, to represent directories in the file system, and build up a tree of these directories while reading the input. I then create an iterator over this tree that computes and yields the sizes of the folders. From there, it's easy to pick out all directory sizes smaller than the threshold and sum them up. **If you skip right to your favorite parts of a movie, here's a full solution for the day:** {{< whole_file_min >}} And now, on to the explanation train. Before the train departs, let's import a few of the modules we'll use today. `IO` is a permanent fixture in our solutions (we always need to read input!), and `List` is a familiar face. The only newcomer here is `Map`, which helps us associate keys with values, much like a dictionary in Python, a hash in Ruby, or a map in C++. We'll use maps and lists for storing the various files and directories on the file system. */ use IO, Map, List; /* With that, our train's first stop: classes! ### Classes in Chapel Like in most languages, classes in Chapel are a way to group together related pieces of data. Up until now, we've used tuples for this purpose. Tuples, however, have a couple of limitations when it comes to solving today's Advent of Code problem: * We can't name a tuple's elements. Whenever you make and use a tuple, it is up to _you_ to remember the order of the elements within it, and what each element represents. * Tuples can be nested, but their precise element types, including nesting depth, must be known at compile-time. As a result, tuples aren’t flexible enough to support the arbitrary levels of nesting that would be required by a program that didn’t know the directory structure _a priori_ (e.g. one that was reading it from disk). We simply don’t have the information at compile-time to describe the tuple’s types and “shape”. Classes have neither of these limitations. They do, however, need to be explicitly created within Chapel code. For example, one might create a class to store information about a person: ```Chapel class person { var firstName, lastName: string; } ``` We've seen plenty of `var` statements used to create variables; when used within a class, `var` declares a _member variable_ (also known as a _field_) for the class. Our `person` contains two pieces of data in its fields: the person's first name (`firstName`) and last name (`lastName`). With that class definition in hand, we can create instances of the `person` class using the `new` keyword. ```Chapel var biggestCandyFan = new person("Daniel", "Fedorin"); ``` As usual, we can rely on type inference to only write the type `person` once; Chapel figures out that `biggestCandyFan` is a `person`. Now, it's easy to get the various fields back out of a class: ```Chapel writeln("The biggest fan of candy is ", biggestCandyFan.firstName); ``` Believe it or not, we've already seen enough of classes to see how to represent nested data structures. The key observation is that classes have names, which means that we can create fields that refer back to instances of the same class. Here's an example of what I mean, in the form of a modified `person` class: ```Chapel {hl_lines=3} class person { var firstName, lastName: string; var children: list(owned person); } ``` The highlighted line is new. We've added a list of children to our person. These children are themselves instances of `person`, which means they too can have children of their own. _Et voilà_ - we've got a nested data structure! #### Memory Management Strategies You probably noticed that `children`'s type is `list(owned person)` --- note the `owned`. This keyword is an indication of the way that memory is allocated and maintained for classes: their _memory management_. To create a class, a Chapel program asks for some memory from the computer (_allocates_ it). This memory is kept by the program until the instance of a class is no longer needed, at which point it's _deallocated_/_freed_. The challenge is knowing when a class is no longer needed! This is where _memory management strategies_, like `owned`, come in. We don't need to get too deep into the various memory management strategies in today's post. {{< details summary="**(If you're curious, here's a brief description of each strategy...)**" >}} * When using the `owned` strategy, a class instance has one "owner" variable. The instance is only around as long as this owner exists. As soon as the owner disappears, the class instance is deallocated. In some cases --- though we won't be covering them today --- ownership can be transferred from one variable to another, but no two values can own the same class instance at the same time. Other variables can still refer to an `owned` class instance, but they must _borrow_ it, creating, for example, a `borrowed person`. Borrows do not affect the lifetime of class or when it is deallocated. * When using the `shared` strategy, Chapel keeps track of how many places still have variables that refer to a particular instance of a class. This is typically called a _reference count_. Each time a variable is created or changed to refer to a class instance, the instance's reference count increases. When that variable goes out of scope and disappears, the reference count decreases. Finally, when the reference count reaches zero (no more variables refer to the class instance), there's no point in keeping it around anymore, and its memory is deallocated. As is the case with `owned`, other variables can borrow `shared` class instances. Such borrows do not affect the reference count at all, and therefore don't influence when the instance is freed. * When using the `unmanaged` strategy, you're promising to manually free the memory later, using the `delete` keyword. This is very similar to how `new`/`delete` work in classic C++. {{< /details >}} So, the `owned` keyword in our `children` list means we've opted for the `owned` memory management strategy. The implication of this is that when a "parent" person is deallocated, so are all of its children (since the person class, through its `children` list, owns each child). If we aren't planning on sharing our data, `owned` is the preferred strategy. This is because it precludes the need for some bookkeeping, which often makes a difference in terms of performance. The added benefit to using `owned`, in my personal view, is that it's easier to figure out when something will be deleted --- there's no chance of some other variable, elsewhere in my program, preventing a class instance's deallocation. #### Methods Remember how I said that classes can be used to group together pieces of related data? Well, they can do more than that. They can also group together operations on this data, in the form of _methods_. For instance, we could add the following definition **inside** the `class` declaration for our `person`: ```Chapel class person { // ... as before proc getGreeting() { return "Hello, " + this.firstName + "!"; } } ``` Just like fields can be thought of as `var`s that are associated with a particular class instance, methods can be thought of as _procedures_ associated with a particular class instance. Thus, methods behave pretty much exactly like the `proc`s we've seen so far, with the notable difference of being able to access that class instance through the `this` keyword. For example, inside the body of a method like `getGreeting` above, `this.firstName` gets us the person's first name, and `this.lastName` would get us their last name. We can call methods using the dot syntax: ```Chapel // Prints "Hello, Daniel!" writeln(biggestCandyFan.getGreeting()); ``` Methods are a powerful tool for abstraction; rather than writing external code that refers to the various fields of a class, we can put that logic inside of methods, and avoid exposing it to the rest of the world. A person writing `.getGreeting()` will not need to know how a name is represented in the `person` class. Another sort of method is a _type method_ (sometimes referred to as a _static method_ in other languages). Rather than being called on an instance of a person, like `biggestCandyFan` or `daniel`, it's called on the class itself. For instance: ```Chapel class person { // ... as before proc type createBiggestCandyFan() { return new person("Daniel", "Fedorin"); } } var biggestCandyFan = person.createBiggestCandyFan(); ``` Methods like this have the benefit of being associated with a particular class. This means that another class can have its own `createBiggestCandyFan()` method, and there won't be any confusion or problems arising from trying to figure out which is which. Perhaps dogs (represented by a hypothetical `dog` class) have a biggest candy fan, too! ```Chapel var biggestCandyFan = person.createBiggestCandyFan(); var biggestCandyFanDog = dog.createBiggestCandyFan(); ``` ### A `Dir` Class to Represent Directories Back to the solution. The class I use for tracking directories is actually not too different from our modified `person` class above. Each directory {{< sidenote right "dir-firstname-note" "will have a name" >}} Despite the recent media noise about ChatGPT, directories have not yet been granted personhood, and do not have both first and last names. {{< /sidenote >}} as well as a collection of files and directories it contains. */ class Dir { var name: string; var files = new map(string, int); var dirs = new list(owned Dir); /* Since files have no additional information to them besides their size, I decided to represent them as a map --- a directory's `files` field associates each file's name with that file's size. The subdirectories are represented just like the `children` field from our `person` record, as a list of owned `Dir`s. There are a few more things I want to add to `Dir`; the first is a way to read our directory from our puzzle input. #### Reading the File System with the `fromInput` Type Method For reasons of abstraction and avoiding conflicts, I put the code for creating a directory from user input into a type method on `Dir`. Within this method, I include the now-familiar code for reading from the input using `readLine`, until we run out of lines. */ proc type fromInput(name: string): owned Dir { var line: string; var newDir = new Dir(name); while readLine(line, stripNewline = true) { /* Notice that I'm accepting the name for the directory as a string formal and initializing a new variable `newDir` with that name. Notice also that I don't need to provide the `files` and `dirs` as arguments to `new Dir` --- they have default values in the class definition. By default, `new` uses the `owned` memory management strategy. For the time being, the `newDir` variable owns our directory-under-construction. We're reading lines now; all that's left is to figure out what to do with them. The first case is that of `$ cd ..`. When we see that line, it means that we're done looking at the current directory; none of the subsequent `ls` lines will be meant for us. Thus, we break out of the input `while`-loop. */ if line == "$ cd .." { break; /* If the `cd` command is used, but its argument isn't `..`, we're being asked to descend into a sub-directory of our current `newDir`. In this case, we call the `fromInput` method again, recursively, to create a subdirectory of the current one. This call will keep consuming lines from the input until the sub-directory has been processed, at which point it will return it to us. We'll immediately append this sub-directory to the `newDir.dirs` list, which becomes the sub-directory's new owner. Recall that we need to give `fromInput` the name of the new sub-directory. We can figure out the name by slicing the string starting after the `$ cd` prefix. Since I want to get the rest of the characters after the prefix, I leave the end of my range unbounded, which makes the slice go until the characters run out at the end of the string. If you're feeling shaky on lists and `append`, check out our [day 5 article]({{< relref "aoc2022-day05-cratestacks" >}}#moving-crates-within-an-array-of-lists). If you want a little refresher on slicing, we first covered it on [day 3]({{< relref "aoc2022-day03-rucksacks" >}}/ranges-and-slicing). */ } else if line.startsWith("$ cd ") { param cdPrefix = "$ cd "; const dirName = line[cdPrefix.size..]; newDir.dirs.append(Dir.fromInput(dirName)); /* As it turns out, all that's left is to handle files. We already get directory names from `cd`, so there's no reason to worry about lines starting with `dir`. The `ls` command itself always precedes the list of files and directories; by itself, it provides us no additional information. Thus, our last case is a line that's neither `dir` nor `ls`. Such a line is a file, so its format will be a number followed by the file's name. I use the `partition` method on the line to split it into three pieces: the part before the space, the space itself, and the part after the space. After that, I can just update the `newDir` map, associating the file called `name` with its size. I use an integer cast to convert `size` (a string) to a number. */ } else if !line.startsWith("$ ls") && !line.startsWith("dir") { const (size, _, name) = line.partition(" "); newDir.files[name] = size : int; /* That's it for the loop! Once the loop stops running, we know we're done processing the directory. All that remains is to return it. Returning an `owned` value from a function or method transfers ownership to whatever code calls the function or method. */ } } return newDir; } /* One more thing: I have explicitly annotated the return type of `fromInput` to be `owned Dir` to let Chapel know that I'm using the `owned` memory management strategy. This might just be the first return type annotation we've written so far. Up until now, Chapel has been able to deduce the return types of our procedures and iterators automatically. However, here, because we are using recursion, it needs just a little bit of help: determining the types in the body of `fromInput` requires knowing the type of `fromInput`! The manual type annotation helps break that loop. */ /* #### An Iterator Method for Listing Directory Sizes Let's recap. What we have now is a data structure, `Dir`, which represents the directory tree. We also have a type method, `Dir.fromInput` that converts our puzzle input into this data structure. What's left? The way I see it, the problem is composed of three pieces: 1. Go through all of the directory sizes... 2. ... ignoring those that are above a certain threshold ... 3. ... and sum them. Over the past week, we've gotten really good at summing things! In Chapel, we can just use `+reduce` to compute the sum of something iterable, so there's point number three. For point two, it turns out that those [loop expressions]({{< relref "aoc2022-day06-packets" >}}#parallel-loop-expressions) from yesterday can be used to filter out elements like so: ```Chapel [for i in iterable] if someCondition then i ``` Putting these two pieces together, we might write something like: ```Chapel + reduce [for size in directorySizes] if size < 1000000 then size ``` That `directorySizes` is the only "fictional" piece of the solution. Perhaps we can make our `Dir` tree support an iterator of directory sizes? Then, we'd have our answer. In my solution, I do just that. Methods on classes don't have to be procedures --- they can also be iterators. There's only one complication. We want our iterator method to yield the sizes of _all_ of the various sub-directories within a `Dir` including sub-directories of sub-directories. That's because we have to sum them all up as per the problem statement. However, when _computing_ the size of a directory, we don't want to include sub-sub-directories in our counting: the direct sub-directories already include the sizes of their own contents. To make this work, I added a `parentSize` formal to the iterator method, which represents a reference to the parent directory's size. When it's done yielding its own size, as well as the sizes of the sub-directories, the iterator method will add its own size to its parent's. Here's the implementation of the iterator method; I'll talk about it in more detail below. */ iter dirSizes(ref parentSize = 0): int { // Compute sizes from files only. var size = + reduce files.values(); for subDir in dirs { // Yield directory sizes from the dir. for subSize in subDir.dirSizes(size) do yield subSize; } yield size; parentSize += size; } /* The first thing this method does is create a new variable, `size`, representing the current directory's size. It's initialized to the sum of all the file sizes. However, at this point, that's not the whole size --- we also need to figure out how much data is stored in the subdirectories. I use a `for` loop over the `dirs` list to examine each sub-directory of the current folder in turn. Each of these sub-directories is its own full-fledged `Dir`, so we can call its `dirSizes` method. This gives us an iterator of all directory sizes from `subDir`. I simply yield them from the parent iterator, making it yield the sizes of all directories, including nested ones. Notice that I also provide `size` as the argument to the recursive call to `dirSizes`: the inner for-loop serves the double purpose of yielding directory sizes and finishing computing the current folder's size. Once all of the sub-directory sizes have been yielded, the `size` variable includes all the files in the folder, including nested ones. Thus, I use it to yield the size of the current folder. I also add `size` to `parentSize`. That concludes our `Dir` class! */ /* {{< skip >}} ```Chapel iter these(param tag: iterKind): (string, int) where tag == iterKind.standalone { var size = + reduce files.values(); coforall dir in dirs with (+ reduce size) { // Yield directory sizes from the dir. forall subSize in dir do yield subSize; // Count its size for our size. size += dir.size; } yield (name, size); this.size = size; } ``` {{< /skip >}} */ } /* ### Putting It All Together With our `Dir` class complete, we can finally make use of it in our code. The first thing we need to do is read our file system from the input; this is accomplished using the `fromInput` method. */ var rootFolder = Dir.fromInput("/"); /* Next up, we can use that `+reduce` expression I described above. I use a new variable, `rootSize`, to represent the size of the top-level directory. After the call to `dirSizes` completes, it will be set to the total size of the root directory, i.e., the total disk usage. */ var rootSize = 0; writeln(+ reduce [size in rootFolder.dirSizes(rootSize)] if size < 100000 then size); /* I could've omitted the argument to `dirSizes` --- notice from the method's signature that I provide a default value for `parentSize`. ```Chapel iter dirSizes(ref parentSize = 0): int { ``` However, knowing `rootSize` lets us easily compute the amount of space we need to free up (for part 2 of today's problem). */ const toDelete = rootSize - 40000000; // = 30000000 - (70000000 - rootSize) /* We can now re-use our `dirSizes` stream to check every directory size again, this time looking for the smallest folder that meets a certain threshold. A `min` reduction takes care of this: */ writeln(min reduce [size in rootFolder.dirSizes()] if size >= toDelete then size); /* And there's the solution to part 2, as well! */ /* ### Summary This concludes today's description of my solution. This time, I introduced Chapel's classes --- defining them, creating fields and adding methods. We got a little taste of memory management strategies and ownership, though I deliberately kept it light to avoid introducing too many new concepts. Admittedly, today's solution is (for the most part) serial. Although the `+reduce` expression that computes the initial `size` of a directory from its `files` is eligible for parallelization, the `dirSizes` iterator is not. The main reason for this is that the interaction between recursive parallel iterators and reductions is, at the time of writing, unimplemented. Nevertheless, I think that using even a serial iterator has _yielded_ an elegant solution (pun intended). If you wanted to write a parallel version, I'd advise creating a new, non-iterator method on `Dir` that solves just part 1 of today's puzzle. This method could return a tuple of two elements, perhaps `sumSmallSizes` and `dirSize`; then, a simple `forall` loop over `dirs` (and judicious use of reduce intents, which are described in our [day 4 article]({{< relref "aoc2022-day04-ranges" >}}third-solution-parallel-approach)) will let you compute the answer in parallel. Thanks for reading! Please feel free to ask any questions or post any comments you have in the new [Blog Category](https://chapel.discourse.group/c/blog/21) of Chapel's Discourse Page. */