From 53172351df9b60184daba7d4171b7bcd04b85a40 Mon Sep 17 00:00:00 2001 From: Danila Fedorin Date: Thu, 8 Dec 2022 22:33:30 -0800 Subject: [PATCH] Bring day 7 up to date with the blog version --- day7.chpl | 507 +++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 460 insertions(+), 47 deletions(-) diff --git a/day7.chpl b/day7.chpl index 11da9e0..9bc7a83 100644 --- a/day7.chpl +++ b/day7.chpl @@ -14,7 +14,7 @@ /* ### The Task at Hand and My Approach - In today's puzzle, we are given a list of terminal-like commands ( + In [today's puzzle](https://adventofcode.com/2022/day/7), we are given a list of terminal-like commands ( [`ls`](https://man7.org/linux/man-pages/man1/ls.1.html) and [`cd`](https://man7.org/linux/man-pages/man1/cd.1p.html) ), as well as output corresponding to running these commands. The commands explore a fictional file system, which can have files (objects with size) @@ -23,35 +23,420 @@ sizes of all folders that are smaller than a particular threshold. The tree-like nature of the file system does not make it amenable to - representations based on arrays, lists, and maps alone. The trouble with - these data types is that they're flat. Our input could -- and will -- have arbitrary + representations based on arrays, lists, or maps alone. The trouble with + these data types is that they're flat. Our input could have arbitrary levels of nested directories. However, arrays, lists, and maps cannot have - such arbitrary nesting -- we'd need something like a list of lists of lists... + such arbitrary nesting --- we'd need something like a list of lists of lists of... We could, of course, use the `map` and `list` data types to represent the file system with some sort of [adjacency list](https://en.wikipedia.org/wiki/Adjacency_list). However, such an implementation would be somewhat clunky and hard to use. - Instead, we'll use a different tool from the repertoire of Chapel language - features, one we haven't seen so far: classes. Much like in most languages, - classes are a way to group together related pieces of data. up until now, - we've used tuples for this purpose. + Instead, in this article I use a different tool from the repertoire of Chapel language + features, one we haven't seen so far: classes. Specifically, I use a class, `Dir`, to represent + directories in the file system, and build up a tree of these directories + while reading the input. I then create an iterator over this tree that + computes and yields the sizes of the folders. From there, it's easy to + pick out all directory sizes smaller than the threshold and sum them up. + **If you skip right to your favorite parts of a movie, here's a full solution for the day:** + {{< whole_file_min >}} + + And now, on to the explanation train. Before the train departs, let's import + a few of the modules we'll use today. `IO` is a permanent fixture in our + solutions (we always need to read input!), and `List` is a familiar face. + The only newcomer here is `Map`, which helps us associate keys with values, + much like a dictionary in Python, a hash in Ruby, or a map in C++. + We'll use maps and lists for storing the various files and directories + on the file system. */ use IO, Map, List; -class TreeNode { +/* + With that, our train's first stop: classes! + + ### Classes in Chapel + + Like in most languages, classes in Chapel are a way to group together related + pieces of data. Up until now, we've used tuples for this purpose. Tuples, + however, have a couple of limitations when it comes to solving today's + Advent of Code problem: + + * We can't name a tuple's elements. Whenever you make and use a tuple, + it is up to _you_ to remember the order of the elements within it, and + what each element represents. + * Tuples can be nested, but their precise element types, including + nesting depth, must be known at compile-time. As a result, tuples aren’t + flexible enough to support the arbitrary levels of nesting that would be + required by a program that didn’t know the directory structure _a priori_ + (e.g. one that was reading it from disk). We simply don’t have the + information at compile-time to describe the tuple’s types and “shape”. + + Classes have neither of these limitations. They do, however, need to be + explicitly created within Chapel code. For example, one might create a + class to store information about a person: + + ```Chapel + class person { + var firstName, lastName: string; + } + ``` + + We've seen plenty of `var` statements used to create variables; when used + within a class, `var` declares a _member variable_ (also known as a _field_) + for the class. Our `person` contains two pieces of data in its fields: the + person's first name (`firstName`) and last name (`lastName`). + + With that class definition in hand, we can create instances of the `person` class + using the `new` keyword. + + ```Chapel + var biggestCandyFan = new person("Daniel", "Fedorin"); + ``` + + As usual, we can rely on type inference to only write the type `person` once; + Chapel figures out that `biggestCandyFan` is a `person`. Now, it's easy to get + the various fields back out of a class: + + ```Chapel + writeln("The biggest fan of candy is ", biggestCandyFan.firstName); + ``` + + Believe it or not, we've already seen enough of classes to see how to represent + nested data structures. The key observation is that classes have names, which + means that we can create fields that refer back to instances of the same class. Here's + an example of what I mean, in the form of a modified `person` class: + + ```Chapel {hl_lines=3} + class person { + var firstName, lastName: string; + var children: list(owned person); + } + ``` + + The highlighted line is new. We've added a list of children to our person. + These children are themselves instances of `person`, which means they too + can have children of their own. _Et voilà_ - we've got a nested data structure! + + #### Memory Management Strategies + You probably noticed that `children`'s type is `list(owned person)` --- + note the `owned`. This keyword is an indication of the way that memory is + allocated and maintained for classes: their _memory management_. To create + a class, a Chapel program asks for some memory from the computer (_allocates_ it). + This memory is kept by the program until the instance of a class is no longer + needed, at which point it's _deallocated_/_freed_. The challenge is knowing when + a class is no longer needed! This is where _memory management strategies_, + like `owned`, come in. + + We don't need to get too deep into the various memory management strategies + in today's post. + + {{< details summary="**(If you're curious, here's a brief description of each strategy...)**" >}} + * When using the `owned` strategy, a class instance has one "owner" variable. + The instance is only around as long as this owner exists. + As soon as the owner disappears, the class instance is deallocated. + In some cases --- though we won't be covering them today --- ownership can + be transferred from one variable to another, but no two values can + own the same class instance at the same time. + + Other variables can still refer to an `owned` class instance, but they must _borrow_ it, + creating, for example, a `borrowed person`. Borrows do not affect the + lifetime of class or when it is deallocated. + * When using the `shared` strategy, Chapel keeps track of how many places + still have variables that refer to a particular instance of a class. This + is typically called a _reference count_. Each time a variable is created + or changed to refer to a class instance, the instance's reference count + increases. When that variable goes out of scope and disappears, the + reference count decreases. Finally, when the reference count reaches + zero (no more variables refer to the class instance), there's no point + in keeping it around anymore, and its memory is deallocated. + + As is the case with `owned`, other variables can borrow `shared` class instances. + Such borrows do not affect the reference count at all, and therefore don't + influence when the instance is freed. + * When using the `unmanaged` strategy, you're promising to manually free + the memory later, using the `delete` keyword. This is very similar to + how `new`/`delete` work in classic C++. + {{< /details >}} + + So, the `owned` keyword in our `children` list means we've opted for the + `owned` memory management strategy. The implication of this is that + when a "parent" person is deallocated, so are all of its children + (since the person class, through its `children` list, owns each child). + If we aren't planning on sharing our data, `owned` is the preferred strategy. This is because + it precludes the need for some bookkeeping, which + often makes a difference in terms of performance. The added benefit to using + `owned`, in my personal view, is that it's easier to figure out when something + will be deleted --- there's no chance of some other variable, elsewhere in my program, + preventing a class instance's deallocation. + + #### Methods + Remember how I said that classes can be used to group together pieces + of related data? Well, they can do more than that. They can also group + together operations on this data, in the form of _methods_. For instance, + we could add the following definition **inside** the `class` declaration + for our `person`: + + ```Chapel + class person { + // ... as before + + proc getGreeting() { + return "Hello, " + this.firstName + "!"; + } + } + ``` + + Just like fields can be thought of as `var`s that are associated with a particular + class instance, methods can be thought of as _procedures_ associated with + a particular class instance. Thus, methods behave pretty much exactly + like the `proc`s we've seen so far, with the notable difference of being able to + access that class instance through the `this` keyword. + For example, inside the body of a method like `getGreeting` above, + `this.firstName` gets us the person's first name, and `this.lastName` would + get us their last name. + + We can call methods using the dot syntax: + + ```Chapel + // Prints "Hello, Daniel!" + writeln(biggestCandyFan.getGreeting()); + ``` + + Methods are a powerful tool for abstraction; rather than writing external code + that refers to the various fields of a class, we can put that logic + inside of methods, and avoid exposing it to the rest of the world. A person + writing `.getGreeting()` will not need to know how a name is represented + in the `person` class. + + Another sort of method is a _type method_ (sometimes referred to as + a _static method_ in other languages). Rather than being called on + an instance of a person, like `biggestCandyFan` or `daniel`, it's called + on the class itself. For instance: + + ```Chapel + class person { + // ... as before + + proc type createBiggestCandyFan() { + return new person("Daniel", "Fedorin"); + } + } + + var biggestCandyFan = person.createBiggestCandyFan(); + ``` + + Methods like this have the benefit of being associated with a particular class. + This means that another class can have its own `createBiggestCandyFan()` + method, and there won't be any confusion or problems arising from trying + to figure out which is which. Perhaps dogs (represented by a hypothetical + `dog` class) have a biggest candy fan, too! + + ```Chapel + var biggestCandyFan = person.createBiggestCandyFan(); + var biggestCandyFanDog = dog.createBiggestCandyFan(); + ``` + + ### A `Dir` Class to Represent Directories + Back to the solution. The class I use for tracking directories is actually not too different + from our modified `person` class above. Each directory + {{< sidenote right "dir-firstname-note" "will have a name" >}} + Despite the recent media noise about ChatGPT, directories have not yet + been granted personhood, and do not have both first and last names. + {{< /sidenote >}} + as well as a collection of files and directories it contains. +*/ + +class Dir { var name: string; var files = new map(string, int); - var dirs = new list(owned TreeNode); + var dirs = new list(owned Dir); - proc init(name: string) { - this.name = name; +/* + Since files have no + additional information to them besides their size, I decided to represent + them as a map --- a directory's `files` field associates each file's name + with that file's size. The subdirectories are represented just like + the `children` field from our `person` record, as a list of owned `Dir`s. + + There are a few more things I want to add to `Dir`; + the first is a way to read our directory from our puzzle input. + + #### Reading the File System with the `fromInput` Type Method + For reasons of abstraction and avoiding conflicts, I put + the code for creating a directory from user input into a type method on `Dir`. Within + this method, I include the now-familiar code for reading from the + input using `readLine`, until we run out of lines. +*/ + + proc type fromInput(name: string): owned Dir { + var line: string; + var newDir = new Dir(name); + + while readLine(line, stripNewline = true) { + /* + Notice that I'm accepting the name for the + directory as a string formal and initializing a new variable `newDir` with that name. + Notice also that I don't need to provide the `files` and `dirs` + as arguments to `new Dir` --- they have default values in the + class definition. By default, `new` uses the `owned` memory management + strategy. For the time being, the `newDir` variable owns our + directory-under-construction. + + We're reading lines now; all that's left is to figure out what to do + with them. The first case is that of `$ cd ..`. When we see that line, + it means that we're done looking at the current directory; none + of the subsequent `ls` lines will be meant for us. Thus, we break + out of the input `while`-loop. + */ + if line == "$ cd .." { + break; + /* + If the `cd` command is used, but its argument isn't `..`, we're being + asked to descend into a sub-directory of our current `newDir`. + In this case, we call the `fromInput` method again, recursively, + to create a subdirectory of the current one. This + call will keep consuming lines from the input until the sub-directory + has been processed, at which point it will return it to us. We'll + immediately append this sub-directory to the `newDir.dirs` list, + which becomes the sub-directory's new owner. + + Recall that we need to give `fromInput` the name of the new + sub-directory. We can figure out the name by slicing the string + starting after the `$ cd` prefix. Since I want to get the rest of the + characters after the prefix, I leave the end of my range unbounded, which + makes the slice go until the characters run out at the end of the string. + If you're feeling shaky on lists and `append`, check out our [day 5 article]({{< relref "aoc2022-day05-cratestacks" >}}#moving-crates-within-an-array-of-lists). + If you want a little refresher on slicing, we first covered it on [day 3]({{< relref "aoc2022-day03-rucksacks" >}}/ranges-and-slicing). + + */ + } else if line.startsWith("$ cd ") { + param cdPrefix = "$ cd "; + const dirName = line[cdPrefix.size..]; + newDir.dirs.append(Dir.fromInput(dirName)); + /* + As it turns out, all that's left is to handle files. We already get + directory names from `cd`, so there's no reason to worry about + lines starting with `dir`. The `ls` command itself always precedes + the list of files and directories; by itself, it provides us no + additional information. Thus, our last case is a line that's neither + `dir` nor `ls`. Such a line is a file, so its format will be a number + followed by the file's name. + + I use the `partition` method on the line to split it into three + pieces: the part before the space, the space itself, and the part + after the space. After that, I can just update the `newDir` map, + associating the file called `name` with its size. I use an integer cast + to convert `size` (a string) to a number. + */ + } else if !line.startsWith("$ ls") && !line.startsWith("dir") { + const (size, _, name) = line.partition(" "); + newDir.files[name] = size : int; + /* + That's it for the loop! Once the loop stops running, we know we're done + processing the directory. All that remains is to return it. Returning + an `owned` value from a function or method transfers ownership to whatever + code calls the function or method. + */ + } + } + return newDir; } + /* + One more thing: I have explicitly annotated the + return type of `fromInput` to be `owned Dir` to let Chapel know + that I'm using the `owned` memory management strategy. This might just + be the first return type annotation we've written so far. Up until now, + Chapel has been able to deduce the return types of our procedures + and iterators automatically. However, here, because we are using + recursion, it needs just a little bit of help: determining the types + in the body of `fromInput` requires knowing the type of `fromInput`! + The manual type annotation helps break that loop. + */ + + /* + #### An Iterator Method for Listing Directory Sizes + Let's recap. What we have now is a data structure, `Dir`, which represents + the directory tree. We also have a type method, `Dir.fromInput` that + converts our puzzle input into this data structure. What's left? + + The way I see it, the problem is composed of three pieces: + + 1. Go through all of the directory sizes... + 2. ... ignoring those that are above a certain threshold ... + 3. ... and sum them. + + Over the past week, we've gotten really good at summing things! In + Chapel, we can just use `+reduce` to compute the sum of something + iterable, so there's point number three. For point two, it turns out that + those [loop expressions]({{< relref "aoc2022-day06-packets" >}}#parallel-loop-expressions) + from yesterday can be used to filter out elements like so: + + ```Chapel + [for i in iterable] if someCondition then i + ``` + + Putting these two pieces together, we might write something like: + ```Chapel + + reduce [for size in directorySizes] if size < 1000000 then size + ``` + + That `directorySizes` is the only "fictional" piece of the solution. + Perhaps we can make our `Dir` tree support an iterator of directory sizes? + Then, we'd have our answer. + + In my solution, I do just that. Methods on classes don't have to be procedures --- + they can also be iterators. There's only one complication. We want our + iterator method to yield the sizes of _all_ of the various sub-directories + within a `Dir` including sub-directories of sub-directories. That's because + we have to sum them all up as per the problem statement. However, when + _computing_ the size of a directory, we don't want to include sub-sub-directories + in our counting: the direct sub-directories already include the sizes of + their own contents. To make this work, I added a `parentSize` formal to + the iterator method, which represents a reference to the parent directory's + size. When it's done yielding its own size, as well as the sizes of the + sub-directories, the iterator method will add its own size to its parent's. + + Here's the implementation of the iterator method; I'll talk about it in + more detail below. + */ + iter dirSizes(ref parentSize = 0): int { + // Compute sizes from files only. + var size = + reduce files.values(); + for subDir in dirs { + // Yield directory sizes from the dir. + for subSize in subDir.dirSizes(size) do yield subSize; + } + yield size; + parentSize += size; + } + /* + The first thing this method does is create a new variable, `size`, + representing the current directory's size. It's initialized to the sum + of all the file sizes. However, at this point, that's not the whole size --- + we also need to figure out how much data is stored in the subdirectories. + + I use a `for` loop over the `dirs` list to examine each sub-directory + of the current folder in turn. Each of these sub-directories is its + own full-fledged `Dir`, so we can call its `dirSizes` + method. This gives us an iterator of all directory sizes from `subDir`. + I simply yield them from the parent iterator, making it yield + the sizes of all directories, including nested ones. Notice that I also + provide `size` as the argument to the recursive call to `dirSizes`: + the inner for-loop serves the double purpose of yielding directory sizes + and finishing computing the current folder's size. + + Once all of the sub-directory sizes have been yielded, the `size` variable + includes all the files in the folder, including nested ones. Thus, I use it to yield + the size of the current folder. I also add `size` to `parentSize`. + + That concludes our `Dir` class! + */ /* + {{< skip >}} ```Chapel iter these(param tag: iterKind): (string, int) where tag == iterKind.standalone { var size = + reduce files.values(); @@ -65,46 +450,74 @@ class TreeNode { this.size = size; } ``` + {{< /skip >}} */ - iter dirSizes(ref parentSize = 0): (string, int) { - var size = + reduce files.values(); - for dir in dirs { - // Yield directory sizes from the dir. - for subSize in dir.dirSizes(size) do yield subSize; - } - yield (name, size); - parentSize += size; - } - - proc type fromInput(name: string, readFrom): owned TreeNode { - var line: string; - var newDir = new TreeNode(name); - - while readFrom.readLine(line, stripNewline = true) { - if line == "$ cd .." { - break; - } else if line.startsWith("$ cd ") { - const dirName = line["$ cd ".size..]; - newDir.dirs.append(TreeNode.fromInput(dirName, readFrom)); - } else if !line.startsWith("$ ls") { - const (sizeOrDir, _, name) = line.partition(" "); - if sizeOrDir == "dir" { - // Ignore directories, we'll `cd` into them. - } else { - newDir.files[name] = sizeOrDir : int; - } - } - } - return newDir; - } } -var rootFolder = TreeNode.fromInput("", stdin); +/* + ### Putting It All Together + With our `Dir` class complete, we can finally make use of it in our code. + The first thing we need to do is read our file system from the input; + this is accomplished using the `fromInput` method. +*/ +var rootFolder = Dir.fromInput("/"); + +/* + Next up, we can use that `+reduce` expression I described above. I use + a new variable, `rootSize`, to represent the size of the top-level directory. + After the call to `dirSizes` completes, it will be set to the total size of + the root directory, i.e., the total disk usage. */ var rootSize = 0; -writeln(+ reduce [(_, size) in rootFolder.dirSizes(rootSize)] if size < 100000 then size); +writeln(+ reduce [size in rootFolder.dirSizes(rootSize)] if size < 100000 then size); -const toDelete = rootSize - 40000000; -writeln(min reduce [(_, size) in rootFolder.dirSizes()] if size >= toDelete then size); +/* + I could've omitted the argument to `dirSizes` --- notice from the method's + signature that I provide a default value for `parentSize`. + + ```Chapel + iter dirSizes(ref parentSize = 0): int { + ``` + + However, knowing `rootSize` lets us easily compute the amount of space we need + to free up (for part 2 of today's problem). + */ +const toDelete = rootSize - 40000000; // = 30000000 - (70000000 - rootSize) + +/* + We can now re-use our `dirSizes` stream to check every directory size again, + this time looking for the smallest folder that meets a certain threshold. + A `min` reduction takes care of this: + */ +writeln(min reduce [size in rootFolder.dirSizes()] if size >= toDelete then size); + +/* And there's the solution to part 2, as well! */ + +/* + ### Summary + This concludes today's description of my solution. This time, I introduced + Chapel's classes --- defining them, creating fields and adding methods. We got + a little taste of memory management strategies and ownership, though I deliberately + kept it light to avoid introducing too many new concepts. + + Admittedly, today's solution is (for the most part) serial. Although the + `+reduce` expression that computes the initial `size` of a directory from + its `files` is eligible for parallelization, the `dirSizes` iterator is not. The main + reason for this is that the interaction between recursive parallel iterators and + reductions is, at the time of writing, unimplemented. + Nevertheless, I think that using even a serial iterator has _yielded_ an elegant + solution (pun intended). + + If you wanted to write a parallel version, I'd advise creating a new, + non-iterator method on `Dir` that solves just part 1 of today's puzzle. + This method could return a tuple of two elements, perhaps `sumSmallSizes` + and `dirSize`; then, a simple `forall` loop over `dirs` (and judicious use of reduce intents, + which are described in our [day 4 article]({{< relref "aoc2022-day04-ranges" >}}third-solution-parallel-approach)) + will let you compute the answer in parallel. + + Thanks for reading! Please feel free + to ask any questions or post any comments you have in the new [Blog + Category](https://chapel.discourse.group/c/blog/21) of Chapel's + Discourse Page. */