AdventOfCode-2022/day7.chpl
2022-12-08 22:33:30 -08:00

524 lines
24 KiB
Chapel
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

// Advent of Code 2022, Day 7: Traversing Directories
// authors: ["Daniel Fedorin"]
// summary: "A solution to day seven of AoC 2022, introducing classes and memory management."
// tags: ["Advent of Code 2022"]
// date: 2022-12-07
/*
Welcome to day 7 of Chapel's Advent of Code 2022 series. We're over halway
through the twelve days of Chapel AoC! In case you haven't been following
the series, check out the introductory [Advent of Code 2022: Twelve
Days of Chapel](../aoc2022-day00-intro/) article for more context.
*/
/*
### The Task at Hand and My Approach
In [today's puzzle](https://adventofcode.com/2022/day/7), we are given a list of terminal-like commands (
[`ls`](https://man7.org/linux/man-pages/man1/ls.1.html) and [`cd`](https://man7.org/linux/man-pages/man1/cd.1p.html)
), as well as output corresponding to running these commands. The commands
explore a fictional file system, which can have files (objects with size)
as well as directories that group files and other (sub-)directories. The
problem then asks to compute the sizes of each folder, and to total up the
sizes of all folders that are smaller than a particular threshold.
The tree-like nature of the file system does not make it amenable to
representations based on arrays, lists, or maps alone. The trouble with
these data types is that they're flat. Our input could have arbitrary
levels of nested directories. However, arrays, lists, and maps cannot have
such arbitrary nesting --- we'd need something like a list of lists of lists of...
We could, of course, use the `map` and `list` data types to represent the
file system with some sort of [adjacency list](https://en.wikipedia.org/wiki/Adjacency_list).
However, such an implementation would be somewhat clunky and hard to use.
Instead, in this article I use a different tool from the repertoire of Chapel language
features, one we haven't seen so far: classes. Specifically, I use a class, `Dir`, to represent
directories in the file system, and build up a tree of these directories
while reading the input. I then create an iterator over this tree that
computes and yields the sizes of the folders. From there, it's easy to
pick out all directory sizes smaller than the threshold and sum them up.
**If you skip right to your favorite parts of a movie, here's a full solution for the day:**
{{< whole_file_min >}}
And now, on to the explanation train. Before the train departs, let's import
a few of the modules we'll use today. `IO` is a permanent fixture in our
solutions (we always need to read input!), and `List` is a familiar face.
The only newcomer here is `Map`, which helps us associate keys with values,
much like a dictionary in Python, a hash in Ruby, or a map in C++.
We'll use maps and lists for storing the various files and directories
on the file system.
*/
use IO, Map, List;
/*
With that, our train's first stop: classes!
### Classes in Chapel
Like in most languages, classes in Chapel are a way to group together related
pieces of data. Up until now, we've used tuples for this purpose. Tuples,
however, have a couple of limitations when it comes to solving today's
Advent of Code problem:
* We can't name a tuple's elements. Whenever you make and use a tuple,
it is up to _you_ to remember the order of the elements within it, and
what each element represents.
* Tuples can be nested, but their precise element types, including
nesting depth, must be known at compile-time. As a result, tuples arent
flexible enough to support the arbitrary levels of nesting that would be
required by a program that didnt know the directory structure _a priori_
(e.g. one that was reading it from disk). We simply dont have the
information at compile-time to describe the tuples types and “shape”.
Classes have neither of these limitations. They do, however, need to be
explicitly created within Chapel code. For example, one might create a
class to store information about a person:
```Chapel
class person {
var firstName, lastName: string;
}
```
We've seen plenty of `var` statements used to create variables; when used
within a class, `var` declares a _member variable_ (also known as a _field_)
for the class. Our `person` contains two pieces of data in its fields: the
person's first name (`firstName`) and last name (`lastName`).
With that class definition in hand, we can create instances of the `person` class
using the `new` keyword.
```Chapel
var biggestCandyFan = new person("Daniel", "Fedorin");
```
As usual, we can rely on type inference to only write the type `person` once;
Chapel figures out that `biggestCandyFan` is a `person`. Now, it's easy to get
the various fields back out of a class:
```Chapel
writeln("The biggest fan of candy is ", biggestCandyFan.firstName);
```
Believe it or not, we've already seen enough of classes to see how to represent
nested data structures. The key observation is that classes have names, which
means that we can create fields that refer back to instances of the same class. Here's
an example of what I mean, in the form of a modified `person` class:
```Chapel {hl_lines=3}
class person {
var firstName, lastName: string;
var children: list(owned person);
}
```
The highlighted line is new. We've added a list of children to our person.
These children are themselves instances of `person`, which means they too
can have children of their own. _Et voilà_ - we've got a nested data structure!
#### Memory Management Strategies
You probably noticed that `children`'s type is `list(owned person)` ---
note the `owned`. This keyword is an indication of the way that memory is
allocated and maintained for classes: their _memory management_. To create
a class, a Chapel program asks for some memory from the computer (_allocates_ it).
This memory is kept by the program until the instance of a class is no longer
needed, at which point it's _deallocated_/_freed_. The challenge is knowing when
a class is no longer needed! This is where _memory management strategies_,
like `owned`, come in.
We don't need to get too deep into the various memory management strategies
in today's post.
{{< details summary="**(If you're curious, here's a brief description of each strategy...)**" >}}
* When using the `owned` strategy, a class instance has one "owner" variable.
The instance is only around as long as this owner exists.
As soon as the owner disappears, the class instance is deallocated.
In some cases --- though we won't be covering them today --- ownership can
be transferred from one variable to another, but no two values can
own the same class instance at the same time.
Other variables can still refer to an `owned` class instance, but they must _borrow_ it,
creating, for example, a `borrowed person`. Borrows do not affect the
lifetime of class or when it is deallocated.
* When using the `shared` strategy, Chapel keeps track of how many places
still have variables that refer to a particular instance of a class. This
is typically called a _reference count_. Each time a variable is created
or changed to refer to a class instance, the instance's reference count
increases. When that variable goes out of scope and disappears, the
reference count decreases. Finally, when the reference count reaches
zero (no more variables refer to the class instance), there's no point
in keeping it around anymore, and its memory is deallocated.
As is the case with `owned`, other variables can borrow `shared` class instances.
Such borrows do not affect the reference count at all, and therefore don't
influence when the instance is freed.
* When using the `unmanaged` strategy, you're promising to manually free
the memory later, using the `delete` keyword. This is very similar to
how `new`/`delete` work in classic C++.
{{< /details >}}
So, the `owned` keyword in our `children` list means we've opted for the
`owned` memory management strategy. The implication of this is that
when a "parent" person is deallocated, so are all of its children
(since the person class, through its `children` list, owns each child).
If we aren't planning on sharing our data, `owned` is the preferred strategy. This is because
it precludes the need for some bookkeeping, which
often makes a difference in terms of performance. The added benefit to using
`owned`, in my personal view, is that it's easier to figure out when something
will be deleted --- there's no chance of some other variable, elsewhere in my program,
preventing a class instance's deallocation.
#### Methods
Remember how I said that classes can be used to group together pieces
of related data? Well, they can do more than that. They can also group
together operations on this data, in the form of _methods_. For instance,
we could add the following definition **inside** the `class` declaration
for our `person`:
```Chapel
class person {
// ... as before
proc getGreeting() {
return "Hello, " + this.firstName + "!";
}
}
```
Just like fields can be thought of as `var`s that are associated with a particular
class instance, methods can be thought of as _procedures_ associated with
a particular class instance. Thus, methods behave pretty much exactly
like the `proc`s we've seen so far, with the notable difference of being able to
access that class instance through the `this` keyword.
For example, inside the body of a method like `getGreeting` above,
`this.firstName` gets us the person's first name, and `this.lastName` would
get us their last name.
We can call methods using the dot syntax:
```Chapel
// Prints "Hello, Daniel!"
writeln(biggestCandyFan.getGreeting());
```
Methods are a powerful tool for abstraction; rather than writing external code
that refers to the various fields of a class, we can put that logic
inside of methods, and avoid exposing it to the rest of the world. A person
writing `.getGreeting()` will not need to know how a name is represented
in the `person` class.
Another sort of method is a _type method_ (sometimes referred to as
a _static method_ in other languages). Rather than being called on
an instance of a person, like `biggestCandyFan` or `daniel`, it's called
on the class itself. For instance:
```Chapel
class person {
// ... as before
proc type createBiggestCandyFan() {
return new person("Daniel", "Fedorin");
}
}
var biggestCandyFan = person.createBiggestCandyFan();
```
Methods like this have the benefit of being associated with a particular class.
This means that another class can have its own `createBiggestCandyFan()`
method, and there won't be any confusion or problems arising from trying
to figure out which is which. Perhaps dogs (represented by a hypothetical
`dog` class) have a biggest candy fan, too!
```Chapel
var biggestCandyFan = person.createBiggestCandyFan();
var biggestCandyFanDog = dog.createBiggestCandyFan();
```
### A `Dir` Class to Represent Directories
Back to the solution. The class I use for tracking directories is actually not too different
from our modified `person` class above. Each directory
{{< sidenote right "dir-firstname-note" "will have a name" >}}
Despite the recent media noise about ChatGPT, directories have not yet
been granted personhood, and do not have both first and last names.
{{< /sidenote >}}
as well as a collection of files and directories it contains.
*/
class Dir {
var name: string;
var files = new map(string, int);
var dirs = new list(owned Dir);
/*
Since files have no
additional information to them besides their size, I decided to represent
them as a map --- a directory's `files` field associates each file's name
with that file's size. The subdirectories are represented just like
the `children` field from our `person` record, as a list of owned `Dir`s.
There are a few more things I want to add to `Dir`;
the first is a way to read our directory from our puzzle input.
#### Reading the File System with the `fromInput` Type Method
For reasons of abstraction and avoiding conflicts, I put
the code for creating a directory from user input into a type method on `Dir`. Within
this method, I include the now-familiar code for reading from the
input using `readLine`, until we run out of lines.
*/
proc type fromInput(name: string): owned Dir {
var line: string;
var newDir = new Dir(name);
while readLine(line, stripNewline = true) {
/*
Notice that I'm accepting the name for the
directory as a string formal and initializing a new variable `newDir` with that name.
Notice also that I don't need to provide the `files` and `dirs`
as arguments to `new Dir` --- they have default values in the
class definition. By default, `new` uses the `owned` memory management
strategy. For the time being, the `newDir` variable owns our
directory-under-construction.
We're reading lines now; all that's left is to figure out what to do
with them. The first case is that of `$ cd ..`. When we see that line,
it means that we're done looking at the current directory; none
of the subsequent `ls` lines will be meant for us. Thus, we break
out of the input `while`-loop.
*/
if line == "$ cd .." {
break;
/*
If the `cd` command is used, but its argument isn't `..`, we're being
asked to descend into a sub-directory of our current `newDir`.
In this case, we call the `fromInput` method again, recursively,
to create a subdirectory of the current one. This
call will keep consuming lines from the input until the sub-directory
has been processed, at which point it will return it to us. We'll
immediately append this sub-directory to the `newDir.dirs` list,
which becomes the sub-directory's new owner.
Recall that we need to give `fromInput` the name of the new
sub-directory. We can figure out the name by slicing the string
starting after the `$ cd` prefix. Since I want to get the rest of the
characters after the prefix, I leave the end of my range unbounded, which
makes the slice go until the characters run out at the end of the string.
If you're feeling shaky on lists and `append`, check out our [day 5 article]({{< relref "aoc2022-day05-cratestacks" >}}#moving-crates-within-an-array-of-lists).
If you want a little refresher on slicing, we first covered it on [day 3]({{< relref "aoc2022-day03-rucksacks" >}}/ranges-and-slicing).
*/
} else if line.startsWith("$ cd ") {
param cdPrefix = "$ cd ";
const dirName = line[cdPrefix.size..];
newDir.dirs.append(Dir.fromInput(dirName));
/*
As it turns out, all that's left is to handle files. We already get
directory names from `cd`, so there's no reason to worry about
lines starting with `dir`. The `ls` command itself always precedes
the list of files and directories; by itself, it provides us no
additional information. Thus, our last case is a line that's neither
`dir` nor `ls`. Such a line is a file, so its format will be a number
followed by the file's name.
I use the `partition` method on the line to split it into three
pieces: the part before the space, the space itself, and the part
after the space. After that, I can just update the `newDir` map,
associating the file called `name` with its size. I use an integer cast
to convert `size` (a string) to a number.
*/
} else if !line.startsWith("$ ls") && !line.startsWith("dir") {
const (size, _, name) = line.partition(" ");
newDir.files[name] = size : int;
/*
That's it for the loop! Once the loop stops running, we know we're done
processing the directory. All that remains is to return it. Returning
an `owned` value from a function or method transfers ownership to whatever
code calls the function or method.
*/
}
}
return newDir;
}
/*
One more thing: I have explicitly annotated the
return type of `fromInput` to be `owned Dir` to let Chapel know
that I'm using the `owned` memory management strategy. This might just
be the first return type annotation we've written so far. Up until now,
Chapel has been able to deduce the return types of our procedures
and iterators automatically. However, here, because we are using
recursion, it needs just a little bit of help: determining the types
in the body of `fromInput` requires knowing the type of `fromInput`!
The manual type annotation helps break that loop.
*/
/*
#### An Iterator Method for Listing Directory Sizes
Let's recap. What we have now is a data structure, `Dir`, which represents
the directory tree. We also have a type method, `Dir.fromInput` that
converts our puzzle input into this data structure. What's left?
The way I see it, the problem is composed of three pieces:
1. Go through all of the directory sizes...
2. ... ignoring those that are above a certain threshold ...
3. ... and sum them.
Over the past week, we've gotten really good at summing things! In
Chapel, we can just use `+reduce` to compute the sum of something
iterable, so there's point number three. For point two, it turns out that
those [loop expressions]({{< relref "aoc2022-day06-packets" >}}#parallel-loop-expressions)
from yesterday can be used to filter out elements like so:
```Chapel
[for i in iterable] if someCondition then i
```
Putting these two pieces together, we might write something like:
```Chapel
+ reduce [for size in directorySizes] if size < 1000000 then size
```
That `directorySizes` is the only "fictional" piece of the solution.
Perhaps we can make our `Dir` tree support an iterator of directory sizes?
Then, we'd have our answer.
In my solution, I do just that. Methods on classes don't have to be procedures ---
they can also be iterators. There's only one complication. We want our
iterator method to yield the sizes of _all_ of the various sub-directories
within a `Dir` including sub-directories of sub-directories. That's because
we have to sum them all up as per the problem statement. However, when
_computing_ the size of a directory, we don't want to include sub-sub-directories
in our counting: the direct sub-directories already include the sizes of
their own contents. To make this work, I added a `parentSize` formal to
the iterator method, which represents a reference to the parent directory's
size. When it's done yielding its own size, as well as the sizes of the
sub-directories, the iterator method will add its own size to its parent's.
Here's the implementation of the iterator method; I'll talk about it in
more detail below.
*/
iter dirSizes(ref parentSize = 0): int {
// Compute sizes from files only.
var size = + reduce files.values();
for subDir in dirs {
// Yield directory sizes from the dir.
for subSize in subDir.dirSizes(size) do yield subSize;
}
yield size;
parentSize += size;
}
/*
The first thing this method does is create a new variable, `size`,
representing the current directory's size. It's initialized to the sum
of all the file sizes. However, at this point, that's not the whole size ---
we also need to figure out how much data is stored in the subdirectories.
I use a `for` loop over the `dirs` list to examine each sub-directory
of the current folder in turn. Each of these sub-directories is its
own full-fledged `Dir`, so we can call its `dirSizes`
method. This gives us an iterator of all directory sizes from `subDir`.
I simply yield them from the parent iterator, making it yield
the sizes of all directories, including nested ones. Notice that I also
provide `size` as the argument to the recursive call to `dirSizes`:
the inner for-loop serves the double purpose of yielding directory sizes
and finishing computing the current folder's size.
Once all of the sub-directory sizes have been yielded, the `size` variable
includes all the files in the folder, including nested ones. Thus, I use it to yield
the size of the current folder. I also add `size` to `parentSize`.
That concludes our `Dir` class!
*/
/*
{{< skip >}}
```Chapel
iter these(param tag: iterKind): (string, int) where tag == iterKind.standalone {
var size = + reduce files.values();
coforall dir in dirs with (+ reduce size) {
// Yield directory sizes from the dir.
forall subSize in dir do yield subSize;
// Count its size for our size.
size += dir.size;
}
yield (name, size);
this.size = size;
}
```
{{< /skip >}}
*/
}
/*
### Putting It All Together
With our `Dir` class complete, we can finally make use of it in our code.
The first thing we need to do is read our file system from the input;
this is accomplished using the `fromInput` method.
*/
var rootFolder = Dir.fromInput("/");
/*
Next up, we can use that `+reduce` expression I described above. I use
a new variable, `rootSize`, to represent the size of the top-level directory.
After the call to `dirSizes` completes, it will be set to the total size of
the root directory, i.e., the total disk usage. */
var rootSize = 0;
writeln(+ reduce [size in rootFolder.dirSizes(rootSize)] if size < 100000 then size);
/*
I could've omitted the argument to `dirSizes` --- notice from the method's
signature that I provide a default value for `parentSize`.
```Chapel
iter dirSizes(ref parentSize = 0): int {
```
However, knowing `rootSize` lets us easily compute the amount of space we need
to free up (for part 2 of today's problem).
*/
const toDelete = rootSize - 40000000; // = 30000000 - (70000000 - rootSize)
/*
We can now re-use our `dirSizes` stream to check every directory size again,
this time looking for the smallest folder that meets a certain threshold.
A `min` reduction takes care of this:
*/
writeln(min reduce [size in rootFolder.dirSizes()] if size >= toDelete then size);
/* And there's the solution to part 2, as well! */
/*
### Summary
This concludes today's description of my solution. This time, I introduced
Chapel's classes --- defining them, creating fields and adding methods. We got
a little taste of memory management strategies and ownership, though I deliberately
kept it light to avoid introducing too many new concepts.
Admittedly, today's solution is (for the most part) serial. Although the
`+reduce` expression that computes the initial `size` of a directory from
its `files` is eligible for parallelization, the `dirSizes` iterator is not. The main
reason for this is that the interaction between recursive parallel iterators and
reductions is, at the time of writing, unimplemented.
Nevertheless, I think that using even a serial iterator has _yielded_ an elegant
solution (pun intended).
If you wanted to write a parallel version, I'd advise creating a new,
non-iterator method on `Dir` that solves just part 1 of today's puzzle.
This method could return a tuple of two elements, perhaps `sumSmallSizes`
and `dirSize`; then, a simple `forall` loop over `dirs` (and judicious use of reduce intents,
which are described in our [day 4 article]({{< relref "aoc2022-day04-ranges" >}}third-solution-parallel-approach))
will let you compute the answer in parallel.
Thanks for reading! Please feel free
to ask any questions or post any comments you have in the new [Blog
Category](https://chapel.discourse.group/c/blog/21) of Chapel's
Discourse Page. */