diff --git a/content/blog/00_compiler_intro.md b/content/blog/00_compiler_intro.md index 5635f3f..01b5513 100644 --- a/content/blog/00_compiler_intro.md +++ b/content/blog/00_compiler_intro.md @@ -7,7 +7,7 @@ draft: true During my last academic term, I was enrolled in a compilers course. We had a final project - develop a compiler for a basic Python subset, using LLVM. It was a little boring - virtually nothing about the compiler -was __not__ covered in class, and it felt more like putting two puzzles +was __not__ covered in class, and it felt more like putting two puzzle pieces together than building a real project. Instead, I chose to implement a compiler for a functional programming language, diff --git a/content/blog/01_compiler_tokenizing.md b/content/blog/01_compiler_tokenizing.md index 976c478..d508f61 100644 --- a/content/blog/01_compiler_tokenizing.md +++ b/content/blog/01_compiler_tokenizing.md @@ -48,7 +48,7 @@ are fairly simple - one or more digits is an integer, a few letters together are a variable name. In order to be able to efficiently break text up into such tokens, we restrict ourselves to __regular languages__. A language is defined as a set of strings (potentially infinite), and a regular -language for which we can write a __regular expression__ to check if +language is one for which we can write a __regular expression__ to check if a string is in the set. Regular expressions are a way of representing patterns that a string has to match. We define regular expressions as follows: @@ -77,7 +77,7 @@ Let's see some examples. An integer, such as 326, can be represented with \\([0- This means, one or more characters between 0 or 9. Some (most) regex implementations have a special symbol for \\([0-9]\\), written as \\(\\setminus d\\). A variable, starting with a lowercase letter and containing lowercase or uppercase letters after it, -can be written as \\(\[a-z\]([a-z]+)?\\). Again, most regex implementations provide +can be written as \\(\[a-z\]([a-zA-Z]+)?\\). Again, most regex implementations provide a special operator for \\((r_1+)?\\), written as \\(r_1*\\). So how does one go about checking if a regular expression matches a string? An efficient way is to @@ -115,8 +115,8 @@ represent numbers directly into numbers, and do other small tasks. So, what tokens do we have? From our arithmetic definition, we see that we have integers. Let's use the regex `[0-9]+` for those. We also have the operators `+`, `-`, `*`, and `/`. -`-` is simple enough: the corresponding regex is `-`. We need to -preface our `/`, `+` and `*` with a backslash, though, since they happen to also be modifiers +The regex for `-` is simple enough: it's just `-`. However, we need to +preface our `/`, `+` and `*` with a backslash, since they happen to also be modifiers in flex's regular expressions: `\/`, `\+`, `\*`. Let's also represent some reserved keywords. We'll say that `defn`, `data`, `case`, and `of` diff --git a/content/blog/02_compiler_parsing.md b/content/blog/02_compiler_parsing.md index cda5718..15b5da4 100644 --- a/content/blog/02_compiler_parsing.md +++ b/content/blog/02_compiler_parsing.md @@ -38,7 +38,7 @@ $$ In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse a nonterminal by trying to match each symbol in its body. In the rule \\(S \\rightarrow \\alpha \\beta \\gamma\\), it will -first try to match \\(alpha\\), then \\(beta\\), and so on. If one of the three contains a nonterminal, it will attempt to parse +first try to match \\(\\alpha\\), then \\(\\beta\\), and so on. If one of the three contains a nonterminal, it will attempt to parse that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar $$ \\begin{align} @@ -105,7 +105,7 @@ A\_{add} & \\rightarrow A\_{add}-A\_{mult} \\\\\\ A\_{add} & \\rightarrow A\_{mult} \\end{align} $$ -The first rule matches another addition, added to the result of another addition. We use the addition in the body +The first rule matches another addition, added to the result of a multiplication. Similarly, the second rule matches another addition, from which the result of a multiplication is then subtracted. We use the \\(A\_{add}\\) on the left side of \\(+\\) and \\(-\\) in the body because we want to be able to parse strings like `1+2+3+4`, which we want to view as `((1+2)+3)+4` (mostly because subtraction is [left-associative](https://en.wikipedia.org/wiki/Operator_associativity)). So, we want the top level of the tree to be the rightmost `+` or `-`, since that means it will be the "last" operation. You may be asking, @@ -150,7 +150,7 @@ What's the last \\(C\\)? We also want a "thing" to be a case expression. Here ar $$ \\begin{align} C & \\rightarrow \\text{case} \\; A\_{add} \\; \\text{of} \\; \\{ L\_B\\} \\\\\\ -L\_B & \\rightarrow R \\; , \\; L\_B \\\\\\ +L\_B & \\rightarrow R \\; L\_B \\\\\\ L\_B & \\rightarrow R \\\\\\ R & \\rightarrow N \\; \\text{arrow} \\; \\{ A\_{add} \\} \\\\\\ N & \\rightarrow \\text{lowerVar} \\\\\\