diff --git a/content/blog/01_compiler_tokenizing.md b/content/blog/01_compiler_tokenizing.md index e2980db..1069375 100644 --- a/content/blog/01_compiler_tokenizing.md +++ b/content/blog/01_compiler_tokenizing.md @@ -55,31 +55,31 @@ patterns that a string has to match. We define regular expressions as follows: * Any character is a regular expression that matches that character. Thus, -\\(a\\) is a regular expression (from now shortened to regex) that matches +\(a\) is a regular expression (from now shortened to regex) that matches the character 'a', and nothing else. -* \\(r_1r_2\\), or the concatenation of \\(r_1\\) and \\(r_2\\), is -a regular expression that matches anything matched by \\(r_1\\), followed -by anything that matches \\(r_2\\). For instance, \\(ab\\), matches +* \(r_1r_2\), or the concatenation of \(r_1\) and \(r_2\), is +a regular expression that matches anything matched by \(r_1\), followed +by anything that matches \(r_2\). For instance, \(ab\), matches the character 'a' followed by the character 'b' (thus matching "ab"). -* \\(r_1|r_2\\) matches anything that is either matched by \\(r_1\\) or -\\(r_2\\). Thus, \\(a|b\\) matches the character 'a' or the character 'b'. -* \\(r_1?\\) matches either an empty string, or anything matched by \\(r_1\\). -* \\(r_1+\\) matches one or more things matched by \\(r_1\\). So, -\\(a+\\) matches "a", "aa", "aaa", and so on. -* \\((r_1)\\) matches anything that matches \\(r_1\\). This is mostly used +* \(r_1|r_2\) matches anything that is either matched by \(r_1\) or +\(r_2\). Thus, \(a|b\) matches the character 'a' or the character 'b'. +* \(r_1?\) matches either an empty string, or anything matched by \(r_1\). +* \(r_1+\) matches one or more things matched by \(r_1\). So, +\(a+\) matches "a", "aa", "aaa", and so on. +* \((r_1)\) matches anything that matches \(r_1\). This is mostly used to group things together in more complicated expressions. -* \\(.\\) matches any character. +* \(.\) matches any character. -More powerful variations of regex also include an "any of" operator, \\([c_1c_2c_3]\\), -which is equivalent to \\(c_1|c_2|c_3\\), and a "range" operator, \\([c_1-c_n]\\), which -matches all characters in the range between \\(c_1\\) and \\(c_n\\), inclusive. +More powerful variations of regex also include an "any of" operator, \([c_1c_2c_3]\), +which is equivalent to \(c_1|c_2|c_3\), and a "range" operator, \([c_1-c_n]\), which +matches all characters in the range between \(c_1\) and \(c_n\), inclusive. -Let's see some examples. An integer, such as 326, can be represented with \\([0-9]+\\). +Let's see some examples. An integer, such as 326, can be represented with \([0-9]+\). This means, one or more characters between 0 or 9. Some (most) regex implementations -have a special symbol for \\([0-9]\\), written as \\(\\setminus d\\). A variable, +have a special symbol for \([0-9]\), written as \(\setminus d\). A variable, starting with a lowercase letter and containing lowercase or uppercase letters after it, -can be written as \\(\[a-z\]([a-zA-Z]+)?\\). Again, most regex implementations provide -a special operator for \\((r_1+)?\\), written as \\(r_1*\\). +can be written as \([a-z]([a-zA-Z]+)?\). Again, most regex implementations provide +a special operator for \((r_1+)?\), written as \(r_1*\). So how does one go about checking if a regular expression matches a string? An efficient way is to first construct a [state machine](https://en.wikipedia.org/wiki/Finite-state_machine). A type of state machine can be constructed from a regular expression