From 43a72533f5f0f51e4d802f5b375ce0b5b0950e03 Mon Sep 17 00:00:00 2001 From: Danila Fedorin Date: Mon, 5 Aug 2019 00:39:54 -0700 Subject: [PATCH] Finish the draft of the parsing post --- content/blog/02_compiler_parsing.md | 70 +++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) diff --git a/content/blog/02_compiler_parsing.md b/content/blog/02_compiler_parsing.md index 2bdeb3d..faba8b7 100644 --- a/content/blog/02_compiler_parsing.md +++ b/content/blog/02_compiler_parsing.md @@ -233,3 +233,73 @@ another expression. Finally, we get to writing our Bison file, `parser.y`. Here's what I come up with: {{< rawblock "compiler_parser.y" >}} + +There's a few things to note here. First of all, the __parser__ is the "source of truth" regarding what tokens exist in our language. +We have a list of `%token` declarations, each of which corresponds to a regular expression in our scanner. + +Next, observe that there's +a certain symmetry between our parser and our scanner. In our scanner, we mixed the theoretical idea of a regular expression +with an __action__, a C++ code snippet to be executed when a regular expression is matched. This same idea is present +in the parser, too. Each rule can produce a value, which we call a __semantic value__. For type safety, we allow +each nonterminal and terminal to produce only one type of semantic value. For instance, all rules for \\(A_{add}\\) must +produce an expression. We specify the type of each nonterminal and using `%type` directives. The types of terminals +are specified when they're declared. + +Next, we must recognize that Bison was originally made for C, rather than C++. In order to allow the parser +to store and operate on semantic values of various types, the canonical solution back in those times was to +use a C `union`. Unions are great, but for C++, they're more trouble than they're worth: unions don't +allow for non-trivial constructors! This means that stuff like `std::unique_ptr` and `std::string` is off limits as +a semantic value. But we'd really much rather use them! The solution is to: + +1. Specify the language to be C++, rather than C. +2. Enable the `variant` API feature, which uses a lightweight `std::variant` alternative in place of a union. +3. Enable the creation of token constructors, which we will use in Flex. + +In order to be able to use the variant-based API, we also need to change the Flex `yylex` function +to return `yy::parser::symbol_type`. You can see it in our forward declaration of `yylex`. + +Now that we made these changes, it's time to hook up Flex to all this. Here's a new version +of the Flex scanner, with all necessary modifications: +{{< rawblock "compiler_scanner_bison.l" >}} + +The key two ideas are that we overrode the default signature of `yylex` by changing the +`YY_DECL` preprocessor variable, and used the `yy::parser::make_` functions +to return the `symbol_type` rather than `int`. + +Finally, let's get a main function so that we can at least check for segmentation faults +and other obvious mistakes: +{{< codeblock "C++" "compiler_main.cpp" >}} + +Now, we can compile and run the code: +``` +flex -o compiler_scanner.cpp compiler_scanner_bison.l +bison -o compiler_parser.cpp -d compiler_parser.y +g++ -c -o scanner.o compiler_scanner.cpp +g++ -c -o parser.o compiler_parser.cpp +g++ compiler_main.cpp parser.o scanner.o +``` +We used the `-d` option for Bison to generate the `compiler_parser.hpp` header file, +which exports our token declarations and token creation functions, allowing +us to use them in Flex. + +At last, we can feed some code to the parser from `stdin`. Let's try it: +``` +./a.out +defn main = { add 320 6 } +defn add x y = { x + y } +``` +The program prints `2`, indicating two declarations were made. Let's try something obviously +wrong: +``` +./a.out +}{ +``` +We are told an error occured. Excellent! + +There's still a number of flaws with our parser: + +2. We don't print errors properly. +3. We also have no way of verifying our tree was built correctly. +1. We're missing the data declaration, from both our C++ source and from the Bison grammars. + +This post is getting a little long, so we will revisit the parser in the next one. See you then!