commit 31de8c902fc52a862b716aea258efda872fe2183 Author: Rob Hess Date: Wed Apr 17 16:46:07 2019 -0700 Add complete assignment and starter code. diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..951656f --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +scan +scanner.cpp diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..2bbf4a9 --- /dev/null +++ b/Makefile @@ -0,0 +1,10 @@ +all: scan + +scan: main.cpp scanner.cpp + g++ main.cpp scanner.cpp -o scan + +scanner.cpp: scanner.l + flex -o scanner.cpp scanner.l + +clean: + rm -f scan scanner.cpp diff --git a/README.md b/README.md new file mode 100644 index 0000000..e55b4b0 --- /dev/null +++ b/README.md @@ -0,0 +1,132 @@ +# Assignment 2 +**Due by 11:59pm on Monday, 5/13/2019** + +**Demo due by 11:59pm on Monday, 5/27/2019** + +In this assignment we'll work on a parser for a small subset of the language Python. In particular, we'll use the widely-used parser generator Bison to implement a parser that, when combined with the scanner we wrote in assignment 1, will perform syntax-directed translation from Python to C/C++. + +There are a few major parts to this assignment, described below. To get you started, you are provided with a Flex scanner specification in `scanner.l` that solves the problem defined in assignment 1. There is also a makefile that specifies compilation for the scanner. A simple `main()` function for the scanner is written in `main.cpp`. + +## 1. Modify the scanner to work with Bison + +Flex and Bison are designed to easily integrate with each other, but you'll still need to make some modifications to the scanner specification to make it and the parser work together. These modifications will be easiest to do in stages: + +1. Set up a basic Bison parser definition (say in `parser.y`) with no nonterminals. The main thing you'll need to do is write `%token` directives to specify all of the terminals you'll use in your grammar. These terminals will correspond directly to the syntactic categories we recognized with the scanner in assignment 1, e.g. `IDENTIFIER`, `FLOAT`, `WHILE`, `PLUS`, etc. To write these `%token` directives, you'll need to figure out what data type(s) you'll use to represent the different program constructs in the representation output by the parser. Remember, our end goal with this project is to output C/C++ code corresponding to the Python code being parsed. + +2. Once you have your parser definition started, add a compilation step for it in the makefile. The goal at this point is to generate a header file containing integer values for all of the nonterminals/syntactic categories, so you can include that header file in the scanner and start returning these integer values instead of just printing out syntactic categories. To generate this header file, add the `-d` option to your `bison` command, e.g.: + ``` + bison -d -o parser.cpp parser.y + ``` + This will generate two files, `parser.cpp` and `parser.hpp`, the later of which is the header file to include in the scanner definition. + +3. Now, make the scanner return syntactic categories instead of printing them. For Python, this sounds easier than it actually is. In particular, under the default setup, a Bison-generated parser exists as a function `yyparse()`. This function repeatedly calls the scanning function `yylex()` that's generated by our Flex specification, and, on each call, it expects `yylex()` to return the integer code for the syntactic category of the next word in the source program. + + Thus, you might be tempted to simply replace all of the `cout` statements in the scanner that print syntactic categories into `return` statements that just return those syntactic categories instead. This would work for all but a small few cases. In particular, there are a few situations where a single call to the scanner could generate *multiple* tokens. Specifically, when a program is dedented by multiple levels at once, we need our scanner to be able to return multiple `DEDENT` tokens from a single Flex rule. This cannot be done with a simple return statement. There are at least two ways to solve this problem: + + 1. **Use a queue to store tokens to return.** Every time a token is generated in the scanner, place it into a queue instead of returning it. Then, at the beginning of each call to `yylex()`, first check the queue to see if there are any tokens waiting to be returned. If there are, simply dequeue the first token and return it. Note that under this approach, you may need to do some extra work to be able to return the *lexeme* along with each syntactic category that's returned, since this is needed for some syntactic categories like `IDENTIFIER`. One possibility would be to store lexeme/syntactic category pairs in your queue. + + 2. **Implement a push parser.** The default model implemented by a Bison-generated parser is to "pull" tokens from the scanner by calling `yylex()` each time a new token is needed. A push parser reverses these roles so that `yylex()` is called only once and now "pushes" a token to the parser each time a new token becomes available. It does this by calling the function `yypush_parse()` with the new token passed as an argument. Under the push-parsing paradigm, it doesn't matter if the scanner generates multiple tokens at a time, since each one can be pushed to the parser in turn. You can read more about how push parsers work in Bison here: + + https://www.gnu.org/software/bison/manual/bison.html#Push-Decl + +## 2. Implement grammar rules to recognize Python constructs + +Once your scanner is able to generate one token at a time, either via a token queue or via push parsing, you are ready to write some grammar rules to recognize constructs in the Python language. At this point, you don't need to worry about attaching actions to these rules. You can just get your grammar in place. + +The grammar you write will need to recognize a simplified subset of Python. In particular, your grammar should be able to recognize a program comprised of the following kinds of statements: + +* **Assignment statements.** These are statements where the value of an expression is assigned to a specific variable, e.g.: + ```python + circumference = pi * 2 * r + ``` + In the subset of Python we'll implement, no assignment statement will span more than a single line of code, and each statement will be terminated by a newline (i.e. lines won't be broken with a `/` character, as they can be in actual Python syntax). The expression on the right-hand side of the assignment can be any valid expression involving identifiers, floats, integers, or booleans and the following operators: `+`, `-`, `*`, `/`, `==`, `!=`, `>`, `>=`, `<`, `<=`, `not`. Expressions may also contain parentheses `()`. + +* **If-elif-else statements.** In Python these look like the following: + ```python + if a: + x = 2 * y + elif b <= 7: + x = 3 * y + else: + x = 4 * y + ``` + Of course, the `elif` and `else` parts are both optional. The statement could also include more `elif` blocks. Importantly, all of the statements to be executed for each of the `if`, `elif`, and `else` conditions are indented to the same level. In other words, each block is contained within a matching `INDENT`/`DEDENT` pair. Also, for this assignment, every one of these blocks will be preceded by a newline. In other words, another statement cannot be included on the same line as the `if`, the `elif`, or the `else`. For this assignment the conditions for `if` and `elif` statements can be any valid expression or any boolean combination of expressions using the `and` and `or` operators. + +* **While statements.** These are similar to `if` statements, e.g.: + ```python + while i < 10: + i = i + 1 + ``` + Again, the block of statements to be executed in each iteration of the while loop will be contained within a matching `INDENT`/`DEDENT` pair and will be separated from the `while` statement with a newline. Again, the termination conditions for `while` statements can be any valid expression or any boolean combination of expressions using the `and` and `or` operators. + +* **Break statements.** These simply consist of the keyword `break` followed by a newline, i.e.: + ```python + break + ``` + +For this assignment, some things you specifically *do not* need to worry about are: + * For loops. + * Function definitions and function calls. + * Arrays and dictionaries. + +## 3. Assign actions to your grammar rules to perform syntax-directed translation into C/C++ + +Syntax-directed translation is essentially compilation by the parser. In other words, in a syntax-directed translation scheme, the parser directly outputs the target program. Our goal for this assignment is to perform syntax-directed translation from Python into C or C++. In other words, your parser must output a working C/C++ program that performs the same computation as the input Python program. + +Once you have your grammar defined, you can begin to attach actions to your rules to perform the syntax directed translation. The easiest approach here will be to use the information you gain from the rules of your grammar about constructs recognized in the source program to generate corresponding C/C++ language strings for those constructs. In this way, at the end of the parse, your grammar's goal symbol will refer to a string containing the entire translated target program. + +A few things to consider while you're performing the syntax-directed translation: + +* Your parser should generate a *working* C/C++ program, so it will need to contain boilerplate things like `#include` statements and a `main()` function. It will probably be easiest if you don't worry about adding things like `#include ` or wrapping your target program within a `main()` function until the parse is complete. If your parse simply translates a sequence of Python statements into a corresponding sequence of C/C++ statements, you can wrap this translated sequence in a `main()` function at the very end. + +* In order to generate a working C/C++ program, you'll also need a variable declaration for each variable used in the program. To do this, you can maintain a simple symbol table, where each variable identifier is stored when it's first encountered. When your parse is finished, you can simply iterate through the identifiers stored in the symbol table and, for each identifier, generate a variable declaration at the top of your `main()` function. + + There are a couple simplifying assumptions you can make for the purposes of this assignment to make this a little easier: + + * Every variable will appear as the left-hand side of an assignment statement before it is used anywhere else. + + * All variables can be scoped to the `main()` function. You don't need to worry about scoping variables within blocks (e.g. inside of an `if` block). + + * All variables can have the same type, e.g. `double` or `float`. + +* So you can tell what's happening with your translated code, you should also generate one `printf()`/`cout` statement at the end of your `main()` function for each variable to print the value of that variable at the end of the execution of the translated program. For example, say you have the following simple Python program: + ```python + five = 2 + 2 + ``` + If you are translating to C++, your parser should output a program that looks like this (though you don't need to match the indentation of this program; it's included only for clarity): + ```c++ + #include + int main() { + double five; + five = 2 + 2; + std::cout << "five: " << five << std::endl; + } + ``` + +* If the source program contains one or more syntax errors, you should not output a target program. Instead, you should report at least the first encountered syntax error. + +* Don't worry about indentation in your generated target program. Everything can be unindented. + +Once you get your translation fully working, you should be able to use `gcc`/`g++` to compile and run the generated target program, provided the source program contains no syntax errors. + +## 4. Make sure your makefile fully generates your parser + +You should be able to type `make` to generate an executable parser from your scanner and parser specifications. + +## Testing your parser + +There are some simple Python programs you may use for testing your parser included in the `testing_code/` directory. Some of these programs (i.e. `p*.py`) are syntactically valid, and your parser should be able to translate them successfully. There are example translations for these programs included in the `example_output/` directory. Some of the programs in `testing_code/` (i.e. `error*.py`) contain various syntax errors. Your parser should fail to translate these programs. + +## Submission + +We'll be using GitHub Classroom for this assignment, and you will submit your assignment via GitHub. Make sure your completed files are committed and pushed by the assignment's deadline to the master branch of the GitHub repo that was created for you by GitHub Classroom. A good way to check whether your files are safely submitted is to look at the master branch your assignment repo on the github.com website (i.e. https://github.com/osu-cs480-sp19/assignment-2-YourGitHubUsername/). If your changes show up there, you can consider your files submitted. + +## Grading criteria + +The TAs will grade your assignment by compiling and running it on one of the ENGR servers, e.g. `flip.engr.oregonstate.edu`, so you should make sure your code works as expected there. `bison` and `flex` are installed on the ENGR servers. If your code does not compile and run on the ENGR servers, the TAs will deduct at least 25 points from your score. + +This assignment is worth 100 points total, broken down as follows: + * 30 points: scanner is modified to correctly return tokens to the parser + * 30 points: grammar rules are correctly set up for the subset of Python described above + * 35 points: parser successfully performs syntax-directed translation, as described above + * 5 points: makefile is specified to fully generate an executable parser diff --git a/example_output/p1.cpp b/example_output/p1.cpp new file mode 100644 index 0000000..642ad5f --- /dev/null +++ b/example_output/p1.cpp @@ -0,0 +1,27 @@ +#include +int main() { +double circle_area; +double circle_circum; +double pi; +double r; +double sphere_surf_area; +double sphere_vol; + +/* Begin program */ + +pi = 3.1415; +r = 8.0; +circle_area = pi * r * r; +circle_circum = pi * 2 * r; +sphere_vol = (4.0 / 3.0) * pi * r * r * r; +sphere_surf_area = 4 * pi * r * r; + +/* End program */ + +std::cout << "circle_area: " << circle_area << std::endl; +std::cout << "circle_circum: " << circle_circum << std::endl; +std::cout << "pi: " << pi << std::endl; +std::cout << "r: " << r << std::endl; +std::cout << "sphere_surf_area: " << sphere_surf_area << std::endl; +std::cout << "sphere_vol: " << sphere_vol << std::endl; +} diff --git a/example_output/p2.cpp b/example_output/p2.cpp new file mode 100644 index 0000000..b38bbca --- /dev/null +++ b/example_output/p2.cpp @@ -0,0 +1,34 @@ +#include +int main() { +double a; +double b; +double x; +double y; +double z; + +/* Begin program */ + +a = true; +b = false; +x = 7; +if (a) { +x = 5; +if (b) { +y = 4; +} else { +y = 2; +} +} +z = (x * 3 * 7) / y; +if (z > 10) { +y = 5; +} + +/* End program */ + +std::cout << "a: " << a << std::endl; +std::cout << "b: " << b << std::endl; +std::cout << "x: " << x << std::endl; +std::cout << "y: " << y << std::endl; +std::cout << "z: " << z << std::endl; +} diff --git a/example_output/p3.cpp b/example_output/p3.cpp new file mode 100644 index 0000000..65efa2d --- /dev/null +++ b/example_output/p3.cpp @@ -0,0 +1,35 @@ +#include +int main() { +double f; +double f0; +double f1; +double fi; +double i; +double n; + +/* Begin program */ + +n = 6; +f0 = 0; +f1 = 1; +i = 0; +while (true) { +fi = f0 + f1; +f0 = f1; +f1 = fi; +i = i + 1; +if (i >= n) { +break; +} +} +f = f0; + +/* End program */ + +std::cout << "f: " << f << std::endl; +std::cout << "f0: " << f0 << std::endl; +std::cout << "f1: " << f1 << std::endl; +std::cout << "fi: " << fi << std::endl; +std::cout << "i: " << i << std::endl; +std::cout << "n: " << n << std::endl; +} diff --git a/main.cpp b/main.cpp new file mode 100644 index 0000000..09c092a --- /dev/null +++ b/main.cpp @@ -0,0 +1,5 @@ +extern int yylex(); + +int main() { + return yylex(); +} diff --git a/scanner.l b/scanner.l new file mode 100644 index 0000000..3a98a1d --- /dev/null +++ b/scanner.l @@ -0,0 +1,173 @@ +/* + * Lexer definition for simplified Python syntax. + */ + +/* + * Since we're only parsing 1 file, we don't need to have yywrap() (plus, + * having it included messes up compilation). + */ +%option noyywrap + +%option yylineno + +%{ +#include +#include +#include + +/* + * We'll use this stack to keep track of indentation level, as described in + * the Python docs: + * + * https://docs.python.org/3/reference/lexical_analysis.html#indentation + */ +std::stack _indent_stack; +%} + +%% + +%{ + /* + * These lines go at the top of the lexing function. We only want to + * initialize the indentation level stack once by pushing a 0 onto it (the + * indentation stack should never be empty, except immediately after it is + * created). + */ + if (_indent_stack.empty()) { + _indent_stack.push(0); + } +%} + +^[ \t]*\r?\n { /* Skip blank lines */ } + +^[ \t]*#.*\r?\n { /* Skip whole-line comments. */ } + +#.*$ { /* Skip comments on the same line as a statement. */ } + +^[ \t]+ { + /* + * Handle indentation as described in Python docs linked above. + * Note that this pattern treats leading spaces and leading tabs + * equivalently, which could cause some unexpected behavior if + * they're combined in a single line. For the purposes of this + * project, that's OK. + */ + if (_indent_stack.top() < yyleng) { + /* + * If the current indentation level is greater than the + * previous indentation level (stored at the top of the stack), + * then emit an INDENT and push the new indentation level onto + * the stack. + */ + std::cout << "INDENT" << std::endl; + _indent_stack.push(yyleng); + } else { + /* + * If the current indentation level is less than or equal to + * the previous indentation level, pop indentation levels off + * the stack until the top is equal to the current indentation + * level. Emit a DEDENT for each element popped from the stack. + */ + while (!_indent_stack.empty() && _indent_stack.top() != yyleng) { + _indent_stack.pop(); + std::cout << "DEDENT" << std::endl; + } + + /* + * If we popped everythin g off the stack, that means the + * current indentation level didn't match any on the stack, + * which is an indentation error. + */ + if (_indent_stack.empty()) { + std::cerr << "Error: Incorrect indentation on line " + << yylineno << std::endl; + return 1; + } + } + } + +^[^ \t\r\n]+ { + /* + * If we find a line that's not indented, pop all indentation + * levels off the stack, and emit a DEDENT for each one. Then, + * call REJECT, so the next rule matching this token is also + * applied. + */ + while (_indent_stack.top() != 0) { + _indent_stack.pop(); + std::cout << "DEDENT" << std::endl; + } + REJECT; + } + +\r?\n { + std::cout << "NEWLINE" << std::endl; + } + +<> { + /* + * If we reach the end of the file, pop all indentation levels + * off the stack, and emit a DEDENT for each one. + */ + while(_indent_stack.top() != 0) { + _indent_stack.pop(); + std::cout << "DEDENT" << std::endl; + } + yyterminate(); + } + +[ \t] { /* Ignore spaces that haven't been handled above. */ } + +"and" { std::cout << "AND\t\t" << yytext << std::endl; } +"break" { std::cout << "BREAK\t\t" << yytext << std::endl; } +"def" { std::cout << "DEF\t\t" << yytext << std::endl; } +"elif" { std::cout << "ELIF\t\t" << yytext << std::endl; } +"else" { std::cout << "ELSE\t\t" << yytext << std::endl; } +"for" { std::cout << "FOR\t\t" << yytext << std::endl; } +"if" { std::cout << "IF\t\t" << yytext << std::endl; } +"not" { std::cout << "NOT\t\t" << yytext << std::endl; } +"or" { std::cout << "OR\t\t" << yytext << std::endl; } +"return" { std::cout << "RETURN\t\t" << yytext << std::endl; } +"while" { std::cout << "WHILE\t\t" << yytext << std::endl; } + +"True" { std::cout << "BOOLEAN\t\t" << true << std::endl; } +"False" { std::cout << "BOOLEAN\t\t" << false << std::endl; } + +[a-zA-Z_][a-zA-Z0-9_]* { + std::cout << "IDENTIFIER\t" << yytext << std::endl; + } + +-?[0-9]*"."[0-9]+ { + std::cout << "FLOAT\t\t" << atof(yytext) << std::endl; + } + +-?[0-9]+ { + std::cout << "INTEGER\t\t" << atoi(yytext) << std::endl; + } + +"=" { std::cout << "ASSIGN\t\t" << yytext << std::endl; } +"+" { std::cout << "PLUS\t\t" << yytext << std::endl; } +"-" { std::cout << "MINUS\t\t" << yytext << std::endl; } +"*" { std::cout << "TIMES\t\t" << yytext << std::endl; } +"/" { std::cout << "DIVIDEDBY\t" << yytext << std::endl; } + +"==" { std::cout << "EQ\t\t" << yytext << std::endl; } +"!=" { std::cout << "NEQ\t\t" << yytext << std::endl; } +">" { std::cout << "GT\t\t" << yytext << std::endl; } +">=" { std::cout << "GTE\t\t" << yytext << std::endl; } +"<" { std::cout << "LT\t\t" << yytext << std::endl; } +"<=" { std::cout << "LTE\t\t" << yytext << std::endl; } + +"(" { std::cout << "LPAREN\t\t" << yytext << std::endl; } +")" { std::cout << "RPAREN\t\t" << yytext << std::endl; } + +"," { std::cout << "COMMA\t\t" << yytext << std::endl; } +":" { std::cout << "COLON\t\t" << yytext << std::endl; } + +. { + std::cerr << "Unrecognized token on line " << yylineno << ": " + << yytext << std::endl; + return 1; + } + +%% diff --git a/testing_code/error1.py b/testing_code/error1.py new file mode 100644 index 0000000..ed13ad9 --- /dev/null +++ b/testing_code/error1.py @@ -0,0 +1,2 @@ +# This file contains an invalid character. +a = 2 $ 8 diff --git a/testing_code/error2.py b/testing_code/error2.py new file mode 100644 index 0000000..52daa15 --- /dev/null +++ b/testing_code/error2.py @@ -0,0 +1,3 @@ +# This file contains an invalid assignment statement. +a = 2 +a b = a * 5 diff --git a/testing_code/error3.py b/testing_code/error3.py new file mode 100644 index 0000000..18c5610 --- /dev/null +++ b/testing_code/error3.py @@ -0,0 +1,4 @@ +# This file contains invalid indentation. +if True: + a = 3 + b = a * 4 diff --git a/testing_code/p1.py b/testing_code/p1.py new file mode 100644 index 0000000..259c4af --- /dev/null +++ b/testing_code/p1.py @@ -0,0 +1,6 @@ +pi = 3.1415 +r = 8.0 +circle_area = pi * r * r +circle_circum = pi * 2 * r +sphere_vol = (4.0 / 3.0) * pi * r * r * r +sphere_surf_area = 4 * pi * r * r diff --git a/testing_code/p2.py b/testing_code/p2.py new file mode 100644 index 0000000..4e4c538 --- /dev/null +++ b/testing_code/p2.py @@ -0,0 +1,14 @@ +a = True +b = False +x = 7 +if a: + x = 5 + if b: + y = 4 + else: + y = 2 + +z = (x * 3 * 7) / y + +if z > 10: + y = 5 diff --git a/testing_code/p3.py b/testing_code/p3.py new file mode 100644 index 0000000..4c64468 --- /dev/null +++ b/testing_code/p3.py @@ -0,0 +1,14 @@ +# This program computes and returns the n'th Fibonacci number. +n = 6 +f0 = 0 +f1 = 1 +i = 0 +while True: + fi = f0 + f1 + f0 = f1 + f1 = fi + i = i + 1 + if i >= n: + break + +f = f0