Project 5: Code Generation

Optional checkpoint due Monday, Apr 15, 2024 at 8pm ET

Final deadline Monday, Apr 22, 2024 at 8pm ET

A PDF version of this document is located here.

In this project, you will implement a code generator for uC, a small language in the C family. The main purpose of this project is to gain an understanding of metaprogramming, how one language can be translated into another, and to get practice with using macros and templates. A secondary goal is to gain experience working with an existing code base, with complex code written by someone else.

This project does not rely on Project 4 in any way. We will only be testing this project with uC code that is semantically correct, so semantic analysis will be disabled. You may start with either your Project 4 implementation or a fresh copy of the Project 4 starter files as a basis, and we will only be providing distribution code for files that are different from or in addition to Project 4.

Since this project allows reuse of code from Project 4, if you choose to work in a different partnership than for Project 4, you must start from a fresh copy of the Project 4 starter files. You will not need to reimplement any part of Project 4 before proceeding with this project.

Please see the syllabus for partnership rules. As a reminder, you may not share any part of your solution outside of your partnership. This includes both code and test cases.

The project is divided into multiple phases that are described below.

Reported Time to Complete the Project

The following is the time students reported they spent on the project in Winter 2023.

Time to complete project 1. 1% less than 5 hours, 11% 5-9 hours, 38% 10-14 hours, 34% 15-19 hours, 13% 20-24 hours, 1% 25-30 hours, 3% greater than 30 hours

These data are for planning purposes only. We do not consider the exact time spent on the project to be reflective of anyone’s learning or ability. Rather, completing the project regardless of how much time it takes is what is important to achieve the learning goals of the project.

Optional Checkpoint

The checkpoint consists of achieving at least 30% of the points on the public and private test cases before the checkpoint deadline. Your grade for the checkpoint will be computed as the better of:

  • \(min(0.3, score) / 0.3\), where \(score\) is the fraction of autograded points earned by your best submission before the checkpoint deadline.

  • \(finalScore\), where \(finalScore\) is the fraction of autograded points earned by your best submission before the final deadline.

Thus, completing the checkpoint is optional. However, doing it will work to your benefit, since you can guarantee full credit on the 20% of the project points dedicated to the checkpoint.

For this project, passing the public test cases for Phases 1-3 plus the style checks is enough to guarantee full credit on the checkpoint.

Overview

In Project 4, you implemented a semantic analyzer for uC, which performed multiple analysis phases over a source program. Your main task in Project 5 is to implement the final code-generation phase, which will generate C++ code from an abstract syntax tree (AST). Since the compiler is generating C++ code rather than machine code, it is a source-to-source compiler.

As an example, consider the following uC program:

void main(string[] args) {
  println("Hello world!");
}

This can be translated into the following C++ code (which, modulo some minor spacing, is the actual code produced by the staff solution to this project):

#include "defs.hpp"
#include "ref.hpp"
#include "array.hpp"
#include "library.hpp"
#include "expr.hpp"

namespace uc {

  // Forward type declarations

  // Forward function declarations

  UC_PRIMITIVE(void)
    UC_FUNCTION(main)(UC_ARRAY(UC_PRIMITIVE(string)) UC_VAR(args));

  // Full type definitions

  // Full function definitions

  UC_PRIMITIVE(void)
    UC_FUNCTION(main)(UC_ARRAY(UC_PRIMITIVE(string)) UC_VAR(args)) {
    UC_FUNCTION(println)("Hello world!"s);
  }

} // namespace uc

int main(int argc, char **argv) {
  uc::UC_ARRAY(uc::UC_PRIMITIVE(string)) args =
    uc::uc_make_array_of<uc::UC_PRIMITIVE(string)>();
  for (int i = 1; i < argc; i++) {
    uc::uc_array_push(args, uc::UC_PRIMITIVE(string)(argv[i]), -1);
  }
  uc::UC_FUNCTION(main)(args);
  return 0;
}

First, there are some includes of library files written in C++, which define built-in uC functions and types as well as macros and templates to be used by the generated code. The generated code itself is placed in the uc namespace to avoid clashing with other code. Forward declarations of types and functions come first, allowing them to be used before their definition as required by uC. Then there are full type and function definitions. The generated code uses macros to mangle names into identifiers that will not clash with C++ names or with each other, in the case of names belonging to different categories. Finally, a C++ main() function converts command-line arguments into the format expected by uC and calls the uC main() function.

The compiler is run as follows to generate C++ code:

$ python3 ucc.py -C <source file>

This parses the source file and generates C++ code, saving it to a file. You can then compile the resulting C++ code as follows:

$ g++ --std=c++20 -I. -o <executable> <generated file>

The -I. argument tells g++ to search the current directory for header files and is necessary if the generated file is not in the current directory.

For the example above, we can run:

$ python3 ucc.py -C tests/hello.uc
$ g++ --std=c++20 -I. -o hello.exe tests/hello.cpp
$ ./hello.exe
Hello world!

The compiler implementation is divided into several Python and C++ files, described below.

Distribution Code

Use the following commands to download and unpack the distribution code:

$ wget https://eecs390.github.io/project-uc/backend/starter-files.tar.gz
$ tar xzf starter-files.tar.gz

Start by looking over the distribution code, which consists of the following files:

File or directory

Purpose

What you need to do with it

defs.hpp

Library

Read and use it

ref.hpp

Library

Read and use it

array.hpp

Library

Read and use it

library.hpp

Library

Read it

expr.hpp

Library

Read, use, and modify it

Makefile

Testing

Run it with make

tests/

Testing

Read, use, and modify it

life.uc

Testing

Read, use, and modify it

You will also need to use and modify code from Project 4. You will specifically need to modify ucbase.py, ucstmt.py, and ucexpr.py. You will also need to modify the ucbackend.py file included in the Project 4 distribution.

Driver Files

The top-level entry point of the compiler is ucc.py. It opens the given uC source file, invokes the parser to produce an AST, and then invokes each of the compiler phases on the AST. If the -S command-line argument is present, it only performs semantic analysis on the AST without any code generation. Otherwise, if the -C argument is present, it disables semantic analysis and invokes the code-generation phases on the AST to generate C++ code to a file. If the -A command-line argument is present, it performs both semantic analysis and code generation.

The interface for the backend compiler phases is defined in ucbackend.py. If all backend phases are enabled, the distribution code calls gen_header() to generate the header for a generated program, which includes library files and opens the uc namespace. Then each backend phase is invoked in turn:

  1. gen_type_decls(), which is responsible for generating forward declarations for each user-defined type

  2. gen_function_decls(), which generates forward declarations for each user-defined function

  3. gen_type_defs(), which produces definitions for each user-defined type

  4. gen_function_defs(), which is responsible for producing definitions for each user-defined function

Then the distribution code calls gen_footer(), which closes the uc namespace and generates a C++ main() function that calls the uC main(). You will need to place your code for invoking code generation on the AST within the top-level functions for each phase.

The ucc.py driver also provides a --backend-phase command-line argument to limit which backend phases run. For example, invoking ucc.py with --backend-phase=3 runs only Phases 1 through 3 of the backend, without calling gen_header() or gen_footer(). Our testing framework makes use of this to test the early phases of code generation.

Unlike in Project 4, it is up to you how to structure your code for code generation. You may add whatever member functions you need to ASTNode, and then call them from the top-level functions for each phase.

A PhaseContext object, defined in uccontext.py, provides a print() method to print to an output file. The distribution code sets the output file appropriately for code generation, and you should use a context’s print method when generating code. If the optional argument indent=True is given, it prepends the output with the context’s current indentation string. At the beginning of code generation, indentation is set to two spaces. You may wish to change the indentation level at various points to produce more readable output code.

All arguments to a context’s print() method other than indent are passed on to the standard, top-level print() function. The following are some combinations of indentation and line endings that can be accomplished with a call to print() on a context ctx:

  • ctx.print(args): print arguments to the output file, with a trailing newline

  • ctx.print(args, end=''): print arguments to the output file, but without a trailing newline

  • ctx.print(args, indent=True): print the current indentation followed by the arguments to the output file, with a trailing newline

  • ctx.print(args, indent=True, end=''): print the current indentation followed by the arguments to the output file, but without a trailing newline

We recommend cloning a context when increasing the indentation level. The following creates a new context with two additional spaces of indentation:

new_ctx = ctx.clone()
new_ctx.indent += '  '
# alternatively, do the above in one step
new_ctx = ctx.clone(indent='  ')

We recommend using Python 3 string formatting or Python 3.6 f-strings to construct output text.

Library Files

The header files provided in the distribution are C++ library files that are included in a generated uC program. They implement useful macros and templates that can be used by the code generator, as well as definitions for built-in uC types and functions. The preamble code generated by ucbackend.code_gen() includes the library headers from the output C++ file.

The most basic macro definitions are in defs.hpp. It defines name-mangling macros that convert uC names to distinct C++ names. These are:

  • UC_PRIMITIVE: mangles the name of a primitive type

  • UC_TYPEDEF: mangles the name of a user-defined type, as used when generating the struct definition of the type

  • UC_REFERENCE: mangles the name of a user-defined type and wraps it in a uc_reference, providing the reference semantics required for user-defined types

  • UC_ARRAY: denotes an array of the given element type, which must itself already be mangled

  • UC_FUNCTION: mangles the name of a function

  • UC_VAR: mangles the name of a variable

Make sure to apply the appropriate macro when generating code for one of the entities above. Within the uC compiler, BaseTypeNameNode objects define a mangle() method that you can use to mangle type names.

The library file ref.hpp defines a uc_reference template that implements reference semantics, as well as memory management using reference counting. It uses std::shared_ptr in order to perform the latter. Except for constructing a null reference, you will not have to create a uc_reference object directly in the compiler. Instead, use the uc_construct() function template to allocate an object and wrap it in a reference. You will need to explicitly instantiate the template, such as uc_construct<UC_REFERENCE(foo)>(...). To construct a null reference of a specific type, use the default constructor for a uc_reference of the appropriate type.

The file array.hpp implements operations on arrays. An array can be constructed with the uc_construct() function template defined in ref.hpp, explicitly instantiated with the array’s type. The templates uc_array_length(), uc_array_push(), uc_array_pop(), and uc_array_index() in array.hpp implement the corresponding array operations.

The file library.hpp defines the appropriate aliases for primitive types as well as the built-in library functions. You will not have to use any of these directly.

The file expr.hpp provides a definition for the uc_id() function template, and it is also intended to implement function overloads for the polymorphic operations in Phase 5. You will need to fill in the code for expr.hpp and use it when generating code for the polymorphic operations. The uc_add() overloads should add two items together. You will need to provide overloads for the combinations of types that may be added in uC. Do not repeat code; use function templates where possible.

Testing Framework

A basic testing framework is implemented in the Makefile. There are separate targets for each of the phases:

  1. make phase1: Run only Phase 1 of code generation. Compile the output with a provided *_phase1.cpp test file using g++ -c, so that g++ does not attempt to link the resulting object file.

  2. make phase2: Run Phases 1 and 2 of code generation. Compile the output with a provided *_phase2.cpp test file using g++ -c.

  3. make phase3: Run up to Phase 3 of code generation. Compile the output with a provided *_phase3.cpp test file using g++, producing an executable. Run the resulting executable.

  4. make phase4: Run all phases of code generation on tests that only require up to Phase 4. Compile the output using g++, producing an executable. Run the resulting executable with command-line arguments 20 10 5 2, save the output to a file with extension .run, and compare against the expected output in the .run.correct file.

  5. make phase5: Run all phases of code generation on tests that require up to Phase 5. Compile the output using g++, producing an executable. Run the resulting executable with command-line arguments 20 10 5 2 and compare the results to the expected output.

You may need to change the value of the PYTHON and CXX variables if the Python executable is not in the path or not called python3, or if you want to use a different C++ compiler. You can do so from the command line:

$ make PYTHON=<your python here> CXX=<your C++ compiler here> ...

Test Cases

The following basic test cases are provided in the tests directory, with output from running the resulting executable.

Test

Description

default.uc

Test default constructors for user-defined types.

equality.uc

Test equality and inequality comparisons.

hello.uc

Simple “hello world” example.

length_field.uc

Test accessing the length field of an array or an object.

literals.uc

Test literals and addition between strings and other types.

null_access.uc

Test accessing a field through a null reference.

particle.uc

Complex example of a uC program.

use_before_decl.uc

Test using types or functions before their declaration, which should be valid.

These are only a limited set of test cases. You will need to write extensive test cases of your own to ensure your compiler is correct.

You will not be able to directly compile the generated C++ code from a uC program until you complete Phase 4, since you will need to be able to generate the uC main() function. However, for each test case, we provide phase-specific test files in the form of <test>_phase{1,2,3}.cpp. These files directly test the functionality required up to the associated phase. For Phases 1 and 2, they ensure that the generated declarations are correct. For Phase 3, the test files determine whether or not type definitions are correct. Use the provided Makefile to run these tests, as described in Testing Framework.

You may find it useful to examine the generated C++ code with macros expanded, so that you can reason about its correctness with respect to the C++ language. We recommend using a command such as the following to run the code through the C preprocessor:

$ cat defs.hpp tests/default.cpp | grep -v '#include' | g++ -E - > tests/default.E.cpp

This command works as follows:

  • It invokes the cat utility to print the contents of defs.hpp (which contains the relevant macro definitions) and tests/default.cpp (the generated code for tests/default.uc) to standard output.

  • It pipes the output through grep -v, which excludes lines that contain the given sequence. We strip out #include lines so that only the code from tests/default.cpp gets preprocessed.

  • It then invokes g++ -E to run the preprocessor. The - command-line argument tells g++ to run on the contents of standard input.

  • Finally, the results are stored in tests/default.E.cpp, which you can then examine by hand.

We have included a Makefile rule to run this command:

$ make tests/default.preprocess
cat defs.hpp tests/default.cpp | grep -v '#include' | g++ -E - > tests/default.E.cpp

You can replace default with another test name to run the corresponding generated output through the processor. Make sure that you run the uC compiler first (either directly using python3 ucc.py -C <test file> or through one of the Makefile targets) to generate the output first.

You may find it helpful to run the preprocessed output through an automatic formatter before examining it. Either use your IDE for this, or an external tool such as clang-format. (You can install it via brew install clang-format on MacOS if you have Homebrew, apt install clang-format on Ubuntu [you may need to use sudo], npm install -g clang-format if you have npm, or even through pip via pip3 install clang-format.)

Phase 1: Generating Type Declarations

The semantics of uC allow types and functions to be used before their definition. In order to support this in the generated C++ code, forward declarations must be made for each user-defined type and function. In addition, forward declarations of types should be made before those of functions, since a function may name a user-defined type in its return or parameter types.

Thus, the first step is to generate forward declarations for user-defined types. As an example, consider the following uC type definition:

struct foo {
  int x;
  double y
};

This should result in a C++ forward declaration as follows:

struct UC_TYPEDEF(foo);

Since this is a forward declaration, this is not a definition for the resulting struct.

Modify gen_type_decls() in ucbackend.py so that it runs your code for this phase.

Phase 2: Generating Function Declarations

The second step is to generate forward declarations for user-defined functions. As an example, consider the following uC function:

void main(string[] args) {
  println("Hello world!");
}

This should result in a C++ forward declaration as follows:

UC_PRIMITIVE(void)
  UC_FUNCTION(main)(UC_ARRAY(UC_PRIMITIVE(string)) UC_VAR(args));

The return and parameter types should be mangled appropriately. You will find the mangle() method of BaseTypeNameNode objects useful for this purpose. The parameter names are optional in a forward declaration, but if you choose to generate them, make sure to mangle them with the UC_VAR macro.

Modify gen_function_decls() in ucbackend.py to run your code for this phase.

Phase 3: Generating Type Definitions

Full type definitions must appear before function definitions, since in C++, the contents of a class or struct are not accessible until after the definition. Thus, the next step is to generate definitions for each user-defined type.

Unlike in the previous phases, we are not providing full examples of generated code for this or future phases. The goal is to reason about what code is necessary to translate between the languages, using your knowledge of both uC and C++, rather than just pattern matching.

In creating an instance of a user-defined uC type, it is legal to provide a single argument for each field, or to provide no arguments at all. The latter results in default initialization, which initializes primitive types to 0, false, or an empty string, and references to user-defined types to null. Consider the following type:

struct bar {
  int x;
  foo f;
};

Both of the following initializations are valid:

bar b1 = new bar(3, new foo(4, 5));
bar b2 = new bar();

The former explicitly initializes each field to the corresponding argument, so that b1.x is 3 and b1.f is a reference to the newly created foo object. The latter performs default initialization, resulting in b2.x being 0 and b2.f null.

You will need to support both possibilities for initialization. Default initialization in uC is equivalent to value initialization in C++. You will need to generate code to value initialize fields in cases that require default initialization from uC.

User-defined types also must support equality comparisons. In C++20, you only have to overload operator==(), and you can default it – the defaulted operator does member-by-member comparisons. Use a signature like the following, so that comparisons can be done on r-values, which bind to const l-value references in C++:

struct UC_TYPEDEF(bar) {
  UC_PRIMITIVE(boolean) operator==(const UC_TYPEDEF(bar) &rhs) const ...
};

Modify gen_type_defs() in ucbackend.py to invoke the code for this phase.

Phase 4: Generating Function Definitions

The next step is to generate full definitions for each user-defined function.

Make sure to mangle the names of uC variables with UC_VAR, both in variable definitions as well as in variable and field accesses.

Most uC statements and expressions have a one-to-one correspondence with C++ statements and expressions, and in those cases, you’ll find that you’ll be able to generate C++ code that is nearly identical to uC code. The semantics of most uC expressions, in terms of their order of operations and their types, are designed to be identical to their C++ counterparts.

For a string literal, use the s suffix so that it is a std::string literal rather than a const char *:

"Hello world!"s

For call nodes, mangle the function name using the UC_FUNCTION macro.

For allocation nodes, you should make use of the library template uc_construct<T>(). You will need to explicitly provide the first template argument, which should be the mangled name of the underlying uC type, when calling it.

For field accesses, the receiver is always of uc_reference type. Since uc_reference derives from shared_ptr, it supports the indirect member-access operator ->, which you can use to access a field. However, the receiver needs to be checked for null before it can be dereferenced via ->. Use uc_null_check() (defined in ref.hpp) to do so:

uc_null_check(<receiver>, <position of AST node>)-><mangled field>

For array indexing, you will find the library template uc_array_index() useful. The position of the corresponding AST node should be passed in as the last argument – it is used for reporting runtime errors.

For unary and binary operations, you should place parentheses around the expression in order to preserve the precedence and associativity encoded in the structure of the AST. For example, consider the following uC code:

(3 - 4) * 5

This results in the AST structure in Figure 1.

_images/prec.svg

Figure 1 AST structure that encodes associativity and precedence.

The parenthesization is encoded in the structure of the AST itself, so that 3 and 4 are grouped together under the MinusNode. In order to preserve this in the generated code, emit parentheses around every binary and unary operation (that has a C++ counterpart):

((3 - 4) * 5)

In uC, addition is a polymorphic operation, as it can be applied to numerical types as well as strings. Thus, its implementation is deferred to Phase 5.

Use the uc_id() function template in expr.hpp to implement the ID operator (#).

For array push and pop operations, you will find the uc_array_push() and uc_array_pop() library templates useful.

Modify gen_function_defs() in ucbackend.py to execute the code for this phase.

Phase 5: Polymorphic Operations

The final compiler phase is to write implementations for the polymorphic operation of adding two items. Since we are not running the type computation from Project 4, we cannot resolve the operation within the compiler itself, as we do not know the types of the operands. Instead, we will rely on the target C++ compiler to resolve the operation for us.

The operands of an addition may be both of numerical type, both of string type, or one of string type and the other of numerical or boolean type. To support this, define overloads for uc_add() in expr.hpp. Then generate code to call uc_add() for an addition. You may find std::to_string() useful for converting numerical types to strings. However, you cannot use it for booleans, which should be converted to the strings true or false, not 1 and 0. Instead, you will need to define specific uc_add() overloads for booleans that perform the correct string conversion.

Do not repeat code if not necessary for the overloads in this phase. Instead, use templates where possible. You should have no more than six overloads for uc_add().

You may run into a case where the C++ compiler considers a call to be ambiguous because two or more overloads are equally applicable. In such a case, you can resolve the ambiguity by providing a more specialized overload that will be preferred over the ambiguous ones.

Phase 6 (Optional): Writing a uC Program

Implement a simulation of Conway’s Game of Life in uC. Your implementation should be on a finite grid. Edge cells have fewer neighbors but should otherwise follow the same rules as any other cell.

Complete the implementation of the simulate() function in life.uc. You may define any helper functions and structs you need.

When printing the grid, you should first print a line consisting of \(cols + 2\) dashes (-). Each row should then start and end with a pipe (|), each live cell should be printed as an asterisk (*), and each dead cell should be printed as a space. Finally, print another line consisting of \(cols + 2\) dashes followed by another empty line.

We have provided a test case in the main() function and the expected output in life_test.out. Compile and run your code with:

$ python3 ucc.py -C life_test.uc
$ g++ --std=c++20 -I. -o life.exe life.cpp
$ ./life.exe

Alternatively, you can use the Makefile to compile and test life.uc:

$ make life

This phase is optional and will not be graded.

Testing and Evaluation

We have provided a small number of test cases in the distribution code. However, the given test cases only cover a small number of possible uC constructs, so you should write extensive tests of your own.

We will evaluate your compiler based on whether the generated code is a valid C++ translation whose behavior matches that expected from the uC source. Thus, we will run a valid uC source file through your compiler to produce the output C++ code. For Phases 1, 2, and 3, we will combine your output with our own C++ test code that relies on the output being correct. For Phases 1 and 2, we will compile with g++ -c, without linking, to ensure that the declarations are correct. For Phase 3, our test code will use assert statements to test that your generated type definitions are correct. For Phases 4 and 5, we will compile your generated code with g++, and run it and check that the output matches what is expected.

Your generated C++ code itself does not have to exactly match ours. However, its behavior must match the behavior of our C++ code.

We will not test your compiler with erroneous uC code.

For Phases 4 and 5, we will run your generated code through valgrind to ensure that fields are properly default initialized. The provided Makefile invokes valgrind if it is present. This can be disabled (e.g. if your valgrind installation is broken) by setting VALGRIND="":

$ make phase5 VALGRIND=""

Grading

The grade breakdown for this project is as follows:

  • 20% checkpoint

  • 80% final deadline autograded

We will not hand grade this project, nor will we grade your test cases. However, we are still requiring that you submit your test cases, as they are part of your solution.

You are required to adhere to the coding practices in the course style guide. We will use the automated tools listed there to evaluate your code. You can run the style checks yourself as described in the guide.

Submission

All code that you write for the interpreter must be placed in the files ucbackend.py, ucbase.py, ucexpr.py, ucfunctions.py, ucstmt.py, uctypes.py, or expr.hpp. We will test all seven files together, so you are free to change interfaces that are internal to these files. You may not change any part of the interface that is used by ucc.py or ucfrontend.py.

Submit all of ucbackend.py, ucbase.py, ucexpr.py, ucfunctions.py, ucstmt.py, uctypes.py, expr.hpp, and any of your own test files to the autograder before the deadline. We suggest including a README.txt describing how to run your test cases.

Frequently Asked Questions

  • Q: Are we supposed to output the same code as what’s in default_phase1.cpp, etc.?

    A: No. That’s a test file for Phase 1, and it gets compiled together with your output to produce a test executable for that phase. You’re supposed to output code as described in this project spec. For the default.uc test case in Phase 1, your output should be something like:

    struct UC_TYPEDEF(foo);
    struct UC_TYPEDEF(bar);
    struct UC_TYPEDEF(baz);
    
  • Q: Should we write all of our code for Phases 1-4 in ucbackend.py?

    A: No. Project 4 demonstrated the proper way to structure code that operates on an AST, and you should follow a similar organization in this project. Take a look at ucfrontend.py for reference. You are allowed to add methods to ASTNode as needed.

  • Q: I’m getting compiler errors when my generated code is compiled with g++. How do I figure out what’s wrong?

    A: Try running the generated code through the preprocessor as described in Test Cases, which will expand the macros. The following example is for default.uc after Phase 1:

    $ cat defs.hpp tests/default.cpp | grep -v '#include' | g++ -E -
    # 1 "<stdin>"
    # 1 "<built-in>" 1
    # 1 "<built-in>" 3
    # 368 "<built-in>" 3
    # 1 "<command line>" 1
    # 1 "<built-in>" 2
    # 1 "<stdin>" 2
    # 46 "<stdin>"
      struct ucp_t_foo;
      struct ucp_t_bar;
      struct ucp_t_baz;