PyVa Compiler

This project was part of the module Programming Paradigms of my bachelor. The project was about writing a compiler in Java (or Haskell if preferred) for a realistic, although non-existent processor called Sprockell. A hardware level simulator was provided to us in Haskell. I worked on this project together with Rifqi. This is the project which I enjoyed the most during the bachelor. The PyVa compiler source code can be found here, it also contains the report which describes the functionality and the implementation of the project in detail.

Features

The compiler we ended up with is called 'PyVa' and, as the name suggests, refers to a combination of Python and Java. The PyVa programming language supports the following features:

Sample program

The following sample programs illustrate how PyVa may be used.

                    
def void bubbleSort(int[] A, int n) {
    boolean swapped = true;
    while (swapped) {
        swapped = false;
        for (int i = 0; i < n; i = i + 1) {
            if (A[i - 1] > A[i]) {
                int temp = A[i];
                A[i] = A[i - 1];
                A[i - 1] = temp;
                swapped = true;
            }
        }
        n = n - 1;
    }
}

int len = 7;
int[] arr = [4,8,42,1,2,3,69];
bubbleSort(arr, len);

for (int i = 0; i < len; i = i + 1) {
    print arr[i];
}
                    
                    

def record LinkedList {
    LinkedList next; int v;
}

LinkedList l = new LinkedList;
l.v = 2;

def void append(LinkedList l, int v) {
    LinkedList current = l;
    while (current.next != null) {
        current = current.next;
    }
    LinkedList node = new LinkedList;
    node.v = v;
    current.next = node;
}

append(l, 4);
append(l, 8);
append(l, 16);
print l.next.next.v; // prints 8
print l.next.next.next.v; // prints 16



int[] intArray = [1, 2, 3, 4, 5];   // initialized with values
int[] sizedArray = new int[5];      // all elements 0 or null with size 5
char[][] 2dChar = [[‘P’,’y’],[‘V’,‘a’],[‘!’],[]];   //outer size 4
boolean[][][] 3dBool = new boolean[3][2][];
3dBool = new boolean[4][2][2];
print 2dChar[0][1];                 // prints ‘y’
char[] string = “PyVa!”;            // string literals are considered char[]

def record Coo {
    int x; int y;
}

Coo coordinate = new Coo;
coordinate.x = -39;
coordinate.y = 69;

Coo[] coordinates = [coordinate, new Coo];
print coordinates[0].x; // print -39
print coordinate.x == coordinates[0].x; // prints true


Implementation details

ANTLR

The PyVa language is defined using ANTLR. In ANTLR you can define lexer and grammar rules with a .g4 file, see below for the PyVa language description.


grammar PyVa;

/* Program consists of many declarations */
program: decl*;

/* We call definitions, variable assignments, if, while and for statements declarations */
decl: DEF funcType ID LPAR params RPAR LBRACE program RBRACE                    #funcdecl // def void printInt(int i) {...}
    | block                                                                     #blockdecl // {...}
    | vardeclare SEMI                                                           #vardecl // int i = a = 0; i = a = 1;
    | IF LPAR expr RPAR block (ELSE block)?                                     #ifdecl // if (i > 1) {...}
    | WHILE LPAR expr RPAR block                                                #whiledecl // while (i < 10) {...}
    | FOR LPAR vardeclare SEMI expr SEMI vardeclare RPAR LBRACE program RBRACE  #fordecl // for (int i = 0; i < 10; i = i + 1) {...}
    | funcall SEMI                                                              #funcalldecl // printInt(1);
    | RETURN expr SEMI                                                          #returndecl // return 1;
    | THREAD ID block                                                           #threaddecl // thread pp {...}
    | JOIN ID SEMI                                                              #joindecl // join pp;
    | SYNC block                                                                #syncdecl // sync {...}
    | DEF RECORD identifier LBRACE recordvartpye* RBRACE                        #recorddecl // def record Coor {int x; int y;}
    | PRINT expr SEMI                                                           #printdecl // print true;
    ;


/* Record declaration */
recordvartpye
    : identifier ID SEMI
    ;

/* A scope block is just many declarations surrounded by curly braces */
block
    : LBRACE program RBRACE
    ;

/* A variable declaration. the first one is for reassignments and the second one is for declarations of new variables */
vardeclare
    : idCall (ASS idCall)* ASS expr         #typelessvardecl // i = a = 2;
    | identifier ID (ASS ID)* ASS expr      #typevardecl // int i = a = 2;
    ;

arrayInit: OBRACK arrayValue? CBRACK                        #valuedArrayInit // [[1], [2], [3]] == int[3][1] regarding size
    | NEW identifier (OBRACK expr CBRACK)+ (OBRACK CBRACK)* #sizedArrayInit // int[4][3][2][], cannot do int[4][][2]
    ;

/* Array initialization with direct values */
arrayValue: (expr (COMMA expr)*); // [1, 2, 3]

/* Accessing an ID. It can look like normal variables (i) or accessings of variables */
idCall: ID accesser*; // e.g. i[2], i.x, i.x[0][2].y

/* The accessers of idCall */
accesser: DOT ID // .x
    | (OBRACK expr CBRACK)+ // [1][2][6/2]
    ;

/* Parameter instances of function definition */
params
    : (identifier ID (COMMA identifier ID)*)? // int i, char c
    ;

/* Primitive types */
primitiveType: BOOLEAN | INT | CHAR | NULL | VOID;

/* The list of possible types. These are the types declared on the left of variable names when declaring new variables. */
identifier: primitiveType | arrayType | compoundType;

/* Array Types */
arrayType: BOOLEAN (OBRACK CBRACK)+
    | INT (OBRACK CBRACK)+
    | CHAR (OBRACK CBRACK)+
    | ID (OBRACK CBRACK)+ //ID refers to user defined identifier here
    ;

/* Custom Types for threads or records */
compoundType: ID | THREAD;

/* Function types can by any type that identifier can allow, or it can be void */
funcType
    : identifier    #typeFuncType
    | VOID          #voidFuncType
    ;

/* All right hand side of assignments are expressions. Expressions can also be used for conditions, parameters, etc. */
expr: prfOp expr        #prfExpr // !true, -2
    | expr multOp expr  #multExpr // 2 * 2, 4 / 2
    | expr plusOp expr  #plusExpr // 2 + 2, 2 - 2
    | expr compOp expr  #compExpr // 2 < 3, 2 <= 3, 4 == 4, 4 != 4, 5 > 4, 5 >= 5
    | expr boolOp expr  #boolExpr // true || false, true && true
    | LPAR expr RPAR    #parExpr // (2 + 2)
    | funcall           #funcallExpr // mod(11, 7)
    | idCall            #idExpr // i
    | arrayInit         #arrayInitExpr // [1, 2, 3], new int[3]
    | CHR               #charExpr // 'a'
    | NUM               #numExpr // 2
    | TRUE              #trueExpr // true
    | FALSE             #falseExpr // false
    | STRING            #stringExpr // "PyVa"
    | NEW ID            #recordExpr // new Coor
    | NULL              #nullExpr // null
    ;

/* A function call expression */
funcall
    : ID LPAR (expr (COMMA expr)*)? RPAR
    ;

prfOp: MINUS    #minusPrfOp
    | NOT       #notPrfOp
    ;

multOp: STAR | SLASH;

plusOp: PLUS | MINUS;

boolOp: AND | OR;

compOp: LE | LT | GE | GT | EQ | NE;

PRINT:  'print';
CLASS:  'class';
NEW:    'new';
IF:     'if';
ELSE:   'else';
FOR:    'for';
WHILE:  'while';
RETURN: 'return';
NOT:    '!';
BOOLEAN:'boolean';
VOID:   'void';
CHAR:   'char';
NULL: 'null';
INT:    'int';
TRUE :  'true';
FALSE:  'false';
OR:     '||';
AND:    '&&';
DEF:    'def';
ASS:    '=';
COLON:  ':';
COMMA:  ',';
DOT:    '.';
DQUOTE: '"';
SQUOTE: ['];
EQ:     '==';
GE:     '>=';
GT:     '>';
LE:     '<=';
LBRACE: '{';
LPAR:   '(';
LT:     '<';
MINUS:  '-';
NE:     '!=';
PLUS:   '+';
RBRACE: '}';
RPAR:   ')';
SEMI:   ';';
SLASH:  '/';
STAR:   '*';
OBRACK: '[';
CBRACK: ']';
THREAD: 'thread';
JOIN:   'join';
SYNC:   'sync';
OCOM:   '/*';
CCOM:   '*/';
DOUBLESLASH: '//';
RECORD: 'record';

ID: LETTER (LETTER | DIGIT)*;
NUM: DIGIT (DIGIT)*;
STRING: DQUOTE .*? DQUOTE;
CHR: SQUOTE .*? SQUOTE;
BOOL: TRUE | FALSE;

WS: [ \t\r\n]+ -> skip;
COMMENT: (OCOM ~[*]* CCOM) -> skip;
LINECOMMENT: DOUBLESLASH ~[\n]* -> skip;

fragment LETTER: [a-zA-Z];
fragment DIGIT: [0-9];

Compiler Stages

The PyVa compiler consists of four stages, namely type and function discovery, type and function implementation, semantics checking and code generation. This is achieved by using ANTLR's generated visitor and listener classes based on our language specification.

Discovery Stage: During the discovery stage, all user-defined types and functions are listed. This phase will fail if types and/or functions are declared twice in the same scope, but also when the user tries to override one of the primitive types (int, boolean, char, thread). This was implemented in order to support hoisting, i.e. support the use of types and functions that are defined later in the same scope.

Implementation Stage: During the implementation stage, types and functions are made concrete. In the discovery stage we only find the identifiers for the types and functions but during this stage we actually fill in the definitions of the types and functions. What types does this new type contain? What type is returned by this function? What parameters does it take? There are numerous reasons why this stage might fail. One of them is that the user is using a type that is not defined anywhere in the enclosing scopes. Another reason could be that there are overlapping identifiers for the fields in a struct.

Semantics Checking Stage: During the semantics checking stage, everything regarding semantics is being checked. This stage makes sure that the correct types are being assigned to the specified variables. You cannot, for example, assign 'null' to a primitive integer. Other things to consider are that a function must have a return statement in all its execution paths. For example the function defined below is not accepted as this function might not return anything.


def int f() {
    if (false) {
        return 0;
    }
    // returns nothing here
}

Code Generation Stage: During the code generation stage, we actually generate the Sprill code that can be run by the Sprockell processor. The class responsible for generating the code is huge. With ~1200 lines of code (without documentation) we are able to generate the resulting Sprockell program from an input file. In comparison to all other stages, the generator is implemented with an ANTLR visitor such that we have full control over the children that are being visited. Although definitely not ideal, the generator simply keeps appending instructions to a string builder in order to create the file. In the meantime we have to keep track of some very specific instruction positions in order to find, for example, the access point of a function.