Using the Compiler

First, you don't need to use the compiler at all. You can do everything by writing programs for your emulator in its assembly language. The compiler is just provided to make experiments easier.

Secondly, you certainly don't have to understand how the compiler works. It is just there as a tool.

The compiler consists of three files: compiler.cpp, defs.cpp, and defs.h, all in the directory /home/demo which is accessible to everyone on rabbit, just copy (cp) those files into your own directory, compile them, and you're ready to go. That directory also contains a file called emu, which is a "reference" emulator, executable only. If you're unsure how I intended certain instructions to behave, you can use that emulator to find out. Just run it, and enter the command "help" to see what it can do. Your emulators don't have to behave or look exactly like this reference one.

Compile the compiler like this:

CC -c defs.cpp
CC -c compiler.cpp
CC compiler.o defs.o -o cmpl

Then you will be able to use the command cmpl to run the compiler.

The compiler does not support all of C, but a reasonable subset useful for the kinds of experiments you are most likely to be doing. I called the language C-- ("C minus minus") because it had to be called something.

`cmpl filename`	compiles the C-- program in the named file. If the filename ends with `.cmm`, you don't have to type the .cmm, it is assumed.
`cmpl -p file`	Print, with proper indentation and extra parentheses, the program while compiling it. This is useful if you think the compiler isn't seeing the program the way you intended it.
`cmpl -l file`	Shows the assembly code instructions being produced.
`cmpl -n file`	This option prevents the use of labels. The compiler works out the exact addresses, and puts them in numeric form in the assembly code.

Multiple options may be combined, e.g. cmpl -lp filename.

The compiler produces two output files, both with names based on the name of the input file. If the input file is called file.cmm, the output files will be called file.ass and file.exe. The .exe file is not very useful to you. It contains the fully assembled version of the program (i.e. binary executable codes), with a few numbers describing where in memory it should be loaded. Your emulators don't have to understand .exe files. The reference emulator is happy with them, the load command is used to load a .exe file for execution.

The .ass file contains an assembly language program ready for processing by your emulator. In it, semicolons ';' are used to start comments: your emulators should ignore everything on any input line after a semicolon. The .ass file contains each C-- statement from the original program as a comment immediately before the assembly code that it produced. This is to make it easier for you to follow the program.

The .ass file also makes use of two Pseudo-Ops, things that look like assembly code instructions, but are really commands to the assembler. Both of them have names beginning with a dot '.' to make them stand out.

.LOC N tells the assembler that the next instruction should be stored in memory location N. Programs usually begin with .LOC 512 to ensure that page 0 is not used.

.DATA N tells the assembler to store the number N in the next memory location, instead of an assembled instruction. This is used to initialise variables. If a program uses memory location 1025 for a variable, at the end of the assebly code there will be the sequence .LOC 1024; .DATA 0.

The C-- Language

The language that the compiler understands is a subset of C, with a few simple additions to support shared memory and multiple processes. The basic statements are: Assignment, If-Else, While, For, Print, Read.

Assignments, If-Elses, Whiles, and Fors are exactly as they are in normal C. Expressions used may consist of variables, integers, character constants, parentheses, the following binary operators: (+ - * / == != < > <= >= && ||), and the following unary operators: (+ - ! &). The following assignment operators are also allowed: (= += -= *= /=). The & operator is only provided so that you can print the address of a variable; pointers are not supported.

Print has been made into a simple statement, to avoid the complexities of printf formats. After the keyword "print" there may be a list of values separated by commas and terminated by a semicolon. The values may be variables, constants, expressions, or strings. Strings are surrounded by quotes (") in the normal way; The print statement is the only place where strings are used in C--. Inside strings ("x") or character constants ('x'), the following escape characters are understood: (\n \r \t).

Read has also been made into a simple statement to avoid the complexities of scanf. After the keyword "read" there may be a list of one or more variable names, separated by commas and terminated by a semicolon.

A group of statements may be surrounded by curly brackets { } to make a single compound statement exactly as in C.

As in C, all variables must be declared before they are used. As C-- does not have functions, there is no such thing as a global variable. A declaration statement consists of "int" or "char" followed by a list of variable names, separated by commas and terminated by a semicolon. Just like in C, except that initialisation is not allowed. There are no arrays.

Generally, a C-- program just consists of a sequence of statements. They do not need to be enclosed in { and }, but can be.

Sample Programs

These sample programs do not use shared memory or multiple processes, which are described next.

A program to print out an un-aligned, ugly 12x12 multiplication table:

int a, b;
for (a=1; a<=12; a+=1)
{ for (b=1; b<=12; b+=1)
    print ' ', a*b;
  print '\n'; }

This is the output it produces:

 1 2 3 4 5 6 7 8 9 10 11 12
 2 4 6 8 10 12 14 16 18 20 22 24
 3 6 9 12 15 18 21 24 27 30 33 36
 4 8 12 16 20 24 28 32 36 40 44 48
 5 10 15 20 25 30 35 40 45 50 55 60
 6 12 18 24 30 36 42 48 54 60 66 72
 7 14 21 28 35 42 49 56 63 70 77 84
 8 16 24 32 40 48 56 64 72 80 88 96
 9 18 27 36 45 54 63 72 81 90 99 108
 10 20 30 40 50 60 70 80 90 100 110 120
 11 22 33 44 55 66 77 88 99 110 121 132
 12 24 36 48 60 72 84 96 108 120 132 144

A program to calculate a factorial:

int n, x, f;
f=1;
print "enter a number: ";
read n;
x=n;
while (x>0)
{ f=f*x;
  x=x-1; }
print "the factorial of ", n, " is ", f, "\n";

and the output it produces

enter a number: 7
the factorial of 7 is 5040

This is the contents of the .ass file produced when the first sample (multiplication table) is compiled:

     .LOC    512
;  int a, b;
;  int a_0 at 1025
;  int b_0 at 1026
;  for (a=1; a<=12; a+=1) ...
     LOADC   1
     STORE   1025
L1:
;  a<=12
     LOAD    1025
     CMPC    12
     JPOS    L2     ; = 537
;  for (b=1; b<=12; b+=1) ...
     LOADC   1
     STORE   1026
L3:
;  b<=12
     LOAD    1026
     CMPC    12
     JPOS    L4     ; = 531
;  print ' ', a*b;
     LOADC   32
     OUT   
     LOAD    1025
     MUL     1026
     OUTD  
;  b+=1
     LOAD    1026
     ADDC    1
     STORE   1026
;  end of for
     JUMP    L3     ; = 519
L4:
;  print '\n';
     LOADC   10
     OUT   
;  a+=1
     LOAD    1025
     ADDC    1
     STORE   1025
;  end of for
     JUMP    L1     ; = 514
L2:
     EXIT    0
     .LOC    1025
     .DATA   0
     .DATA   0

Multiple Processes

A program may define a number of processes. These are a bit like normal C functions in appearance, and should come before the statements of the main program. A process definition consists of the keyword process followed by a name, followed by a list of statements inside curly brackets { }. Defining a process does not automatically execute the process, just as defining a function doesn't automatically call that function.

A process may be started by using the fork function. Unlike in unix, the fork function takes one parameter, which must be the name of a previously defined process. The idea is that fork simply starts up new time-shared process, exactly identical to the original one, (with a private copy of all of its memory pages), but running the named process. The original process continues executing the next statement in the normal way.

Here is a sample of a program that uses multiple processes, but not in a very useful way:

process one
{ int i;
  for (i=1; i<8; i+=1) 
    print i, '\n';
  print "one done\n"; }

process two
{ int i;
  for (i=100; i<700; i+=100)
    print i, '\n';
  print "two done\n"; }

int i;
fork(one);
fork(two);
for (i=20; i<28; i+=1)
  print i, '\n';

And here is the output it produces when run:

If you look carefully, you can see that the numbers 1, 2, 3, 4, 5, 6, 7 all appear in the right order, as do the numbers 100, 200, 300, 400, 500, 600, and the numbers 20, 21, 22, 23, 24, 25, 26, 27. However, the numbers from the three sequences are all interleaved in a random way, as are the characters from the two strings.

Shared Memory

A variable declaration may be preceded by the sequence shared "name", where name is a string of up to four characters used to identify a page of shared memory. If two processes declare variables using the same share name, those variables will be allocated in the same physical page, and can therefore be used to communicate between the two processes.

This sample program is a misguided attempt at communications between two processes:

process one
{ shared "pig" int square;         
  int i;
  for (i=0; i<10; i+=1)
  { square=i*i;
    sleep(); }                    
  square=99; }

process two
{ shared "pig" int value;
  sleep(); 
  while (value!=99)
    print value, '\n'; }
    
fork(one);
fork(two);

The variable 'square' in the first process and the variable 'value' in the second process occupy the same locations in physical memory, and so are really just two names for the same shared variable. The first process generates the squares 0, 1, 4, 9, 16, 25, 36, 49, 64, 81 one-by-one and puts them in the shared variable. The second process prints the values it finds in that shared variable. The problem is that there is no proper synchronisation between the processes. The sleep() function simply generates a SLEEP emulator instruction, which makes the current process surrender the rest of its turn on the CPU. This output was produced by the program:

As you can see, it contains the right numbers, but with repetitions, because there is no way for the second process to know when a new number has been produced.

Semaphores

A semaphore is simply a shared variable that is used in a very special way, to protect a critical section. The two operations that may be performed on a semaphore are P() and V(). The compiler implements these functions as SemaphoreP(s) and SemaphoreV(s), where s is the semaphore variable itself.

The compiler provided to you compiles these two operations deliberately incorrectly, because part of your assignment is to work out the correct implementation for your emulator.

The following program is a corrected version of the previous program. It uses a shared variable called ready to say whether or not there is a new square waiting to be printed, and a semaphore to properly protect accesses to that shared variable.

process one
{ shared "pig" int s, square, ready;
  int i;
  for (i=0; i<10; i+=1)
  { while (ready) sleep();
    semaphoreP(s);
    square=i*i;
    ready=1;
    SemaphoreV(s); } }

process two
{ shared "pig" int s, value, ready;
  int a;
  while (1)
  { while (!ready) sleep();
    semaphoreP(s);
    a=value;
    ready=0;
    SemaphoreV(s);
    print a, '\n'; } }

fork(one);
fork(two);

Note that read-only accesses to a shared variable (such as looking at ready in the while loop conditions) do not necessarily need to be protected. This is the output it would produce if the compiler produced the correct code for semaphoreP and semaphoreV:

But note that in this version process two would never terminate.

Correcting the Semaphore Code

For the sake of easy experimentation, I have made the part of the compiler that is responsible for translating semaphoreP and semaphoreV more-or-less understandable. To correct it, edit compiler.cpp, and search for the function called compile_semaphore_p, and you will see the following definitions. The function new_label is used to prepare a label that can be used in jump instructions. The function generate actually produces the assembly code: its first parameter is the instruction's op-code, and the second is its operand (zero if none), the third operand is the number of operands (0 or 1) and the fourth is a letter indicating how the optional operand is to be treated, 'N'=normally, 'L'=as a label.

Beside the functions, I have included a sample of the assembly language produced, so you can see what does what, and work out how to correct the functions:

void compile_semaphore_p(int semaphore_address)
{ int again=new_label();                      //          ; set aside L1 as label 'again'
  int cont=new_label();                       //          ; set aside L2 as label 'cont'
  generate(I_TAS, semaphore_address, 1, 'N'); //  TAS  s  ; s is the sempahore
  define_label(again);                        //L1:
  generate(I_ADDC, 1, 1, 'N');                //  ADDC 1
  generate(I_JNEG, cont, 1, 'L');             //  JNEG L2
  generate(I_JUMP, again, 1, 'L');            //  JUMP L1
  define_label(cont); }                       //L2:

void compile_semaphore_v(int semaphore_address)
{ generate(I_LOADC, 3, 1, 'N');               //  LOADC 3
  generate(I_ADD, semaphore_address, 1, 'N'); //  ADD   s
  generate(I_HALT, 0, 0, 'N'); }              //  HALT

For debugging purposes, the following little program allows us to see exactly the code produced for these functions. Don't try to run this code, just look at the .ass file.

int s;
print 97;
semaphoreP(s);
print 98;
semaphoreV(s);
print 99;

The print statements are just there to clearly separate the program into the P and V parts. This is the assembly code produced, you can clearly see how it matches the two compiler functions shown above:

     .LOC    512
;  int s;
;  int s_0 at 1025
;  print 97;
     LOADC   97
     OUTD  
;  semaphoreP(s);
     TAS     1025
L1:
     ADDC    1
     JNEG    L2     ; = 518
     JUMP    L1     ; = 515
L2:
;  print 98;
     LOADC   98
     OUTD  
;  semaphoreV(s);
     LOADC   3
     ADD     1025
     HALT  
;  print 99;
     LOADC   99
     OUTD  
     EXIT    0
     .LOC    1025
     .DATA   0