Your own virtual machine and assembler

Don't panic! This is a long description but that doesn't mean the assignment is
long. There is just a lot of detail because most of you are new to this kind of
thing, so I won't take anything for granted. No detailed requirements are hidden
in the descriptions. Just create an assembler and virtual machine with at least
the instrctions given below and make sure it can run a variety of programs.
A substantial part is also for extra credit only.

What I call the first, second, ... step are really just recommendations. You can split
the task up as you see fit. But remember that if you type the whole thing at once it
will be very difficult to locate the sources of the enormous number of inevitable errors.

For clarity, definitions of terms
   Assembly Language: the human readable form of the computer's internal language
   Machine Code: The binary version of A.L. that the computer's electronics obey
   (those two are often used interchangeably becuase they are so closely related)
   Assembler: the program that converts assembly language into binary machine code
   Virtual Machine: software that behaves exactly like a real computer and runs the M.C.

As a reminder, there will be three separate areas of memory in the virtual machine,
they may be as simple as arrays of a reasonably large size. One (an array of unsigned
ints) will store the executable code, another (an array of ordinary ints) will store
the values of the program's variables, and the third (an array of strings) is a bit
of a cheat but it will allow your programs to produce comprehensible output.

Here, I type assembly language instructions in capital letters, that is just to make
them stand out as what they are, you do not have to do the same.

First step

   Make a program that can read and decode an assembly language program. The format
   of the assembly language is ...
      blank lines and lines beginning with # are to be ignored
      lines beginning with a space are instructions. All instructions begin with
          an "opcode" and may specify a register and a numeric operand as in ...
              STOP
              JUMP XXX         XXX is the numeric operand, no register has been given
              ADD  RRR XXX     RRR is a register name, XXX is a numeric operand
                               there is no way to provide a register but no numeric operand.
          opcodes, registers, and numeric operands will never have spaces within them.
          You should choose easily recognisable names for your registers, but their names
          must feature a number to identify the register, R3 or even just 3 are reasonable
          choices. numeric operands will be either names or numbers (ints only).
      any line beginning with anything other that space or # is a "directive". It is
      not an executable part of your program, it tells the assembler to do something.

   You must decide what the opcodes are and how they are named, but there must be enough
   for useful programs to be created. At a minimum there must be ...

      Arithmetic instructions for addition, subtraction, multiplication, and division.
         All specify a register and an operand, and result in the value stored in the
         register being changed accordingly. Arithmetic instructions must all have three
         forms, one in which the numeric operand represents a number, one in which the
         numeric operand represents an address / memory location, and the value stored
         there is the value used to modify the register, and one form in which the
         numeric operand is actually a register number. Instructions must have names
         that strongly suggest what they do. The three forms must have slightly different
         names or your assembler will be in trouble. 
         for example, you could choose
             MULN  RRR  123     to mean multiply the register by 123, and
             MULM  RRR  123     to multiply the resister by whatever is stored in
                                memory location 123.
             MULR  RRR  3       to mean multiply register RRR by whatever is stored
                                in register 3.

      Jump instructions in at least four forms: unconditional in which no register
         is given and execution causes your program to change course so that the next
         instruction comes from the position indicated by the numeric operand.
         Conditional jumps make the change of course only happen if the provided
         register satisfies a particular condition: less than zero, equal to zero,
         or greater than zero. Naturally all four (or more) jump instructions must
         have names that strongly suggest that a jump is going to happen, and any
         condition required.

      A stop instruction that when executed makes the program stop running. Useful
         recommendation: make stop be the instruction that has opcode 0, that way
         if you forget to put a stop at the end of a program, it will probably
         stop anyway.

      Three "load" instructions that specify a register and an operand. They set
         the register to a new value. Just like for arithmetic, one version provides
         the new value directly, another gives the address in memory in which the
         new value is to be found, and the third takes the new value from another register.

      A "store" instruction that overwrites the memory location given by the numeric
         operand with the register's value. The opposite of load.

      At least four input/output instructions, three of them use a register but ignore
         their numeric operand. One waits for the user to type an int value, and sets
         the register to that value. One does the opposite - prints the value of the
         register for the user to see. The third and fourth cause a string to be
         printed. Strings will be built into your assembly language program and
         will be represented by a number. In one of the instructions that number is
         the numeric operand, in the other it comes from the register.
      
      Of course you may add more to make the virtual machine more useful but be
      careful to get the requirements right first.

   Each line of the file that represents an instruction will create a single 32 bit
   int that represents the whole instruction and stores it in the next free location
   in the program code memory.

   The opcodes will all have numeric values that become part of the 32 bit instruction.
   Nothing clever is required for this, you could have a const array of strings in
   your program, the strings are the opcodes and their position in the array gives
   their numeric value.
   
   You must have suitably named const ints to record the positions. For instance if
   "MULN" happens to be at position 11 in that array, you might define
       const int ... ... , op_muln = 11, ..., ... ;
   Trying to remember what the numbers are would be hopeless. Sensible names for them
   is the only option.

   The opcode's value, register number, and numeric operand must all be squeezed
   into the one 32 bit instruction. The number of bits for each of the three
   parts is easy to determine. Set aside enough bits for the largest of your opcode
   values, then enough bits to specify a register number (go for 8 or 16 registers,
   less than 8 can be difficult, and more than 16 is just a waste) so 8 registers
   would need three bits, 16 would need 4. The remainder of the 32 bits give the
   numeric operand.

   For this first stage of your creation, your assembler should just print the
   things it calculates, first as three separate values, then as the combined 32 bit
   instruction value. For example, if you have a LOADN (code 3) instruction for putting
   a numeric value in a register, a MULN (code 11) instruction for multiplying by 
   the operand, a STORE (code 4) operation for storing the register value in memory,
   and a STOP (code 21) instruction, and you have a total of 29 different instruction
   codes (they would fit in 5 bits), and 16 registers, this little bit of program
        LOADN R2 66
        MULN  R2 99
        STORE R2 5
        STOP
   when run it would work out what 66 * 99 is, and store it in data memory location
   5, but we're not there yet. The numeric values for the three components of each
   instruction would be
        3, 2, 66
        11, 2, 99
        4, 2, 5
        21, 0, 0
   in binary, stretched to the appropriate number of bits, that is
        00011, 0010, 00000000000000001000000
        01011, 0010, 00000000000000001100011
        00100, 0010, 00000000000000000000101
        10101, 0000, 00000000000000000000000
   as 32 bit values in binary, they are trivially
        00011001000000000000000001000000
        01011001000000000000000001100011
        00100001000000000000000000000101
        10101000000000000000000000000000
   as ints printed in decimal they would be
        419430464, 1493172323, 553648133, 2818572288, but you would never waste time
   working that out, it is better to look at them in hexadecimal. Incidentally, the
   fourth instruction has a 1 in its most significant bit, so it would naturally be
   printed as a negative, -1476395008, so you should use unsigned ints to avoid
   confusion.
   If that were the entire assembly language program, those four values would be
   stored in positions 0, 1, 2, and 3 of the code memory array. But for this first
   stage we only want to check that the values are correct, so I would expect the
   output to be something like
        00011, 0010, 00000000000000001000000
        00011001000000000000000001000000
        01011, 0010, 00000000000000001100011
        01011001000000000000000001100011
        00100, 0010, 00000000000000000000101
        00100001000000000000000000000101
        10101, 0000, 00000000000000000000000
        10101000000000000000000000000000
   displaying everything in binary will make it very easy to check correctness.
   Hexadecimal would be not quite so easy but much more compact.

Second step

   We will add a directive and assemble a program. If a line in the assembly
   language program file begins with the word "label" that will be followed by
   another word that will be recorded as the name for the position of the next
   (as yet unseen) instruction in program code memory. This is easy to handle,
   just give your assembler a vector "names" of strings and a vector "places"
   of ints, and also count the instructions as they are encountered. If your
   program started as above
        LOADN R2 66
        MULN  R2 99
        STORE R2 5
      LABEL end
        STOP
   then when the LABEL end directive is seen you will have already seen three
   instructions, so "end" would be the first string in names, and 3 would be
   the first int in places. Later on, if we see an instruction that says
        JUMP end
   your assembler would search the vectors and know that the numeric operand
   of this JUMP instruction should be 3.

   This means that your program must become a "two pass" assembler, which is
   the normal thing. It means that it reads the assembly language program file
   twice. On the first pass it simply counts instructions and puts the right
   values in the names and places vector, no conversions to binary are done.

   On the second pass you already know the locations/numbers associated with
   all labels, so do not re-process them, just use the two vectors as they
   are. This is when you convert the program into its binary form as unsigned
   ints and store them is the program code memory.

Third step

   At some point, you should turn off the printing of the instruction parts
   we had at the end of step one, but make them conditional so that it will
   be easy to put them back if things go wrong in the future.

   Two more directives, easy after you have done the second step. One directive
   consists of the word DATA followed by another word and then an int value.
   e.g.   DATA count 35
   This is a declaration of a variable, and you'll need a vector of strings
   for this. Whenever you see a DATA directive, push the name onto this new
   vector and store the initial value, 35, in the corresponding position in
   the data memory area. Now, if your assembler sees a word in an instrucion
   and it doesn't match any label, search the new variables vector, and the
   position in the vector tells you the correct memory location for this
   variable. For example ...
      if your program has already seen three DATA directives and then encounters
   DATA size 17 during the first pass, "size" will be at position 3 in the
   vector and 17 will be at position 3 in the data memory. In the second
   pass when an instruction like 
      LOAD R1 size
   size will be replaced by 3 in the instruction code, and that's all that
   is needed for (global) variables.

   The second new directive is called STRING. Once you see STRING at the
   beginning of a line, take the following word, replace all of its underlines
   _ with spaces, and push it onto the vector of output strings I mentioned
   right at the top. Having _ represent a space just means that you have a
   simple way to read the whole string. It is the assembly language programmer's
   responsibility to keep not of what order the strings appeared in. Now
   the two mysterious instructions for printing strings can do their jobs. 
   Let's say the first was called OUTSN (OUTput String by Number) and OUTSR
   (OUTput String from Register) Then on execution, the instrcution
      OUTSN 4
   would cause the 4th recorded string to be printed, as would the sequence
      LOADN R2 4
      OUTSR R2 0
   Pick on a character that isn't of much use in ordinary strings, something
   like \ or ~, and make it stand for newline, but only when your assembly
   language program is actually running. If you had
      STRING was_the_answer~
   then "was the answer~" would be in the strings memory, but when your
   program executes an OUTSN or OUTSR for that string, it would be printed
   with a real newline \n at the end instead of the ~.

   To recap, after this third step, your two-pass assembler will be creating
   a binary executable program in the program code memory area as well as
   handling the new directives.

Fourth step

   This probably seems as though it would be the hardest, but it shouldn't be.
   Run the program that your assembler just put together. You just need to
   create an int variable (I'll call it PC) initial value 0, to say where the
   next instruction to be executed can be found. Running a program requires
   just this sequence:
      1. instr = code_memory[PC];
      2. PC += 1;
      3. split instr into its three parts opcode, reg, and number, or whatever
   you want tcall them. Beware, "register" has a special meaning in C++.
      4. if (opcode == op_loadn)
           ...
         else if (opcode == op_loadr)
           ...
         else
           ...

   You have already got everything set up in the program code and data and
   strings memory area, all that's needed is an array of ints for the registers
   and you're ready to go.

   Once again, start off by making error tracing easy. Every time an instruction
   is executed, print it in a readable way (opcode printed as MULN rather than 11)
   and print the values of the register it uses and any memory location used
   just before and just after executing the instruction. No need to be clever
   there: if the numeric operand is N, and N is between 0 and the size of the
   data memory array, print the value stored there. Don't waste time working
   out whether or not N is actually going to be used as an address.

   Make it easy to turn that tracing on and off so that you don't have your
   program's proper output hidden by all of that.

Lots of tests, you must be confident that everything works.

Some samples, the meanings of the opcodes should be obvious from the examples
I gave above.

Add up the squares of all the numbers between 0 and what the user says:

STRING number?_
STRING the_sum_is_
STRING ~ 
DATA n 0
   LOADN R2 0    (R2 will keep the running total)
   OUTSN 0
   READN R1 0    (the 0 is useless but required)
LABEL loop
   STORE R1 n    (keep N safe)
   JZER  R1 end  (we're adding up the squares in reverse order so 0 means done)
   MULR  R1 1    (multiply R1 by R1)
   ADDR  R2 1    (add register 1 to the running total)
   LOADM R1 n    (recover saved n ready for the loop because R1 has been squared)
   SUBN  R1 1
   JUMP  loop
LABEL end
   OUTSN 1
   OUTR  R2 0
   OUTSN 2
   STOP

An (ugly) multiplication table up to 12. Throughout this, R3 is used as a temporary

STRING _
STRING ~
   LOADN R1 1     (R1 is for the row)
LABEL outer
   LOADN R2 1     (R2 is the the column)
LABEL inner
   LOADR R3 1     (R3 = R1, the row number)
   MULR  R3 2     (R3 *= R2, the column)
   OUTR  R3 0
   OUTSN 0
   ADDN  R2 1     (increment column for next time round the loop)
   LOADR R3 2     (preparing to compare column number (R2) with 13)
   SUBN  R3 13
   JNEG  R3 inner (column - 13 is negative means more columns to come)
   OUTSN 1
   ADDN  R1 1     (increment the row for the outer loop)
   LOADR R3 1     (preparing to compare row with 13)
   SUBN  R3 13
   JNEG  R3 outer
   STOP

Extra Credit possibilities:

   Make functions work. This involves using the part of the data memory that is
   not occupied by the program's variables/data to make a stack, and creating
   another speical variable, like PC, that records where the end of the stack is.
   I'll call it SP for Stack Pointer. To push a value onto the stack, just add
   one to SP and store the value at the memory location that SP now indicates.
   To pop something from the stack it is just the reverse - take the value in the
   memory location that SP is indicating then subtract one from it (from SP that is).
   You would also need to invent a few new instructions, but they are not complicated.
   One of them is a push instruction whose operand is always an address in the
   data memory area, and it does exactly what you just saw. The other is a pop
   instruction, the value removed from the stack is stored in the memory location
   given by the instruction's operand. For example, this sequence
      PUSH 2
      PUSH 7
      POP  2
      POP  7
   would have the effect of swapping the values of the variables in locations 2
   and 7.

   Functions are just stretches of code with a LABEL at the beginning that is the
   function's name. There are two essential obvious things to making functions
   work. One is that there must be a way to get back to where the program was
   before the call when the end of the function is reached. The other is that
   local variables that the function uses must not overwrite existing variables.
   The assembly language programmer is responsible for all of those things, you
   do not have to automate those actions or anything like that.

   Both of those needs are satisfied by the stack and the few new instructions.
   One of the new instructions is for calling a function (it's name is often
   CALL or something very close to that). It is exactly the same as the unconditional
   jump instruction except that it pushes the old value of PC onto the stack
   as well as changing it to the jump address (i.e. the beginning of the function).
   At the end of a function there is the counterpart to call, a "return" instruction.
   It has no register and no operand. All it does is pop a value from the stack
   and set PC equal to it. If you have been using the stack in the proper way,
   the value popped will be the old PC value that the call instruction pushed,
   so the program will jump back to the instruction after that call.

   Local variables: the way a modern programming language manages local variables
   if more complicated than this, but this will do the trick. Before every call
   instruction the programmer adds a sequence of "push" instruction, one for
   every variable that the calling function is using. This means that their
   values are safe on the stack so the called function can do anything it wants.
   After the call instruction the programmer adds a sequence of pop instructions
   the same as the pushes but in reverse order, so the saved values of those
   variables have been restored.

   This could of course be turned inside out: the caller does nothing but call
   the function. The function's first instructions are to push the variables
   that it is going to use, and just before the return instruction it pops them
   back to their original values.

   Parameters and return value? This is often handled by programmers agreeing
   on the details of a very simple scheme: just before calling a function, put
   the first parameter in register R1, the second in R2, and so on. So long as
   the people who write the function and the caller know the scheme it works
   very easily. Return value is the same: another agreement between programmers,
   perhaps functions always store their result (if there is one) in R1 just 
   before the return instruction.

An example. A factorial function and its caller

   STRING ~
   LABEL factorial
      LOADN R2 1      (we'll build the result in R2 and use R1 to drive the loop)
   LABEL loop
      JZER  R1 done   (jump if the paramter is 0, all finished)
      MULR  R2 R1     (result so far multiplied by N)
      SUBN  R1 1      (decrement N)
      JUMP  loop
   LABEL done
      LOADR R1 2      (put the result in R2 where it is expected)
      RET

   ...
   ...
   ...                (this shows that function being used. I'll pretend that
   ...                 varibles x, count, and fido are currently in use. We
   ...                 want to set fido to the factorial of x)
   LOAD  R1 x         (put the parameter where it is expected to be)
   PUSH  x
   PUSH  count
   PUSH  fido
   CALL  factorial
   POP   fido         (the factorial function didn't use the three saved variables
   POP   count         but in general the programmer would have no way of knowing
   POP   x             that)
   STORE R1 fido      (put the result where it is wanted)
   OUTR  R1 0
   OUTSN 0
   STOP