Your own virtual machine and assembler Don't panic! This is a long description but that doesn't mean the assignment is long. There is just a lot of detail because most of you are new to this kind of thing, so I won't take anything for granted. No detailed requirements are hidden in the descriptions. Just create an assembler and virtual machine with at least the instrctions given below and make sure it can run a variety of programs. A substantial part is also for extra credit only. What I call the first, second, ... step are really just recommendations. You can split the task up as you see fit. But remember that if you type the whole thing at once it will be very difficult to locate the sources of the enormous number of inevitable errors. For clarity, definitions of terms Assembly Language: the human readable form of the computer's internal language Machine Code: The binary version of A.L. that the computer's electronics obey (those two are often used interchangeably becuase they are so closely related) Assembler: the program that converts assembly language into binary machine code Virtual Machine: software that behaves exactly like a real computer and runs the M.C. As a reminder, there will be three separate areas of memory in the virtual machine, they may be as simple as arrays of a reasonably large size. One (an array of unsigned ints) will store the executable code, another (an array of ordinary ints) will store the values of the program's variables, and the third (an array of strings) is a bit of a cheat but it will allow your programs to produce comprehensible output. Here, I type assembly language instructions in capital letters, that is just to make them stand out as what they are, you do not have to do the same. First step Make a program that can read and decode an assembly language program. The format of the assembly language is ... blank lines and lines beginning with # are to be ignored lines beginning with a space are instructions. All instructions begin with an "opcode" and may specify a register and a numeric operand as in ... STOP JUMP XXX XXX is the numeric operand, no register has been given ADD RRR XXX RRR is a register name, XXX is a numeric operand there is no way to provide a register but no numeric operand. opcodes, registers, and numeric operands will never have spaces within them. You should choose easily recognisable names for your registers, but their names must feature a number to identify the register, R3 or even just 3 are reasonable choices. numeric operands will be either names or numbers (ints only). any line beginning with anything other that space or # is a "directive". It is not an executable part of your program, it tells the assembler to do something. You must decide what the opcodes are and how they are named, but there must be enough for useful programs to be created. At a minimum there must be ... Arithmetic instructions for addition, subtraction, multiplication, and division. All specify a register and an operand, and result in the value stored in the register being changed accordingly. Arithmetic instructions must all have three forms, one in which the numeric operand represents a number, one in which the numeric operand represents an address / memory location, and the value stored there is the value used to modify the register, and one form in which the numeric operand is actually a register number. Instructions must have names that strongly suggest what they do. The three forms must have slightly different names or your assembler will be in trouble. for example, you could choose MULN RRR 123 to mean multiply the register by 123, and MULM RRR 123 to multiply the resister by whatever is stored in memory location 123. MULR RRR 3 to mean multiply register RRR by whatever is stored in register 3. Jump instructions in at least four forms: unconditional in which no register is given and execution causes your program to change course so that the next instruction comes from the position indicated by the numeric operand. Conditional jumps make the change of course only happen if the provided register satisfies a particular condition: less than zero, equal to zero, or greater than zero. Naturally all four (or more) jump instructions must have names that strongly suggest that a jump is going to happen, and any condition required. A stop instruction that when executed makes the program stop running. Useful recommendation: make stop be the instruction that has opcode 0, that way if you forget to put a stop at the end of a program, it will probably stop anyway. Three "load" instructions that specify a register and an operand. They set the register to a new value. Just like for arithmetic, one version provides the new value directly, another gives the address in memory in which the new value is to be found, and the third takes the new value from another register. A "store" instruction that overwrites the memory location given by the numeric operand with the register's value. The opposite of load. At least four input/output instructions, three of them use a register but ignore their numeric operand. One waits for the user to type an int value, and sets the register to that value. One does the opposite - prints the value of the register for the user to see. The third and fourth cause a string to be printed. Strings will be built into your assembly language program and will be represented by a number. In one of the instructions that number is the numeric operand, in the other it comes from the register. Of course you may add more to make the virtual machine more useful but be careful to get the requirements right first. Each line of the file that represents an instruction will create a single 32 bit int that represents the whole instruction and stores it in the next free location in the program code memory. The opcodes will all have numeric values that become part of the 32 bit instruction. Nothing clever is required for this, you could have a const array of strings in your program, the strings are the opcodes and their position in the array gives their numeric value. You must have suitably named const ints to record the positions. For instance if "MULN" happens to be at position 11 in that array, you might define const int ... ... , op_muln = 11, ..., ... ; Trying to remember what the numbers are would be hopeless. Sensible names for them is the only option. The opcode's value, register number, and numeric operand must all be squeezed into the one 32 bit instruction. The number of bits for each of the three parts is easy to determine. Set aside enough bits for the largest of your opcode values, then enough bits to specify a register number (go for 8 or 16 registers, less than 8 can be difficult, and more than 16 is just a waste) so 8 registers would need three bits, 16 would need 4. The remainder of the 32 bits give the numeric operand. For this first stage of your creation, your assembler should just print the things it calculates, first as three separate values, then as the combined 32 bit instruction value. For example, if you have a LOADN (code 3) instruction for putting a numeric value in a register, a MULN (code 11) instruction for multiplying by the operand, a STORE (code 4) operation for storing the register value in memory, and a STOP (code 21) instruction, and you have a total of 29 different instruction codes (they would fit in 5 bits), and 16 registers, this little bit of program LOADN R2 66 MULN R2 99 STORE R2 5 STOP when run it would work out what 66 * 99 is, and store it in data memory location 5, but we're not there yet. The numeric values for the three components of each instruction would be 3, 2, 66 11, 2, 99 4, 2, 5 21, 0, 0 in binary, stretched to the appropriate number of bits, that is 00011, 0010, 00000000000000001000000 01011, 0010, 00000000000000001100011 00100, 0010, 00000000000000000000101 10101, 0000, 00000000000000000000000 as 32 bit values in binary, they are trivially 00011001000000000000000001000000 01011001000000000000000001100011 00100001000000000000000000000101 10101000000000000000000000000000 as ints printed in decimal they would be 419430464, 1493172323, 553648133, 2818572288, but you would never waste time working that out, it is better to look at them in hexadecimal. Incidentally, the fourth instruction has a 1 in its most significant bit, so it would naturally be printed as a negative, -1476395008, so you should use unsigned ints to avoid confusion. If that were the entire assembly language program, those four values would be stored in positions 0, 1, 2, and 3 of the code memory array. But for this first stage we only want to check that the values are correct, so I would expect the output to be something like 00011, 0010, 00000000000000001000000 00011001000000000000000001000000 01011, 0010, 00000000000000001100011 01011001000000000000000001100011 00100, 0010, 00000000000000000000101 00100001000000000000000000000101 10101, 0000, 00000000000000000000000 10101000000000000000000000000000 displaying everything in binary will make it very easy to check correctness. Hexadecimal would be not quite so easy but much more compact. Second step We will add a directive and assemble a program. If a line in the assembly language program file begins with the word "label" that will be followed by another word that will be recorded as the name for the position of the next (as yet unseen) instruction in program code memory. This is easy to handle, just give your assembler a vector "names" of strings and a vector "places" of ints, and also count the instructions as they are encountered. If your program started as above LOADN R2 66 MULN R2 99 STORE R2 5 LABEL end STOP then when the LABEL end directive is seen you will have already seen three instructions, so "end" would be the first string in names, and 3 would be the first int in places. Later on, if we see an instruction that says JUMP end your assembler would search the vectors and know that the numeric operand of this JUMP instruction should be 3. This means that your program must become a "two pass" assembler, which is the normal thing. It means that it reads the assembly language program file twice. On the first pass it simply counts instructions and puts the right values in the names and places vector, no conversions to binary are done. On the second pass you already know the locations/numbers associated with all labels, so do not re-process them, just use the two vectors as they are. This is when you convert the program into its binary form as unsigned ints and store them is the program code memory. Third step At some point, you should turn off the printing of the instruction parts we had at the end of step one, but make them conditional so that it will be easy to put them back if things go wrong in the future. Two more directives, easy after you have done the second step. One directive consists of the word DATA followed by another word and then an int value. e.g. DATA count 35 This is a declaration of a variable, and you'll need a vector of strings for this. Whenever you see a DATA directive, push the name onto this new vector and store the initial value, 35, in the corresponding position in the data memory area. Now, if your assembler sees a word in an instrucion and it doesn't match any label, search the new variables vector, and the position in the vector tells you the correct memory location for this variable. For example ... if your program has already seen three DATA directives and then encounters DATA size 17 during the first pass, "size" will be at position 3 in the vector and 17 will be at position 3 in the data memory. In the second pass when an instruction like LOAD R1 size size will be replaced by 3 in the instruction code, and that's all that is needed for (global) variables. The second new directive is called STRING. Once you see STRING at the beginning of a line, take the following word, replace all of its underlines _ with spaces, and push it onto the vector of output strings I mentioned right at the top. Having _ represent a space just means that you have a simple way to read the whole string. It is the assembly language programmer's responsibility to keep not of what order the strings appeared in. Now the two mysterious instructions for printing strings can do their jobs. Let's say the first was called OUTSN (OUTput String by Number) and OUTSR (OUTput String from Register) Then on execution, the instrcution OUTSN 4 would cause the 4th recorded string to be printed, as would the sequence LOADN R2 4 OUTSR R2 0 Pick on a character that isn't of much use in ordinary strings, something like \ or ~, and make it stand for newline, but only when your assembly language program is actually running. If you had STRING was_the_answer~ then "was the answer~" would be in the strings memory, but when your program executes an OUTSN or OUTSR for that string, it would be printed with a real newline \n at the end instead of the ~. To recap, after this third step, your two-pass assembler will be creating a binary executable program in the program code memory area as well as handling the new directives. Fourth step This probably seems as though it would be the hardest, but it shouldn't be. Run the program that your assembler just put together. You just need to create an int variable (I'll call it PC) initial value 0, to say where the next instruction to be executed can be found. Running a program requires just this sequence: 1. instr = code_memory[PC]; 2. PC += 1; 3. split instr into its three parts opcode, reg, and number, or whatever you want tcall them. Beware, "register" has a special meaning in C++. 4. if (opcode == op_loadn) ... else if (opcode == op_loadr) ... else ... You have already got everything set up in the program code and data and strings memory area, all that's needed is an array of ints for the registers and you're ready to go. Once again, start off by making error tracing easy. Every time an instruction is executed, print it in a readable way (opcode printed as MULN rather than 11) and print the values of the register it uses and any memory location used just before and just after executing the instruction. No need to be clever there: if the numeric operand is N, and N is between 0 and the size of the data memory array, print the value stored there. Don't waste time working out whether or not N is actually going to be used as an address. Make it easy to turn that tracing on and off so that you don't have your program's proper output hidden by all of that. Lots of tests, you must be confident that everything works. Some samples, the meanings of the opcodes should be obvious from the examples I gave above. Add up the squares of all the numbers between 0 and what the user says: STRING number?_ STRING the_sum_is_ STRING ~ DATA n 0 LOADN R2 0 (R2 will keep the running total) OUTSN 0 READN R1 0 (the 0 is useless but required) LABEL loop STORE R1 n (keep N safe) JZER R1 end (we're adding up the squares in reverse order so 0 means done) MULR R1 1 (multiply R1 by R1) ADDR R2 1 (add register 1 to the running total) LOADM R1 n (recover saved n ready for the loop because R1 has been squared) SUBN R1 1 JUMP loop LABEL end OUTSN 1 OUTR R2 0 OUTSN 2 STOP An (ugly) multiplication table up to 12. Throughout this, R3 is used as a temporary STRING _ STRING ~ LOADN R1 1 (R1 is for the row) LABEL outer LOADN R2 1 (R2 is the the column) LABEL inner LOADR R3 1 (R3 = R1, the row number) MULR R3 2 (R3 *= R2, the column) OUTR R3 0 OUTSN 0 ADDN R2 1 (increment column for next time round the loop) LOADR R3 2 (preparing to compare column number (R2) with 13) SUBN R3 13 JNEG R3 inner (column - 13 is negative means more columns to come) OUTSN 1 ADDN R1 1 (increment the row for the outer loop) LOADR R3 1 (preparing to compare row with 13) SUBN R3 13 JNEG R3 outer STOP Extra Credit possibilities: Make functions work. This involves using the part of the data memory that is not occupied by the program's variables/data to make a stack, and creating another speical variable, like PC, that records where the end of the stack is. I'll call it SP for Stack Pointer. To push a value onto the stack, just add one to SP and store the value at the memory location that SP now indicates. To pop something from the stack it is just the reverse - take the value in the memory location that SP is indicating then subtract one from it (from SP that is). You would also need to invent a few new instructions, but they are not complicated. One of them is a push instruction whose operand is always an address in the data memory area, and it does exactly what you just saw. The other is a pop instruction, the value removed from the stack is stored in the memory location given by the instruction's operand. For example, this sequence PUSH 2 PUSH 7 POP 2 POP 7 would have the effect of swapping the values of the variables in locations 2 and 7. Functions are just stretches of code with a LABEL at the beginning that is the function's name. There are two essential obvious things to making functions work. One is that there must be a way to get back to where the program was before the call when the end of the function is reached. The other is that local variables that the function uses must not overwrite existing variables. The assembly language programmer is responsible for all of those things, you do not have to automate those actions or anything like that. Both of those needs are satisfied by the stack and the few new instructions. One of the new instructions is for calling a function (it's name is often CALL or something very close to that). It is exactly the same as the unconditional jump instruction except that it pushes the old value of PC onto the stack as well as changing it to the jump address (i.e. the beginning of the function). At the end of a function there is the counterpart to call, a "return" instruction. It has no register and no operand. All it does is pop a value from the stack and set PC equal to it. If you have been using the stack in the proper way, the value popped will be the old PC value that the call instruction pushed, so the program will jump back to the instruction after that call. Local variables: the way a modern programming language manages local variables if more complicated than this, but this will do the trick. Before every call instruction the programmer adds a sequence of "push" instruction, one for every variable that the calling function is using. This means that their values are safe on the stack so the called function can do anything it wants. After the call instruction the programmer adds a sequence of pop instructions the same as the pushes but in reverse order, so the saved values of those variables have been restored. This could of course be turned inside out: the caller does nothing but call the function. The function's first instructions are to push the variables that it is going to use, and just before the return instruction it pops them back to their original values. Parameters and return value? This is often handled by programmers agreeing on the details of a very simple scheme: just before calling a function, put the first parameter in register R1, the second in R2, and so on. So long as the people who write the function and the caller know the scheme it works very easily. Return value is the same: another agreement between programmers, perhaps functions always store their result (if there is one) in R1 just before the return instruction. An example. A factorial function and its caller STRING ~ LABEL factorial LOADN R2 1 (we'll build the result in R2 and use R1 to drive the loop) LABEL loop JZER R1 done (jump if the paramter is 0, all finished) MULR R2 R1 (result so far multiplied by N) SUBN R1 1 (decrement N) JUMP loop LABEL done LOADR R1 2 (put the result in R2 where it is expected) RET ... ... ... (this shows that function being used. I'll pretend that ... varibles x, count, and fido are currently in use. We ... want to set fido to the factorial of x) LOAD R1 x (put the parameter where it is expected to be) PUSH x PUSH count PUSH fido CALL factorial POP fido (the factorial function didn't use the three saved variables POP count but in general the programmer would have no way of knowing POP x that) STORE R1 fido (put the result where it is wanted) OUTR R1 0 OUTSN 0 STOP