Variables

This covers: the connection between a variable and the memory location that it occupies, as seen by the compiler and the running program. It covers the concepts of scope and extent, the purpose of a linker, how a program accesses its variables.

A program creates a mapping from the names of variables (or more accurately, the names of identifiers) to their values. In most cases, this is a two part mapping. The compiler produces its own mapping from identifier names to a set of static attributes (these are all the things that the compiler could reasonably know about an identifier: its type, its address, and so on) in its symbol table. A running program maintains the other part of the mapping, from addresses to values, simply by storing appropriate values in particular memory locations.

Terminology (informal definitions)
Identifier A programmer-defined name, not limited to variable names, but also including function names, data-structure names, and other things.
Address Memory is considered to be a huge array of identical storage locations. The address is just a number used to identify a particular one.
Attribute (of an identifier) one of the pieces of information that may be known about an identifier, including what kind of identifier it is (variable, function name, etc), what its type is (int, float, void->int, etc), its value, its address in memory, etc.
Static (referring to an attribute) one that does not change, and can be known by the compiler. In most languages, this describes type and address.
Dynamic (referring to an attribute) one that can change, and can only be known when the program is running. In most cases, this only covers value.

       When a compiler processes a program, it reads through it in an orderly, top-to-bottom manner. When it reaches a declaration, it keeps a note of any information provided in a special data structure called the symbol table. At appropriate times the compiler may discard outdated information from that table.
       Consider what happens when the compiler processes this C program:
           1- int iii,jjj;
           2- 
           3- void fff(void)
           4- { int xxx,yyy;
           5-   ...
           6-   iii=jjj*72;
           7-   xxx=yyy+1;
           8-   ... }
           9-
          10- int mmm;
          11- 
          12- int ggg(void)
          13- { float iii;
          14-   ...
          15-   iii=jjj+1;
          16-   ... }
Initially, the symbol table is empty. On reaching line 1, it is noted that iii and jjj are now known, and that they are both ints, a primitive compiler might also decide what addresses in memory they will occupy. A modern compiler recognises that this file might be just part of a larger program that will be linked together later. Other parts of the program (that will be, or have been, separately compiled) will also create an unknown number of their own global variables, so it is not possible to know what memory addresses will still be available.
       This presents quite a problem. The usual solution, which is not very efficient but does at least work, is to have the compiler produce not a proper executable file, but instead a compromise called an Object File. An object file contains blocks of proper executable code, together with some extra notations, including plain text identifier names. Once all the files that make up a program have been compiled, their object files are read by a linker. A linker can see exactly how much space is needed in total for global variables and everything else, so it can allocate particular addresses for things. Armed with this information, the linker combines all the blocks of executable code from all the object files ento one real executable file. Using the plain text labels, it can link all the uses of a global variable with its one real definition.

So, in summary, the compiler produces an object file, which contains executable code for the program, but with any references to global things (variables or functions) appearing by name instead of in their executable binary form. The linker later works out the correct addresses and substitutes them in. This results in a lot of extra work, but solves the problem of not being able to know addresses of global variables, and also makes globals truly global, so that they can be accessed even from other files that form part of the same program.

Getting back to the example program, and making up a simple assembly language, we might expect the function fff to be translated to assembly code thus:

           1- iii: bytes 4
              jjj: bytes 4
           3- fff:
           4-     ...
           5-     ...
           6-     MUL   jjj, #72, iii
           7-     ADD   yyy, #1, xxx
           8-     ...
                  RET
If we imagine that the machine code translations of the assembly code mnemonics are (in hexadecimal) MUL=3A, ADD=3B, RET=44, one byte constant operand=F7, memory address operand=FF, and note that 72 in hexadecimal is 48, the object file produced would contain the following mix of executable code and special notations:
           1- <"iii" is here>, 00, 00, 00, 00,
              <"jjj" is here>, 00, 00, 00, 00
           3- <"fff" is here>,
           4- ...
           5- ...
           6- 3A, FF, <address of "jjj">, F7, 48, FF, <address of "iii">
           7- 3B, ......., F7, 01, .......
           8- ...
              44
The text enclosed in pointy brackets <...> represents notations left in the object file for processing by the linker. The linker can work out what all the addresses really will be, and substitute in exact numbers as requested. The large numbers of dots on line 7 represent the fact that we haven't yet seen how to deal with local variables.
       So, by the time the compiler has reached the beginning of line 4, it has output the following codes to the object file it is creating:
           <"iii" is here>, 00, 00, 00, 00,
           <"jjj" is here>, 00, 00, 00, 00
           <"fff" is here>
and its symbol table will contain three pieces of information:
          iii   int          globalvar
          jjj   int          globalvar
          fff   void->void   function
which is all it needs to know about those three identifiers, given that it will be referring to them by name.
       When it comes to the declaration int xxx,yyy; on line 4, things are different. xxx and yyy are local variables, belonging solely to the function fff. Nothing defined in any other file should even know they exist, so there is no need for the linker to be told about them.
       Compilers handle local variables entirely by themselves, without involving the linker at all. This is possible because a single function is never split over a number of files and compiled separately. When the compiler reaches a function definition, it can see the whole of the definition, and therefore knows exactly how much space is required for local variables, so it can allocate addresses itself.
       Almost.

       Local variables "live" (i.e. occupy memory) in the stack. The stack is an expandable area of memory, usually bounded above by the highest addresses in the CPU's address space and and bounded below by the value currently stored in the special CPU register known as the stack pointer (SP). Stacks usually start at high addresses and grow downwards. To make more spce on the stack, just decrement SP. To release stack space back to the system, increment SP.
       On entering a function (as a result of a function call, of course), the stack pointer is decremented by the number of bytes required to hold all the local variables (if we assume 4 bytes per integer, that means 8 bytes in the example). The eight bytes thus added to the stack become the homes of the function's local variables. Because the stack might grow again at some point, the value of the stack pointer is immediately copied into another special purpose CPU register called the Frame pointer (FP). All accesses to local variables are made relative to FP.
       When a function exits, the space used for its locals is released by reincrementing the stack pointer. The frame pointer must also be returned to its original value so that the calling function (the one about to be returned to) can still find its locals. This is usually done by saving the old FP on the stack as soon as a new function is entered, and getting it back just before exitting. This adds an extra 4 bytes to the stack space requirement for each function.

       In the example, at line 4, the compiler notes that 8 bytes are required for local variables, so a total of 12 bytes of stack are required. It generates these three instructions:
           4-     SUB   #12, SP
                  MOV   FP, 0(SP)
                  MOV   SP, FP
The second instruction is saving the frame pointer's value on the stack. The notation 0(SP) [the 0 may be any number, the SP may be any register name] indicates a special form of addressing understood by virtually all CPUs. It means: take the contents of the register SP (this will be the address of the lowest byte of the stack), add 0 to it (no effect), and use the result as an address (or pointer). So the frame pointer is stored wherever the stack pointer points. The third instruction copies the contents of the stack pointer into the frame pointer (so now FP points to the place where the old FP was saved).

At the end of any function, there should also be three instructions: one to recover the old FP, one to reduce the stack to its original size, and one to jump back to the caller:

           8-     ADD   #12, SP
                  MOV   0(FP), FP
                  RET
Also at line 4, the compiler adds to the symbol table information about the two new variables, so the symbol table now contains:
          iii   int          globalvar
          jjj   int          globalvar
          fff   void->void   function
          xxx   int          localvar    offset=4
          yyy   int          localvar    offset=8
The new entries contain nothing unusual, except for the offset=N part. The stack has been enlarged by 12 bytes, to make room for the saved FP and these two ne variables. Each of those three items probably require 4 bytes. As the saved FP occupies the very beginning of the new stack portion (or Stack Frame), the first new variable will be offset by 4 bytes from the beginning of the frame, and the second by 8. "Offset=4" means that xxx is stored 4 bytes from the beginning of the current stack frame.

       We have already seen that when we reach line 6, dealing entirely with global variables "iii=jjj*72", the compiler has in mind the assembly code "MUL jjj, #72, iii" when it adds the bytes "3A, FF, <address of "jjj">, F7, 48, FF, <address of "iii">" to the object file.
       When the compiler reaches line 7, which deals with local variables: "xxx=yyy+1", it knows that xxx starts 4 bytes from the beginning of the stack frame, and yyy starts 8 bytes from the beginning of the stack frame, so it can use the assembly code "ADD 8(FP), #1, 4(FP)" to represent it, and add to the object file the bytes: "3B, FE, 08, F7, 01, FE, 04" (assuming that the hexadecimal code FE represents addressing relative to the frame pointer). This does not leave any notes for the linker at all. This is another small reason why local variables are better than globals when all other considerations are equal.

In summary, the function:

           3- void fff(void)
           4- { int xxx,yyy;
           5-   ...
           6-   iii=jjj*72;
           7-   xxx=yyy+1;
           8-   ... }
produces the assembly/object code:
           3-  <"fff" is here>
           4-  SUB   #12, SP
               MOV   FP, 0(SP)
               MOV   SP, FP
           5-   ...
           6-  MUL   <address of "jjj">, #72, <address of "iii">
           7-  ADD   8(FP), #1, 4(FP)
           8-  ...
               ADD   #12, SP
               MOV   0(FP), FP
               RET
        After processing the end of the function, the compiler removes from the symbol table any entries describing its local variables. It then continues to process the rest of the program in the normal way.
       Eventually, line 13 is reached: "float iii", which declares a local variable with the same name as an already existing global. This presents no problem. A new entry is added to the symbol table exactly as before, and the symbol table behaves as a stack, so when inside the function ggg any access to iii will find the newest entry, describing it as a local float. When ggg is finished, the new entry for iii is removed, but the old one describing it as a global int is still there, completely unchanged. When the end of the program is reached, the symbol table contains:
          iii   int          globalvar
          jjj   int          globalvar
          fff   void->void   function
          mmm   int          globalvar
          ggg   void->int    function

       The variable mmm is not quite as global as iii and jjj. The function fff can not possibly make any use of mmm, because there is no mention of mmm in the symbol table when fff is being processed. On the other hand, both fff and ggg can make full use of jjj. The int variable iii is again different. It is global, but can not be accessed inside ggg for a different reason: its declaration has been hidden by a local declaration using the same name.

       iii, jjj, and mmm are all global variables, but their availabilities are different. Obviously the common distinction between global and local isn't enough to describe everything that can happen.
       The Scope of a variable is the portion of the program in which it could reasonably be said to exist. The rules of C and Pascal say that the scope of a variable begins exactly where its name first appears in a declaration, and stretches to the end of the block that the declaration appeared in (for "top level" declarations which are not inside any block, the end of the scope is the end of the file). Algol's rule is that the scope of a variable covers the whole block in which the declaration appeared. If our example program were translated to Algol, the variable mmm would be accessible everywhere, even inside fff. The scope of a variable is very much a static attribute. The simple rule is that a variable is "in scope" exactly when it is in the compiler's symbol table. The Algol rules demand that a compiler first scan a block to preprocess any declarations before translating any of the statements it contains.
       The Visibility of a variable is the portion of the program in which it can be accessed. The visibility of a variable is always a subset of its scope. A variable is visible exactly when it is in scope, and no newer variable with the same name is also in scope.
       The concepts of scope and visibility are enough to explain the differences between iii, jjj, and mmm.

       Variables also have an Extent (sometimes called a Lifetime, terms can get confusing), which may be either Static or Dynamic; these terms describe the existence of memory allocated for the variable's use. A static variable is given an allocation of memory when the program is being initialised, before it really starts to run, and keep that memory until the program terminates. All global variables are static.
       A dynamic variable is given an allocation of memory at some appropriate time during the execution of a program, and loses it later. The process of allocation and deallocation of memory for dynamic variables may be repeated many times. All of the local variables in the example program are dynamic. In Pascal, local variables can only be dynamic; in C and Algol they can be either, it is the programmer's choice; in Fortran they can only be static.

       It is very easy to confuse scope and extent. They are very similar concepts, and seem to be identical in most cases. In reality, they can't possibly be identical, because they talk about completely different things. The scope of a variable is a region of the program; the extent of a variable is a period of time. The extent of a variable is very often the period of time that the program is executing inside that variable's scope, but does not have to be.

       In this C program:
          1- int a;
          2-
          3- void e(void)
          4- { ... }
          5-
          6- void f(void)
          7- { int b;
          8-   static int c;
          9-   ...
         10-   ... }
         11-
         12- int d;
         13- ...
       The variable a has as its scope the entire program from line 1 onwards. Its declaration is not inside any block, so its scope stretches from its point of declaration to the end of the program. Its extent is the whole run of the program; it is allocated memory when the program starts, and keeps it until it ends.
       The variable b's scope is the entire function f from line 6 to line 10. Outside of the function f the compiler will not even admit that b exists. The extent of b is a particular activation of f. Whenever f is called, memory is allocated for b, and it keeps that memory only until that call returns. An important thing to think about is what happens if f calls e, which it is perfectly entitled to do. e is outside the scope of b, so the compiler is unaware of b's existence, and it can't possibly be used. On the other hand, f has been called, and has not yet exitted; when e finishes, control will return to f which will expect to find all its variables intact. So, the memory allocated for b must remain allocated. Therefore that particular call of e is within the extent of b.
       On the other hand, if e is called directly from another function, perhaps main, it will be outside both the scope and the extent of b.
       The variable c is different. Its scope is still the function f from line 8 to line 10, but its extent is the whole run of the program. That is what the C keyword static means. Although c is totally private to f, and other functions can have their own variables called c without causing interference, it still "lives" for ever. Its memory is allocated at the same time as the global variables', and is not released until the program finishes.
       If the function f leaves data in b and c before exitting, then the next time it is called, it can absolutely rely on that information still being in c, but whatever it left in b will be gone.
       The variable d is global in the usual sense, although its scope is slightly reduced, being from its declaration point, line 12, to the end of the program. Its extent is still the entire run of the program. All static variables are allocated memory before the program really starts to run. So again it is possible to jump in and out of the scope of d, and occasionally lose the ability to access it, but it is not possible to jump in and out of its extent.
       If f were to be called recursively, there would be multiple versions of b, all with the same scope, but different extents. There would only be one version of c.

This brings us to an important and deceptively non-trivial question: What is a variable? Obviously a variable is not just a name that may be used in an assignment. Nor can it be a memory location: variables can cease to exist, but memory locations are permanent. The only workable answer is that a variable is a two-part object, consisting of the association between a declaration and a memory location.
       If a memory location that was used for a dynamic variable gets recycled after the end of that variable's extent, and is reused as the location of another variable, it is still very definitely a different variable. Similarly, if a function is called recursively, many versions of its dynamic variables are brought into existence. Each was created from the same declaration, but occupies a different memory location. Again, they are all clearly different variables.

We may also ask "how global is a global variable?". If a program is split into a number of files, each compiled separately, how do the global variables behave? Fortran, Pascal, and Algol have no standardised support for separate compilation, so the question is left up to the designers of individual compilers (if it is answered at all). In C, obscure linguistic tricks are used.
       In C, as in other languages, all global (i.e. top-level) variables have static extent. Programmers do not put the keyword static in front of a top level declaration to indicate static extent, it is automatic. However, it is possible to put the keyword static in front of a global variable declaration, in which case it means something else all together. A normal top-level variable declaration, such as int i;, not inside any block, produces a global static variable whose declaration is "exported to the linker". This means that it can be accessed from other separately compiled units of the same program when they are linked together. Adding the word static, producing static int i; still produces a global static variable, but prevents its name from being exported to the linker, so it will be a different variable from any other global integer called i in other separately compiled program parts.

Language Specifics
       C: Declarations that appear inside a block (between { and }) have as their scope the region between their declaration and the } that terminates that block. Declarations that are not inside any block (i.e. Top-Level declarations) have as their scope the whole program from their declaration point onwards. All globals have static extent. Locals have dynamic extent (limited to the lifetime of the block they were declared in) unless the declaration begins with static.
       Pascal: Declarations that appear inside a block (between begin and end) have as their scope the region between their declaration and the end that terminates that block. Top-level declarations (not inside any block) have as their scope the whole program from their declaration point onwards. All globals have static extent. All locals have dynamic extent (limited to the lifetime of the block they were declared in).
       Algol 60: All declarations appear inside a block. An Algol program is just one big block. All declarations have as their scope the entire block that they are declared inside (not just from their declaration point on). The extent of any variable is the lifetime (period of time between entry and exit) of their block; if a variable's block is the whole program, then it is effectively static, although technically, all variables are dynamic in extent. The one exception is that if a variable is preceeded by the keyword own, the variables it creates have true static extent.
       Fortran: All variables have as their scope the entire program unit in which they appear. Program units are whole subroutines, whole functions, or the main program. There is no such thing as a global variable (although they may be simulated with "common blocks"). All variables have static extent.