Variables
This covers: the connection between a variable and the memory location
that it occupies, as seen by the compiler and the running program. It covers the
concepts of scope and extent, the purpose of a linker, how a program accesses
its variables.
A program creates a mapping from the names of variables (or more accurately, the names
of identifiers) to their values. In most cases, this is a two part mapping. The compiler
produces its own mapping from identifier names to a set of static attributes (these are all
the things that the compiler could reasonably know about an identifier: its type, its address,
and so on) in its symbol table. A running program maintains the other part of the mapping,
from addresses to values, simply by storing appropriate values in particular memory locations.
Terminology (informal definitions)
Identifier | A programmer-defined name, not limited to variable names, but also including
function names, data-structure names, and other things. |
Address | Memory is considered to be a huge array of identical storage locations. The
address is just a number used to identify a particular one. |
Attribute | (of an identifier) one of the pieces of information that may be known about
an identifier, including what kind of identifier it is (variable, function name, etc), what
its type is (int, float, void->int, etc), its value, its address in memory, etc. |
Static | (referring to an attribute) one that does not change, and can be known by the
compiler. In most languages, this describes type and address. |
Dynamic | (referring to an attribute) one that can change, and can only be known when
the program is running. In most cases, this only covers value. |
When a compiler processes a program, it reads through it in an orderly, top-to-bottom manner.
When it reaches a declaration, it keeps a note of any information provided in a special data structure
called the symbol table. At appropriate times the compiler may discard outdated information from
that table.
Consider what happens when the compiler processes this C program:
1- int iii,jjj;
2-
3- void fff(void)
4- { int xxx,yyy;
5- ...
6- iii=jjj*72;
7- xxx=yyy+1;
8- ... }
9-
10- int mmm;
11-
12- int ggg(void)
13- { float iii;
14- ...
15- iii=jjj+1;
16- ... }
Initially, the symbol table is empty. On reaching line 1, it is noted that iii and jjj
are now known, and that they are both ints, a primitive compiler might also decide what addresses
in memory they will occupy. A modern compiler recognises that this file might
be just part of a larger program that will be linked together later. Other parts of the program
(that will be, or have been, separately compiled) will also create an unknown number of their own
global variables, so it is not possible to know what memory addresses will still be available.
This presents quite a problem. The usual solution,
which is not very efficient but does at least work,
is to have the compiler produce not a proper executable file, but instead a compromise called an
Object File. An object file contains blocks of proper executable code, together with some extra
notations, including plain text identifier names. Once all the files that make up a program have been
compiled, their object files are read by a linker. A linker can see exactly how much space is needed in
total for global variables and everything else, so it can allocate particular addresses for things.
Armed with this information, the linker combines all the blocks of executable code from all the object
files ento one real executable file. Using the plain text labels, it can link all the uses of a global
variable with its one real definition.
So, in summary, the compiler produces an object file, which contains executable code for the program,
but with any references to global things (variables or functions) appearing by name instead of in their
executable binary form. The linker later works out the correct addresses and substitutes them in. This
results in a lot of extra work, but solves the problem of not being able to know addresses of global
variables, and also makes globals truly global, so that they can be accessed even from other files
that form part of the same program.
Getting back to the example program, and making up a simple assembly language, we might expect the function
fff to be translated to assembly code thus:
1- iii: bytes 4
jjj: bytes 4
3- fff:
4- ...
5- ...
6- MUL jjj, #72, iii
7- ADD yyy, #1, xxx
8- ...
RET
If we imagine that the machine code translations of the assembly code mnemonics are (in hexadecimal)
MUL=3A, ADD=3B, RET=44, one byte constant operand=F7, memory address operand=FF, and note that
72 in hexadecimal is 48, the object file
produced would contain the following mix of executable code and special notations:
1- <"iii" is here>, 00, 00, 00, 00,
<"jjj" is here>, 00, 00, 00, 00
3- <"fff" is here>,
4- ...
5- ...
6- 3A, FF, <address of "jjj">, F7, 48, FF, <address of "iii">
7- 3B, ......., F7, 01, .......
8- ...
44
The text enclosed in pointy brackets <...> represents notations left in the object file
for processing by the linker. The linker can work out what all the addresses really will be,
and substitute in exact numbers as requested. The large numbers of dots on line 7 represent
the fact that we haven't yet seen how to deal with local variables.
So, by the time the compiler has reached the beginning of line 4, it has output the following codes to the
object file it is creating:
<"iii" is here>, 00, 00, 00, 00,
<"jjj" is here>, 00, 00, 00, 00
<"fff" is here>
and its symbol table will contain three pieces of information:
iii int globalvar
jjj int globalvar
fff void->void function
which is all it needs to know about those three identifiers, given that it will be referring to
them by name.
When it comes to the declaration int xxx,yyy; on line 4, things are different. xxx and
yyy are local variables, belonging solely to the function fff. Nothing defined in any other
file should even know they exist, so there is no need for the linker to be told about them.
Compilers handle local variables entirely by themselves, without involving the linker at all.
This is possible because a single function is never split over a number of files and compiled separately.
When the compiler reaches a function definition, it can see the whole of the definition, and therefore
knows exactly how much space is required for local variables, so it can allocate addresses itself.
Almost.
Local variables "live" (i.e. occupy memory) in the stack. The stack is an expandable area of memory,
usually bounded above by the highest addresses in the CPU's address space and
and bounded below by the value currently stored in
the special CPU register known as the stack pointer (SP). Stacks usually start at high addresses and
grow downwards. To make more spce on the stack, just decrement SP. To release stack space back to the
system, increment SP.
On entering a function (as a result
of a function call, of course), the stack pointer is decremented
by the number of bytes required to hold all the local variables (if we assume 4 bytes per integer,
that means 8 bytes in the example). The eight bytes thus added to the stack become the homes of
the function's local variables. Because the stack might grow again at some point, the value of the stack
pointer is immediately copied into another special purpose CPU register called the Frame pointer (FP).
All accesses to local variables are made relative to FP.
When a function exits, the space used for its locals
is released by reincrementing the stack pointer.
The frame pointer must also be returned to its original value so that the calling function (the one about
to be returned to) can still find its locals. This is usually done by saving the old FP on the stack
as soon as a new function is entered, and getting it back just before exitting. This adds an extra 4 bytes
to the stack space requirement for each function.
In the example, at line 4, the compiler notes that 8 bytes are required for local variables, so a total of
12 bytes of stack are required. It generates these three instructions:
4- SUB #12, SP
MOV FP, 0(SP)
MOV SP, FP
The second instruction is saving the frame pointer's value on the stack. The notation 0(SP) [the 0
may be any number, the SP may be any register name] indicates a special form of addressing understood
by virtually all CPUs. It means: take the contents of the register SP (this will be the address of the
lowest byte of the stack), add 0 to it (no effect), and use the result as an address (or pointer). So
the frame pointer is stored wherever the stack pointer points. The third instruction copies the contents of
the stack pointer into the frame pointer (so now FP points to the place where the old FP was saved).
At the end of any function, there should also be three instructions: one to recover the old FP, one to
reduce the stack to its original size, and one to jump back to the caller:
8- ADD #12, SP
MOV 0(FP), FP
RET
Also at line 4, the compiler adds to the symbol table information about the two new variables, so the
symbol table now contains:
iii int globalvar
jjj int globalvar
fff void->void function
xxx int localvar offset=4
yyy int localvar offset=8
The new entries contain nothing unusual, except for the offset=N part. The stack has been enlarged by
12 bytes, to make room for the saved FP and these two ne variables. Each of those three items probably
require 4 bytes. As the saved FP occupies the very beginning of the new stack portion (or Stack Frame),
the first new variable will be offset by 4 bytes from the beginning of the frame, and the second by 8.
"Offset=4" means that xxx is stored 4 bytes from the beginning of the current stack frame.
We have already seen that when we reach line 6, dealing entirely with global variables "iii=jjj*72",
the compiler has in mind the assembly code "MUL jjj, #72, iii" when it adds the bytes
"3A, FF, <address of "jjj">, F7, 48, FF, <address of "iii">" to the object file.
When the compiler reaches line 7, which deals with local variables: "xxx=yyy+1", it knows that
xxx starts 4 bytes from the beginning of the stack frame, and yyy starts 8 bytes from the
beginning of the stack frame, so it can use the assembly code "ADD 8(FP), #1, 4(FP)" to represent
it, and add to the object file the bytes: "3B, FE, 08, F7, 01, FE, 04" (assuming that the
hexadecimal code FE represents addressing relative to the frame pointer). This does not leave any
notes for the linker at all. This is another small reason why local variables are better than globals
when all other considerations are equal.
In summary, the function:
3- void fff(void)
4- { int xxx,yyy;
5- ...
6- iii=jjj*72;
7- xxx=yyy+1;
8- ... }
produces the assembly/object code:
3- <"fff" is here>
4- SUB #12, SP
MOV FP, 0(SP)
MOV SP, FP
5- ...
6- MUL <address of "jjj">, #72, <address of "iii">
7- ADD 8(FP), #1, 4(FP)
8- ...
ADD #12, SP
MOV 0(FP), FP
RET
After processing the end of the function, the compiler removes from the symbol table any entries
describing its local variables. It then continues to process the rest of the program in the normal way.
Eventually, line 13 is reached: "float iii", which declares a local variable with the same name
as an already existing global. This presents no problem. A new entry is added to the symbol table exactly
as before, and the symbol table behaves as a stack, so when inside the function ggg any access
to iii will find the newest entry, describing it as a local float. When ggg is finished,
the new entry for iii is removed, but the old one describing it as a global int is still there,
completely unchanged. When the end of the program is reached, the symbol table contains:
iii int globalvar
jjj int globalvar
fff void->void function
mmm int globalvar
ggg void->int function
The variable mmm is not quite as global as iii and jjj. The function fff
can not possibly make any use of mmm, because there is no mention of mmm in the symbol
table when fff is being processed. On the other hand, both fff and ggg can
make full use of jjj. The int variable iii is again different. It is global, but
can not be accessed inside ggg for a different reason: its declaration has been hidden by
a local declaration using the same name.
iii, jjj, and mmm are all global variables, but their availabilities are
different. Obviously the common distinction between global and local isn't enough to describe
everything that can happen.
The Scope of a variable is the portion of the program in which it could reasonably be said to
exist. The rules of C and Pascal say that the scope of a variable begins exactly where its name first
appears in a declaration, and stretches to the end of the block that the declaration appeared in (for
"top level" declarations which are not inside any block, the end of the scope is the end of the file).
Algol's rule is that the scope of a variable covers the whole block in which the declaration appeared.
If our example program were translated to Algol, the variable mmm would be accessible everywhere,
even inside fff. The scope of a variable is very much a static attribute. The simple rule is
that a variable is "in scope" exactly when it is in the compiler's symbol table. The Algol rules demand
that a compiler first scan a block to preprocess any declarations before translating any of the
statements it contains.
The Visibility of a variable is the portion of the program in which it can be accessed.
The visibility of a variable is always a subset of its scope. A variable is visible exactly when
it is in scope, and no newer variable with the same name is also in scope.
The concepts of scope and visibility are enough to explain the differences between iii,
jjj, and mmm.
Variables also have an Extent (sometimes called a Lifetime, terms can get
confusing), which may be either Static or Dynamic; these terms describe
the existence of memory allocated for the variable's use. A static variable is given
an allocation of memory when the program is being initialised, before it really starts
to run, and keep that memory until the program terminates. All
global variables are static.
A dynamic variable is given an allocation of memory at some appropriate time during the
execution of a program, and loses it later. The process of allocation and deallocation
of memory for dynamic variables may be repeated many times. All of the local variables
in the example program are dynamic. In Pascal, local variables can only be dynamic;
in C and Algol they can be either, it is the programmer's choice; in Fortran they can
only be static.
It is very easy to confuse scope and extent. They are very similar concepts, and seem to
be identical in most cases. In reality, they can't possibly be identical, because they
talk about completely different things. The scope of a variable is a region of the program;
the extent of a variable is a period of time. The extent of a variable is very often
the period of time that the program is executing inside that variable's scope, but does not
have to be.
In this C program:
1- int a;
2-
3- void e(void)
4- { ... }
5-
6- void f(void)
7- { int b;
8- static int c;
9- ...
10- ... }
11-
12- int d;
13- ...
The variable a
has as its scope the entire program from line 1 onwards.
Its declaration is not inside any block, so its scope stretches from its point
of declaration to the end of the program. Its extent is the whole run of the program;
it is allocated memory when the program starts, and keeps it until it ends.
The variable b's
scope is the entire function f from line 6 to line
10. Outside of the function f the compiler will not even admit that b
exists. The extent of b is a particular activation of f. Whenever
f is called, memory is allocated for b, and it keeps that memory only
until that call returns. An important thing to think about is what happens if f
calls e, which it is perfectly entitled to do. e is outside the scope
of b, so the compiler is unaware of b's existence, and it can't possibly be
used. On the other hand, f has been called, and has not yet exitted; when e
finishes, control will return to f which will expect to find all its variables
intact. So, the memory allocated for b must remain allocated. Therefore that
particular call of e is within the extent of b.
On the other hand, if e
is called directly from another function, perhaps
main, it will be outside both the scope and the extent of b.
The variable c
is different. Its scope is still the function f
from line 8 to line 10, but its extent is the whole run of the program. That is what
the C keyword static means. Although c is totally private to
f, and other functions can have their own variables called c
without causing interference, it still "lives" for ever. Its memory is allocated
at the same time as the global variables', and is not released until the program finishes.
If the function f leaves data in b and c before exitting,
then the next time it is called, it can absolutely rely on that information still
being in c, but whatever it left in b will be gone.
The variable d
is global in the usual sense, although its scope is slightly
reduced, being from its declaration point, line 12, to the end of the program. Its
extent is still the entire run of the program. All static variables are allocated
memory before the program really starts to run. So again it is possible to jump in
and out of the scope of d, and occasionally lose the ability to access it,
but it is not possible to jump in and out of its extent.
If f were to be called recursively, there would be multiple versions of
b, all with the same scope, but different extents. There would only be
one version of c.
This brings us to an important and deceptively non-trivial question: What is a variable?
Obviously a variable is not just a name that may be used in an assignment.
Nor can it be a memory location: variables can cease to exist, but memory locations
are permanent. The only workable answer is that a variable is a two-part object, consisting
of the association between a declaration and a memory location.
If a memory location that was used for a dynamic variable gets recycled after the end
of that variable's extent, and is reused as the location of another variable, it is still
very definitely a different variable. Similarly, if a function is called recursively,
many versions of its dynamic variables are brought into existence. Each was created from
the same declaration, but occupies a different memory location. Again, they are all clearly
different variables.
We may also ask "how global is a global variable?". If a program is split into a number
of files, each compiled separately, how do the global variables behave?
Fortran, Pascal, and Algol have no standardised support for separate compilation,
so the question is left up to the designers of individual compilers (if it is answered
at all). In C, obscure linguistic tricks are used.
In C, as in other languages, all global
(i.e. top-level) variables have static extent.
Programmers do not put the keyword static in front of a top level declaration
to indicate static extent, it is automatic. However, it is possible to put the keyword
static in front of a global variable declaration, in which case it means
something else all together. A normal top-level variable declaration, such as
int i;, not inside any block, produces a global static variable whose
declaration is "exported to the linker". This means that it can be accessed from
other separately compiled units of the same program when they are linked together.
Adding the word static, producing static int i; still produces a
global static variable, but prevents its name from being exported to the linker,
so it will be a different variable from any other global integer called i
in other separately compiled program parts.
Language Specifics
C: Declarations that appear inside a block (between { and })
have as their scope the region between their declaration and the }
that terminates that block. Declarations
that are not inside any block (i.e. Top-Level declarations) have as their scope the
whole program from their declaration point onwards. All globals have static extent.
Locals have dynamic extent (limited to the lifetime of the block they were declared in)
unless the declaration begins with static.
Pascal: Declarations that appear inside a block (between begin and
end) have as their scope the region between their
declaration and the end that terminates that block. Top-level declarations
(not inside any block) have as their scope the
whole program from their declaration point onwards. All globals have static extent.
All locals have dynamic extent (limited to the lifetime of the block they were declared in).
Algol 60: All declarations appear inside a block. An Algol program is just one big
block. All declarations have as their scope the entire block that they are declared inside
(not just from their declaration point on). The extent of any variable is the lifetime
(period of time between entry and exit) of their block; if a variable's block is the
whole program, then it is effectively static, although technically, all variables
are dynamic in extent. The one exception is that if a variable is preceeded by the keyword
own, the variables it creates have true static extent.
Fortran: All variables have as their scope the entire program unit in which they
appear. Program units are whole subroutines, whole functions, or the main program. There
is no such thing as a global variable (although they may be simulated with "common blocks").
All variables have static extent.