The drive to remove redundancies is widely seen as a good thing. Christoph Knabe sees how it has influenced programming languages.
Since the beginning of programming, redundancies in source code have prevented maintenance and reuse. By ‘redundancy’ we mean that the same concept is expressed in several locations in the source code. Over the last 50 years the efforts to avoid redundancies [ Wikipedia ] have inspired a large number of programming constructs. This relationship is often not obvious to the normal programmer. Examples include relative addressing, symbolic addressing, formula translation, parameterizable subroutines, control structures, middle-testing loops, symbolic constants, preprocessor features, array initialization, user defined data types, information hiding, genericity, exception handling, inheritance, dynamic dispatch, aspect oriented programming, functional programming, and even program generators and relational databases. These constructs are discussed using examples from 14 widely used programming languages. Whosoever understands the common concept is well equipped for the future.
What can the Zuse computer say today?
Architecture of the Zuse 22 | |||||||||||
|
In 1971 my (high) school inherited a 12-year-old Zuse 22 and I learned programming on it. In the rapidly moving computer domain we would usually consider using such obsolete technology to be a waste of time. But this has provided me with the background for a survey of the programming techniques developed over the last 50 years. The Zuse 22 of the German computer pioneer Konrad Zuse was one of the first mass produced computers in the world (55 machines was a lot in those days!). It had a highly economical construction and was programmed in a machine level language: the Freiburgian Code. The example program in table 1 adds the natural numbers from n decrementing to 1, and prints the result (tested by the Z22 simulator of Wolfgang Pavel [ Pavel ]). In practice only the contents of the Instruction column were punched onto paper tape and read by the computer as a program. The Address column indicates into which word the instruction was stored, and the Comment column corresponds to comments in contemporary languages.
Address | Instruction | Comment |
---|---|---|
T2048T | T ransport the following to words 2048 ff. | |
2048 | 10' | i : Initial value for i is n, here the natural number 10. |
2049 | 0' | sum : Initial value is the natural number 0. |
2050 | B2049 | B ring the sum into the accu(mulator). |
2051 | A2048 | A dd i to the accu. |
2052 | U2049 | Store ( U mspeichern) accu to sum. |
2053 | B2048 | B ring i into the accu. |
2054 | SC1 | S ubtract the Constant value 1 from the accu. |
2055 | U2048 | Store ( U mspeichern) accu to i. |
2056 | PPE2050 | If accu P ositive E xecute from (go to) 2050 |
2057 | B2049 | B ring sum into the accu. |
2058 | D | Print ( D rucke) accu. |
2059 | Z0 | Stopp |
E2050E | E xecute now from 2050 |
The instructions can have symbolic, combinable operation letters :
B
=Bring,
A
=Add,
S
=Subtract,
T
=Transport,
U
=Umspeichern (store to),
C
=Const-Value,
PP
=if Positive,
E
=Execute from (go to),
D
=Drucken (print),
Z
=Stop. But the addressing was purely numeric with absolute storage addresses. Here the variables
i
and
sum
are stored at the addresses 2048 and 2049 respectively. The algorithm itself is stored from address 2050, where we jump back using the instruction
PPE2050
, if the value of
i
is still positive.
Redundancies appear here in the addresses: 2048 for i appears 4 times, 2049 for sum 3 times, 2050 for the beginning of the program and the loop twice explicitly and once implicitly (two cells after where the tape content is stored). As a consequence the program is neither relocatable in the working storage nor simply extendable. So there are big difficulties in its maintenance.
Relative and symbolic addressing
Progress came later for the transistorized Zuse 23 by the development of ‘Relative Addressing’. This enabled a programmer to write a subroutine as if it was located at address 0. A certain prefix instruction told the loading program to store the actual start address in a load-time base register, which was usually register 26. Appending
A26
to an address caused the loading program to add the content of register 26 to the value to form an absolute address before storing the instruction to be executed later. So when using relative addressing the conditional jump instruction to the beginning of the program in table 1 would be
PPE2A26
instead of
PPE2050
. By this means the program has become relocatable. Relative addressing was still very economic: it did not need more resources than the register 26 at load time.
True, relative addressing facilitates relocating a subroutine in working storage, but inside the subroutine it is
as inflexible as absolute addressing. If we wanted to extend the example by inserting a prefix action before the calculation loop, we would have to shift the relative jump goal
2A26
, too. Thus ‘symbolic addressing’ was introduced with the Zuse 23 (sold from 1961). See the German programming manual for the Z23 [
Zuse23
] p. 65ff. The Z23 loading program substituted each bracketed identifier of up to 5 characters by its address. The program from table 1 could be rewritten with the symbolic addresses
(I)
,
(SUM)
, and
(BEGIN)
as in Listing 1.
T2048T (I) 10' (SUM) 0' (BEGIN) B(SUM) A(I) U(SUM) B(I) SC1 U(I) PPE(BEGIN) B(SUM) D Z0 E(BEGIN)E |
Listing 1 |
Now it is possible to insert further instructions at any place without destroying the program. This improvement was such a big one that the assembler for the Siemens 2002 was named after this technique, PROSA (Programming with Symbolic Addresses). Necessary resources for symbolic addressing were an addressing program and a symbol table.
We are now acquainted with the most common procedure for redundancy elimination: The redundant code part gets a name and we use that name instead of the former, redundant code parts.
Formula translation
In the beginning, technical and scientific calculations dominated computer applications. But as we can see in the words 2050…2053 of the Z22 example, a simple addition needed three instructions (bring, add, store). For the formula ( a + b )*( a - b ) we would need about 7 instructions. So the need quickly arose for simplifying formula calculations. This was enabled by formulae with variable identifiers, literals, operator signs, operator priorities, and parentheses. The programming language FORTRAN got its name by this feature (Formula Translator). Fortran I was defined in 1956. At that time there was no possibility of defining your own operators.
Subroutines
If you needed an instruction sequence several times, on the Zuse 22 you could jump there by using the call instruction
F
from different program locations. Besides doing the actual jump, this instruction loaded a ‘jump back’ instruction into register 5. That is why you had to begin each subroutine by copying the contents of register 5 to the end of the subroutine using a U-instruction. This assured a jump back to the calling location when reaching the end of the subroutine.
But often you don’t need identical, but only similar processing. In this case Freiburgian Code had the convention of reserving cells before the subroutine for arguments and results. These had to be filled by the caller before the call instruction and retrieved afterwards, respectively. Then it had to jump to the first instruction of the subroutine, which always had to be
B5
followed by a
U
with the address of the last instruction cell of the subroutine. So the program from listing 1, converted into a Zuse23 subroutine with entry address
SUMUP
for summing up the integer numbers from 1 to
n
, is shown in listing 2.
T2048T (N) 10' (SUM) 0' (SUMUP) B5 U(BACK) B(SUM) A(N) U(SUM) B(N) SC1 U(N) PPE(SUMUP) (BACK)Z0 |
Listing 2 |
In order to print the sum of the numbers from 1 to 20, you could call
SUMUP
as follows:
BC20 U(N) F(SUMUP) B(SUM) D
While this had to be obeyed as a convention on the Zuse 23, nowadays it is automated in all higher programming languages by the concept of a subroutine call with an argument list and return value. FORTRAN II and Algol introduced the concept of named, parameterized subroutines around 1958. The possibilites for redundancy elimination were enormous and gave rise to the style of procedural programming. A subroutine in FORTRAN II for summing up the numbers from 1 to
n
by a counting loop is shown in listing 3. Identifiers starting with
I
to
N
are considered integers.
C COMPUTES THE SUM OF THE NUMBERS FROM 1 TO N FUNCTION ISUMUP(N) ISUM = 0 DO 99 I = 1, N 99 ISUM = ISUM + I ISUMUP = ISUM RETURN END |
Listing 3 |
Control structures
The building blocks by which programs got their high flexibility and reusability were conditional jumps such as the
PPE
instruction of the Zuse 22. Conditional forward jumps could be used to implement conditional branches. A conditonal backward jump could be used to implement repetition, which would terminate when the condition no longer held. By combining these facilities you could construct arbitrarily complex algorithms. As conditional branches and limited repetitions with or without a control variable were needed frequently, the popular programming languages introduced such constructs. In FORTRAN I (1957) you could sum up the integer numbers from 1 to
n
by the following counting loop:
ISUM = 0 DO 99 I = 1, N 99 ISUM = ISUM + I C HERE ISUM CONTAINS THE SUM OF THE NUMBERS C FROM 1 TO N
FORTRAN had half-symbolic addressing. A statement could be identified by a freely electable number, a so-called ‘label’. The statement
DO 99 I = 1, N
incremented the control variable
I
from 1 to
N
, and each time all statements up to and including the statement labeled by 99 are repeatedly executed. For branching FORTRAN I offered the arithmetic
IF
:
IF
(expression) negativeLabel,zeroLabel,positiveLabel
This statement jumps to one of the three enumerated labels depending on the sign of the expression result.
Algol 60 had already introduced nowadays common control structures:
-
A multi-branches cascadable
IF
:if
condthen
exprelse
expr -
A universal loop with control variable, for example:
-
for i := 1 step 1 until 100 do print(i)
prints the numbers from 1 to 100 -
for i := 1, i*2 while i<2000 do print(i)
prints the powers of 2 from 1 to 1024.
-
Modern control structures with the purpose of being able to completely avoid jump instructions were introduced by Pascal (1970). Pascal distinguished the pre-testing
WHILE
-
DO
-loop, the post-testing
REPEAT
-
UNTIL
-loop, and the counting
FOR
-
DO
-loop. Nevertheless Pascal still contained the
GOTO
statement, as non-local termination could not be done otherwise
Unfortunately the
WHILE
-loop, planned for repetitions where the number of iterations is not known in advance, implies redundancies in the following use case, which occurs extremely often in practice: We want to process an unknown number of elements, maybe even none. The code in listing 4 reads positive numbers and prints their sum. The procedure
readNumber
is understood to deliver -1 if it can’t find any further number. The keywords are typed in upper case, although this is not important in Pascal.
VAR x: integer; sum: integer := 0; BEGIN readNumber(x); WHILE x>0 DO BEGIN sum := sum + x; readNumber(x); END; writeln; writeln('The sum is ', sum, '.'); END |
Listing 4 |
We see, that the procedure
readNumber
has to be called redundantly: Once before the
WHILE
-loop, the second time at the end of the loop body, in order to prepare the variable
x
for the next test of the
WHILE
-condition.
That is why C (1973), Modula-2 (1978), and Ada (1980) introduced the possibility of leaving an arbitrary loop by a special jump instruction, in particular an endless loop. Using this we could solve the above task without redundancies (C syntax, listing 5).
int sum=0, x; for(;;){ readNumber(&x); if (x<=0) break; sum += x; } |
Listing 5 |
This corresponds to the general pattern for processing an unknown-in-advance number of records in a middle-testing loop (listing 6).
initialize for(;;){ retrieve if (notSuccessful) break; process } |
Listing 6 |
Recursion : Recursion is when a function calls itself either directly or indirectly. The ability to do so was introduced by LISP and Algol at around the same time, but the then predominant languages FORTRAN and COBOL did not allow recursion. Some tasks, e.g. traversing trees, are most elegantly expressed by recursion. Recursion does not directly contribute to the elimination of redundancies. If the recursive call is the last statement of a function, the compiler can replace it by a simple, storage-efficient loop. Compilers of functional programming languages regularly do this optimization.
User-defined control structures
The ability for programmers to define their own control structures was born in LISP (1958), as in this language instructions and data were notated in the same form, the so-called S-expressions. They also had the same internal representation. For example,
(TIMES N 7)
on the one hand means the multiplication of
N
by 7. On the other hand it is simply a list with the elements
TIMES
,
N
, and
7
. A function in LISP I, which was defined as
FEXPR
instead of the usual
EXPR
, had a special property. It evaluated its passed arguments not before executing its function body, but left this until the explicit usage of the function
EVAL
in the function body. So a function body could arbitrarily control the frequency and order of evaluation of its arguments.
Later on when the power of this concept was found out, LISP dialects introduced comfortable notations to define
FEXPR
s. For example in MacLISP you could define an
if
-
then
-
else
construct, which was not contained in early LISP, by the traditional
COND
as follows:
(defun IF fexpr (args) (let ((predicate (car args)) (then (cadr args)) (else (caddr args))) (cond ((eval predicate) (eval then)) (t (eval else)))))
The
IF
function gets all arguments unevaluated as a list with the name
ARGS
.
LET
assigns the individual arguments to the values
PREDICATE
,
THEN
, and
ELSE
.
COND
evaluates
THEN
or
ELSE
depending on the evaluation of
PREDICATE
. Example from Steele and Gabriel [
LISP
].
As a lover of redundancy-free solutions I missed middle-testing loops in the modern object-functional language Scala (2001). But I could define one myself (listing 7).
def loopGetTestProcess[Item] (getItem: => Item) (shouldContinue: Item=>Boolean) (processItem: Item=>Unit) { var item = getItem while(shouldContinue(item)){ processItem(item) item = getItem } } |
Listing 7 |
The function
loopGetTestProcess
has 3 parameter lists. In the first it expects an expression
getItem
for obtaining a next item, in the second a boolean function
shouldContinue
for judging the success of the obtainment, and in the third a function
processItem
for processing the obtained item. By using the generic parameter
[Item]
it is assured that the types fit together. The redundant call of
getItem
is well encapsulated in the function body and is not visible in the using source code.
As syntactic sugar in Scala you can write an argument list of length 1 with curly brackets thus enabling the following usage:
def printLines(in: java.io.BufferedReader){ loopGetTestProcess(in.readLine())(_!=null){ println } }
In Java this could be done by
break;
and would look like in listing 8.
void printLines(final java.io.BufferedReader in){ for(;;){ final String line = in.readLine(); if(line==null)break; println(line); } } |
Listing 8 |
The big achievement in Scala is that the programmer has defined this control structure himself and can define others too.
Constants
In working storage every cell is modifiable. That is why there were no constants in Freiburgian code. By around 1960, the predominant higher programming languages FORTRAN, Algol, and COBOL had only variables and literals, usable in expressions. e.g. the literal
2
in the FORTRAN expression
A**2 + 2*A*B + B**2
with
**
meaning ‘to the power of’. A redundancy problem arises, if you must use the same literal several times. In COBOL 66 you could also use literals to dimension a variable, for example a west-German ZIP code, which consisted of four digits, as
ZIP PICTURE 9(4)
. But if you had done it this way at several locations in your program code, the reorganisation of the German postal systems in 1993 obliged you to change all these locations to five digits:
ZIP PICTURE 9(5)
. In the early higher programming languages there was no possibility to declare variable or array sizes free of redundancy.
Pascal (1970) solved this problem very cleanly by declarable, symbolic constants, which could be used for dimensioning arrays, as well. E.g.:
CONST zipLength: integer := 4; TYPE ZipCode = PACKED ARRAY[zipLength] OF character; VAR clientZip: ZipCode; BEGIN FOR i := 1 TO zipLength DO write(clientZip[i]);
Pascal also introduced the concept of user-defined data types (here the type
ZipCode
). Along with the symbolic dimensioning constants this enabled you to define sizes for a whole software system free of redundancies.
C (1973) solved the dimensioning problem less elegantly, as you had to use preprocessor macros instead of symbolic constants. But on the other hand you could even do it without explicit size constants, if you used the
sizeof
operator. So the same thing expressed in C:
typedef char ZipCode[4]; ZipCode clientZip; int i=0; for(; i<sizeof(clientZip); i++){ putchar(clientZip[i]); }
C++ (1983) introduced the ability to declare frozen variables at any location in the program code by the keyword
const
and by that to build calculated ‘constants’. Beginning with Java (1995) it became normal to dimension arrays only at runtime, or to work with the growable collections.
Preprocessor features
Often a preprocessor was used in order to avoid redundancies. So it is possible to include declarations, which are needed in the same wording in several compilation units, from one single file. This technique was used in COBOL (
COPY
), FORTRAN (
INCLUDE
), and C (
#include
). In this way you could build a data abstraction module in C. You had to write the function declarations of the module in its header file, and the function definitions along with the managed module variables in its implementation file. As an example see the header file
stack.h
for a stack of characters. The lines starting with
#
contain preprocessor directives.
#ifndef stack_h #define stack_h #define MAX_STACK_SIZE 10 void stack_push(char item); char stack_pop(void); #endif
Every client of this stack has to include the header file stack.h and then you can call the functions declared therein:
#include "stack.h" ... stack_push('X');
About 1985–1995 this was the predominant modularization technique in industry, before it was superseded by object orientation.
Stringize Operator
: The C/C++ preprocessor contains the
#
operator, which delivers the name of a given macro argument as a string. With this you can program without redundancies simple dialogs, e.g. for test purposes. So you can use in C++ the following macro
PROMPT_READ
in order to print the name of the passed variable, and to read a value into it afterwards:
#define PROMPT_READ(var) \ { cout << #var << "? "; cin >> var;}
Using this macro you can program a console dialog as in listing 9.
string name; int age; PROMPT_READ(name); PROMPT_READ(age); cout << name << " is " << age << " years old." << endl; |
Listing 9 |
A typical dialog usage could look as follows:
name? Ulf age? 22 Ulf is 22 years old.
Generally preprocessors are considered an inelegant solution. That is why the majority of cases for which you used the preprocessor in C or C++, are solvable in Java by its own language constructs without a preprocessor. However in Java the following is not possible:
-
determination of the name of a variable as by the
stringize
operator in C -
delegation of the abortion of a method execution and some accompanying action, e.g.
errno=FAILURE; return;
to another method -
delegation of a certain pattern of exception handling to another method.
If we had C-like macros in Java we could write
TRANSACTION(myAction1(); myAction2())
which would be useful to transform a statement sequence into a database transaction by
#define TRANSACTION(stmts) \ {try{stmts; commit();}catch(Exception e) \ {rollback(e);}}
Array initialization
Pascal introduced redundancy-free array dimensioning, but initialization still had a problem. As an example we want to declare and initialize an array in Pascal, which contains codes for three allowed actions. Even in modern GNU Pascal you have to indicate the array size 3:
VAR actions: ARRAY[1..3] OF char = ('G','P','D');
But the size is redundant in relation to the number of initialization elements. The more actions you need, the more difficult becomes the manual numbering.
C introduced to derive an array’s size from the length of its initialization list:
static char actions[] = {'G','P','D'};
. Java inherited this useful feature.
Summary and prospects
The early higher programming languages introduced the possibility of avoiding redundancies by extraction, parameterization, and naming of a repeated code snippet. Such code snippets could be: addresses, values, size indications, declarations, and statement sequences. Some code patterns were invoked by syntactical units of the programming language, e.g. loops. In the better case the programmer could give a freely electable name to a code pattern. If a programming language helped to eliminate redundancies better than a competing language, this was a relevant advantage in the battle for dissemination.
In the 2nd part of this article we will deal with the more modern techniques ‘information hiding’, genericity, exception handling, object-oriented, aspect-oriented, and functional programming as well as domain specific languages and relational data bases, and how they all contribute to redundancy elimination.
References
[LISP] Steele/Gabriel: The Evolution of LISP , p. 9: http://www.dreamsongs.com/Files/HOPL2-Uncut.pdf
[Pavel] http://www.wpavel.de/zuse/simu/
[Wikipedia] http://en.wikipedia.org/wiki/Don%27t_repeat_yourself
[Zuse23] http://www.weblearn.hs-bremen.de/risse/RST/WS04/Zuse23/Z23Programmierungsanleitung.pdf