Fortran, the language that CONGEN is implemented in, suffers from a lack of flexibility with respect to the storage of data. The structuring of data is limited to scalars, complex numbers, and arrays. For many programming tasks, there is a need for more complicated structures for holding data. For example, in CONGEN, the protein structure file, the PSF, consists of a number of arrays and scalars that collectively describe the structure of a macromolecule. It would very useful to be able to refer to the PSF as a whole as opposed to referring to all of its constituents.
This inability to refer to a number of related scalars and arrays by one name results in a number of problems. Passing the information to a subroutine requires either using a large number of parameters or common blocks.
Using large numbers of parameters makes subroutine calls very bulky and increases the chances of errors resulting from the mistyping of a parameter. On the order of 100 parameters are required to call the analysis subroutine.
Common blocks could be used to pass information to a subroutine. However, these require allocating all needed space at compile time. Also, having more than one instance of a data structure is difficult. For example, the analysis facility requires a second PSF, coordinate set, and other data structures. Without solving this data structure problem, we need to establish another huge set of scalars and arrays. Further, as the program is changed, this second set would have been modified as well which would have a great potential for programming errors. These problems required a better method be introduced to handle the use of data structures in Fortran.
Having decided to implement a more succinct method for using data structures, we must consider other design criteria which would be useful to meet. First of all, the elements of each data structure must convenient to access, and the clarity of the program must not be unduly restricted. Next, since we have a mechanism for dynamic storage allocation, data structures should use only as much space as they need and should be able to grow. Third, the sharing of readonly data is desirable.
Given these design goals, how are they implemented for use in CONGEN? First, the heaps are ideally suited for this task as it allows completely arbitrary use of storage. Scalars are assumed to require one word of storage on the heap. Arrays will take as many words as their dimension and data type requires. Character strings are stored in the character string heap.
To keep track of the location and data types of elements in a
data structure, three arrays; the base, length, and type arrays; are
used. The base array stores the index of the first word or character of
each element in the data structure. The length array gives either the
total number of integers required for the element or the total number of
characters, depending on the data type. The type array specifies
the data type of the element. If the type array contains a positive
number, this indicates that the element is a character string, and the
number in the type array specifies the length of the character strings.
Otherwise, if the number is equal to the parameter,
the element is numeric. Any other values are illegal.
Multiple instances of a data structure
require multiple sets of arrays.
The naming scheme for the base, length, and type arrays is to prefix a
`B', `L', or an `T', respectively,
in front of the data structure name.
To identify the word in the base array which corresponds to an element in the data structure, we use a common block which is named for the data structure. If the data structure contains n elements, the common block of indices contains n+1 variables. The first variable gives the size of the base array needed to hold one instance of the data structure. The remaining variables all have the same name as the element in the data structure. Before proceeding any further, an example is required.
Consider a simplified and slightly contrived version of the
data structure for storing
the non-bonded list. Two arrays;
INBLO; six scalars;
one character string,
NAME are needed. We will use the
NBND, for the data structure. The base, length, and type arrays
for this data
structure are called
and the following declarations in a
subroutine would suffice to use parts of the non-bonded list for a
IMPLICIT INTEGER(A-Z) # include "heap.fcm" COMMON /INBND/ SNBND,JNB,INBLO,NNNB,CTONNB,CTOFNB,CUTNB, 2 WMIN,NBOPT,NAME INTEGER BNBND(SNBND)
To make reference to the scalar
NBOPT, one would write
To make reference to the character string,
NAME, one would write
CHEAP(BNBND(NAME))(1:TNBND(NAME)). The substring reference
serves to specify a legitimate length for the name, and would be accessible
in a called subroutine as the length of the string.
Three points must be noted before we continue. First, we can
store any type of data in these data structures, provided that the
numeric and character data are identified and separated.
All we must know is how
much space they require, which is provided by the
PRECISION variables require two integers
per variable on most machines, and one integer on the Cray.
Second, we ensure that the index common blocks are the
same from one subroutine to the next by using an
#include statement which is processed by the Fortran
C preprocessor, see Fcpp. An
statement takes a file name and inserts
the contents of the file in the place of the
#include statement. By
setting up files which contain the common block definitions and using
#include statements exclusively for getting the common block, we can
ensure that the common blocks are the same.
Third, in general, one cannot access the data directly. The
cases where direct access is possible is for numeric data whose size is
the same as an
INTEGER, such as
Since the heap is declared as
INTEGER, we make a direct
reference. Single precision real scalars can be accessed directly using
RHEAP, and logicals can be reference using
References to any other parts of the data structure can be
done using two general techniques.
The first of these reference techniques is to call a subroutine
with the desired element of the data structure as a parameter. In the
subroutine, one declares the parameter for what it is, and one can make
references as desired. For example, if we wish to access the
in the non-bonded data structure, we can make a subroutine call as
and declare in the subroutine
SUBROUTINE SUB(JNB) INTEGER JNB(1)
and make references to any element of the
JNB array. This is the same
method as described for stacks. The overhead involved with this scheme
is that one must create a subroutine for the references.
The second technique which has a much higher overhead, but
doesn't require a new subroutine is to use referencing and storage
functions which have been written for CONGEN and which are found in the
source file array.flx. The referencing functions take an array
and indices into it and return the value of the specified element. For
example, the function
REFI accepts two arguments, an integer
A, and an index,
A(I). The storage functions take an array, indices, and a value
and store the value into specified element. For example, the subroutine
STORI takes an integer array,
A; its length,
N; and index,
I; and a value,
VALUE, and executes the following
A(I)=VALUE. The higher overhead of this method results
from the fact that a subroutine call is required for each array
Next, what subroutines and functions are provided to assist in the use of these data structures? There are many which cover most of the needs of CONGEN.
STRTDT is used to initialize the common blocks.
Whenever a new data structure is defined or an old one is changed in
STRTDT must be changed to accommodate this.
SETUPI, is used to initialize the
index common blocks. It depends on being able to equivalence the integer
variables in the index common block to an integer array by using
a subroutine call.
INITDT is used to initialize a base array. Before
a base array can be used, it must be initialized to the free state. The
free state means that all bases are set a value which cannot point into
the heap. This value, known as the free value, is a sufficient large
negative number such that any references using it will cause an access
violation which is useful in catching programming errors before they do
ALLDT is used to allocate space for a data
structure on the heaps. It is called with filled length and type arrays and a
free base array. The contents of the length array is summed to get the
space requirements for the data structure.
are then called to get
space on the heaps for the data structure. Finally, the bases are all set
to point to their part of data structure. Lengths of zero are permitted.
In these cases the bases for the element will be left as the free value.
This will catch any illegal references.
FREEDT is used to free a data structure. The data
structure must be allocated before it can be freed. The base array is
set to the free value.
The logical function
USEDDT is used to determine if a base array
is free or is pointing to a data structure. It returns
.TRUE. if the
base array passed to it is in use.
RESIZE is used to change the size of any of the elements in a
data structure. It is called with a base, length, and type array and another
trio of arrays which give elements to be changed and the new sizes. Note that
only the lengths of character string elements can be changed, data types of
elements may not be.
RESIZE allocates space for a copy of the data structure with new sizes;
copies the old data into the new structure; and then frees the old data
WRITDT may be used to write a data structure
in binary form to a file. The trio of arrays must be specified along
with a Fortran unit number,
ICNTRL information, and a
READDT will read what
DUPLDT is used to duplicate a data structure. It simply
makes a complete copy of the arrays and the data stored in the heaps.
With the subroutines used to implement these data structures explained, we can return to a discussion of other implementation details. The next issue is the question of sharing data. Because the base arrays are pointers, there is nothing to prevent two base arrays from pointing to the same data. Obviously, this can lead to great problems if this occurs unintentionally, but one can save space and underscore the identity of data if we allow sharing of read-only data.
Implementing sharing is very easy with the above scheme. We reserve the first element of the base array to be a pointer to the reference count for a particular instance of a data structure. When the data structure is allocated, the reference count is set to 1. If we assign another data structure to this instance by copying its bases, the reference count is incremented by 1. If we free the data structure, the reference count is decremented by one. When the reference count drops to zero, the space on the heap is freed. The use of the first element of the base arrays has an additional benefit because the length of the reference count is 1 word. This fact ensures that the first word in the base array cannot have a free value when the base array is in use.
ASNGDT, implements the assignment of data
structures which are to shared. It will take a base and length array and
copies them to another base and length array. The reference count for
the data structure will be incremented by one.
The final detail to be discussed is the interface between this scheme for manipulating data structures and the usual scheme. Currently, the analysis subroutine and non-bonded list uses this more concise means for handling data structures; the rest of CONGEN works with the separate arrays and scalars. When the analysis subroutine is invoked, the data structures must be built. Copying subroutines exist for the five data structures passed to the analysis subroutine. These copying subroutines will take the arrays and scalars which comprise a data structure and build a data structure on the heap for it. Once the copying operation is complete, the actual analysis subroutine can be called and it can freely use the data structures existing on the heap.
The copying operation presents a naming problem in that the indices kept in the index common blocks have the same names as the arrays and scalars. To circumvent this problem, a second set of index common blocks is kept for all data structures which are not isolated in the analysis subroutine. This second set of index common blocks have the same common block names as the first, but the names of the indices are prefixed with `I'.
A good example of the use of these data structures is the
NBONDS and some of its subsidiary routines in the file,