Dynamic Data Structures - Congen V2.2.3327

Next: Applicability, Previous: Heaps, Up: Storage Management

31.3 Data Structures on the Heap in CONGEN

Fortran, the language that CONGEN is implemented in, suffers from a lack of flexibility with respect to the storage of data. The structuring of data is limited to scalars, complex numbers, and arrays. For many programming tasks, there is a need for more complicated structures for holding data. For example, in CONGEN, the protein structure file, the PSF, consists of a number of arrays and scalars that collectively describe the structure of a macromolecule. It would very useful to be able to refer to the PSF as a whole as opposed to referring to all of its constituents.

This inability to refer to a number of related scalars and arrays by one name results in a number of problems. Passing the information to a subroutine requires either using a large number of parameters or common blocks.

Using large numbers of parameters makes subroutine calls very bulky and increases the chances of errors resulting from the mistyping of a parameter. On the order of 100 parameters are required to call the analysis subroutine.

Common blocks could be used to pass information to a subroutine. However, these require allocating all needed space at compile time. Also, having more than one instance of a data structure is difficult. For example, the analysis facility requires a second PSF, coordinate set, and other data structures. Without solving this data structure problem, we need to establish another huge set of scalars and arrays. Further, as the program is changed, this second set would have been modified as well which would have a great potential for programming errors. These problems required a better method be introduced to handle the use of data structures in Fortran.

Having decided to implement a more succinct method for using data structures, we must consider other design criteria which would be useful to meet. First of all, the elements of each data structure must convenient to access, and the clarity of the program must not be unduly restricted. Next, since we have a mechanism for dynamic storage allocation, data structures should use only as much space as they need and should be able to grow. Third, the sharing of readonly data is desirable.

Given these design goals, how are they implemented for use in CONGEN? First, the heaps are ideally suited for this task as it allows completely arbitrary use of storage. Scalars are assumed to require one word of storage on the heap. Arrays will take as many words as their dimension and data type requires. Character strings are stored in the character string heap.

To keep track of the location and data types of elements in a data structure, three arrays; the base, length, and type arrays; are used. The base array stores the index of the first word or character of each element in the data structure. The length array gives either the total number of integers required for the element or the total number of characters, depending on the data type. The type array specifies the data type of the element. If the type array contains a positive number, this indicates that the element is a character string, and the number in the type array specifies the length of the character strings. Otherwise, if the number is equal to the parameter, DT_NUMERIC, then the element is numeric. Any other values are illegal. Multiple instances of a data structure require multiple sets of arrays. The naming scheme for the base, length, and type arrays is to prefix a `B', `L', or an `T', respectively, in front of the data structure name.

To identify the word in the base array which corresponds to an element in the data structure, we use a common block which is named for the data structure. If the data structure contains n elements, the common block of indices contains n+1 variables. The first variable gives the size of the base array needed to hold one instance of the data structure. The remaining variables all have the same name as the element in the data structure. Before proceeding any further, an example is required.

31.3.1 Example – The Non-bonded List Data Structure

Consider a simplified and slightly contrived version of the data structure for storing the non-bonded list. Two arrays; JNB and INBLO; six scalars; NNNB, CTONNB, CTOFNB, CUTNB, WMIN, and NBOPT; and one character string, NAME are needed. We will use the name, NBND, for the data structure. The base, length, and type arrays for this data structure are called BNBND, LNBND, and TNBND, respectively, and the following declarations in a subroutine would suffice to use parts of the non-bonded list for a calculation:

           IMPLICIT INTEGER(A-Z)
     #     include "heap.fcm"
           COMMON /INBND/ SNBND,JNB,INBLO,NNNB,CTONNB,CTOFNB,CUTNB,
          2               WMIN,NBOPT,NAME
           INTEGER BNBND(SNBND)

To make reference to the scalar NBOPT, one would write HEAP(BNBND(NBOPT)). To make reference to the character string, NAME, one would write CHEAP(BNBND(NAME))(1:TNBND(NAME)). The substring reference serves to specify a legitimate length for the name, and would be accessible in a called subroutine as the length of the string.

Three points must be noted before we continue. First, we can store any type of data in these data structures, provided that the numeric and character data are identified and separated. All we must know is how much space they require, which is provided by the FSIZEOF function, see Stacks. For example, DOUBLE PRECISION variables require two integers per variable on most machines, and one integer on the Cray.

Second, we ensure that the index common blocks are the same from one subroutine to the next by using an #include statement which is processed by the Fortran C preprocessor, see Fcpp. An #include statement takes a file name and inserts the contents of the file in the place of the #include statement. By setting up files which contain the common block definitions and using #include statements exclusively for getting the common block, we can ensure that the common blocks are the same.

Third, in general, one cannot access the data directly. The cases where direct access is possible is for numeric data whose size is the same as an INTEGER, such as REALs and LOGICALs. Since the heap is declared as INTEGER, we make a direct reference. Single precision real scalars can be accessed directly using RHEAP, and logicals can be reference using QHEAP. References to any other parts of the data structure can be done using two general techniques.

The first of these reference techniques is to call a subroutine with the desired element of the data structure as a parameter. In the subroutine, one declares the parameter for what it is, and one can make references as desired. For example, if we wish to access the JNB array in the non-bonded data structure, we can make a subroutine call as follows:

           CALL SUB(HEAP(BNBND(JNB)))

and declare in the subroutine

           SUBROUTINE SUB(JNB)
           INTEGER JNB(1)

and make references to any element of the JNB array. This is the same method as described for stacks. The overhead involved with this scheme is that one must create a subroutine for the references.

The second technique which has a much higher overhead, but doesn't require a new subroutine is to use referencing and storage functions which have been written for CONGEN and which are found in the source file array.flx. The referencing functions take an array and indices into it and return the value of the specified element. For example, the function REFI accepts two arguments, an integer array, A, and an index, I. REFI returns A(I). The storage functions take an array, indices, and a value and store the value into specified element. For example, the subroutine STORI takes an integer array, A; its length, N; and index, I; and a value, VALUE, and executes the following statement: A(I)=VALUE. The higher overhead of this method results from the fact that a subroutine call is required for each array reference.

31.3.2 Useful Subprograms

Next, what subroutines and functions are provided to assist in the use of these data structures? There are many which cover most of the needs of CONGEN.

31.3.2.1 `STRTDT` and `SETUPI`

The subroutine STRTDT is used to initialize the common blocks. Whenever a new data structure is defined or an old one is changed in size, STRTDT must be changed to accommodate this. STRTDT calls SETUPI.

The subroutine, SETUPI, is used to initialize the index common blocks. It depends on being able to equivalence the integer variables in the index common block to an integer array by using a subroutine call.

31.3.2.2 `INITDT`

The subroutine INITDT is used to initialize a base array. Before a base array can be used, it must be initialized to the free state. The free state means that all bases are set a value which cannot point into the heap. This value, known as the free value, is a sufficient large negative number such that any references using it will cause an access violation which is useful in catching programming errors before they do subtle damage.

31.3.2.3 `ALLDT`

The subroutine ALLDT is used to allocate space for a data structure on the heaps. It is called with filled length and type arrays and a free base array. The contents of the length array is summed to get the space requirements for the data structure. ALLHP and ALLCHP are then called to get space on the heaps for the data structure. Finally, the bases are all set to point to their part of data structure. Lengths of zero are permitted. In these cases the bases for the element will be left as the free value. This will catch any illegal references.

31.3.2.4 `FREEDT`

The subroutine FREEDT is used to free a data structure. The data structure must be allocated before it can be freed. The base array is set to the free value.

31.3.2.5 `USEDDT`

The logical function USEDDT is used to determine if a base array is free or is pointing to a data structure. It returns .TRUE. if the base array passed to it is in use.

31.3.2.6 `RESIZE`

RESIZE is used to change the size of any of the elements in a data structure. It is called with a base, length, and type array and another trio of arrays which give elements to be changed and the new sizes. Note that only the lengths of character string elements can be changed, data types of elements may not be. RESIZE allocates space for a copy of the data structure with new sizes; copies the old data into the new structure; and then frees the old data structure.

31.3.2.7 `READDT` and `WRITDT`

READDT and WRITDT may be used to write a data structure in binary form to a file. The trio of arrays must be specified along with a Fortran unit number, HDR and ICNTRL information, and a title. READDT will read what WRITDT creates.

31.3.2.8 `DUPLDT`

DUPLDT is used to duplicate a data structure. It simply makes a complete copy of the arrays and the data stored in the heaps.

31.3.3 Sharing Data and `ASGNDT`

With the subroutines used to implement these data structures explained, we can return to a discussion of other implementation details. The next issue is the question of sharing data. Because the base arrays are pointers, there is nothing to prevent two base arrays from pointing to the same data. Obviously, this can lead to great problems if this occurs unintentionally, but one can save space and underscore the identity of data if we allow sharing of read-only data.

Implementing sharing is very easy with the above scheme. We reserve the first element of the base array to be a pointer to the reference count for a particular instance of a data structure. When the data structure is allocated, the reference count is set to 1. If we assign another data structure to this instance by copying its bases, the reference count is incremented by 1. If we free the data structure, the reference count is decremented by one. When the reference count drops to zero, the space on the heap is freed. The use of the first element of the base arrays has an additional benefit because the length of the reference count is 1 word. This fact ensures that the first word in the base array cannot have a free value when the base array is in use.

The subroutine, ASNGDT, implements the assignment of data structures which are to shared. It will take a base and length array and copies them to another base and length array. The reference count for the data structure will be incremented by one.

31.3.4 Switching to Data Structures on the Heap

The final detail to be discussed is the interface between this scheme for manipulating data structures and the usual scheme. Currently, the analysis subroutine and non-bonded list uses this more concise means for handling data structures; the rest of CONGEN works with the separate arrays and scalars. When the analysis subroutine is invoked, the data structures must be built. Copying subroutines exist for the five data structures passed to the analysis subroutine. These copying subroutines will take the arrays and scalars which comprise a data structure and build a data structure on the heap for it. Once the copying operation is complete, the actual analysis subroutine can be called and it can freely use the data structures existing on the heap.

The copying operation presents a naming problem in that the indices kept in the index common blocks have the same names as the arrays and scalars. To circumvent this problem, a second set of index common blocks is kept for all data structures which are not isolated in the analysis subroutine. This second set of index common blocks have the same common block names as the first, but the names of the indices are prefixed with `I'.

A good example of the use of these data structures is the subroutine NBONDS and some of its subsidiary routines in the file, nbonds.flx.

31.3 Data Structures on the Heap in CONGEN

31.3.1 Example – The Non-bonded List Data Structure

31.3.2 Useful Subprograms

31.3.2.1 STRTDT and SETUPI

31.3.2.2 INITDT

31.3.2.3 ALLDT

31.3.2.4 FREEDT

31.3.2.5 USEDDT

31.3.2.6 RESIZE

31.3.2.7 READDT and WRITDT

31.3.2.8 DUPLDT

31.3.3 Sharing Data and ASGNDT