Overview of Directed Searching

Next: Parallel Processing, Previous: Sidechain Overview, Up: Overview of Conformational Search

12.1.3 Overview of Directed Searching

The process of conformational search can be viewed as a search of tree, where a tree means a graph with no cycles. See the following figure:

                              o <-- Root
                             /|\
                            / | \
                           /  |  \
                          /   |   \
                         /    |    \
                        o     o     o           Degree of Freedom 1
                       /|     |\     \
                      / |     | \     \
                     /  |     |  \     \
                    /   |     |   \     \
                   /    |     |    \     \
                  o     o     o     o     o     Degree of Freedom 2

In this analysis, the root of the tree represents the system at the beginning of the search process, where no degrees of freedom have been sampled. The next level represents the search process after the first degree of freedom has been sampled. In the above figure, there are three nodes at this level which signifies that the good conformations were found. The third level in the tree represents the effects of sampling the second degree of freedom, and so on.

Each node in the tree represents the result of a particular set of samplings for the degrees of freedom. Leaf nodes are those nodes which exist at the very end of the tree, and represent complete conformations, where every degree of freedom has been completely sampled. Nodes in between the root and leaves are partial conformations. Any node which has been sampled is called an expanded node, and any node which has not been sampled is called an open node.

The goal of a conformational search is to find the lowest energy leaf node. An exhaustive search of the tree can be programmed easily by simply following down the leftmost branch of the tree until either a leaf is hit, or a conformation which is blocked because of bad contacts or other problems. The program then backs up to the previous level, and examines the next node over, and again searches down towards the leaves. Effectively, the search proceeds from left to right across the entire tree, and every leaf is eventually generated. This is a depth-first search. Because each degree of freedom increases the number the nodes by an approximately constant factor, the time for an exhaustive search is exponential in the number of degrees of freedom.

In principle, knowledge of the path through the search tree to the best leaf would make it theoretically possible to find the best leaf in time linear with the number of degrees of freedom. Such knowledge is generally not available in advance. However, in some situations, information about the partial conformations can provide a guide towards finding the best leaf. For example, if CONGEN is used to reconstruct the backbone of a protein from the alpha carbon coordinates, then the RMS deviation of the partial conformations from the known alpha carbon coordinates is an excellent guide to a good quality fit.

The use of information about partial conformations is the basis for the directed search options. When these options are used, energies or RMS deviations for conformations at each node in the tree are calculated. The program maintains a sorted index of the open nodes in the tree, and it samples degrees of freedom in an order that depends on the order of the energies or RMS deviations.

The evaluations of the open nodes can be affected by the use of the EINHERIT and EIMMEDIATE keywords for particular degrees of freedom. EINHERIT causes all nodes expanded for a particular degree of freedom to inherit the evaluation of their father. EIMMEDIATE sets the evaluation of nodes at this level to the smallest possible value which forces the program to expand nodes at this level ahead of all others. This is useful when you do not want the energies or RMS's of this degree of freedom to influence the directed search.

Three selection strategies exist at present. The first strategy, the evaluation strategy which is selected by the EVAL keyword, always selects the lowest energy or RMS deviation node. It has a serious drawback when a degree of freedom results in conformers whose energy or RMS deviation is raised. In that case, the program will first expand all nodes before that degree of freedom before moving on towards the leaves.

The second strategy, the deepening evaluation strategy which is selected by the DEVAL keyword, is intended to circumvent the drawback of the EVAL strategy. Here, CONGEN maintains a sorted index of open nodes for each degree of freedom. When the search begins, all the indices are empty except for the first degree of freedom which contains just the root node. The program then cycles through each degree of freedom, and selects the lowest energy or RMS deviation node at that level for expansion. If there are no open nodes available, or if the search reaches the leaves, then it cycles back to the first degree of freedom. This cycling forces the program to make progress toward the leaves.

The third strategy, the mixed method selected by the MIX keyword, is a combination of EVAL and DEVAL. The MIX strategy maintains both types of indices, and simply alternates between each rule for the selection of nodes to expand. So far, it is the best directed search strategy.

It is possible to use the directed search methods to generate random structures. The trick here is to use random numbers for the evaluation of nodes rather than energy values. There are two possible approaches to the generation of random structures, generating multiple conformations in a single CONGEN run or making multiple runs of CONGEN with each generating one conformation. In the first case, the EVAL search option generally leads to more variation in the structures, although there is substantial overlap of structures. In the second case, the DEVAL option should be used because it is faster for each run.

The directed search capabilities are still under active development, and suggestions for improvement are always welcome.

When the directed search methods are used, the program maintains a search tree in memory. Since each node requires several hundred bytes of memory, it is impractical to keep the entire search tree. Thus, there is a tree pruning mechanism which can be tailored to periodically eliminate nodes that are either unnecessary or unlikely to be sampled in a reasonable amount of time given the search strategy in use. See Global Options for Conformational Search, for a description of the TREE options.