Next: , Previous: Creation of Comparison Data, Up: Comparisons


18.3 Matching of the Comparison Data Structures

The matching process begins after the PSF, COOR, and PARM options are handled. Segments are matched first. The SEGMATCH command specifies how segments are to be matched. In the absence of a SEGMATCH command, the segment are paired by matching the segments which have the same identifier. If the command is given, the pairs of segment identifiers dictate the matching of segments.

Once the segments are matched, we must match the residues in every matched segment. Central to this problem is the finding of homologies between two sequences.1

The problem of finding homologies has been attacked by many others. The method used in the analysis facility is a modification of the string to string correction problem,2 a problem well known in the computer science literature. The string to string correction problem can be stated as follows: Given a cost function for deleting any character from a string, a cost function for inserting a character into a string, and cost function for changing some character in the string into another character; find the lowest cost sequence of insertions, deletions, and changing operations that transform one string into the other. The cost functions in this algorithm are general in that they specify a different value for every possible character or pairs of characters. This problem is solvable in time and computer storage proportional to the product of the lengths of the strings.

We can see that this problem is applicable to finding a homology. As a protein genetic code evolves, insertions, deletions, and changes occur in its sequence. The best homology is one which minimizes these mutations. The algorithm calculates what is needed, a matching of residues that gives the lowest cost transformation of one sequence into another. Further, by establishing an appropriate change function, we can find homologies which permit conservative changes.

With the homology finding process understood, we can turn to the RESMATCH command. Any number of RESMATCH commands may be specified up to the number of segment pairs. Each command can specify the matching of residues in one segment pair or all. The first part of a RESMATCH command specifies what segment pair the command applies to. A segment identifier pair refers to that pair only; ALLSEG refers to all segments pairs which have not yet been processed. The RESMATCH keeps a record of which segment pairs have been specified in a command; any which are not specified are matched by finding the homology between their sequences. Further, each segment can be matched only once.

The next part of the RESMATCH command specifies how the homology should be found. If NOHOMOLOGY is specified, then the homology finding step is bypassed. If HOMOLOGY is not specified either, then the homology is taken over the complete sequences in the segment pairs. When HOMOLOGY is specified, one may specify the range of residues over which the homology should be taken. The ranges are specified as pairs of pairs of residue identifiers. The first pair gives the starting and ending residues in the first segment; the second pair gives the starting and ending residues for the second segment. More than one set of ranges may be specified.

Any number of conserved sets of residues may be specified using the CONSERVE command. Each use of CONSERVE specifies that all residues whose names are given may be considered equivalent when the homology is found.

It is possible to raise the penalty by a factor of 10 for making insertions and deletions by using the PROTECT option. The PROTECT option specifies pairs of residues in the main sequence which can only be deleted with a high penalty or which can have insertions within the specified range with a high penalty. Multiple ranges can be specified. This option is most useful when one wishes to conserve regions of secondary structure.

The PRINT option, if specified, causes all calls to the homology finding subroutine in a RESMATCH command to print the matching of the sequences, an alignment showing the relationship of the sequences, and a set of SPLICE commands, see Splice Command, which will effect the transformation from the main sequence to the comparison sequence.

Finally, in addition to the homology operations, one may specify that certain residues are to be matched. The residue identifier pairs at the end of a RESMATCH command match those residues.

Once the matching of residues with segment pairs has taken place, we can tackle the final step of matching the two PSF's — matching their atoms. If no ATOMATCH commands are given, the default action is to use the IUPAC nomenclature for each atom in the residues and match atoms which have the same IUPAC name. This is usually acceptable as the IUPAC nomenclature results in many equivalent atoms being properly matched.

With ATOMATCH commands, one can tailor the atom matching process as well. Any number of ATOMATCH commands, up to the number of residue pairs, may be specified. As with RESMATCH, the ATOMATCH command processor keeps a record of what residue pairs have had their atoms matched. Any residue pairs which have not been matched at the end of ATOMATCH processing have their atoms matched by the default algorithm.

The first part of an ATOMATCH command specifies what segment pair the command applies to. If ALLSEG is specified or if no segment identifier pair is given, the command will apply to the residue pairs in all segment pairs. If a segment identifier pair is given, the command will be applied to that pair only.

The next part of the command specifies what residue pairs the command applies to. ALLRES or no specification of residue results in the command applying to all residues within the selected segment pairs. RESID means that the next two words are the residue identifiers of the residues whose atoms are to be matched. RESNAME means that the following two words are residue names and that the atom matching specified in the command applies to all such pairs found in the given segment pairs. For this option, the program will check for the residue name pair in either order, e.g. if we say RESMATCH RESNAME ALA GLY ..., it does not matter which sequence has GLY and which has ALA. On the other hand, the order in the pair does matter with the RESID option.

The option, ONLY, specifies that the default operation of matching atoms by IUPAC names should not take place. Normally, the atoms in a residu pair are matched by IUPAC names before the user matching of atoms is executed. Specifying ONLY will override the preliminary matching.

Finally, the pairs of words at the end of the command give the matching of atoms specified through their IUPAC names.

Once the atom pairing is complete, it is checked for duplicate references to any atom as the pairing must be an equivalence relation. Any pair which refers to an atom which is referred to by another pair is deleted. Should this occur, it is important to see what command is causing a duplication and to fix the problem. The most common source for this error is to forget to specify ONLY in the ATOMATCH command. Usually, the pair deleted from a duplicate is the pair one is interested in.

It should be noted that the matching process is performed even if just coordinates are being compared. The default action results in an atom pairing that matches every atom to itself.

Once the atom pairing is established, one can specify that the coordinates of paired atoms be matched. There are two operations involved in fitting coordinates — translation and rotation. The translation operation is as follows: The geometric center or center of gravity of the main atoms in the atom pair list and the comparison atoms in the atom pair list are calculated. The difference between the centers is added to the comparison coordinates to minimize the least squares translational differences between the coordinates. The rotation operation is as follows: The comparison coordinate set can be rotated about its center to minimize the least square differences between the two sets. The rotation can be done with respect with several different sets of atoms.

The COORMATCH option specifies how the coordinate matching take place. If COORMATCH is not specified, no attempt is made to match the coordinates. If COORMATCH is specified without any modifier, it results in a translation followed by a least square rotational fit of all paired atoms. If MASS is specified, the center of gravities are used for the translational fit; otherwise, the geometric centers are used. If SIDE is specified, the least square rotational fit takes place with all atoms pairs where at least one of the atoms in the pair does not have an IUPAC name that corresponds to a protein backbone. If BACK is specified, the least square rotational fit takes place with the complement of atoms that would be matched by SIDE. I.e. BACK will do the matching using only backbone atoms, and SIDE will do the match using side chain atoms. If NOROT is specified, no rotation takes place. Finally, the SAVE option will save the coordinate transformation necessary to fit the comparison coordinates onto the reference coordinates. This can be used later by the coordinate transformation commands to move atoms around according to this transformation, see Coordinate Manipulations.


Footnotes

[1] You can use the homology program, see Homology, to find the homology between two sequences.

[2] R. A. Wagner and M. J. Fischer, “The String To String Correction Problem”, J. Association of Computing Machinery 21, 168-173 (1974).