SIAM J. APPL. Math. 009
Vol. 44. No. 3. June 1984
THE CONTEXT DEPENDENT COMPARISON OF BIOLOGICAL
SEQUENCES*
W. JOHN WILBURt and DAVID J. LIPMANt
Abstract. A general method for comparing two macromolecules is developed. The method differs from more traditional procedures in that matches are evaluated dependent on sequence context. We first define a context dependent similarity score between sequences and give a dynamic programming algorithm for its calculation. Conditions arc then described which allow the conversion of the similarity score to a metric distance. The class of metrics obtained in this manner includes the Sellers metric. An advantage of the method is the ability to make very rapid comparisons and to align long sequences.
1. Introduction. The problem of finding a suitable metric for biological sequences arose in the context of assigning a meaningful evolutionary distance between homologous molecular sequences. For an excellent review of this development and of the more general problem of coding and string corrections we refer the reader to Kruskal [6]. Initial results for biological sequences were obtained by Fitch and Mar- goliash [4], Needleman and Wunsch [7], Sankoff [10], and Beyer et al. [1]. The problem reached a satisfactory solution in the metric defined by Sellers [11] and in its extension by Waterman et al. [13]. Such metrics have in common that individual sequence elements are compared independent of their context just as the mutations (substitutions, deletions, and insertions) assumed to transform one sequence into another are taken to occur independent of sequence context. It is our purpose to make sequence context play a central role. Although it is true that the metric introduced by Waterman et al. [13] allows context dependence of the penalty attached to sequence elements occurring in a gap, such sequence elements are always paired with blanks and in contradistinction to this we shall be concerned with context dependence in evaluating sequence elements paired with sequence elements.
There are two basic reasons for considering context dependence as we shall consider it here. First, this approach can significantly improve the speed of sequence comparisons. Suppose we wish to compare sequences A and B but we only consider a match of two elements significant if it occurs between two additional matches. Then the context (elements before and after) of a given element, a, in A, greatly limits the number of elements in B which we need consider as possible matches for a in any alignment of A and B. When this effect is properly utilized the computation time (and/or for long sequences core space) may be greatly reduced for a comparison. Second, suppose we wish to compare two amino acid sequences to relate the protein structures for which they code. Then an isolated match of amino acids may be considered insignificant and it may be useful to score amino acid matches in relation to the structural similarity of their context. With this approach we may expect alignments to have more structural significance.
Kruskal [6] specifies the three basic modes of sequence comparison as the trace, the alignment, and the listing. We shall deal only with alignments as traces are not convenient and listings are not applicable in our case. A listing is a list of elementary operations (generally mutations for biological sequences) which when performed in order and as specified serve to transform one sequence into another. The distance
* Received by the editors December 22, 1982, and in revised form August 16, 1983.
t Mathematical Research Branch, National Institute of Arthritis, Diabetes and Digestive and Kidney Diseases. National Institutes of Health, Bethesda, Maryland 20205.
557
Do'stlaringiz bilan baham: |