ON THE OPTIMALITY OF THE DAYHOFF MATRIX FOR COMPUTING THE SIMILARITY SCORE BETWEEN FRAGMENTS OF BIOLOGICAL SEQUENCES
A.M. Leontovich
Institute of Physicochemical Problems in Biology, laboratory bldg. A,
Moscow State University, Moscow 119899, Russia
abstract. One of the most important problems of the biological sequence analysis is the search for similar sequences or sequence fragments and measuring the observed similarity. One of the most widespread and theoretically substantiated definitions of the similarity measure is that by Staden. The exact numerical value of this measure depends on the choice of the letter substitution weight matrix that characterizes the similarity of individual letters (nucleotides or amino acid residues) constituting the compared sequences. In the case of amino acid sequences the most often used matrix is that by Dayhoff. Its construction is based on the statistics of amino acid substitutions in a set of a priori related (but sufficiently divergent) proteins. In this paper some theoretical corroboration for the Dayhoff matrix are presented. In particular, a statement is proved showing that in some sense the Dayhoff matrix is an optimal matrix for the search of similar fragments.
In the analysis of biological sequences (polynucleotide or polypeptide) it is often necessary to understand whether there is a similarity between these sequences or their fragments, and, if such a similarity exists, to estimate how large it is. One of the most widespread and theoretically substantiated definitions of a similarity measure is that by Staden [1]. The exact numerical value of this measure depends on the choice of letter substitution weights; these weights characterize the similarity of individual letters (nucleotides or amino acid residues), that constitute the compared sequences.
The choice of a matrix can vary for different situations. The simplest matrix is the unit matrix with weights equal 1 if the letters coincide and 0 if the letters differ. The unit matrix is usually considered as the weight matrix if polynucleotide sequences are compared. But in the case of polypeptide sequences other matrices are mostly used [2, p. 103]. Sometimes the choice of weights depends on certain physical chemistry considerations (for instance, the hydrophobicity of amino acids, the charge, and so on) [3]. Another method of the weight matrix construction stems from the genetic code table [2, p. 103], [4]. Finally, a very often (probably the most often) used matrix is the Dayhoff matrix [4], whose construction is based on the statistics of letter (amino acid) substitutions (it seems that biologists do not use this approach for analysis of polynucleotide sequences). In this matrix relatively large weights correspond to letter pairs with the tendency for mutual substitution (of course, large weights correspond also to coinciding letters).
The last choice (the Dayhoff matrix) is the object of this paper. It will be shown that in some sense the Dayhoff matrix is an optimal matrix for the search of similar fragments.
First, let us recall the definition of the Staden similarity measure for fragments.
Consider two biological sequences, either polynucleotide (DNA or RNA), or polypeptide (proteins). Let a_{1},...,a_{N1} and b_{1},...,b_{N2} be the letters constituting these sequences. If the polynucleotide sequences are being considered, then each of letters a_{r}, b_{s} is one of the four nucleotide bases (A, G, C, T/U). If polypeptide sequences are being considered, then a_{r}, b_{s} belong to the set of 20 amino acids. Denote by p_{a }, q_{b} the frequencies of the letters a and b in the first and the second sequences respectively, that is
(1)
where N^{1}_{a} is the number of occurences of the letter a in the first sequence (i.e. the number of subscripts l for which a_{l} = a,
1
l
n_{1}), N^{2}_{b} is the number of occurences of the letter b in the second sequence, and N_{1} and N_{2} are lengths of sequences.
For each pair of letters a, b we introduce the number r _{a,b} that characterizes the similarity of the letters a and b. This number is called the weight of the substitution of the letter a to the letter b. The weights p_{a,b} form the matrix P = r _{a,b}. For polynucleotide sequences the matrix P is of dimension 4x4, while for polypeptides it is of dimension 20 x 20. Denote by m and D the mean and the variance of substitution weights in considered sequences:
Let (a_{i}, a_{i+1},..., a_{i+L1}) and (b_{j}, b_{j+1},... ,b_{j+L1}) be a pair of the sequence fragments of length L. Consider
This value A was suggested by Staden as a similarity measure between the compared fragments (a_{i},a_{i+1},... ,a_{i+L1}) and (b_{j},b_{j+1},...,b _{j+L1}).
If we assume that the sequences (a_{1},...,a_{N1}), (b_{1},...,b_{N2}) are independent Bernoulli sequences with letter frequencies p_{a}, q_{b} respectively, then the weights r _{a(i+l), b(j+l)} are random variables with the mean m and the variance D, and these random variables are independent for various l, 0 < l < L — 1. Then the similarity score A defined by formula (2) is a random variable. Its mean equals 0 and its variance equals 1. For L —> the distribution of this variable A tends to the standard normal distribution function N(0,1), that is,
For large L this formula allows us to find the probability P of the event that the sum of substitution weights for independent Bernoulli sequences of the length L is not less than the value of the corresponding sum for the fragments under comparison. This probability P is uniquely determined by A, i.e. P = P(A). Hence A is a natural similarity measure for fragments.
Now we describe the method to construct the Dayhoff matrix (as the author understands it).
A set of protein families is considered. Each family consists of presumably related proteins. In each case the presumption of relatedness is based not only on the similarity of the corresponding sequences, but rather on the biological considerations. However, the similarity between the proteins of a given family is certain but not too large, so that the sequences are not nearly identical. On the other hand, the proteins are similar enough in order to allow an unique alignment, in which the homologous fragments of the sequences are situated below each other. Such alignment is based not only on the sequences themselves, but also on the biological and chemical properties of the proteins.
Consider a pair of proteins a ,b belonging to one family. Align the protein sequences. After the alignment lengths of the sequences become equal. Denote by N the number of the alignment positions occupied by two amino acids (and not by an amino acid and a gap). Next, denote by N_{aa } the number of positions that are occupied by the amino acid a in the sequence a and an arbitrary amino acid (but not a deletion) in the sequence b . Similarly, denote by N_{bb } the number of the positions occupied by b in b and any amino acid in a . Finally, denote by N_{aba b } the number of positions occupied by a in a and b in b . Consider
where a, b is an arbitrary pair of amino acids. (If r _{aba b } > 1, than there is an observable tendency of the amino acid b to occupy the same position as the amino acid a). For fixed a, b the values r _{aba b } are averaged for all protein pairs a , b belonging to same family, and then averaged for all families from the set. The obtained averaged values are considered as the weights r _{àb} in the Dayhoff matrix. (Actually, in the real Dayhoff matrix all values are then multiplied by 10 and rounded to integers, but this is not important, since, by (2), the procedure of simultaneous multiplication of all weights by a same number does not change the value A).
Note that despite the fact that r _{aba b } and r _{baa b } are, in general, different, the matrix P obtained after the averaging is symmetric (that is, r _{ab} = r _{ba}), since the averaging is made over all protein pairs and both the pair (a , b ) and the pair (b ,a ) are considered.
The Dayhoff matrix often used by biologists is presented in Table 1 (this matrix is taken from [6, Table 1.2]).
In the Dayhoff matrix, large values of weights r _{ab} correspond to pairs of amino acids a, b that have a tendency for mutual substitution (in particular, diagonal elements r _{aa} have large values). This seems to be perfectly natural. But is there a more formal foundation for the Dayhoff method and formula (3). Here we present a statement that, in our opinion, partially validates this formula.
Table 1. The Dayhoff weight matrix

Gly 
Pro 
Asp 
Glu 
Ala 
Asn 
Gin 
Ser 
Thr 
Lys 
Arg 
His 
Val 
He 
Met 
Cys 
Leu 
Phe 
Tyr 
Trp 
Gly 
29 



















Pro 
12 
14 


















Asp 
11 
10 
23 

















Glu 
11 
11 
19 
20 
















Ala 
13 
14 
12 
12 
14 















`Asn 
11 
10 
14 
12 
12 
17 














Gin 
10 
11 
14 
14 
11 
12 
21 













Ser 
12 
11 
12 
11 
13 
13 
11 
13 












Thr 
10 
10 
10 
10 
12 
12 
11 
13 
18 











Lys 
7 
8 
10 
10 
9 
12 
11 
10 
9 
24 










Arg 
4 
4 
7 
7 
5 
10 
13 
7 
6 
22 
75 









His 
6 
6 
9 
8 
8 
13 
12 
9 
9 
11 
20 
59 








Val 
7 
8 
7 
8 
10 
8 
9 
9 
11 
8 
5 
6 
22 







He 
6 
6 
6 
7 
8 
7 
8 
8 
10 
7 
5 
7 
21 
25 






Met 
5 
6 
6 
6 
8 
7 
8 
8 
9 
8 
10 
6 
16 
16 
26 





Cys 
6 
4 
5 
5 
7 
6 
5 
11 
9 
4 
2 
3 
11 
9 
12 
166 




Leu 
4 
4 
5 
6 
6 
5 
6 
6 
7 
6 
4 
6 
15 
18 
21 
5 
40 



Phe 
2 
2 
2 
3 
4 
3 
3 
4 
5 
3 
4 
7 
7 
11 
11 
2 
12 
70 


Tyr 
1 
1 
1 
1 
2 
2 
2 
3 
2 
3 
3 
7 
3 
6 
6 
1 
6 
66 
137 

Trp 
1 
1 
1 
1 
1 
2 
2 
2 
2 
2 
3 
15 
2 
4 
4 
1 
3 
41 
46 
414 
Let us have a pair of fragments (a_{1},...,a_{L}), (b_{1},...,b_{L}) of the same length L. Denote by n^{1}_{a} the number of positions in the first fragment occupied by the letter a, by n^{2}_{b} the number of positions in the second fragment occupied by the letter b, and by n_{àb} the number of position occupied by a in the first fragment and b in the second fragment (i.e. the number of positions l such that a_{l}= a, b_{l} = b). Following (2), the similarity measure for the two fragments is
where
Each choice of weights r
_{àb} corresponds to a value of the similarity measure computed by formulas (4) and (5), A = À{r
_{àb}}. The larger is the value À{r
_{àb}} (for the given pair of fragments), the more sensitive is the matrix P = r
_{àb} for search of similarity between these fragments. The question is, which matrix P corresponds to the maximal value of À{r
_{àb}} (for a given pair of fragments)? The answer is given by the following theorem.
Theorem. Let
Then for each weight matrix r
_{àb} we have
Proof. Note first that, by (4) and (6), multiplication of all weights r
_{àb} by the a constant and adding a constant to all weights r
_{àb} does not change A{r
_{àb}}. Hence it is sufficient to find a set of weights {r
_{àb}} that maximizes A{r
_{àb}} assuming that the following conditions are satisfied:
Therefore, we have a conditional extremum problem. By (8), this problem is evidently equivalent to the problem of maximization of a more simple variable
under conditions (8). Using the method of Lagrange factors it is easy to discover that the maximum value of S{r
àb} corresponds to
where the factors l , m are found from conditions (8). This expression implies the left inequality in (7). Next, from (5) and (6) we easily see that
and the proof is completed.
Suppose that the letter frequencies in the fragments equal the corresponding frequencies in the entire sequences, that is
(so the fragments have average frequency characteristics). It follows from (6), (7), and (9) that in this case
The evident similarity between formulas (3) and (10) justifies the following interpretation of the above theorem: the choice of weights for the letter (amino acid) substitutions with the Dayhoff matrix is close to optimal if the compared sequences (proteins) have the intermediate similarity that is close to the similarity level in the protein families used as learning sets in the construction of the Dayhoff matrix. If the similarity level is different, some other matrix should be used.
As an example we consider the case of nearly identical sequences. For such sequences the values n_{a}^{1}, n_{a}^{2}, n_{aa} are close to each other, while the values ï_{àb} are small compared to n_{a}^{1}, n_{a}^{2} for à a ¹ b. Formula (10) implies that in this case it is useful to choose as a weight matrix the diagonal matrix with elements inversely proportional to letter frequencies. (However, for such almost coinciding sequences the choice of the weight matrix is of no great importance since similar fragments would be surely found with the unit matrix, Dayhoff matrix, or many other matrices.)
Formulas (6) and (10) suggest possible modifications of the Staden approach to the definition of the similarity measure for fragments. In these modifications for the letter substitutions are not set once and forever, but depend of the sequences being considered. Two approaches are possible.
In the first approach the recomputation of weights is made during process of determination of similarity between fragments. This recomputation of weights depends on the sequences as a whole, but not on the fragments under consideration. First, some standard weight matrix r _{àb} (e.g., the Dayhoff matrix) is considered and similar fragments are found. Then the alignment is performed. Hence, the similarity level of those two sequences is determined. If necessary, weights are redefined using the following procedure. Denote by N the number of the alignment positions occupied by letters in both sequences (and not a letter and a gap). Denote by N_{a}^{1} the number of positions occupied by the letter a in the first sequence and by an arbitrary letter in the second sequence; similarly, denote by n_{ b} ^{2} the number of the positions occupied by by the letter b in the second sequence. Finally, denote by N_{ab} the number of the positions occupied by a in the first sequence and by b in the second sequence. The new weights are then defined by the formula
(cf. (3), (10); note that the matrix r
^{’} _{àb} is symmetric since for that very reason its elements in (12) are defined as half sums of the two terms). Then the more accurate search of similar fragment is performed using these new more adequate weights
r
'_{ab}.
When this procedure is applied to independent Bernoulli sequences, the distribution of the random variable A' characterizing the similarity between fragments differs from the distribution of the corresponding random variable A defined using the preset weight matrix r _{àb}. In particular, it is evident that the distribution of A' is shifted to the right compared to the distribution for A, and that although A' has an asymptotically normal distribution N(0, 1) for L —> , N / L —> , it tends to the limit distribution more slowly than A. These facts should be considered during interpretation of observed results.
The described approach has an evident drawback compared to the standard Staden method. It is too complicated. Indeed, in order to find the weights r ^{'}_{ab}, the usual Staden procedure is to be performed and then the sequences are to be aligned, which is a laborous task.
In the second approach the weights depend directly on the fragments that are compared, and not on the entire sequences from which the fragments were taken. Namely, it is suggested to use as the weights the values r _{àb} maximizing the similarity measure between the fragments (see (6)). Then the similarity measure between the fragments A^{f} is defined by formula (7), so that A^{f} =
A{r ^{f}_{àb}}. In this case letter frequencies (that are parts of (6) and (7)) can be computed using either the entire sequences (see (1)) or the fragments under consideration. (See (9). The weights r ^{f}_{àb} and the similarity measure A^{f} = À{r ^{f} _{àb}} are defined by formulas (10) and (11).) It is more convenient to compute not the similarity measure A^{f}, but its square A^{f} ^{2}. In this approach the recomputation of the weights is not performed, so it is less cumbersome. Moreover, the simplicity of formulas (7), (11) also is attractive.
To interpret the experimental results it is necessary to understand the nature of the distribution of the variable A^{f} (or A^{f} ^{2}) when independent Bernoulli sequences are compared. Evidently, the distribution of A^{f} is fundamentally different from the distributions of A and A' discussed above. For example, the value A^{f} (unlike A and A') is automatically nonnegative (see (7)). Of course, the distribution of A^{f} for L —> does not tend to the standard normal distribution N(0, 1).
Consider in more detail the case when the letter frequencies are computed using the entire sequences (i.e. by formula (1)). We analyse the distribution of A^{f} ^{2}. By (7), A^{f} ^{2} is the sum of 20^{2} = 400 positive terms
where
It is not difficult to compute
It can be shown that the variables x _{ab}, are asymptotically normal for L —> . (Therefore, A^{f} ^{2} looks like the x^{2} with 400 degrees of freedom, but by (15), the variances of x _{ab} differ from 1, and, what is even more important, by (16) these values are mutually dependent.) Formulas (14) and (15) imply that
In the case when all letters have equal frequencies p_{a} = q_{b} = 1/20, it is possible to find also the limit value (for L —> ) for the variance Var(A^{f} ^{2}). Namely, computing the eigenvalues of the covariation matrix (15), (16), one sees that one of the eigenvalues equals 0, while all others equal 1. Using the asymptotic normality of the variables x _{ab}, it is possible to prove that in this case
Var(A^{f} ^{2}) —> 2(20^{2}  1) = 798 for L —> .
In the general case we cannot compute the variance Var(A^{f} ^{2}), since the eigenvalues of the covariation matrix cannot be found. However, the limit value of the variance would be certainly close to 798, even if the letter frequencies are not close to one another.
The variable A^{f} ^{2} is not, of course, asymptotically normal for L —> (as mentioned above, it is more close to the variable x^{2} with 399 degrees of freedom). However, since the number of terms in (13) is very large, the distribution of A^{f} ^{2} is close to normal. Therefore in applications one can assume that A^{f} ^{2} has a normal distribution with the mean 399 and the variance 798.
Since the mean E(A^{f} ^{2}) is very large, the distribution of the variable A^{f} also is close to normal with the mean close to
and the variance close to
A similar situation for the distribution of (A^{f} ^{2} and (A^{f} occures in the case when the letter frequencies are computed from the fragments, i.e. by formula (9). However, in this case the mean Å((A^{f} ^{2}) and the variance Var((A^{f} ^{2}) are slightly different:
E((A^{f} ^{2}) = 20^{2}  2 • 20 + 1 = 361, Var((A^{f} ^{2}) ~ 2 • 19^{3}/21 ~ 653
(in the case of equal letter frequencies the variance coincides with the above value). As we have mentioned earlier, this approach has some advantages compared to the previous one (when the weights depend on the entire sequences and not on the compared fragments). But it has also serious drawbacks. One of them is that the value of (A^{f} ^{2} is large not only if the fragments under comparison are very similar, but also when they are very dissimilar (i.e. when there is a tendency for nonsimilar letters to occupy same positions). Therefore, after computing the value (A^{f} ^{2} it is necessary to find out why this value is large (due to the similarity or the dissimilarity of the fragments). The other drawback is that this approach (contrary to both the standard Staden technique and the first alternative technique) cannot be used by the DotHelix algorithm for the fast search of similar fragments in sequences [5].
Hence, both methods have serious drawbacks (besides the fact that they have not been studied analytically in any detail). On the other hand, the problem of the weight choice seems to be not too important for biologists. Nevertheless, the possibility of such approaches is worth bearing in mind.
References
[1] R. Staden, An interactive graphics program for comparing and aligning nucleic acid and ammo acid sequences, Nucl. Acids Res. 10 (1982), 29512961.
[2] R. F. Doolittle, Of URFs and ORFs. A primer on how to analyze derived amino acid sequences, University Science Books, Mill Valley, 1986.
[3] W. M. Fitch, An improved method of testing for evolutionary homology, J. Mol. Biol. 16 (1966), 916.
[4] M. D. Dayhoff, Atlas of protein sequence and structure, vol. 5, suppl. 3, Natl. Biomed. Res. Found., Washington, D.C., 1978.
[5] A. M. Leontovich, L. I. Brodsky, and A. E. Gorbalenya, Construction of the complete map of local similarity for two polymers, Biopolimery i Kletka 6 (1990), no. 6, 1421. (Russian)
[6] G. E. Shulz and R. H. Schirmer, Principles of protein structure, SpringerVerlag, New York, 1979.