Belozersky Institute


Russian EMBnet Node



Multiple Sequence Alignment is the arrangement of several protein or nucleic acid sequences with postulated gaps so that similar residues (in one-letter code) are juxtaposed.

GeneBee algorithm of MULTIPLE ALIGNMENT proceeds in four steps:

  • construction of pairwise MOTIFS
  • construction of MULTIPLE MOTIFS (of thickness exceeding 2)
  • construction of MULTIPLE ALIGNMENTS from previously obtained MOTIFS and SUPERMOTIFS and consequent selection of the best alignment.



  3. POWER
  4. TOTAL and LOCAL alignments




The multiple alignment of several biopolymer sequences is the arrangement of several protein or nucleic acid sequences with postulated gaps so that similar residues (in one-letter code) are juxtaposed. In this program the alignment is first constructed of the so-called MOTIFS (see below) and later is improved by a dynamic programming procedure.

1.2. MOTIFS.
In some fragments of the set of aligned sequences there exists a similarity between all or some layers (sequence fragments) of the alignment. These regions of similarity without gaps are called MOTIFS. MOTIFS are characterized by length (number of letters in each fragment) and thickness (number of layers in a MOTIF).

1.3. POWER.
The degree of similarity between MOTIF layers is determined in this program by probabilistic formulas. To estimate it we introduce the notion of the POWER of a MOTIF. This notion is fundamental for this program. POWER relates to the possibility of receiving a sum of mismatch weights along an alignment for random sequences of the same length. Analogous reasoning allows to introduce a notion of the POWER of an ALIGNMENT. In various procedures of the algorithm either POWER or the traditional sum of mismatch weights serve as the actual value of an ALIGNMENT.

1.4. TOTAL and LOCAL alignments.
In the fist step of our algorithm for each pair of sequences all similar fragments are found by the DotHelix procedure. They form motifs with thickness 2. Similarities between sequences and motifs of thickness 2 form motifs of thickness 3. Further, motifs of thickness 4 are obtained from motifs of thickness 2 and 3 by linking them with suitable sequence fragments or motifs. So the full set of the best motifs could be constructed.
The draft result of an alignment can be represented as a chain of consistent (with regards to the order of letters in sequences) motifs, interrupted by the necessary number of gaps. If an alignment cannot be extended by addition of any new motifs at either end, it is called TOTAL. If an alignment can be extended at least at one end, it is called LOCAL alignment.

Often close neighboring motifs would extend one another if some gaps are introduced. Such region (called SUPERMOTIF) can have greater similarity between fragments forming it, than the individual motifs. Supermotifs and motifs are instances of LOCAL alignments. Note, that in course of search for functionally important regions, the knowledge of motifs and supermotifs is of greater importance for a biologist than an alignment itself. Thus the procedure for motif and supermotif determination has an independent value.

The algorithm proceeds in five steps:

  1. construction of pairwise MOTIFS;
  2. construction of multiple MOTIFS (with thickness exceeding 2);
  3. forming of SUPERMOTIFS from these MOTIFS;
  4. construction of MULTIPLE ALIGNMENTS from previously obtained MOTIFS and SUPERMOTIFS and selection of the best alignment.
  5. refinement of the DRAFT alignment by iterative dynamic programming procedure.

The 4th step can be performed in 2 ways.

  1. The method is an automated input of MOTIFS and SUPERMOTIFS ordered by their quality - GREEDY ALGORITHM.
  2. The method is a variant of dynamic programming, that employs a similarity measure of pairwise MOTIFS. In this method the quality of a TOTAL alignment is determined by the sum of mismatch weights accounting for gap penalties.


As it has been said above, we define the quality of various local similarities based on probabilistic considerations. The most fundamental is the notion of MOTIF POWER. It is defined by the following formula:

       S S          ( W{b(r1,k), b(r2,k)} - M )
       k 1<r1<r2<K
A = -------------------------------------------    (1)
               sqrt (DL)

S is sum,
A - POWER of the considered MOTIF,
L - length of a MOTIF, i.e. the common length of fragments forming it,
K - thickness of a MOTIF, that is the number of layers in it,
b(r,k) are the k-th letters (residues) forming the r-th layer,
W{B,C} is the weight of substitution of a letter B to letter D, characterizing the similarity of corresponding residues,
M and D are user-defined constants.

Increase of power corresponds to increase of similarity between motif layers.

For random independent Bernoulli sequences M should be mean substitution weight {M(W)}, D - the variance of sum of weights situated in one column of a motif. Then the power A can be considered as a random variable which mean equals 0 and variance equals 1. When this choice of the constants M and D is made, the power A is the normalized sum over all motif positions of weights for pairwise substitutions of letters situated in one column. The probabilistic meaning of A is following:

ln{P(A>A0} is equivalent to -(A0 /2)**2,
(we use sign "**" for "raising to power").

Biological sequences, especially similar ones, cannot be assumed to be independent. Thus in some cases it is useful to set M to values larger than mean of substitution weights. It allows not to mark as similar substantially non-similar fragments.

The weights W{A,C} form a weight matrix defined by user. Its choice can vary. In the case of protein sequences one of the most widely used is the Dayhoff matrix in which weights are set based on substitution frequencies in already known homologous proteins. In other cases weights correspond to physico-chemical properties of amino acids. Simultaneous use of several matrices in course of the alignment construction is also possible. In this case the algorithm links together independently obtained motifs.

The notion of POWER as defined by formula (1) is a base for other definitions: CORRECTED MOTIF POWER, CONDITIONAL POWER, ALIGNMENT POWER. The latter values are the ones directly used in this program.

1.1. Corrected power
The power A characterizes the probability to obtain the given sum of weights for an individual motif taken from Bernoulli sequences. However, as the thickness K increases, the number of possible motifs becomes larger for a fixed length. Thus for Bernoulli sequences it is simpler to encounter powerful motifs among thick ones than among thin ones. So thick motifs would prevail in the list of the most powerful motifs and their importance would be overstated. In order to avoid that, we perform a correction substituting the power A for the CORRECTED POWER (A_cor).

The formula for A_cor is based on the following informal reasoning. Let's assume for simplicity that all sequences have similar length n. The number of possible relative shifts of sequences, when motifs of thickness k are considered has the order n**(k-1), and it is approximately n**(k-1) times greater than that for motifs of thickness 2, for which no correction is performed. Thus the correction should decrease the corresponding probability P(A) at n**(k-2) times. In order to do that it is necessary to decrease half of the squared power by (k-2)*ln(n). Thus we get the following formula for corrected power:

A_cor = sqrt{A**2 - 2*(k-2)*ln(n)}    (2).

1.2. Alignment power
We characterize similarity of sequence fragments participating in an alignment (LOCAL or TOTAL) by its POWER. It is defined analogously to the MOTIF POWER A (formula 1). When weights are summed along the alignment path, the weight of a pair of corresponding residues b(r1,k) and b(r2,k) is taken from the matrix (as in (1)) if this pair occurs in a motif, and equals the mean weight M(W) otherwise. If one of the symbols b(r1,k) and b(r2,k) is a gap, then the weight equals the user-defined gap penalty, which in any case does not exceed M(W). In the formula for the alignment power L is set to the alignment length.

One of the reasons for introducing the gap penalty is the correction similar to one in the corrected motif power A_cor. The problem is that the number of possible alignments sharply increases as the allowed number of gaps increases. Thus for random independent Bernoulli sequences the maximum of the sum of weights along the alignment path increases, and so the power of alignments with many gaps is too large. In order to compensate for this effect, the gap penalty is introduced. Unfortunately, unlike the situation with the corrected motif power, it is difficult to estimate theoretically the gap penalty, since sums of weights along different path are strongly dependent.

As it was stated, the first step of our procedure is the construction of pairwise motifs. This is performed by DOTHELIX procedure (Leontovich et al., 1993).

Thresholds for power and lengths of motifs are set by user. Sets of motifs can be constructed employing different matrices.

When the standard constant M=M(W) is chosen, a large number of the obtained motifs will be "noise", despite their formal statistical significance. This fact, obstacling further alignment construction by increase of the search, is caused by the null hypothesis of independence of the sequence being aligned (indeed, the very desire to align sequences is an evidence of their dependence!). Clearly, for dependent (similar) sequences the mean mismatch weight increases as compared to formula (2). The program allows a user to account for this fact by increasing M (by setting parameter "Minimum homology ratio" in the interval from 0 to 1: 0.01 is recommended). Experiments demonstrate that this procedure allows to filter out a majority of noise motifs.

Thus it is possible to find motifs for various M and different weight matrices. A user can lump together all obtained motifs, and it increases possibilities of alignment.

The described procedure is fairly slow, since it has to process a very large number of relative sequence shifts. To accelerate this process, it's possible to consider only the most "promising" shifts, which are processed further. Of course, this pre-processing can lead to the loss of some motifs. Such sorting of registers is performed by a Lipman-Pearson type procedure (Lipman and Pearson, 1985).

Consider it in more detail. First, the letters (residues) are divided into several classes ("colors"). This coloring is consistent with the weight matrix: substitution weights within a class should be sufficiently large. For each shift of sequences the following procedure is performed: for each position of a window (of length, say, 7) the value t(t-1)/2 is calculated, where 't' is the number of color matches within this window.

Second, the sum of these values for all window positions for a given shift is calculated. To select a given shift the ratio of this sum to maxsum should be more than user-defined value (parameter "Coincidence ratio").

The described characteristic is more preferable as opposed to the more widely used number of perfect matches or number of color matches (that is, the sum of t), since it is sensitive to determination of shifts setting in correspondence locally similar regions. On the other hand, our procedure is sufficiently fast, since the above characteristic can be calculated using pre-formed looking tables of all color pairs (the distance between which does not exceed the window length) for all sequences.

A user-defined number of shifts is a subject for further processing.

Generally speaking, in order to find all possible motifs of different lengths, all comparison registers should be analyzed, and for each of them the set of non-intersecting sufficiently powerful motifs should be found, as it is done in DotHelix procedure. However, this procedure requires a prohibitively large time even for a quite small number of sequences k, since the number of registers is proportional to n**(k-1), where n is the average sequence length.
In order to avoid the full register search, we employ a procedure for motif determination. This procedure is based on the fact that in a sufficiently powerful motif all, or almost all pairs of layers are significally similar, so a multiple motif should be the union of pair motifs.

Consider this algorithm in more details. It is sufficient to describe the procedure of linking of two motifs (or a motif and a sequence fragment). If the motifs have no common layers, the DotHelix procedure is applied (each motif is considered as a unit). Linking of a fragment to a motif is performed in the same way. If motifs have common layers, then (if it's possible) the natural correspondence between layers is established, and then DotHelix is applied to thus defined comparison register.

Three options of the parameter "How to process motifs" relate to three nested possible ways of motif construction:
- to create all pair motifs ("Sequence-sequence");
- to extend pair motif by one additional layer ("Motif-sequence");
- to link all possible pairs of existed motifs ("Motif-motif").

In all three cases, prior to processing by DotHelix, motifs are slightly extended in order to allow a more precise identification of motif boundaries.

The result of this procedure is a set of motifs of various lengths. Each motif is characterized by the corrected power (A_cor), calculated when the motif was constructed. Motifs with the power exceeding certain thresholds are retained.

Incomplete thickness of a motif means that only some sequences possess a good similarity in this region.

Intermediate storage of motifs is based on the stack principle, so that only some best (with regards to the corrected power) motifs are retained. When a new motif is constructed, the stack is reordered and the worst of the stored motifs are deleted. It allows to save memory and calculation time.

Speaking metaphorically, the suggested method first selects areas of potential similarity, based on pairwise comparisons. Then some of these areas become more "visible" due to their similarity with fragments of other sequences, while the others are not supported by additional similarities and disappear from the stack.

The accelerating procedure described in Section 2 can be applied to the construction of multiple motifs as well. It can be further accelerated by considering color matches only in conserved positions of motif (i.e. positions colored almost entirely in one color). The percentage of one-colored letters in a conservative position is a user-defined parameter "Minimum letter frequence".

4. CONSTRUCTION OF ALIGNMENTS (supermotifs and total alignments).
As it was said, each DRAFT alignment (partial or total) is a consistent chain of motifs. Each aligned sequence consists of fragments similar to fragments of other sequences (and thus belonging to motifs), fragments not belonging to any motif, and gaps. Since it is hardly possible that biological sequences can possess "negative" similarity (one with a negative power), i.e. improbable in a random sequence dissimilarity, it is reasonable to assume that the order of letters and gaps in motif-free fragments is unimportant. In this program in such regions first letters are written, and then the necessary number of gaps is added. The only exclusion from this rule is the beginning of the sequence, which is indented to the right.

4.1. Construction of supermotifs
Supermotifs are formed at the first step of construction of the total alignment . Supermotif is defined as a partial alignment, which power exceeds powers of all its subalignments. In particular, its power exceeds powers of all motifs belonging to it. In the program two methods for linking of motifs into supermotifs are implemented: greedy and cluster.

In the greedy method first the most powerful motif is considered, then it is linked with such motif, that the power of the constructed supermotif exceeds powers of both motifs belonging to it. Then the obtained supermotif is linked to the third motif (if it is possible) etc. This procedure is performed while the supermotif power increases.

In the cluster method the following scheme is employed. Initially each motif is considered as a supermotif. At each step two such supermotifs are linked, that the power of the obtained supermotif exceeds powers of its constituents, and it is the maximum one among all possible linkages.

It should be noted, that not all supermotifs can be constructed by these procedures. Construction of algorithms that would allow us to construct all existing supermotifs is a subject of further research.

4.2. Construction of total DRAFT alignments.
Consider now construction of a total alignment. As it has been mentioned in Section 1, two methods are currently implemented in the program: greedy and dynamic programming.

4.2.1. Greedy procedure
Construction of total alignments by the greedy procedure is similar to the greedy construction of supermotifs, but it is not required that the power of obtained partial alignments necessarily increases. The procedure can be performed in several steps. First a total alignment of the most powerful motifs and supermotifs is constructed. Then it is supplemented by a new set of motifs and the alignment is modified without disturbing already aligned fragments. These new motifs can be obtained by either softening requirements to the power, or with the use of some other matrix, or allowing a lower homology threshold, or by combining some of the above-mentioned parameters. Thus we obtain the second alignment, add to it more motifs, etc. Usually at the first step the most popular - Dayhoff - - matrix is used.

This algorithm can be generalized in the following way (the derived procedure would not be a greedy one). First, one of t best (super)motifs is taken, where t is defined by a user. Then each obtained partial alignment (currently consisting of a single supermotif) is linked by a new supermotif. This is done by one of t ways so that the power of the alignment, if possible, increases. The procedure continues until there are supermotifs that can be still linked to an alignment. If t=1, this is the above described greedy procedure, If t= number of motifs in the stack, this procedure reduces to the full search. As t increases, the calculation time also increases, but the quality of alignment improves.

4.2.2. Dynamic programming (Needleman - Wunsch).
A dynamic programming procedure has been implemented for construction of a total alignment. First, we describe this method in the case of two sequences.

We represent an alignment as a chain of motifs. These motifs can belong to an alignment not only completely, but partially as well. It is taken into account in our program, but below we for simplicity assume that motifs enter an alignment entirely. Alignment quality is measured by its "price", that is defined as a sum of centered weights W(al,bl) of mismatches along the alignment path minus gap penalties. For each motif it is possible to point out motifs that can precede or follow it in an alignment. Thus the set of motifs is a partially ordered set. If a partial alignment has been constructed, further aligning depends only on the most recently aligned fragment. Thus the dynamic programming principles, and in particular, a Needleman-Wunsch type procedure can be employed. For each motif we find the maximum value of alignments starting at the beginning of the sequences and ending at this motif. We do not consider all transitions from a current motif to subsequent ones: among motifs having close registers only transition to the nearest one is allowed.

If many sequences are aligned, a complete alignment can also be represented as a chain of pairwise motifs (since a multiple motif can be decomposed into pairwise ones). However, unlike the previous case, alignment depends not only on the last aligned motif, but on some previous motifs. Thus in this case the dynamic programming principle cannot be applied.

In this connection we introduce the following notion. Each partial alignment is set in correspondence with its FRONT, defined as the set of the rightmost in each sequence pairwise motifs belonging to the alignment, and their relative positions in the alignment. Thus in order to define a front, it is necessary, first, to set for each sequence such motif, that one of its layers belongs to this sequence, and, second, to define the position of this motif in the alignment relative to the initial position of the leftmost motif in this front. The initial position of any frontal motif in a sequence, to which it is ascribed, should be situated rightwards relative to beginnings in this sequence of all other frontal motifs. One motif can be ascribed to two sequences (recall that all motifs are pairwise ones).

It is simple to demonstrate that further construction depends only on the front of the already aligned part. Thus a dynamic programming procedure is applicable.

It is clear, that as the number of motifs increases, the number of fronts also increases - at a very fast rate (roughly as [N/t]**t) , where 'N' - number of motifs, 't' - - number of sequences). Thus it seems to be necessary to bound the number of fronts by a reasonable stack size, so that alignment of natural sequences would be possible.

4.2.3. Direct method.

4.3. Construction of the REFINED alignment.

The REFINED alignment is constructed by the iterative improvement of the DRAFT alignment. Each layer (sequence) of the DRAFT alignment is aligned against all other alignment's layers taken as a single batch by the dynamic programming procedure. The process of iteration is completed when the quality of alignment doesn't become any better.

The basic multiple motifs were inserted while constructing the DRAFT alignment. These basic motifs contain consensus (conservative) sub-columns. This is exactly why the DRAFT alignment is a good starting point for the iterative process of the improvement of alignments by the consequtive procedures of dynamic programming.

The euristic reasons are following. Let's call the consensus sub-columns "embryos". As a rule, the largest faction of "embryos" does correspond to the correct alignment. However, some of the "embryos" can be incorrect. The task of the iterative improvement is to expand these correct "embryos"/sub-columns up to the thickness of a complete alignment. During this process incorrect "embryos" would disappear, because they won't find good correspondence in the other lines, trying to expand. The residues, participating in these incorrect "embryos", would be forced to move to the other columns, so that they could provide the strengthening of the correct "embryos".


There are several matrices inplemented in GeneBee. You may choose any of them - Dayhoff, Blosum62,
or Johnson - at the prompt in the full query page. The default matrix is Dayhoff.

(modified 250 PAM matrix from Atlas of Protein sequence and structure, (1978), v.5, suppl. 3, pp.345-358):

     A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A   12
C    8 22
D   10  5 14
E   10  5 13 14
F    6  6  4  5 19
G   11  7 11 10  5 15
H    9  7 11 11  8  8 16
I    9  8  8  8 11  7  8 15
K    9  5 10 10  5  8 10  8 15
L    8  4  6  7 12  6  8 12  7 16
M    9  5  7  8 10  7  8 12 10 14 16
N   10  6 12 11  6 10 12  8 11  7  8 12
P   11  7  9  9  5  9 10  8  9  7  8  9 16
Q   10  5 12 12  5  9 13  8 11  8  9 11 10 14
R    8  6  9  9  6  7 12  8 13  7 10 10 10 11 16
S   11 10 10 10  7 11  9  9 10  7  8 11 11  9 10 12
T   11  8 10 10  7 10  9 10 10  8  9 10 10  9  9 11 13
V   10  8  8  8  9  9  8 14  8 12 12  8  9  8  8  9 10 14
W    4  2  3  3 10  3  7  5  7  8  6  6  4  5 12  8  5  4 27
Y    7 10  6  6 17  5 10  9  6  9  8  8  5  6  6  7  7  8 10 20

Unique Identifier: 93066354 (MEDLINE)
Authors: Henikoff S. Henikoff J. G.
Institution: Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, WA 98104.
Title: Amino acid substitution matrices from protein blocks.
Source: Proceedings of the National Academy of Sciences of the United States of America., (1992), v.89, pp.10915 -10919).
Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.

     A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A    8
C    4 13
D    2  1 10
E    3  0  6  9
F    2  2  1  1 10
G    4  1  3  2  1 10
H    2  1  3  4  3  2 12
I    3  3  1  1  4  0  1  8
K    3  1  3  5  1  2  3  1  9
L    3  3  0  1  4  0  1  6  2  8
M    3  3  1  2  4  1  2  5  3  6  9
N    2  1  5  4  1  4  5  1  4  1  2 10
P    3  1  3  3  0  2  2  1  3  1  2  2 11
Q    3  1  4  6  1  2  4  1  5  2  4  4  3  9
R    3  1  2  4  1  2  4  1  6  2  3  4  2  5  9
S    5  3  4  4  2  4  3  2  4  2  3  5  3  4  3  8
T    4  3  3  3  2  2  2  3  3  3  3  4  3  3  3  5  9
V    4  3  1  2  3  1  1  7  2  5  5  1  2  2  1  2  4  8
W    1  2  0  1  5  2  2  1  1  2  3  0  0  2  1  1  2  1 15
Y    2  2  1  2  7  1  6  3  2  3  3  2  1  3  2  2  2  3  6 11
Unique Identifier: 94016587 (MEDLINE)
Authors: Johnson M. S. Overington J. P.
Institution: Department of Crystallography, Birkbeck College, University of London, U.K.
Title: A structural basis for sequence comparisons. An evaluation of scoring methodologies. Source: Journal of Molecular Biology. 233(4):716-38, 1993 Oct 20.
A residue-exchange matrix has been derived that is suitable for comparison of amino acid sequences. This matrix is based on the tabulation of 207,795 amino acid replacements observed in 65 homologous sets of structurally aligned three-dimensional structures (235 proteins). The majority of the data is from structural comparisons where there is between 15 and 40% sequence identity. As a result, a scoring matrix such as the one devised here should provide a sensitive basis for the comparison of amino acid sequences and the search for homologous sequences in amino acid databases. In order to assess the value of this matrix we have made a comparative analysis with 12 other published scoring matrices that have been used for the alignment of protein amino acid sequences. We find that the matrix derived here is among the better performers in terms of alignment significance, detection of homologous sequences and the accuracy of alignments.

    A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
A  16
C   6 26
D   8  0 18
E   9  3 12 18
F   7  5  3  3 20
G   9  2  8  7  1 18
H   7  2  9  7  8  7 22
I   8  2  5  5 10  4  5 18
K   9  1  8 11  4  6 10  5 17
L   6  1  2  4 12  3  5 12  6 17
M   8  5  4  7  9  5  7 12  8 14 21
N   8  2 12  9  6  8 11  5 10  5  6 18
P   9  1  9  8  5  7  5  4  9  7  0  7 20
Q   9  3  9 12  3  7 11  3 11  5  9  9  6 19
R   8  4  6 10  4  7 10  4 13  6  6  8  6 12 20
S  10  2 10  8  5  8  7  5  8  5  5 11  9  9  9 16
T   9  4  8  9  5  6  7  7 10  5  7 10  8  9  8 12 17
V   9  5  5  6  8  4  6 14  6 12 10  4  5  6  5  5  8 17
W   4  1  4  2 13  3  6  6  4  9  9  4  2  2  6  4  0  5 25
Y   6  2  6  6 13  4  9  7  6  7  8  7  3  5  8  6  7  8 12 20

Unitary matrix for DNA/RNA

   A  C  G  T  U
A 10
C  0 10
G  0  0 10
T  0  0  0 10
U  0  0  0 10 10


Go to FULL QUERY page


Last updated - July 12, 1999.