Services

Medline

About
Multiple Alignment
Stable
Homology search
Multiple alignment
Phylogenetic tree
Experimental
Evolution rate
Pubmed evolution
Ecology JS charts
Obsolete
Nest  homology modeling
Scratch / accpro
Scratch / sspro
Propka / pdb2pqr
Gene annotations
SNP Stats
Utilities
Readseq format conversion
Distance matrix
Translate / Reverse complement
Draw alignment
Keyword search
MAFFT offers various multiple alignment strategies. They are classified into three types, (a) the progressive method, (b) the iterative refinement method with the WSP score, and (c) the iterative refinment method using both the WSP and consistency scores. In general, there is a tradeoff between speed and accuracy. The order of speed is a > b > c, whereas the order of accuracy is a < b < c. The results of benchmarks can be seen here. The following are the detailed procedures for the major options of MAFFT.
(a) FFTNS1, FFTNS2 — Progressive methods
prog.png
These are simple progressive methods like ClustalW. By using the several new techniques described below, these options can align a large number of sequences (up to ∼5,000) on a standard desktop computer. The qualities of the resulting alignments are shown here. The detailed algorithms are described in Katoh et al. (2002).
FFTNS1
mafft retree 1 input_file > output_file
or
fftns retree 1 input_file > output_file
is the simplest progressive option in MAFFT and one of the fastest methods currently available. The procedure is: (1) make a rough distance matrix by counting the number of shared 6tuples (see below) between every sequence pair, (2) build a guide tree and (3) align the sequences according to the branching order.
FFTNS2
mafft retree 2 input_file > output_file
or
fftns input_file > output_file
The distance matrix used in FFTNS1 is very approximate and unreliable. In FFTNS2, (4) the guide tree is recomputed from the FFTNS1 alignment, and (5) the second progressive alignment is carried out.
The following techniques are used to improve the performance.
FFT approximation. (Not yet written) See Katoh et al. (2002).
kmer counting. To accelerate the initial calculation of the distance matrix, which requires a CPU time of O(N2) steps, a rough method similar to the 'quicktree' option of ClustalW is adopted, in which the number of kmers shared by a pair of sequences is counted and regarded as an approximation of the degree of similarity. MAFFT uses the very rapid method proposed by Jones et al. (1992) with a minor modification (Katoh et al. 2002): (1) The 20 amino acids are compressed to 6 alphabets, according to Dayhoff et al. (1978), and (2) MAFFT performs the second progressive alignment (FFTNS2) in order to improve the accuracy.
Modified UPGMA. A modified version of UPGMA is used to construct a guide tree, which works well for handling fragment sequences.
The second progressive alignment. The accuracy of the second progressive alignment (FFTNS2) is slightly higher than that of the first progressive alignment (FFTNS1) according to the BAliBASE test, but the amount CPU time required by FFTNS2 is approximately two times longer than that by FFTNS1.
(b) FFTNSi, NWNSi — Iterative refinement method
The accuracy of progressive alignment can be improved by the iterative refinement method (Berger and Munson 1991, Gotoh 1993). A simplified version of PRRN is implemented as the FFTNSi option of MAFFT. In FFTNSi, an initial alignment by FFTNS2 is subjected to an iterative refienment process.
FFTNSi (max. 1,000 cycles)
mafft maxiterate 1000 input_file > output_file
or
fftnsi maxiterate 1000 input_file > output_file
The iterative refinement is repeated until no more improvement in the WSP score is made or the number of cycles reaches 1,000.
FFTNSi (max. 2 cycles)
mafft maxiterate 1000 input_file > output_file
or
fftnsi input_file > output_file
As most of the quality of improvement is obtained in the early stage of the iteration, this option is also useful (default of the fftnsi script).
Objective function. The weighted sumofpairs (WSP) score proposed by Gotoh is used.
Treedependent partitioning. (Not yet written) See Hirosawa et al.
Effect of FFT. To test the effect of the FFT approximation, we also implemented the NWNSx options, in which the FFT approximation is disabled, but the other procedures are the same as those in the corresponding FFTNSx. There was no significant reduction in the accuracy by introducing the FFT approximation (Katoh et al. 2002).
(c) LINSi, EINSi, GINSi — Iterative refinement methods using WSP and consistency scores
In order to obtain more accurate alignments in extremely difficult cases, three new options, LINSi, GINSi and EINSi, have been added to recent versions (v.≥5) of MAFFT. These options use a new objective function combining the WSP score (Gotoh) explained above and the COFFEElike score (Notredame et al.), which evaluates the consistency between a multiple alignment and pairwise alignments (Katoh et al. 2005).
For pairwise alignment, three different types of algorithms are implemented, global alignment (NeedlemanWunsch), local alignment (SmithWaterman) with affine gap costs (Gotoh) and local alignment with generalized affine gap costs (Altschul). The differences in the accuracy values among these methods are small for the currently available benchmarks, as shown here. However, each of them has different characteristics, according to the algorithm in the pairwise alignment stage:
EINSi
mafft genafpair maxiterate 1000 input_file > output_file
or
einsi input_file > output_file
is suitable for alignments like this:
oooooooooXXXXXXXooooooooooooooooooXXXXXooooooooooooooooooooooooooooooooo
XXXXXXXXXoooooooooooooooooooooooooooooooooXXXXXooooooooooooooooooooooo
ooooooooXXXXXXXXXXXXXXoooooooooooooooooooooo
XXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXX
XXXXXXXXXXX
where 'X's indicate alignable residues, 'o's indicate unalignable residues and ''s indicate gaps. Unalignable residues are left unaligned at the pairwise alignment stage, because of the use of the generalized affine gap cost. Therefore EINSi is applicable to a difficult problem such as RNA polymerase, which has several conserved motifs embedded in long unalignable regions. As EINSi has the minimum assumption of the three methods, this is recommended if the nature of sequences to be aligned is not clear. Note that EINSi assumes that the arrangement of the conserved motifs is shared by all sequences.
Updated! (2015/Jun) Parameters for EINSi have been changed in version 7.243. The new parameters work better for aligning a set of long sequences and short sequences that are closely related to each other. To disable this change, add the oldgenafpair option.
With the new parameters, EINSi may be able to align multiple cDNAs and multiple genomic sequences of a gene from closely related species. However, it consumes large memory space when the sequences are long.
LINSi
mafft localpair maxiterate 1000 input_file > output_file
or
linsi input_file > output_file
is suitable to:
ooooooooooooooooooooooooooooooooXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXooooooooooo
ooooooooooooooXXXXXXXXXXXXXXXXXXXXooooooooooo
ooooooooooooooooooooooooXXXXXXXXXXXXXXXXXXXXXXoooooooooooooooooo
XXXXXXXXXXXXXXXXXXXXXXX
LINSi can align a set of sequences containing sequences flanking around one alignable domain. Flanking sequences are ignored in the pairwise alignment by the SmithWaterman algorithm. Note that the input sequences are assumed to have only one alignable domain. In benchmark tests, the ref4 of BAliBASE corresponds to this. The other categories of BAliBASE also correspond to similar situations, because they have flanking sequences. LINSi also shows higher accuracy values for a part of SABmark and HOMSTRAD than GINSi, but we have not identified the reason for this.
GINSi
mafft globalpair maxiterate 1000 input_file > output_file
or
ginsi input_file > output_file
is suitable to:
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXX
GINSi assumes that entire region can be aligned and tries to align them globally using the NeedlemanWunsch algorithm; that is, a set of sequences of one domain must be extracted by truncating flanking sequences. In benchmark tests, SABmark and HOMSTRAD correspond to this.
Consistency score. The COFFEE objective function was originally proposed by Notredame et al. (1998), and the extended versions are used in TCoffee and ProbCons. MAFFT also adopts a similar objective function, as described in Katoh et al. (2005). However, the consistency among three sequences (called 'library extension' in TCoffee) is currently not calculated in MAFFT, because the improvement in accuracy by library extension was limited to alignments consisting of a small number (<10) of sequences in our preliminary tests. If library extention is needed, then please use TCoffee or ProbCons.
Consistency + WSP. Instead, the WSP score is summed with the consistency score in the objective function of MAFFT. The use of the WSP score has the merit that a pattern of gaps can be incorporated into the objective function. This is probably the reason why MAFFT achieves higher accuracy than ProbCons and TCoffee for alignments consisting of many (∼10  ∼100) sequences. This suggests that the pattern of gaps within a group to be aligned is important information when aligning two groups of proteins (and evaluating homology between distantly related protein families).
Parameters
Scoring matrix for amino acid alignment. The BLOSUM62 matrix is adopted as a default scoring matrix, because this showed slightly higher accuracy values than the BLOSUM80, 45, JTT200PAM, 100PAM and Gonnet matrices in SABmark tests.
Scoring matrix for nucleotide alignment. The default scoring matrix is derived from Kimura's twoparameter model. The ratio of transitions to transversions is set at 2 by default. Other parameters can be used, but have not yet been tested.
Gap penalties for proteins. The default gap penalties for amino acid alignments have been changed in v.4.0. Note that the current version of MAFFT returns an entirely different alignment from v.<4.0. In v.4.0, two major gap penalties (op [gap open penalty] and ep [offset value, which functions like a gap extension penalty, see the mafft3 paper for definition]) were tuned by applying the FFTNS2 option to a part of the SABmark benchmark. We adopted the parameter set (op 1.53 ep 0.123) optimized for SABmark, because this works better for other benchmark (HOMSTRAD, PREFAB and BAliBASE) tests than the previous one (ep 2.4 ep 0.06). Other parameters might work better in other situations. Consistencybased options have more parameters (LINSi has four more parameters and EINSi has six more parameters, as explained here). We determined these additional parameters so that the SmithWaterman alignment function used in LINSi returns a local alignment similar to that generated by FASTA, but we have not closely tuned them yet. In our tests using SABmark, the accuracy values can be improved by 23% by tuning these parameters, but this improvement may result from overfitting.
Gap penalties for RNAs. The default gap penalties for nucleotide alignment have changed in v.5.6. Note that the current version of MAFFT returns an entirely different alignment from v.<5.6. In the former versions (v.<5.6), the default gap penalties for nucleotide alignments were set at the same values as those for amino acid alignments. According to BRAliBASE, these penalties result in very bad alignments for RNAs. The newer versions (v.≥5.6) use a different penalties for nucleotide alignment; the penalty values are set to three times larger than those for amino acids. This is not yet the optimal value for BRAliBASE. The BRAliBASE score can be improved by closely tuning the penalty values, but we have not adopted the optimized penalties, because we are not sure whether they are applicable to a wide range of problems.
Average score or log expectation score. A different parameter set from from that described above is used in MUSCLE, which has an algorithm similar to that of NWNSi. MUSCLE improved in the accuracy of multiple sequence alignment by introducing better parameters than those of the previous version (v3.89) of MAFFT (shown in gray letters in these tables). The latest version of MAFFT uses the readjusted gap penalties (see above) with a conventional average score. As Wallace et al. (2005) reported that the log expectation score outperforms the conventional average score, we are planning to examine the effectiveness of the log expectation score proposed in MUSCLE.
Maffthomologs
The accuracy of an alignment of a few distantly related sequences is considerably improved when they are aligned together with their close homologs. The reason for the improvement is probably the same as that for PSIBLAST. That is, the positions of highly conserved residues, those with many gaps and other additional information are provided by close homologs. According to Katoh et al. (2005), the improvement by adding close homologs is 10% or so, which is comparable to the improvement achieved by incorporating the structural information of a pair of sequences. Maffthomologs in the mafft server works like this:
Collect a number (50 by default) of close homologs (E=1e10 by default) of the input sequences.
Align the input sequences and homologs together using the LINSi strategy.
Remove the homologs.
maffthomologs
A service based on a similar idea is also available in PRALINE. Note that maffthomologs aims only to improve the alignment accuracy, but not to comprehensively collect the fulllength sequences of homologs. Thus, the resulting alignment with homologs by maffthomologs is not appropriate for evolutionary analyses. For such a purpose, services with complete sequences, such as PipeAlign, are more appropriate.
Mapping a nucleotide base to a complex number
When the FFT approximation is used in nucleotide alignment, MAFFT v.5.743 and previous versions convert a nucleotide base to a set of four real numbers. For example, the sequence AACGTCT is converted to a set of four arrays: (1,1,0,0,0,0,0), (0,0,1,0,0,1,0), (0,0,0,1,0,0,0) and (0,0,0,0,1,0,1). In v.5.744, this will be replaced with a more efficient method, according to Rockwood et al. (2005). That is, AACGTCT is converted to (i,i,1,1,i,1,i), where i is an imaginary unit.
Correction formula for the 6mer distance
Some options (FFTNS2, PartTree, etc.) use the number of 6tuples shared between every sequence pair as the basis of the guide tree. Mafft35 computes the distance Dij between sequences i and j as
Dij = 1  Sij / min( Sii, Sjj )
where Sij is the number of shared 6tuples by sequences i and j. The expected value of Dij between two unrelated sequences depends on the lengths of the sequences; when comparing a very short sequence and a very long sequence, Dij is near 0 by chance, even if the sequences are unrelated to each other. The situation is depicted in the figure below, where the red points indicate Dij calculated for randomly generated aa sequences (by ROSE) with various pairs of lengths. The cyan surface is
f(x,y) = y/x * 0.1 + 10000 / ( x + 10000 ) + 0.01
where x and y are the lengths of longer and shorter sequences, respectively. The function f(,) was empirically determined so that the surface is close to the red points. Mafft6 uses Dij' instead of Dij,
Dij' = Dij / f(x,y).
brishur.com