Benchmark dataset

The following datasets, HomFam and OXFam were used as benchamrk datasets in "Application of the MAFFT sequence alignment program to large data - reexamination of the usefulness of chained guide trees".

HomFamThis HomFam dataset is modified version of original HomFam dataset constructed by the authors of Clustal Omega. The dataset contains totally 89 HOMSTRAD families as reference multiple sequence alignments and their corresponding Pfam sequences to be aligned. The information of secondly structure (as α-helix, β-strand or 310-helix) in HOMSTRAD was held as capital letter in the reference alignments. The repeated-sequence contained families were removed and the character of 'U' in the sequences was replaced with 'X'. The order of sequences was randomized in every sequence set to prevent artificial effects induced by the input sequence order in MSA calculation.
OXFamThe OXFam dataset contains 165 OXBench reference alignments and their corresponding Pfam sequences. The construction procedure was almost same as that of HomFam. In the construction process, 53 sequence families which possibly included repeated-sequences, are of multi-domain families or shared homologous relationships with the other families were excluded from original 218 OXBench sequence families. As is the case with the above HomFam, the information of structure conserved region (SCR) was held as capital letters in the reference alignments. The order of sequences was randomized in every sequence set to prevent artificial effects induced by the input sequence order in MSA calculation.

Benchmark results for large-scaled dataset

The following data include benchmark results for popular and high-speed multiple sequence aligners against currently available large-scaled benchmark dataset.

HomFam

Small
(0, 3000]
38 files
Medium
(3000, 10000]
32 files
Large
(10000,]
19 files
All
[93, 93681]
89 files
Mean SP / TC score
MAFFT - FFT-NS-1 0.8971 / 0.76560.8560 / 0.64480.7415 / 0.50690.8491 / 0.6669
MAFFT - FFT-NS-1 (memsavetree) 0.8980 / 0.78460.8573 / 0.64690.7023 / 0.46390.8416 / 0.6667
MAFFT - FFT-NS-2 0.9074 / 0.78060.8957 / 0.72980.7795 / 0.56460.8759 / 0.7162
MAFFT - FFT-NS-2 (memsavetree) 0.9085 / 0.80220.8926 / 0.71790.7134 / 0.47640.8611 / 0.7023
MAFFT - Randomchain 0.8699 / 0.73150.8671 / 0.69670.7106 / 0.49320.8349 / 0.6681
MAFFT - PartTree (partsize=50) 0.8443 / 0.65490.8190 / 0.58170.6148 / 0.36090.7862 / 0.5658
MAFFT - PartTree (partsize=1000) 0.8840 / 0.74650.8406 / 0.63070.6844 / 0.43210.8258 / 0.6377
MAFFT - DPPartTree (partsize=50) 0.8892 / 0.76100.8599 / 0.65780.7142 / 0.46010.8413 / 0.6597
MAFFT - DPPartTree (partsize=1000) 0.8918 / 0.76900.8684 / 0.69140.7546 / 0.54540.8541 / 0.6934
MAFFT - Sparsecore (p=100) 0.9105 / 0.79390.9004 / 0.72750.7945 / 0.59430.8821 / 0.7274
MAFFT - Sparsecore (p=500) 0.9267 / 0.83150.9167 / 0.75730.8045 / 0.61480.8970 / 0.7586
MAFFT - Sparsecore (p=1000) 0.9405 / 0.86280.9228 / 0.77460.8159 / 0.62830.9075 / 0.7810
MAFFT - Sparsecore (p=100, memsavetree) 0.9238 / 0.83800.9094 / 0.74280.7641 / 0.54690.8845 / 0.7416
MAFFT - Sparsecore (p=500, memsavetree) 0.9221 / 0.82330.9213 / 0.77820.8175 / 0.62040.8995 / 0.7638
MAFFT - Sparsecore (p=1000, memsavetree)0.9392 / 0.85990.9327 / 0.80520.7907 / 0.58990.9052 / 0.7826
Clustal Omega 0.9148 / 0.80570.8693 / 0.71520.6871 / 0.44490.8498 / 0.6961
Clustal Omega - Full 0.9088 / 0.80860.8806 / 0.73650.6692 / 0.43860.8475 / 0.7037
Clustal Omega - Randomchain 0.8798 / 0.75800.8309 / 0.6918- / - - / -
Muscle 1 iteration 0.8094 / 0.66400.7572 / 0.5672- / - - / -
Muscle 1 iteration - Randomchain 0.8224 / 0.67200.8189 / 0.67710.7001 / 0.44710.7951 / 0.6258
Muscle 2 iteration 0.8078 / 0.66060.6949 / 0.4645- / - - / -
Muscle 2 iteration - Randomchain 0.8437 / 0.70530.8251 / 0.65280.7425 / 0.52740.8154 / 0.6484
UPP -fast 0.8616 / 0.74660.8407 / 0.70870.7700 / 0.58530.8345 / 0.6985
UPP -default 0.8678 / 0.74920.8708 / 0.75700.7956 / 0.63300.8535 / 0.7272
MAFFT - G-INS-1 0.9358 / 0.85490.9520 / 0.84800.8844 / 0.74410.9306 / 0.8288
Total CPU time (min)
MAFFT - FFT-NS-1 1.215140160
MAFFT - FFT-NS-1 (memsavetree) 2.021240260
MAFFT - FFT-NS-2 2.936420460
MAFFT - FFT-NS-2 (memsavetree) 5.666910990
MAFFT - Randomchain 2.0157188
MAFFT - PartTree (partsize=50) 1.5113547
MAFFT - PartTree (partsize=1000) 3.0217194
MAFFT - DPPartTree (partsize=50) 8.147100160
MAFFT - DPPartTree (partsize=1000) 55270490820
MAFFT - Sparsecore (p=100) 7.461580650
MAFFT - Sparsecore (p=500) 1603907901300
MAFFT - Sparsecore (p=1000) 810210015004400
MAFFT - Sparsecore (p=100, memsavetree) 9.89412001300
MAFFT - Sparsecore (p=500, memsavetree) 15044014002000
MAFFT - Sparsecore (p=1000, memsavetree)730210022005000
Clustal Omega 21160300480
Clustal Omega - Full 4457054006000
Clustal Omega - Randomchain 13018000--
Muscle 1 iteration 3.536--
Muscle 1 iteration - Randomchain 1.79.44556
Muscle 2 iteration 9.0120--
Muscle 2 iteration - Randomchain 3.0176989
UPP -fast 53190260500
UPP -default 360160024004400
MAFFT - G-INS-1 (370)(5200)(44000)(49000)

Versions:

MAFFT 7.294; Clustal Omega 1.2.1; Muscle 3.8.31; UPP 2.0

Execution commands from above on the method column:

mafft --retree 1 --maxiterate 0 input

mafft --retree 1 --maxiterate 0 --memsavetree input

mafft input

mafft --memsavetree input

mafft --randomchain --randomseed seed input

mafft --parttree --partsize 50 input

mafft --parttree --partsize 1000 input

mafft --dbparttree --partsize 50 input

mafft --dpparttree --partsize 1000 input

mafft-sparsecore.rb -s seed -p 100 -i input

mafft-sparsecore.rb -s seed -p 500 -i input

mafft-sparsecore.rb -s seed -p 1000 -i input

mafft-sparsecore.rb -s seed -p 100 -A "--memsavetree" -i input

mafft-sparsecore.rb -s seed -p 500 -A "--memsavetree" -i input

mafft-sparsecore.rb -s seed -p 1000 -A "--memsavetree" -i input

clustalo -i input

clustalo --full -i input

clustalo --pileup -i input

muscle -maxiters 1 -diags1 -sv -distance kbit20_3 -in input

muscle -maxiters 1 -diags1 -sv -distance kbit20_3 -usetree randomchain -in input

muscle -maxiters 2 -in input

muscle -maxiters 2 -usetree randomchain -in input

run_upp.py -m amino -B 100 -s input

run_upp.py -m amino -s input

mafft --globalpair --thread 10 input

OXFam

Small
(0, 3000]
74 files
Medium
(3000, 10000]
59 files
Large
(10000,]
32 files
All
[19, 81503]
165 files
Mean SP / TC score
MAFFT - FFT-NS-1 0.9248 / 0.89360.8813 / 0.81420.7884 / 0.70180.8828 / 0.8280
MAFFT - FFT-NS-1 (memsavetree) 0.9157 / 0.88400.8652 / 0.79330.7648 / 0.69630.8684 / 0.8152
MAFFT - FFT-NS-2 0.9300 / 0.90160.8853 / 0.82530.8207 / 0.74040.8928 / 0.8430
MAFFT - FFT-NS-2 (memsavetree) 0.9246 / 0.89210.8869 / 0.81820.7668 / 0.69010.8805 / 0.8265
MAFFT - Randomchain 0.9010 / 0.86220.8571 / 0.79890.7674 / 0.68270.8594 / 0.8048
MAFFT - PartTree (partsize=50) 0.9180 / 0.86640.8246 / 0.73270.6825 / 0.58570.8389 / 0.7641
MAFFT - PartTree (partsize=1000) 0.9046 / 0.86120.8299 / 0.74800.7108 / 0.61950.8403 / 0.7739
MAFFT - DPPartTree (partsize=50) 0.9154 / 0.87690.8508 / 0.77790.7568 / 0.66860.8616 / 0.8011
MAFFT - DPPartTree (partsize=1000) 0.9261 / 0.88430.8584 / 0.79060.7843 / 0.70990.8744 / 0.8169
MAFFT - Sparsecore (p=100) 0.9438 / 0.91430.9058 / 0.85460.8364 / 0.76130.9094 / 0.8633
MAFFT - Sparsecore (p=500) 0.9441 / 0.91550.9228 / 0.88070.8391 / 0.76450.9161 / 0.8738
MAFFT - Sparsecore (p=1000) 0.9533 / 0.93190.9257 / 0.88970.8427 / 0.77070.9220 / 0.8855
MAFFT - Sparsecore (p=100, memsavetree) 0.9483 / 0.91900.8845 / 0.82670.8480 / 0.78060.9060 / 0.8592
MAFFT - Sparsecore (p=500, memsavetree) 0.9328 / 0.90300.9167 / 0.87830.8383 / 0.76940.9087 / 0.8682
MAFFT - Sparsecore (p=1000, memsavetree)0.9543 / 0.93400.9276 / 0.88960.8319 / 0.76990.9210 / 0.8863
Clustal Omega 0.9257 / 0.88420.8735 / 0.81180.7409 / 0.64080.8712 / 0.8111
Clustal Omega - Full 0.9244 / 0.88860.8595 / 0.78390.7440 / 0.66880.8662 / 0.8085
Clustal Omega - Randomchain 0.8905 / 0.84520.8477 / 0.7888- / - - / -
Muscle 1 iteration 0.8450 / 0.77820.6365 / 0.5220- / - - / -
Muscle 1 iteration - Randomchain 0.8797 / 0.82680.8464 / 0.78460.6937 / 0.60670.8317 / 0.7690
Muscle 2 iteration 0.8555 / 0.80000.6896 / 0.5818- / - - / -
Muscle 2 iteration - Randomchain 0.8995 / 0.85400.8371 / 0.77190.7229 / 0.63090.8429 / 0.7814
UPP -fast 0.9327 / 0.90280.8940 / 0.85350.7878 / 0.71960.8908 / 0.8496
UPP -default 0.9415 / 0.91380.9068 / 0.86760.8211 / 0.76010.9057 / 0.8675
MAFFT - G-INS-1 0.9572 / 0.93580.9485 / 0.91470.8749 / 0.82120.9381 / 0.9060
Total CPU time (min)
MAFFT - FFT-NS-1 2.431160200
MAFFT - FFT-NS-1 (memsavetree) 3.850270330
MAFFT - FFT-NS-2 5.581470560
MAFFT - FFT-NS-2 (memsavetree) 1117010001200
MAFFT - Randomchain 3.43096130
MAFFT - PartTree (partsize=50) 2.7225983
MAFFT - PartTree (partsize=1000) 6.251140190
MAFFT - DPPartTree (partsize=50) 16110210340
MAFFT - DPPartTree (partsize=1000) 12078015002400
MAFFT - Sparsecore (p=100) 11130660800
MAFFT - Sparsecore (p=500) 19064010001900
MAFFT - Sparsecore (p=1000) 1100350028007500
MAFFT - Sparsecore (p=100, memsavetree) 1721013001500
MAFFT - Sparsecore (p=500, memsavetree) 20072017002600
MAFFT - Sparsecore (p=1000, memsavetree)1100340033007900
Clustal Omega 27220590840
Clustal Omega - Full 82130071008400
Clustal Omega - Randomchain 1704600--
Muscle 1 iteration 7.194--
Muscle 1 iteration - Randomchain 3.1215377
Muscle 2 iteration 18320--
Muscle 2 iteration - Randomchain 5.53885130
UPP -fast 923805401000
UPP -default 660330049008900
MAFFT - G-INS-1 (760)(13000)(71000)(86000)

Versions:

MAFFT 7.294; Clustal Omega 1.2.1; Muscle 3.8.31; UPP 2.0

Execution commands from above on the method column:

mafft --retree 1 --maxiterate 0 input

mafft --retree 1 --maxiterate 0 --memsavetree input

mafft input

mafft --memsavetree input

mafft --randomchain --randomseed seed input

mafft --parttree --partsize 50 input

mafft --parttree --partsize 1000 input

mafft --dbparttree --partsize 50 input

mafft --dpparttree --partsize 1000 input

mafft-sparsecore.rb -s seed -p 100 -i input

mafft-sparsecore.rb -s seed -p 500 -i input

mafft-sparsecore.rb -s seed -p 1000 -i input

mafft-sparsecore.rb -s seed -p 100 -A "--memsavetree" -i input

mafft-sparsecore.rb -s seed -p 500 -A "--memsavetree" -i input

mafft-sparsecore.rb -s seed -p 1000 -A "--memsavetree" -i input

clustalo -i input

clustalo --full -i input

clustalo --pileup -i input

muscle -maxiters 1 -diags1 -sv -distance kbit20_3 -in input

muscle -maxiters 1 -diags1 -sv -distance kbit20_3 -usetree randomchain -in input

muscle -maxiters 2 -in input

muscle -maxiters 2 -usetree randomchain -in input

run_upp.py -m amino -B 100 -s input

run_upp.py -m amino -s input

mafft --globalpair --thread 10 input

ContTest

Small
(0, 3000]
15 files
Medium
(3000, 10000]
70 files
Large
(10000,]
51 files
All
[1467, 43912]
136 files
Mean ContTest score
MAFFT - FFT-NS-1 0.38300.48030.52310.4856
MAFFT - FFT-NS-1 (memsavetree) 0.39150.48570.50760.4835
MAFFT - FFT-NS-2 0.40810.48740.54390.4998
MAFFT - FFT-NS-2 (memsavetree) 0.39800.50290.55250.5099
MAFFT - Randomchain 0.44060.52270.59970.5425
MAFFT - PartTree (partsize=50) 0.38120.40300.42880.4103
MAFFT - PartTree (partsize=1000) 0.38830.43510.45230.4364
MAFFT - DPPartTree (partsize=50) 0.37230.42890.48170.4424
MAFFT - DPPartTree (partsize=1000) 0.37790.45550.49880.4632
MAFFT - Sparsecore (p=100) 0.37470.50050.57710.5153
MAFFT - Sparsecore (p=500) 0.38830.51800.60460.5361
MAFFT - Sparsecore (p=1000) 0.38080.52370.61980.5440
MAFFT - Sparsecore (p=100, memsavetree) 0.38780.51430.59270.5298
MAFFT - Sparsecore (p=500, memsavetree) 0.39810.52640.61070.5438
MAFFT - Sparsecore (p=1000, memsavetree)0.35350.53240.61260.5428
Clustal Omega 0.30390.42910.42620.4142
Clustal Omega - Full 0.30800.45850.46400.4440
Clustal Omega - Randomchain 0.43280.53240.57030.5357
Muscle 1 iteration 0.27010.36780.34140.3471
Muscle 1 iteration - Randomchain 0.38170.52170.59570.5340
Muscle 2 iteration 0.32060.38170.32540.3538
Muscle 2 iteration - Randomchain 0.44420.52890.61410.5515
UPP -fast 0.35150.51390.57440.5187
UPP -default 0.35550.52540.59360.5323
MAFFT - G-INS-1 0.38530.54450.65820.5696
Total CPU time (min)
MAFFT - FFT-NS-1 0.4820150170
MAFFT - FFT-NS-1 (memsavetree) 0.8436240280
MAFFT - FFT-NS-2 1.254440500
MAFFT - FFT-NS-2 (memsavetree) 2.51209901100
MAFFT - Randomchain 0.561688100
MAFFT - PartTree (partsize=50) 0.44115061
MAFFT - PartTree (partsize=1000) 1.024120140
MAFFT - DPPartTree (partsize=50) 2.353160210
MAFFT - DPPartTree (partsize=1000) 142607701000
MAFFT - Sparsecore (p=100) 2.277650730
MAFFT - Sparsecore (p=500) 302509301200
MAFFT - Sparsecore (p=1000) 160110021003400
MAFFT - Sparsecore (p=100, memsavetree) 3.414013001500
MAFFT - Sparsecore (p=500, memsavetree) 4031016002000
MAFFT - Sparsecore (p=1000, memsavetree)180120028004200
Clustal Omega 5.0130460600
Clustal Omega - Full 1683056006400
Clustal Omega - Randomchain 286400110000120000
Muscle 1 iteration 2.081550630
Muscle 1 iteration - Randomchain 0.53135063
Muscle 2 iteration 3.923025002700
Muscle 2 iteration - Randomchain 0.862278100
UPP -fast 17230500750
UPP -default 130200046006700
MAFFT - G-INS-1 (110)(7100)(48000)(55000)

Versions:

MAFFT 7.294; Clustal Omega 1.2.1; Muscle 3.8.31; UPP 2.0

Execution commands from above on the method column:

mafft --retree 1 --maxiterate 0 input

mafft --retree 1 --maxiterate 0 --memsavetree input

mafft input

mafft --memsavetree input

mafft --randomchain --randomseed seed input

mafft --parttree --partsize 50 input

mafft --parttree --partsize 1000 input

mafft --dbparttree --partsize 50 input

mafft --dpparttree --partsize 1000 input

mafft-sparsecore.rb -s seed -p 100 -i input

mafft-sparsecore.rb -s seed -p 500 -i input

mafft-sparsecore.rb -s seed -p 1000 -i input

mafft-sparsecore.rb -s seed -p 100 -A "--memsavetree" -i input

mafft-sparsecore.rb -s seed -p 500 -A "--memsavetree" -i input

mafft-sparsecore.rb -s seed -p 1000 -A "--memsavetree" -i input

clustalo -i input

clustalo --full -i input

clustalo --pileup -i input

muscle -maxiters 1 -diags1 -sv -distance kbit20_3 -in input

muscle -maxiters 1 -diags1 -sv -distance kbit20_3 -usetree randomchain -in input

muscle -maxiters 2 -in input

muscle -maxiters 2 -usetree randomchain -in input

run_upp.py -m amino -B 100 -s input

run_upp.py -m amino -s input

mafft --globalpair --thread 10 input

Related website

MAFFT official website

Contact

E-mail: kyamada [AT] ecei.tohoku.ac.jp