Benchmarking Metagenomics Tools for Taxonomic Classification
Cell, [36.216]
2019-08-08 Review
DOI: https://doi.org/10.1016/j.cell.2019.07.010
全文可開放獲取 https://www.cell.com/cell/fulltext/S0092-8674(19)30775-5
第一作者:Simon H. Ye1,2,*
通訊作者:Simon H. Ye1,2,*
其它作者:Katherine J. Siddle, Daniel J. Park, Pardis C. Sabeti
作者單位:
1麻省理工學院,哈佛-麻省理工健康科學與技術中心(Harvard-MIT Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA)
2麻省理工學院和哈佛大學博德研究所(Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA)
日報有多種軟件可用于宏基因組數(shù)據(jù)的物種分類,但缺少系統(tǒng)的評估;
本文介紹了當前主流宏基因組分析方法,并對20個分類軟件進行了系統(tǒng)評估;
同時介紹了評估的關鍵指標,為更多分類軟件的評測提供了框架;
對數(shù)據(jù)庫建索引步驟的資源消耗評估,有助于用戶選擇自建索引或使用同行已建索引;
對軟件運行中內(nèi)存、線程數(shù)和時間使用的評估,有利于根據(jù)自身硬件條件選擇合適的軟件和分析方案,預估項目所需時間。
主編評語:宏基因組測序正在徹底改變微生物物種的檢測和表征,但目前軟件太多,令同行選擇非常困難。近日Cell雜志發(fā)文對物種分類軟件系統(tǒng)進行了系統(tǒng)的評估,此文結果對同行根據(jù)自己實際情況選擇最符合自身硬件條件的分析方案提供指導,以便獲得較優(yōu)結果。同時也為開發(fā)相關軟件的同行,提供了一套系統(tǒng)評估軟件性能的框架。
摘要宏基因組測序正在徹底改變微生物組中物種的檢測和表征,并且有多種軟件工具可用于對這些數(shù)據(jù)進行分類學分類。這些工具的快速發(fā)展和宏基因組數(shù)據(jù)的復雜性使得研究人員能夠對其性能進行基準測試非常重要。在這里,我們回顧了當前的宏基因組分析方法,并使用模擬和實驗數(shù)據(jù)集評估了20個宏基因組分類器的性能。我們描述了用于評估性能的關鍵指標,為其他分類器的比較提供了框架,并討論了宏基因組數(shù)據(jù)分析的未來。
Metagenomic sequencing is revolutionizing the detection and characterization of microbial species, and a wide variety of software tools are available to perform taxonomic classification of these data. The fast pace of development of these tools and the complexity of metagenomic data make it important that researchers are able to benchmark their performance. Here, we review current approaches for metagenomic analysis and evaluate the performance of 20 metagenomic classifiers using simulated and experimental datasets. We describe the key metrics used to assess performance, offer a framework for the comparison of additional classifiers, and discuss the future of metagenomic data analysis.
主要結果圖1. 從宏基因組樣本到物種組成Figure 1 Processing Steps to Go from a Complex Metagenomic Sample to an Abundance Profile of Sample Content
Figure 2 Metrics Used for Evaluating Classifier Performance
AUPR(area under the precision-recallcurve, 準確-召回曲線下的面積)和L2(straight-line distance between the observed and true abundance vectors,實際與預測間的直線距離)距離是兩個互補的指標,分別提供對分類器準度-召回和豐度估計準確性的評估。綜合以上指標,它們提供了易于解釋的分類器性能圖,可用于比較分類器。
AUPR and L2 distance are two complementary metrics that provide insight into the accuracy of a classifier’s precision-recall and abundance estimates, respectively. Considered together, they provide a readily interpretable picture of classifier performance and can be used to compare classifiers.
表1. 分類器評估指標匯總Table 1 A List of Benchmarked Classifiers and Their Various Characteristics
主要包括數(shù)據(jù)庫是否可定制,能否產(chǎn)生豐度組成長,內(nèi)存消耗,時間消耗等。
“自定義數(shù)據(jù)庫”是指最終用戶創(chuàng)建自定義數(shù)據(jù)庫的能力。時間和內(nèi)存要求是基于一個570萬個序列的數(shù)據(jù)集,數(shù)據(jù)庫和輸入文件已經(jīng)緩存在內(nèi)存中。某些方法(標記為“變化”)能夠靈活地降低其內(nèi)存使用量(以運行時間的大量增加為代價)。
a最新版本的PathSeq現(xiàn)在允許用戶創(chuàng)建和指定自定義數(shù)據(jù)庫,但在執(zhí)行基準測試時,此選項不可用; 因此,它被排除在這些分析之外。
“Custom databases” refers to the ability for the end user to create a custom database. The time and memory requirements are for a 5.7 million-read dataset with the database and input already cached in memory. Some methods (marked as “varies”) have the ability to flexibly decrease their memory usage (at the cost of a massive increase in run time).
aThe latest version of PathSeq now allows the user to create and specify a custom database, but this option was not available when benchmarking studies were performed; thus, it was excluded from those analyses.
圖3. 評估AUPR得分Figure 3 Benchmark AUPR Scores
(A)物種水平上每個分類器的準確-召回率曲線(AUPR)得分下的面積(更高的值更好)。每個繪圖點代表(分類器,數(shù)據(jù)集組合)的得分。分類器按其目標類進行分組和著色(藍色為DNA,橙色為蛋白,紅色為DNA標記)。
(B)AUPR用于統(tǒng)一的RefSeq CG數(shù)據(jù)庫而不是默認數(shù)據(jù)庫。RefSeq CG圖上缺少條目是無法創(chuàng)建自定義數(shù)據(jù)庫的分類器??梢钥吹?,在相同數(shù)據(jù)庫下,各軟件表現(xiàn)結果差異并不大。有關其他信息,請參見圖S1-S4。
(A) Area under the precision-recall curve (AUPR) scores for each classifier at the species level (a higher value is better). Each plot point represents the score for a (classifier, dataset combination). Classifiers are grouped and colored by their target class.
(B) AUPR for the uniform RefSeq CG database instead of default databases. Missing entries on the RefSeq CG plot are classifiers that cannot create custom databases.
For additional information, see Figures S1–S4.
圖4. 評估L2距離Figure 4 Benchmark L2 Distances
(A)每個分類器的物種豐度分布與真實組合物之間的距離(較低的值更好)。每個繪圖點表示(分類器,數(shù)據(jù)集)組合的L2距離。分類器按其目標類進行分組和著色。
(B)使用統(tǒng)一的RefSeq CG數(shù)據(jù)庫的豐度距離。缺少的條目是無法創(chuàng)建自定義數(shù)據(jù)庫的分類器。
(C)跨模擬數(shù)據(jù)集的分類器之間的中位數(shù)成對L2標準豐度的層級聚類。非黑色簇對應顏色是0.09相似度閾值的組。彩色框對應于方法類型:DNA,蛋白質和標記分類器?!発”注釋表示基于k-mer方法。有關其他信息,請參見圖S6。
(A) Distance between the species abundance profile for each classifier compared with the true composition (a lower value is better). Each plot point represents the L2 distance for a (classifier, dataset) combination. Classifiers are grouped and colored by their target class.
(B) Abundance distance using the uniform RefSeq CG database. Missing entries are classifiers that cannot create custom databases.
(C) Median pairwise L2 abundance norms between classifiers across simulated datasets, hierarchically clustered. Non-black cluster link colors are groups at a 0.09 similarity threshold. Colored boxes correspond to the method type: DNA, protein, and marker classifiers. The “k” annotation indicates k-mer-based methods.
For additional information, see Figure S6.
圖5. 種水平分類比例Figure 5 Proportion of Abundance Classified at the Species Rank
(A)用默認數(shù)據(jù)庫分類物種水平的樣本豐度比例。
(B)使用統(tǒng)一的RefSeq CG數(shù)據(jù)庫。僅顯示允許自定義數(shù)據(jù)庫的程序。有關其他信息,請參見圖S5。
(A) Proportion of sample abundance classified at the species rank with default databases.
(B) Using uniform RefSeq CG databases. Only programs allowing custom databases are shown.
For additional information, see Figure S5.
圖6. 在ATCC均勻樣本數(shù)據(jù)集中檢測到的物種數(shù)量與最小豐度閾值的關系Figure 6 Number of Species Classified versus Minimum Abundance Threshold Detected in ATCC Even Sample Datasets
每種0.05豐度的20種物種的真實豐度被描繪為黑色虛線。有關其他信息,請參見圖S7-S9。
The truth abundance of 20 species at 0.05 abundance each is depicted as a black dotted line.
圖7. 計算資源消耗評測Figure 7 Benchmark of Computational Resources
(A)處理含有570萬條序列樣本所需的時間,而不是第一次運行后的第二次運行所需的時間。對于許多分類器,第二次運行更快,因為樣本序列和數(shù)據(jù)庫文件緩存在內(nèi)存中。Bracken沒有繪制,因為它需要的時間和內(nèi)存可以忽略不計。
(B)每個分類器在執(zhí)行期間使用的最大內(nèi)存,磁盤上數(shù)據(jù)庫大小以及32個可用CPU的平均使用數(shù)。
(C)使用各種方法創(chuàng)建RefSeq CG數(shù)據(jù)庫所花費的時間和內(nèi)存。分類器按照增加的時間排序。MMseqs2和DIAMOND在數(shù)據(jù)庫構建期間不對基因組進行索引,而是在樣本分類期間即時索引。
(A) Time required to process a sample containing 5.7 million reads versus a second run immediately after the first. This second run is faster for many classifiers because sample reads and database files are cached in memory. Bracken is not plotted because it requires negligible time and memory.
(B) The maximum memory utilized by each classifier during execution, the on-disk database size, and average number of CPUs utilized of 32 available.
(C) Time taken and memory used to create the RefSeq CG database using various methods. Classifiers are sorted by increasing time taken. MMseqs2 and DIAMOND do not index the genomes during database construction but, rather, index on the fly during sample classification.
Referencehttps://www.cell.com/cell/fulltext/S0092-8674(19)30775-5
Ye, S.H., Siddle, K.J., Park, D.J., and Sabeti, P.C. (2019). Benchmarking Metagenomics Tools for Taxonomic Classification. Cell 178, 779-794.
寫在后面學習16S擴增子、宏基因組科研思路和分析實戰(zhàn),關注“宏基因組”