연구

저희 연구실에서는 빅데이터를 활용하는 알고리즘과 기계학습을 결합하여 알려지지 않은 미생물 군집에 대한 통찰력을 얻는 새로운 계산 방법을 제시하고 활용 가능한 프로그램 및 도구를 개발합니다.

소프트웨어

MMseqs2 (Many-against-Many sequence searching)는 사이즈가 큰 단백질 및 DNA, RNA 서열 세트를 검색하고 클러스터링하는 소프트웨어 모음입니다. MMseqs2는 BLAST보다 10000배 빠르게 실행되며, PSI-BLAST와 동일한 민감도 (sensitivity)를 유지한 채 400배 이상의 향상된 속도로 프로파일 검색을 수행 할 수 있습니다.

ColabFold는 쉽고 빠르고 간편하게 단백질 구조를 예측할 수 있는 환경을 제공합니다. MMseqs2를 사용하여 AlphaFold 시스템보다 16배 빠르게 multiple sequence alignment를 생성하는 모듈을 탑재한 AlphaFold2 및 RoseTTAFold를 사용하여 단백질 구조를 보다 빠르게 예측할 수 있습니다.

Foldseek is a software suite for searching and clustering protein structures. It is 600,000 times faster than the fastest state-of-the-art aligners. Allowing to query millions of structures in seconds.

Linclust is a method that can cluster sequences down to 50% pairwise sequence similarity and its runtime scales linearly with the input set size, not quadratically as in conventional algorithms. It is >1000 times faster compared to its competitors.

Plass is a software to assemble short read sequencing data on a protein level. The main purpose of Plass is the assembly of complex metagenomic datasets. It assembles 10 times more protein residues in soil metagenomes than Megahit.

Conterminator is an efficient method for detecting incorrectly labeled sequences across kingdoms by an exhaustive all-against-all sequence comparison.

주요 논문

*: First author

†: Corresponding author

Lab members in bold

Kim W.*, Mirdita M., Levy K.E., Gilchrist C.L.M., Schweke H., Söding J., Levy E.D.†, Steinegger M.† (2025) Rapid and sensitive protein complex alignment with Foldseek-Multimer, Nature Methods [preprint] [journal] [software]

Kim J.*, Steinegger M.† (2024) Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA, Nature Methods [preprint] [journal] [software] [pdf]

Barrio-Hernandez I.*, Yeo J.*, Jänes J., Mirdita M., Gilchrist C.L.M., Wein T., Varadi M., Velankar S., Beltrao P.†, Steinegger M.† (2023) Clustering predicted structures at the scale of the known protein universe, Nature [preprint] [journal] [software]

van Kempen M.*, Kim S.S.*, Tumescheit C., Mirdita M., Lee J., Gilchrist C.L.M., Söding J., Steinegger M. (2023) Fast and accurate protein structure search with Foldseek, Nature Biotechnology [preprint] [journal] [software]

Mirdita M.*†, Schütze K., Moriwaki Y., Heo L., Ovchinnikov S.†, Steinegger M.† (2022) ColabFold: making protein folding accessible to all, Nature Methods [preprint] [journal] [software]

Steinegger M.*, Salzberg S.L.† (2020) Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank, Genome Biology [preprint] [journal] [software]

Steinegger M.*, Meier M., Mirdita M., Vöhringer H., Haunsberger S.J., Söding J.† (2019) HH-suite3 for fast remote homology detection and deep protein annotation, BMC Bioinformatics [preprint] [journal] [software]

Steinegger M.*†, Mirdita M., Söding J.† (2019) Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold, Nature Methods [preprint] [journal] [software]

Steinegger M.*†, Söding J.† (2018) Clustering huge protein sequence sets in linear time, Nature Communications [preprint] [journal] [software]

Steinegger M.*, Söding J.† (2017) MMseqs2: Sensitive protein sequence searching for the analysis of massive data sets, Nature Biotechnology [preprint] [journal] [software]