論文

査読有り
2020年4月22日

A strategy for predicting gene functions from genome and metagenome sequences on the basis of oligopeptide frequency distance.

Genes & genetic systems
  • Takashi Abe
  • ,
  • Ryo Ikarashi
  • ,
  • Masaya Mizoguchi
  • ,
  • Masashi Otake
  • ,
  • Toshimichi Ikemura

95
1
開始ページ
11
終了ページ
19
記述言語
英語
掲載種別
研究論文(学術雑誌)
DOI
10.1266/ggs.19-00041

As a result of the extensive decoding of a massive amount of genomic and metagenomic sequence data, a large number of genes whose functions cannot be predicted by sequence similarity searches are accumulating, and such genes are of little use to science or industry. Current genome and metagenome sequencing largely depend on high-throughput and low-cost methods. In the case of genome sequencing for a single species, high-density sequencing can reduce sequencing errors. For metagenome sequences, however, high-density sequencing does not necessarily increase the sequence quality because multiple and unknown genomes, including those of closely related species, are likely to exist in the sample. Therefore, a function prediction method that is robust against sequence errors becomes an increased need. Here, we present a method for predicting protein gene function that does not depend on sequence similarity searches. Using an unsupervised machine learning method called BLSOM (batch-learning self-organizing map) for short oligopeptide frequencies, we previously developed a sequence alignment-free method for clustering bacterial protein genes according to clusters of orthologous groups of proteins (COGs), without using information from COGs during machine learning. This allows function-unknown proteins to cluster with function-known proteins, based solely on similarity with respect to oligopeptide frequency, although the method required high-performance supercomputers (HPCs). Based on a wide range of knowledge obtained with HPCs, we have now developed a strategy to correlate function-unknown proteins with COG categories, using only oligopeptide frequency distances (OPDs), which can be conducted with PC-level computers. The OPD strategy is suitable for predicting the functions of proteins with low sequence similarity and is applied here to predict the functions of a large number of gene candidates discovered using metagenome sequencing.

リンク情報
DOI
https://doi.org/10.1266/ggs.19-00041
PubMed
https://www.ncbi.nlm.nih.gov/pubmed/32161228
ID情報
  • DOI : 10.1266/ggs.19-00041
  • PubMed ID : 32161228

エクスポート
BibTeX RIS