This website requires JavaScript.
DOI: 10.1101/2022.08.22.504727

Coding nucleic acid sequences with graph convolutional network

R.Wang Y. K. Ng X. Zhang J. Wang S. Li
摘要
Phage mining yields a large number of phage genome sequences. There is an urgent need for tools to analyze the obtained phage sequences reliably. Neural network-based methods are prime candidates for these tasks due to their applicability to large, diverse datasets where little knowledge is available. However, the highly variable lengths of nucleic acid sequences pose a major obstacle to the sequence presentation as neural network input. Also, genetic variations further complicate the sequence comparison. Here we propose a graph representation of nucleic acid sequence, where all subsequences of the represented sequence can be matched to gapped patterns formed by paths of the graph, and various distribution information of these subsequences are encoded as features of the graph elements. These graphs, which we call gapped pattern graphs, can be transformed through a Graph Convolutional Network (GCN) to form lower-dimensional embedding for downstream tasks. Based on the gapped pattern graph, we introduced Graphage, an implementation of a neural network model, and compared it with equivalent models based on other forms of input (e.g., k-mer distribution, word2vec) in performing four downstream tasks: phage and ICE discrimination, phage integration site prediction, phage lifestyle prediction, and phage host prediction. We also compared Graphage with the current state-of-the-art tools on these tasks, where such tools are available. Our results show that Graphage consistently outperformed all the other tools and methods under various metrics on all four tasks. With Graphage, we identified distinct gapped pattern signatures with regard to phage phenotypes.
展开全部

暂无人提供速读十问回答

论文十问由沈向洋博士提出,鼓励大家带着这十个问题去阅读论文,用有用的信息构建认知模型。写出自己的十问回答,还有机会在当前页面展示哦。

Q1论文试图解决什么问题?
Q2这是否是一个新的问题?
Q3这篇文章要验证一个什么科学假设?
0
被引用
笔记
问答