This website requires JavaScript.
DOI: 10.1101/2022.08.22.504727

Coding nucleic acid sequences with graph convolutional network

R.Wang Y. K. Ng X. Zhang J. Wang S. Li
Phage mining yields a large number of phage genome sequences. There is an urgent need for tools to analyze the obtained phage sequences reliably. Neural network-based methods are prime candidates for these tasks due to their applicability to large, diverse datasets where little knowledge is available. However, the highly variable lengths of nucleic acid sequences pose a major obstacle to the sequence presentation as neural network input. Also, genetic variations further complicate the sequence comparison. Here we propose a graph representation of nucleic acid sequence, where all subsequences of the represented sequence can be matched to gapped patterns formed by paths of the graph, and various distribution information of these subsequences are encoded as features of the graph elements. These graphs, which we call gapped pattern graphs, can be transformed through a Graph Convolutional Network (GCN) to form lower-dimensional embedding for downstream tasks. Based on the gapped pattern graph, we introduced Graphage, an implementation of a neural network model, and compared it with equivalent models based on other forms of input (e.g., k-mer distribution, word2vec) in performing four downstream tasks: phage and ICE discrimination, phage integration site prediction, phage lifestyle prediction, and phage host prediction. We also compared Graphage with the current state-of-the-art tools on these tasks, where such tools are available. Our results show that Graphage consistently outperformed all the other tools and methods under various metrics on all four tasks. With Graphage, we identified distinct gapped pattern signatures with regard to phage phenotypes.