49 Hits in 1.0 sec

Evaluation of deep-learning-based lncRNA identification tools [article]

Cheng Yang, Man Zhou, Haoling Xie, Huaiqiu Zhu
2019 bioRxiv   pre-print
Long non-coding RNAs (lncRNAs, length above 200 nt) exert crucial biological roles and have been implicated in cancers. To characterize newly discovered transcripts, one major issue is to distinguish lncRNAs from mRNAs. Since experimental methods are time-consuming and costly, computational methods are preferred for large-scale lncRNA identification. In a recent study, Amin et al. evaluated three deep-learning-based lncRNA identification tools (i.e., lncRNAnet, LncADeep, and lncFinder) and
more » ... uded "The LncADeep PR (precision recall) curve is just above the no-skill model and LncADeep showed poor overall performance". This surprising conclusion is based on the authors' use of a non-default setting of LncADeep. Actually, LncADeep has two models, one for full-length transcripts, and the other for transcripts including partial-length. Being aware of the difficulty of assembling full-length transcripts from RNA-seq dataset, LncADeep's default model is for transcripts including partial-length. However, according to the results posted on Amin et al.'s website, the authors used LncADeep with full-length model, while they claimed to use the default setting of LncADeep, to identify lncRNAs from GENCODE dataset, which is composed of full- and partial-length transcripts. Thus, in their evaluation, the performance of LncADeep was underestimated. In this correspondence, we have tested LncADeep's default setting (i.e., model for transcripts including partial-length) on the datasets used in Amin et al., and LncADeep achieved overall the best performance compared with the other tools' results reported by Amin et al.
doi:10.1101/683425 fatcat:334as7lp2rc4lmx2gikgxfjbfm

Aging progression of human gut microbiota

Congmin Xu, Huaiqiu Zhu, Peng Qiu
2019 BMC Microbiology  
Cheng Zhu at Georgia Institute of Technology for their interest to the project and useful discussions. We also thank Dr. Kuang Chen and Mr. Zhongjie Xie at  ... 
doi:10.1186/s12866-019-1616-2 pmid:31660868 pmcid:PMC6819604 fatcat:qvl6kp7w5bc3hpujqzu4th4cki

LightCUD: a program for diagnosing IBD based on human gut microbiome data

Congmin Xu, Man Zhou, Zhongjie Xie, Mo Li, Xi Zhu, Huaiqiu Zhu
2021 BioData Mining  
Background The diagnosis of inflammatory bowel disease (IBD) and discrimination between the types of IBD are clinically important. IBD is associated with marked changes in the intestinal microbiota. Advances in next-generation sequencing (NGS) technology and the improved hospital bioinformatics analysis ability motivated us to develop a diagnostic method based on the gut microbiome. Results Using a set of whole-genome sequencing (WGS) data from 349 human gut microbiota samples with two types of
more » ... IBD and healthy controls, we assembled and aligned WGS short reads to obtain feature profiles of strains and genera. The genus and strain profiles were used for the 16S-based and WGS-based diagnostic modules construction respectively. We designed a novel feature selection procedure to select those case-specific features. With these features, we built discrimination models using different machine learning algorithms. The machine learning algorithm LightGBM outperformed other algorithms in this study and thus was chosen as the core algorithm. Specially, we identified two small sets of biomarkers (strains) separately for the WGS-based health vs IBD module and ulcerative colitis vs Crohn's disease module, which contributed to the optimization of model performance during pre-training. We released LightCUD as an IBD diagnostic program built with LightGBM. The high performance has been validated through five-fold cross-validation and using an independent test data set. LightCUD was implemented in Python and packaged free for installation with customized databases. With WGS data or 16S rRNA sequencing data of gut microbiome samples as the input, LightCUD can discriminate IBD from healthy controls with high accuracy and further identify the specific type of IBD. The executable program LightCUD was released in open source with instructions at the webpage The identified strain biomarkers could be used to study the critical factors for disease development and recommend treatments regarding changes in the gut microbial community. Conclusions As the first released human gut microbiome-based IBD diagnostic tool, LightCUD demonstrates a high-performance for both WGS and 16S sequencing data. The strains that either identify healthy controls from IBD patients or distinguish the specific type of IBD are expected to be clinically important to serve as biomarkers.
doi:10.1186/s13040-021-00241-2 pmid:33468221 fatcat:uyilrcdsgzeg7g5zkrcydiu3qu

InteMAP: Integrated metagenomic assembly pipeline for NGS short reads

Binbin Lai, Fumeng Wang, Xiaoqi Wang, Liping Duan, Huaiqiu Zhu
2015 BMC Bioinformatics  
Next-generation sequencing (NGS) has greatly facilitated metagenomic analysis but also raised new challenges for metagenomic DNA sequence assembly, owing to its high-throughput nature and extremely short reads generated by sequencers such as Illumina. To date, how to generate a high-quality draft assembly for metagenomic sequencing projects has not been fully addressed. Results: We conducted a comprehensive assessment on state-of-the-art de novo assemblers and revealed that the performance of
more » ... ch assembler depends critically on the sequencing depth. To address this problem, we developed a pipeline named InteMAP to integrate three assemblers, ABySS, IDBA-UD and CABOG, which were found to complement each other in assembling metagenomic sequences. Making a decision of which assembling approaches to use according to the sequencing coverage estimation algorithm for each short read, the pipeline presents an automatic platform suitable to assemble real metagenomic NGS data with uneven coverage distribution of sequencing depth. By comparing the performance of InteMAP with current assemblers on both synthetic and real NGS metagenomic data, we demonstrated that InteMAP achieves better performance with a longer total contig length and higher contiguity, and contains more genes than others. Conclusions: We developed a de novo pipeline, named InteMAP, that integrates existing tools for metagenomics assembly. The pipeline outperforms previous assembly methods on metagenomic assembly by providing a longer total contig length, a higher contiguity and covering more genes. InteMAP, therefore, could potentially be a useful tool for the research community of metagenomics.
doi:10.1186/s12859-015-0686-x pmid:26250558 pmcid:PMC4545859 fatcat:uy74jdhnafdjbkk6r6lpdrlduy

Identifying micro-inversions using high-throughput sequencing reads

Feifei He, Yang Li, Yu-Hang Tang, Jian Ma, Huaiqiu Zhu
2016 BMC Genomics  
The identification of inversions of DNA segments shorter than read length (e.g., 100 bp), defined as micro-inversions (MIs), remains challenging for next-generation sequencing reads. It is acknowledged that MIs are important genomic variation and may play roles in causing genetic disease. However, current alignment methods are generally insensitive to detect MIs. Here we develop a novel tool, MID (Micro-Inversion Detector), to identify MIs in human genomes using next-generation sequencing
more » ... Results: The algorithm of MID is designed based on a dynamic programming path-finding approach. What makes MID different from other variant detection tools is that MID can handle small MIs and multiple breakpoints within an unmapped read. Moreover, MID improves reliability in low coverage data by integrating multiple samples. Our evaluation demonstrated that MID outperforms Gustaf, which can currently detect inversions from 30 bp to 500 bp. Conclusions: To our knowledge, MID is the first method that can efficiently and reliably identify MIs from unmapped short next-generation sequencing reads. MID is reliable on low coverage data, which is suitable for large-scale projects such as the 1000 Genomes Project (1KGP). MID identified previously unknown MIs from the 1KGP that overlap with genes and regulatory elements in the human genome. We also identified MIs in cancer cell lines from Cancer Cell Line Encyclopedia (CCLE). Therefore our tool is expected to be useful to improve the study of MIs as a type of genetic variant in the human genome. The source code can be downloaded from:
doi:10.1186/s12864-015-2305-7 pmid:26818118 pmcid:PMC4895285 fatcat:2xjgz7rl6bd4zmssvmap6wznay

A dynamic Bayesian network approach to protein secondary structure prediction

Xin-Qiu Yao, Huaiqiu Zhu, Zhen-Su She
2008 BMC Bioinformatics  
Protein secondary structure prediction method based on probabilistic models such as hidden Markov model (HMM) appeals to many because it provides meaningful information relevant to sequence-structure relationship. However, at present, the prediction accuracy of pure HMM-type methods is much lower than that of machine learning-based methods such as neural networks (NN) or support vector machines (SVM). Results: In this paper, we report a new method of probabilistic nature for protein secondary
more » ... ructure prediction, based on dynamic Bayesian networks (DBN). The new method models the PSI-BLAST profile of a protein sequence using a multivariate Gaussian distribution, and simultaneously takes into account the dependency between the profile and secondary structure and the dependency between profiles of neighboring residues. In addition, a segment length distribution is introduced for each secondary structure state. Tests show that the DBN method has made a significant improvement in the accuracy compared to other pure HMM-type methods. Further improvement is achieved by combining the DBN with an NN, a method called DBNN, which shows better Q 3 accuracy than many popular methods and is competitive to the current state-of-the-arts. The most interesting feature of DBN/DBNN is that a significant improvement in the prediction accuracy is achieved when combined with other methods by a simple consensus. Conclusion: The DBN method using a Gaussian distribution for the PSI-BLAST profile and a highordered dependency between profiles of neighboring residues produces significantly better prediction accuracy than other HMM-type probabilistic methods. Owing to their different nature, the DBN and NN combine to form a more accurate method DBNN. Future improvement may be achieved by combining DBNN with a method of SVM type.
doi:10.1186/1471-2105-9-49 pmid:18218144 pmcid:PMC2266706 fatcat:tgttvadxznbfbmshsqh6wad2ku

Computational evaluation of TIS annotation for prokaryotic genomes

Gang-Qing Hu, Xiaobin Zheng, Li-Ning Ju, Huaiqiu Zhu, Zhen-Su She
2008 BMC Bioinformatics  
[3] and Zhu et al.  ...  Taking S. solfataricus as an example, Zhu, et al.  ... 
doi:10.1186/1471-2105-9-160 pmid:18366730 pmcid:PMC2362131 fatcat:v24alvvjtnebpfw4em24y3mudi

NanoReviser: An Error-correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm [article]

Luotong Wang, Li Qu, Longshu Yang, Yiying Wang, Huaiqiu Zhu
2020 bioRxiv   pre-print
Nanopore sequencing is regarded as one of the most promising third-generation sequencing (TGS) technologies. Since 2014, Oxford Nanopore Technologies (ONT) has developed a series of devices based on nanopore sequencing to produce very long reads, with an expected impact on genomics. However, the nanopore sequencing reads are susceptible to a fairly high error rate owing to the difficulty in identifying the DNA bases from the complex electrical signals. Although several basecalling tools have
more » ... n developed for nanopore sequencing over the past years, it is still challenging to correct the sequences after applying the basecalling procedure. In this study, we developed an open-source DNA basecalling reviser, NanoReviser, based on a deep learning algorithm to correct the basecalling errors introduced by current basecallers provided by default. In our module, we re-segmented the raw electrical signals based on the basecalled sequences provided by the default basecallers. By employing convolution neural networks (CNNs) and bidirectional long short-term memory (Bi-LSTM) networks, we took advantage of the information from the raw electrical signals and the basecalled sequences from the basecallers. Our results showed NanoReviser, as a post-basecalling reviser, significantly improving the basecalling quality. After being trained on standard ONT sequencing reads from public E. coli and human NA12878 datasets, NanoReviser reduced the sequencing error rate by over 5% for both the E. coli dataset and the human dataset. The performance of NanoReviser was found to be better than those of all current basecalling tools. Furthermore, we analyzed the modified bases of the E. coli dataset and added the methylation information to train our module. With the methylation annotation, NanoReviser reduced the error rate by 7% for the E. coli dataset and specifically reduced the error rate by over 10% for the regions of the sequence rich in methylated bases. To the best of our knowledge, NanoReviser is the first post-processing tool after basecalling to accurately correct the nanopore sequences without the time-consuming procedure of building the consensus sequence. The NanoReviser package is freely available at
doi:10.1101/2020.07.25.220855 fatcat:gmqz3l3npfbofk3c6ik4hkvkdq

Genome reannotation of Escherichia coli CFT073 with new insights into virulence

Chengwei Luo, Gang-Qing Hu, Huaiqiu Zhu
2009 BMC Genomics  
As one of human pathogens, the genome of Uropathogenic Escherichia coli strain CFT073 was sequenced and published in 2002, which was significant in pathogenetic bacterial genomics research. However, the current RefSeq annotation of this pathogen is now outdated to some degree, due to missing or misannotation of some essential genes associated with its virulence. We carried out a systematic reannotation by combining automated annotation tools with manual efforts to provide a comprehensive
more » ... anding of virulence for the CFT073 genome. Results: The reannotation excluded 608 coding sequences from the RefSeq annotation. Meanwhile, a total of 299 coding sequences were newly added, about one third of them are found in genomic island (GI) regions while more than one fifth of them are located in virulence related regions pathogenicity islands (PAIs). Furthermore, there are totally 341 genes were relocated with their translational initiation sites (TISs), which resulted in a high quality of gene start annotation. In addition, 94 pseudogenes annotated in RefSeq were thoroughly inspected and updated. The number of miscellaneous genes (sRNAs) has been updated from 6 in RefSeq to 46 in the reannotation. Based on the adjustment in the reannotation, subsequent analysis were conducted by both general and case studies on new virulence factors or new virulence-associated genes that are crucial during the urinary tract infections (UTIs) process, including invasion, colonization, nutrition uptaking and population density control. Furthermore, miscellaneous RNAs collected in the reannotation are believed to contribute to the virulence of strain CFT073. The reannotation including the nucleotide data, the original RefSeq annotation, and all reannotated results is freely available via Conclusion: As a result, the reannotation presents a more comprehensive picture of mechanisms of uropathogenicity of UPEC strain CFT073. The new genes change the view of its uropathogenicity in many respects, particularly by new genes in GI regions and new virulence-associated factors. The reannotation thus functions as an important source by providing new information about genomic structure and organization, and gene function. Moreover, we expect that the detailed analysis will facilitate the studies for exploration of novel virulence mechanisms and help guide experimental design.
doi:10.1186/1471-2164-10-552 pmid:19930606 pmcid:PMC2785843 fatcat:7nyy5tdarjhw7foaofxm7l62za

Gene prediction in metagenomic fragments based on the SVM algorithm

Yongchu Liu, Jiangtao Guo, Gangqing Hu, Huaiqiu Zhu
2013 BMC Bioinformatics  
Metagenomic sequencing is becoming a powerful technology for exploring micro-ogranisms from various environments, such as human body, without isolation and cultivation. Accurately identifying genes from metagenomic fragments is one of the most fundamental issues. Results: In this article, we present a novel gene prediction method named MetaGUN for metagenomic fragments based on a machine learning approach of SVM. It implements in a three-stage strategy to predict genes. Firstly, it classifies
more » ... put fragments into phylogenetic groups by a k-mer based sequence binning method. Then, proteincoding sequences are identified for each group independently with SVM classifiers that integrate entropy density profiles (EDP) of codon usage, translation initiation site (TIS) scores and open reading frame (ORF) length as input patterns. Finally, the TISs are adjusted by employing a modified version of MetaTISA. To identify protein-coding sequences, MetaGun builds the universal module and the novel module. The former is based on a set of representative species, while the latter is designed to find potential functionary DNA sequences with conserved domains. Conclusions: Comparisons on artificial shotgun fragments with multiple current metagenomic gene finders show that MetaGUN predicts better results on both 3' and 5' ends of genes with fragments of various lengths. Especially, it makes the most reliable predictions among these methods. As an application, MetaGUN was used to predict genes for two samples of human gut microbiome. It identifies thousands of additional genes with significant evidences. Further analysis indicates that MetaGUN tends to predict more potential novel genes than other current metagenomic gene finders.
doi:10.1186/1471-2105-14-s5-s12 pmid:23735199 pmcid:PMC3622649 fatcat:tp5w47gy25azfagxxfldffgroa

Horizontal gene transfer in an acid mine drainage microbial community

Jiangtao Guo, Qi Wang, Xiaoqi Wang, Fumeng Wang, Jinxian Yao, Huaiqiu Zhu
2015 BMC Genomics  
Horizontal gene transfer (HGT) has been widely identified in complete prokaryotic genomes. However, the roles of HGT among members of a microbial community and in evolution remain largely unknown. With the emergence of metagenomics, it is nontrivial to investigate such horizontal flow of genetic materials among members in a microbial community from the natural environment. Because of the lack of suitable methods for metagenomics gene transfer detection, microorganisms from a low-complexity
more » ... nity acid mine drainage (AMD) with near-complete genomes were used to detect possible gene transfer events and suggest the biological significance. Results: Using the annotation of coding regions by the current tools, a phylogenetic approach, and an approximately unbiased test, we found that HGTs in AMD organisms are not rare, and we predicted 119 putative transferred genes. Among them, 14 HGT events were determined to be transfer events among the AMD members. Further analysis of the 14 transferred genes revealed that the HGT events affected the functional evolution of archaea or bacteria in AMD, and it probably shaped the community structure, such as the dominance of G-plasma in archaea in AMD through HGT. Conclusions: Our study provides a novel insight into HGT events among microorganisms in natural communities. The interconnectedness between HGT and community evolution is essential to understand microbial community formation and development.
doi:10.1186/s12864-015-1720-0 pmid:26141154 pmcid:PMC4490635 fatcat:rawejgsgzng73ngripzp7ovjhq

The landscape of micro-inversions provides clues for population genetic analysis of humans [article]

Li Qu, Luotong Wang, Feifei He, Yilun Han, Longshu Yang, May D. Wang, Huaiqiu Zhu
2020 bioRxiv   pre-print
Variations in the human genome have been studied extensively. However, little is known about the role of micro-inversions (MIs), generally defined as small (<100 bp) inversions, in human evolution, diversity, and health. Depicting the pattern of MIs among diverse populations is critical for interpreting human evolutionary history and obtaining insight into genetic diseases. Results: In this paper, we explored the distribution of MIs in genomes from 26 human populations and 7 nonhuman primate
more » ... omes and analyzed the phylogenetic structure of the 26 human populations based on the MIs. We further investigated the functions of the MIs located within genes associated with human health. With hg19 as the reference genome, we detected 6,968 MIs among the 1,937 human samples and 24,476 MIs among the 7 nonhuman primate genomes. The analyses of MIs in human genomes showed that the MIs were rarely located in exonic regions. Nonhuman primates and human populations shared only 82 inverted alleles, and Africans had the most inverted alleles in common with nonhuman primates, which was consistent with the Out of Africa hypothesis. The clustering of MIs among the human populations also coincided with human migration history and ancestral lineages. Conclusions: We propose that MIs are potential evolutionary markers for investigating population dynamics. Our results revealed the diversity of MIs in human populations and showed that they are essential to constructing human population relationships and have a potential effect on human health.
doi:10.1101/2020.07.27.218867 fatcat:qyps7qjqcjb2jck36nzwjnkz4u

A mutation degree model for the identification of transcriptional regulatory elements

Changqing Zhang, Jin Wang, Xu Hua, Jinggui Fang, Huaiqiu Zhu, Xiang Gao
2011 BMC Bioinformatics  
Current approaches for identifying transcriptional regulatory elements are mainly via the combination of two properties, the evolutionary conservation and the overrepresentation of functional elements in the promoters of co-regulated genes. Despite the development of many motif detection algorithms, the discovery of conserved motifs in a wide range of phylogenetically related promoters is still a challenge, especially for the short motifs embedded in distantly related gene promoters or very
more » ... ely related promoters, or in the situation that there are not enough orthologous genes available. Results: A mutation degree model is proposed and a new word counting method is developed for the identification of transcriptional regulatory elements from a set of co-expressed genes. The new method comprises two parts: 1) identifying overrepresented oligo-nucleotides in promoters of co-expressed genes, 2) estimating the conservation of the oligo-nucleotides in promoters of phylogenetically related genes by the mutation degree model. Compared with the performance of other algorithms, our method shows the advantages of low false positive rate and higher specificity, especially the robustness to noisy data. Applying the method to co-expressed gene sets from Arabidopsis, most of known cis-elements were successfully detected. The tool and example are available at Conclusions: The mutation degree model proposed in this paper is adapted to phylogenetic data of different qualities, and to a wide range of evolutionary distances. The new word-counting method based on this model has the advantage of better performance in detecting short sequence of cis-elements from co-expressed genes of eukaryotes and is robust to less complete phylogenetic data.
doi:10.1186/1471-2105-12-262 pmid:21708002 pmcid:PMC3228546 fatcat:erwkyuaxgbenxp7fgl3ingdzkm

Identify phage hosts from metaviromic short reads based on deep learning and Markov chain model [article]

Jie Tan, Zhencheng Fang, Shufang Wu, Qian Guo, Xiaoqing Jiang, Huaiqiu Zhu
2021 bioRxiv   pre-print
AbstractPhages - viruses that infect bacteria and archaea - are dominant in the virosphere and play an important role in the microbial community. It is very important to identify the host of a given phage fragment from metavriome data for understanding the ecological impact of phage in a microbial community. State-of-the-art tools for host identification only present reliable results on long sequences within a narrow candidate host range, while there are a large number of short fragments in
more » ... metagenomic data and the taxonomic composition of a microbial community is often complicated. Here, we present a method, named HoPhage, to identify the host of a given phage fragment from metavirome data at the genus level. HoPhage integrates two modules using the deep learning algorithms and the Markov chain model, respectively. By testing on both the artificial benchmark dataset of phage contigs and the real virome data, HoPhage demonstrates a satisfactory performance on short fragments within a wide candidate host range at every taxonomic level. HoPhage is freely available at
doi:10.1101/2021.03.01.433351 fatcat:zzq4fupmm5h5rfbfbuoocihefe

Dynamic functional connectivity states characterize NREM sleep and wakefulness

Shuqin Zhou, Guangyuan Zou, Jing Xu, Zihui Su, Huaiqiu Zhu, Qihong Zou, Jia‐Hong Gao
2019 Human Brain Mapping  
According to recent neuroimaging studies, temporal fluctuations in functional connectivity patterns can be clustered into dynamic functional connectivity (DFC) states and correspond to fluctuations in vigilance. However, whether there consistently exist DFC states associated with wakefulness and sleep stages and what are the characteristics and electrophysiological origin of these states remain unclear. The aims of the current study were to investigate the properties of DFC in different sleep
more » ... ages and to explore the relationship between the characteristics of DFC and slow-wave activity. We collected both eyes-closed wakefulness and sleep data from 48 healthy young volunteers with simultaneous electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) recordings. EEG data were employed as the gold standard of sleep stage scoring, and DFC states were estimated based on fMRI data. The results demonstrated that DFC states of the fMRI signals consistently corresponded to wakefulness and nonrapid eye movement sleep stages independent of the number of clusters. Furthermore, the mean dwell time of these states significantly correlated with slow-wave activity. The inclusion or omission of regression of the global signal and the selection of parcellation schemes exerted minimal effects on the current findings. These results provide strong evidence that DFC states underlying fMRI signals match the fluctuations of vigilance and suggest a possible electrophysiological source of DFC states corresponding to vigilance states.
doi:10.1002/hbm.24770 pmid:31444893 fatcat:krm27s7emzecxd3pofpfpirg7y
« Previous Showing results 1 — 15 out of 49 results