52 Hits in 1.2 sec

Infusing Linguistic Knowledge of SMILES into Chemical Language Models [article]

Ingoo Lee, Hojung Nam
2022 arXiv   pre-print
The simplified molecular-input line-entry system (SMILES) is the most popular representation of chemical compounds. Therefore, many SMILES-based molecular property prediction models have been developed. In particular, transformer-based models show promising performance because the model utilizes a massive chemical dataset for self-supervised learning. However, there is no transformer-based model to overcome the inherent limitations of SMILES, which result from the generation process of SMILES.
more » ... n this study, we grammatically parsed SMILES to obtain connectivity between substructures and their type, which is called the grammatical knowledge of SMILES. First, we pretrained the transformers with substructural tokens, which were parsed from SMILES. Then, we used the training strategy 'same compound model' to better understand SMILES grammar. In addition, we injected knowledge of connectivity and type into the transformer with knowledge adapters. As a result, our representation model outperformed previous compound representations for the prediction of molecular properties. Finally, we analyzed the attention of the transformer model and adapters, demonstrating that the proposed model understands the grammar of SMILES.
arXiv:2205.00084v1 fatcat:isksqse4gjbcvpcca42k3klxze

Prediction models for drug-induced hepatotoxicity by using weighted molecular fingerprints

Eunyoung Kim, Hojung Nam
2017 BMC Bioinformatics  
Drug-induced liver injury (DILI) is a critical issue in drug development because DILI causes failures in clinical trials and the withdrawal of approved drugs from the market. There have been many attempts to predict the risk of DILI based on in vivo and in silico identification of hepatotoxic compounds. In the current study, we propose the in silico prediction model predicting DILI using weighted molecular fingerprints. Results: In this study, we used 881 bits of molecular fingerprint and used
more » ... s features describing presence or absence of each substructure of compounds. Then, the Bayesian probability of each substructure was calculated and labeled (positive or negative for DILI), and a weighted fingerprint was determined from the ratio of DILI-positive to DILI-negative probability values. Using weighted fingerprint features, the prediction models were trained and evaluated with the Random Forest (RF) and Support Vector Machine (SVM) algorithms. The constructed models yielded accuracies of 73.8% and 72.6%, AUCs of 0.791 and 0.768 in cross-validation. In independent tests, models achieved accuracies of 60.1% and 61.1% for RF and SVM, respectively. The results validated that weighted features helped increase overall performance of prediction models. The constructed models were further applied to the prediction of natural compounds in herbs to identify DILI potential, and 13,996 unique herbal compounds were predicted as DILI-positive with the SVM model. Conclusions: The prediction models with weighted features increased the performance compared to non-weighted models. Moreover, we predicted the DILI potential of herbs with the best performed model, and the prediction results suggest that many herbal compounds could have potential to be DILI. We can thus infer that taking natural products without detailed references about the relevant pathways may be dangerous. Considering the frequency of use of compounds in natural herbs and their increased application in drug development, DILI labeling would be very important.
doi:10.1186/s12859-017-1638-4 pmid:28617228 pmcid:PMC5471939 fatcat:vitoedcl6nclfgp6cbg3quxjda

Drug repositioning of herbal compounds via a machine-learning approach

Eunyoung Kim, A-sol Choi, Hojung Nam
2019 BMC Bioinformatics  
Drug repositioning, also known as drug repurposing, defines new indications for existing drugs and can be used as an alternative to drug development. In recent years, the accumulation of large volumes of information related to drugs and diseases has led to the development of various computational approaches for drug repositioning. Although herbal medicines have had a great impact on current drug discovery, there are still a large number of herbal compounds that have no definite indications.
more » ... lts: In the present study, we constructed a computational model to predict the unknown pharmacological effects of herbal compounds using machine learning techniques. Based on the assumption that similar diseases can be treated with similar drugs, we used four categories of drug-drug similarity (e.g., chemical structure, side-effects, gene ontology, and targets) and three categories of disease-disease similarity (e.g., phenotypes, human phenotype ontology, and gene ontology). Then, associations between drug and disease were predicted using the employed similarity features. The prediction models were constructed using classification algorithms, including logistic regression, random forest and support vector machine algorithms. Upon cross-validation, the random forest approach showed the best performance (AUC = 0.948) and also performed well in an external validation assessment using an unseen independent dataset (AUC = 0.828). Finally, the constructed model was applied to predict potential indications for existing drugs and herbal compounds. As a result, new indications for 20 existing drugs and 31 herbal compounds were predicted and validated using clinical trial data. Conclusions: The predicted results were validated manually confirming the performance and underlying mechanismsfor example, irinotecan as a treatment for neuroblastoma. From the prediction, herbal compounds were considered to be drug candidates for related diseases which is important to be further developed. The proposed prediction model can contribute to drug discovery by suggesting drug candidates from herbal compounds which have potentials but few were studied.
doi:10.1186/s12859-019-2811-8 fatcat:s5mfpdlwxvhsbfv4spipffju4m

Phenotype-oriented network analysis for discovering pharmacological effects of natural compounds

Sunyong Yoo, Hojung Nam, Doheon Lee
2018 Scientific Reports  
Although natural compounds have provided a wealth of leads and clues in drug development, the process of identifying their pharmacological effects is still a challenging task. Over the last decade, many in vitro screening methods have been developed to identify the pharmacological effects of natural compounds, but they are still costly processes with low productivity. Therefore, in silico methods, primarily based on molecular information, have been proposed. However, large-scale analysis is
more » ... ly considered, since many natural compounds do not have molecular structure and target protein information. Empirical knowledge of medicinal plants can be used as a key resource to solve the problem, but this information is not fully exploited and is used only as a preliminary tool for selecting plants for specific diseases. Here, we introduce a novel method to identify pharmacological effects of natural compounds from herbal medicine based on phenotype-oriented network analysis. In this study, medicinal plants with similar efficacy were clustered by investigating hierarchical relationships between the known efficacy of plants and 5,021 phenotypes in the phenotypic network. We then discovered significantly enriched natural compounds in each plant cluster and mapped the averaged pharmacological effects of the plant cluster to the natural compounds. This approach allows us to predict unexpected effects of natural compounds that have not been found by molecular analysis. When applied to verified medicinal compounds, our method successfully identified their pharmacological effects with high specificity and sensitivity. Natural compounds and their derivatives have been used as a valuable source of medicinal agents. To date, an impressive number of modern drugs have been derived from natural sources, many based on their use in herbal medicine 1-3 . Herbal medicine has accumulated considerable knowledge about the medicinal use of plants over the last thousand years. Additionally, herbal medicine is presumed to be safe, harmless and without side effects because of its natural origins 4,5 . Recent surveys showed that approximately 70-80% of the world's population depends on herbal medicine for their primary health care 6,7 . However, only a small number of plant species have been investigated by scientists and approved for commercial purposes while more than 35,000 plant species are used for medicinal purposes worldwide 8,9 . Therefore, a better understanding of herbal medicine through scientific analysis will provide new insights for drug development. Most previous studies on finding medicinal agents from herbal medicine were performed by in vitro assessment. The plant associated with the disease of interest was selected from herbal medicine. Then, the natural compound or plant itself was extracted, and its biological activities were confirmed by in vitro screening methods 10-13 . However, large-scale experiments are required to analyze a large number of constituent natural compounds, and the problem increases exponentially as the number of plants under consideration increases. Therefore, in silico approaches, such as similarity-based, network-based or mechanism-based methods, have been proposed to filter potential medicinal agents from numerous natural compounds [14] [15] [16] [17] . Most of these studies have used herbal medicine information only as a preliminary tool to select plants or natural compounds for a certain disease. They Published: xx xx xxxx OPEN
doi:10.1038/s41598-018-30138-w pmid:30076354 pmcid:PMC6076245 fatcat:uzda5wgwpzajbmkimb47pxnwqe

SELF-BLM: Prediction of drug-target interactions via self-training SVM

Jongsoo Keum, Hojung Nam, Alexey Porollo
2017 PLoS ONE  
Predicting drug-target interactions is important for the development of novel drugs and the repositioning of drugs. To predict such interactions, there are a number of methods based on drug and target protein similarity. Although these methods, such as the bipartite local model (BLM), show promise, they often categorize unknown interactions as negative interaction. Therefore, these methods are not ideal for finding potential drug-target interactions that have not yet been validated as positive
more » ... nteractions. Thus, here we propose a method that integrates machine learning techniques, such as self-training support vector machine (SVM) and BLM, to develop a self-training bipartite local model (SELF-BLM) that facilitates the identification of potential interactions. The method first categorizes unlabeled interactions and negative interactions among unknown interactions using a clustering method. Then, using the BLM method and self-training SVM, the unlabeled interactions are selftrained and final local classification models are constructed.
doi:10.1371/journal.pone.0171839 pmid:28192537 pmcid:PMC5305209 fatcat:irfzswf3u5cjrbenv2moptqi6y

DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein sequences [article]

Ingoo Lee, Jongsoo Keum, Hojung Nam
2018 arXiv   pre-print
Identification of drug-target interactions (DTIs) plays a key role in drug discovery. The high cost and labor-intensive nature of in vitro and in vivo experiments have highlighted the importance of in silico-based DTI prediction approaches. In several computational models, conventional protein descriptors are shown to be not informative enough to predict accurate DTIs. Thus, in this study, we employ a convolutional neural network (CNN) on raw protein sequences to capture local residue patterns
more » ... articipating in DTIs. With CNN on protein sequences, our model performs better than previous protein descriptor-based models. In addition, our model performs better than the previous deep learning model for massive prediction of DTIs. By examining the pooled convolution results, we found that our model can detect binding sites of proteins for DTIs. In conclusion, our prediction model for detecting local residue patterns of target proteins successfully enriches the protein features of a raw protein sequence, yielding better prediction results than previous approaches.
arXiv:1811.02114v1 fatcat:4df5zlprbnfvhh27f3b45tgwvy

Sequence-based prediction of protein binding regions and drug–target interactions

Ingoo Lee, Hojung Nam
2022 Journal of Cheminformatics  
AbstractIdentifying drug–target interactions (DTIs) is important for drug discovery. However, searching all drug–target spaces poses a major bottleneck. Therefore, recently many deep learning models have been proposed to address this problem. However, the developers of these deep learning models have neglected interpretability in model construction, which is closely related to a model's performance. We hypothesized that training a model to predict important regions on a protein sequence would
more » ... crease DTI prediction performance and provide a more interpretable model. Consequently, we constructed a deep learning model, named Highlights on Target Sequences (HoTS), which predicts binding regions (BRs) between a protein sequence and a drug ligand, as well as DTIs between them. To train the model, we collected complexes of protein–ligand interactions and protein sequences of binding sites and pretrained the model to predict BRs for a given protein sequence–ligand pair via object detection employing transformers. After pretraining the BR prediction, we trained the model to predict DTIs from a compound token designed to assign attention to BRs. We confirmed that training the BRs prediction model indeed improved the DTI prediction performance. The proposed HoTS model showed good performance in BR prediction on independent test datasets even though it does not use 3D structure information in its prediction. Furthermore, the HoTS model achieved the best performance in DTI prediction on test datasets. Additional analysis confirmed the appropriate attention for BRs and the importance of transformers in BR and DTI prediction. The source code is available on GitHub (
doi:10.1186/s13321-022-00584-w pmid:35135622 pmcid:PMC8822694 fatcat:anqgprcf3faotoaywggxnfuosu

Identification of temporal association rules from time-series microarray data sets

Hojung Nam, KiYoung Lee, Doheon Lee
2009 BMC Bioinformatics  
One of the most challenging problems in mining gene expression data is to identify how the expression of any particular gene affects the expression of other genes. To elucidate the relationships between genes, an association rule mining (ARM) method has been applied to microarray gene expression data. However, a conventional ARM method has a limit on extracting temporal dependencies between gene expressions, though the temporal information is indispensable to discover underlying regulation
more » ... nisms in biological pathways. In this paper, we propose a novel method, referred to as temporal association rule mining (TARM), which can extract temporal dependencies among related genes. A temporal association rule has the form [gene A↑, gene B↓] → (7 min) [gene C↑], which represents that high expression level of gene A and significant repression of gene B followed by significant expression of gene C after 7 minutes. The proposed TARM method is tested with Saccharomyces cerevisiae cell cycle time-series microarray gene expression data set. Results: In the parameter fitting phase of TARM, the fitted parameter set [threshold = ± 0.8, support ≥ 3 transactions, confidence ≥ 90%] with the best precision score for KEGG cell cycle pathway has been chosen for rule mining phase. With the fitted parameter set, numbers of temporal association rules with five transcriptional time delays (0, 7, 14, 21, 28 minutes) are extracted from gene expression data of 799 genes, which are pre-identified cell cycle relevant genes. From the extracted temporal association rules, associated genes, which play same role of biological processes within short transcriptional time delay and some temporal dependencies between genes with specific biological processes are identified. Conclusion: In this work, we proposed TARM, which is an applied form of conventional ARM. TARM showed higher precision score than Dynamic Bayesian network and Bayesian network. Advantages of TARM are that it tells us the size of transcriptional time delay between associated genes, activation and inhibition relationship between genes, and sets of co-regulators.
doi:10.1186/1471-2105-10-s3-s6 pmid:19344482 pmcid:PMC2665054 fatcat:szjavispcfa47ltlso2eg4wlhi

Diagnostic imaging features of calyceal diverticulum in a cat

Yunjeong Nam, Youngwon Lee, Hojung Choi
2021 Korean Journal of Veterinary Research  
ORCID Yunjeong Nam, Youngwon Lee, Hojung Choi,  ... 
doi:10.14405/kjvr.2021.61.e17 fatcat:o3qelhq3tjf6zcbyweyenaaopa

Computational identification of altered metabolism using gene expression and metabolic pathways

Hojung Nam, Jinwon Lee, Doheon Lee
2009 Biotechnology and Bioengineering  
Understanding altered metabolism is an important issue because altered metabolism is often revealed as a cause or an effect in pathogenesis. It has also been shown to be an important factor in the manipulation of an organism's metabolism in metabolic engineering. Unfortunately, it is not yet possible to measure the concentration levels of all metabolites in the genome-wide scale of a metabolic network; consequently, a method that infers the alteration of metabolism is beneficial. The present
more » ... dy proposes a computational method that identifies genomewide altered metabolism by analyzing functional units of KEGG pathways. As control of a metabolic pathway is accomplished by altering the activity of at least one ratedetermining step enzyme, not all gene expressions of enzymes in the pathway demonstrate significant changes even if the pathway is altered. Therefore, we measure the alteration levels of a metabolic pathway by selectively observing expression levels of significantly changed genes in a pathway. The proposed method was applied to two strains of Saccharomyces cerevisiae gene expression profiles measured in very high-gravity (VHG) fermentation. The method identified altered metabolic pathways whose properties are related to ethanol and osmotic stress responses which had been known to be observed in VHG fermentation because of the high sugar concentration in growth media and high ethanol concentration in fermentation products. With the identified altered pathways, the proposed method achieved best accuracy and sensitivity rates for the Red Star (RS) strain compared to other three related studies (gene-set enrichment analysis (GSEA), significance analysis of microarray to gene set (SAM-GS), reporter metabolite), and for the CEN.PK 113-7D (CEN) strain, the proposed method and the GSEA method showed comparably similar performances.
doi:10.1002/bit.22320 pmid:19378263 fatcat:e42ztwxjavbhtpeql6srdjmjjm

hERG-Att: Self-Attention-Based Deep Neural Network for Predicting hERG Blockers

Kim Hyunho, Nam Hojung
2020 Computational biology and chemistry  
A voltage-gated potassium channel encoded by the human ether-à-go-go-related gene (hERG) regulates cardiac action potential, and it is involved in cardiotoxicity with compounds that inhibit its activity. Therefore, the screening of hERG channel blockers is a mandatory step in the drug discovery process. The screening of hERG blockers by using conventional methods is inefficient in terms of cost and efforts. This has led to the development of many in silico hERG blocker prediction models.
more » ... , constructing a high-performance predictive model with interpretability on hERG blockage by certain compounds is a major obstacle. In this study, we developed the first, attention-based, interpretable model that predicts hERG blockers and captures important hERG-related compound substructures. To do that, we first collected various datasets, ranging from public databases to publicly available private datasets, to train and test the model. Then, we developed a precise and interpretable hERG blocker prediction model by using deep learning with a self-attention approach that has an appropriate molecular descriptor, Morgan fingerprint. The proposed prediction model was validated, and the validation result showed that the model was well-optimized and had high performance. The test set performance of the proposed model was significantly higher than that of previous fingerprint-based conventional machine learning models. In particular, the proposed model generally had high accuracy and F1 score thereby, representing the model's predictive reliability. Furthermore, we interpreted the calculated attention score vectors obtained from the proposed prediction model and demonstrated the important structural patterns that are represented in hERG blockers. In summary, we have proposed a powerful and interpretable hERG blocker prediction model that can reduce the overall cost of drug discovery by accurately screening for hERG blockers and suggesting hERG-related substructures.
doi:10.1016/j.compbiolchem.2020.107286 pmid:32531518 fatcat:2epje6ftercwjpkd4bgsu4ub7y

Prognostic factor analysis for breast cancer using gene expression profiles

Soobok Joe, Hojung Nam
2016 BMC Medical Informatics and Decision Making  
The survival of patients with breast cancer is highly sporadic, from a few months to more than 15 years. In recent studies, the gene expression profiling of tumors has been used as a promising means of predicting prognosis factors. Methods: In this study, we used gene expression datasets of tumors to identify prognostic factors in breast cancer. We conducted log-rank tests and used unsupervised clustering methods to find reciprocally expressed gene sets associated with worse survival rates.
more » ... nosis prediction scores were determined as the ratio of gene expressions. Results: As a result, four prognosis prediction gene set modules were constructed. The four prognostic gene sets predicted worse survival rates in three independent gene expression data sets. In addition, we found that cancer patient with poor prognosis, i.e., triple-negative cancer, HER2-enriched, TP53 mutated and high-graded patients had higher prognosis prediction scores than those with other types of breast cancer. Conclusions: In conclusion, based on a gene expression analysis, we suggest that our well-defined scoring method of the prediction of survival outcome may be useful for developing prognostic factors in breast cancer.
doi:10.1186/s12911-016-0292-5 pmid:27454576 pmcid:PMC4959370 fatcat:4qpzzevaojbvfm7zdyhfu72zlm

Systems assessment of transcriptional regulation on central carbon metabolism by Cra and CRP [article]

Donghyuk Kim, Sang Woo Seo, Hojung Nam, Gabriela I. Guzman, Ye Gao, Bernhard O. Palsson
2016 bioRxiv   pre-print
Two major transcriptional regulators of carbon metabolism in bacteria are Cra and CRP. CRP is considered to be the main mediator of catabolite repression. Unlike for CRP, available in vivo DNA binding information of Cra is scarce. Here we generate and integrate ChIP-exo and RNA-seq data to identify 39 binding sites for Cra and 97 regulon genes that are regulated by Cra in Escherichia coli. An integrated metabolic-regulatory network was formed by including experimentally-derived regulatory
more » ... ation and a genome-scale metabolic network reconstruction. Applying analysis methods of systems biology to this integrated network showed that Cra enables the optimal bacterial growth on poor carbon sources by redirecting and repressing the glycolysis flux, by activating the glyoxylate shunt pathway, and by activating the respiratory pathway. In these regulatory mechanisms, the overriding regulatory activity of Cra over CRP is fundamental. Thus, elucidation of interacting transcriptional regulation of core carbon metabolism in bacteria by two key transcription factors was possible by combining genome-wide experimental measurement and simulation with a genome-scale metabolic model.
doi:10.1101/080929 fatcat:uqxknefmy5cpxogjhkyb6aegri

Prediction model construction of mouse stem cell pluripotency using CpG and non-CpG DNA methylation markers

Soobok Joe, Hojung Nam
2020 BMC Bioinformatics  
Genome-wide studies of DNA methylation across the epigenetic landscape provide insights into the heterogeneity of pluripotent embryonic stem cells (ESCs). Differentiating into embryonic somatic and germ cells, ESCs exhibit varying degrees of pluripotency, and epigenetic changes occurring in this process have emerged as important factors explaining stem cell pluripotency. Here, using paired scBS-seq and scRNA-seq data of mice, we constructed a machine learning model that predicts degrees of
more » ... potency for mouse ESCs. Since the biological activities of non-CpG markers have yet to be clarified, we tested the predictive power of CpG and non-CpG markers, as well as a combination thereof, in the model. Through rigorous performance evaluation with both internal and external validation, we discovered that a model using both CpG and non-CpG markers predicted the pluripotency of ESCs with the highest prediction performance (0.956 AUC, external test). The prediction model consisted of 16 CpG and 33 non-CpG markers. The CpG and most of the non-CpG markers targeted depletions of methylation and were indicative of cell pluripotency, whereas only a few non-CpG markers reflected accumulations of methylation. Additionally, we confirmed that there exists the differing pluripotency between individual developmental stages, such as E3.5 and E6.5, as well as between induced mouse pluripotent stem cell (iPSC) and somatic cell. In this study, we investigated CpG and non-CpG methylation in relation to mouse stem cell pluripotency and developed a model thereon that successfully predicts the pluripotency of mouse ESCs.
doi:10.1186/s12859-020-3448-3 pmid:32366211 fatcat:hfnbzvadejhk5lam4jguqvklpu

Identification of drug-target interaction by a random walk with restart method on an interactome network

Ingoo Lee, Hojung Nam
2018 BMC Bioinformatics  
Identification of drug-target interactions acts as a key role in drug discovery. However, identifying drug-target interactions via in-vitro, in-vivo experiments are very laborious, time-consuming. Thus, predicting drug-target interactions by using computational approaches is a good alternative. In recent studies, many feature-based and similarity-based machine learning approaches have shown promising results in drug-target interaction predictions. A previous study showed that accounting
more » ... vity information of drug-drug and protein-protein interactions increase performances of prediction by the concept of 'guilt-by-association'. However, the approach that only considers directly connected nodes often misses the information that could be derived from distance nodes. Therefore, in this study, we yield global network topology information by using a random walk with restart algorithm and apply the global topology information to the prediction model. Results : As a result, our prediction model demonstrates increased prediction performance compare to the 'guilt-by-association' approach (AUC 0.89 and 0.67 in the training and independent test, respectively). In addition, we show how weighted features by a random walk with restart yields better performances than original features. Also, we confirmed that drugs and proteins that have high-degree of connectivity on the interactome network yield better performance in our model. Conclusions: The prediction models with weighted features by considering global network topology increased the prediction performances both in the training and testing compared to non-weighted models and previous a 'guilt-by-association method'. In conclusion, global network topology information on protein-protein interaction and drug-drug interaction effects to the prediction performance of drug-target interactions.
doi:10.1186/s12859-018-2199-x pmid:29897326 pmcid:PMC5998759 fatcat:h5vgmz5cbzgqxdtzveb7o7374y
« Previous Showing results 1 — 15 out of 52 results