1,148 Hits in 1.3 sec

Mining the diseasome

Davnah Urbach, Jason H Moore
2011 BioData Mining  
suggested that biological and clinical information may serve as valuable expert knowledge for genetic association studies and that disease networks may provide useful guidance prior to and during data mining  ... 
doi:10.1186/1756-0381-4-25 pmid:21906309 pmcid:PMC3179740 fatcat:fyvvbpyq6rbg3bo6e6oq27z2vi

Mining beyond the exome

Davnah Urbach, Jason H Moore
2011 BioData Mining  
We then highlight the importance and the necessity of designing efficient methods to mine beyond the exome.  ... 
doi:10.1186/1756-0381-4-14 pmid:21668977 pmcid:PMC3144005 fatcat:leoctean5bbt5ob2zt35hezkue

Synthetic learning machines

Hemant Ishwaran, James D Malley
2014 BioData Mining  
Using a collection of different terminal nodesize constructed random forests, each generating a synthetic feature, a synthetic random forest is defined as a kind of hyperforest, calculated using the new input synthetic features, along with the original features. Results: Using a large collection of regression and multiclass datasets we show that synthetic random forests outperforms both conventional random forests and the optimized forest from the regresssion portfolio. Conclusions: Synthetic
more » ... rests removes the need for tuning random forests with no additional effort on the part of the researcher. Importantly, the synthetic forest does this with evidently no loss in prediction compared to a well-optimized single random forest.
doi:10.1186/s13040-014-0028-y pmid:25614764 pmcid:PMC4279689 fatcat:5rhcoe5hj5drbkxcvexnoplzuy

Motif mining based on network space compression

Qiang Zhang, Yuan Xu
2014 BioData Mining  
In this paper, we provide a new approach for motif mining based on compressing the searching space.  ...  Searching for sub-graphs in a network is the most important part of the motif mining process.  ...  Store graphs The storage of graphs is the first step in the process of solving the motif-mining problem.  ... 
doi:10.1186/s13040-014-0029-x pmid:25525470 pmcid:PMC4269098 fatcat:vchnkimhyvfj5oat6sylhlj6iy

Microarray enriched gene rank

Eugene Demidenko
2015 BioData Mining  
We develop a new concept that reflects how genes are connected based on microarray data using the coefficient of determination (the squared Pearson correlation coefficient). Our gene rank combines a priori knowledge about gene connectivity, say, from the Gene Ontology (GO) database, and the microarray expression data at hand, called the microarray enriched gene rank, or simply gene rank (GR). GR, similarly to Google PageRank, is defined in a recursive fashion and is computed as the left maximum
more » ... eigenvector of a stochastic matrix derived from microarray expression data. An efficient algorithm is devised that allows computation of GR for 50 thousand genes with 500 samples within minutes on a personal computer using the public domain statistical package R. Results: Computation of GR is illustrated with several microarray data sets. In particular, we apply GR (1) to answer whether bad genes are more connected than good genes in relation with cancer patient survival, (2) to associate gene connectivity with cluster/subtypes in ovarian cancer tumors, and to determine whether gene connectivity changes (3) from organ to organ within the same organism and (4) between organisms. Conclusions: We have shown by examples that findings based on GR confirm biological expectations. GR may be used for hypothesis generation on gene pathways. It may be used for a homogeneous sample or for comparison of gene connectivity among cases and controls, or in longitudinal setting.
doi:10.1186/s13040-014-0033-1 pmid:25649242 pmcid:PMC4305247 fatcat:mzn7gugdcvcjdbaatvdeem6qcu

The spatial dimension in biological data mining

Davnah Urbach, Jason H Moore
2011 BioData Mining  
The goal of this editorial is to highlight the spatial dimension of biological data mining. Among its numerous applications, data mining plays an increasingly important role in epidemiology.  ...  Since their first application, data mining procedures have progressively been tweaked to accommodate various types of information, including social science-and biological data.  ...  The goal of this editorial is to highlight the spatial dimension of biological data mining. Among its numerous applications, data mining plays an increasingly important role in epidemiology.  ... 
doi:10.1186/1756-0381-4-6 pmid:21477341 pmcid:PMC3084166 fatcat:4hewjj5uv5bytlipwdm7l37yti

Clustering-based approaches to SAGE data mining

Haiying Wang, Huiru Zheng, Francisco Azuaje
2008 BioData Mining  
It places an emphasis on current limitations and opportunities in this area for supporting biologically-meaningful data mining and visualisation.  ...  However, due to the unique characteristics of SAGE data, mining this type of data poses a great challenge to the bio-data mining community.  ...  This paper places an emphasis on clustering-based approaches to SAGE data mining.  ... 
doi:10.1186/1756-0381-1-5 pmid:18822151 pmcid:PMC2553774 fatcat:4ahsd5qvorhnxfqha4mh46kshy

Conservation machine learning

Moshe Sipper, Jason H. Moore
2020 BioData Mining  
Editorial Ensemble techniques-wherein a model is composed of multiple (possibly) weaker models-are prevalent nowadays within the field of machine learning (ML). Well-known methods such as bagging [1], boosting [2], and stacking [3] are ML mainstays, widely (and fruitfully) deployed on a daily basis. Generally speaking, there are two types of ensemble methods, the first generating models in sequence-e.g., AdaBoost [2]-the latter in a parallel manner-e.g., random forests [4] and evolutionary
more » ... ithms [5] . AdaBoost (Adaptive Boosting) is an ML meta-algorithm that is used in conjunction with other types of learning algorithms to improve performance. The output of so-called "weak learners" is combined into a weighted sum that represents the final output of the boosted classifier. Adaptivity is obtained by tweaking subsequent weak learners in favor of those instances misclassified by previous classifiers. The maximum number of estimators at which boosting is terminated is a free parameter that has to be carefully set by the user. The popular Scikit-learn Python package, used extensively within the ML community, sets this default value to 50 [6] . A random forest is an ensemble learning method that operates by constructing a multitude of decision trees at training time and then outputting the majority class (for classification problems) or mean prediction (for regression problems) of the individual trees. The number of trees is a free parameter set by the user; the default Scikit-learn value is 100 (up from 10 in past versions) [6] . An evolutionary algorithm is a population-based approach that inherently produces a cornucopia of models over generations of evolution. Most often one seeks a single, final model (or a Pareto set of models, when multiple objectives are sought). Yet, as eloquently suggested by [7] in their paper's title, might we not obtain "Ensemble learning for free with evolutionary algorithms?" They proposed evolutionary ensemble learning, which extracts an ensemble either from the final population only or incrementally during evolution. Recently, [8] focused on genetic programming-wherein the individuals evolved are computational trees-introducing an ensemble coevolutionary algorithm that maintains two subpopulations, trees and forests, with the output model being a forest built as an ensemble of trees.
doi:10.1186/s13040-020-00220-z pmid:32774460 pmcid:PMC7405443 fatcat:ofnxi4tlnjbctbhwxwkrky5jae

Data mining and the evolution of biological complexity

Davnah Urbach, Jason H Moore
2011 BioData Mining  
Regardless, it is a useful exercise to think about where biological complexity comes from as a way to facilitate the selection of data mining methods.  ...  The assumptions we make about this complexity greatly influences the analytical methods we choose for data mining and, in turn, our results and inferences.  ... 
doi:10.1186/1756-0381-4-7 pmid:21477342 pmcid:PMC3083376 fatcat:2vb7g77ccjcvpi65ria6c5hv2m

Empowering the data science scientist

Jason H. Moore
2021 BioData Mining  
Competing interests Author is Editor-in-Chief of BioData Mining.  ... 
doi:10.1186/s13040-021-00246-x pmid:33485343 fatcat:6wxk54bc25gwdi3ysiujyali4q

The limits of p-values for biological data mining

James D Malley, Abhijit Dasgupta, Jason H Moore
2013 BioData Mining  
BioData Mining 2013, 6:10  ...  This can be resolved by bringing the focus back to the scientific, data mining questions: What are the hypotheses of interest (are there different ways to frame the analysis)?  ... 
doi:10.1186/1756-0381-6-10 pmid:23663551 pmcid:PMC3668262 fatcat:ty6g7cxjcfc4fmzzkf2vm5o5ha

Big Data analysis on autopilot?

Scott M Williams, Jason H Moore
2013 BioData Mining  
BioData Mining can serve as a vortex for this kind of research and we hope to engage diverse cohorts of researchers to do so.  ... 
doi:10.1186/1756-0381-6-22 pmid:24314297 pmcid:PMC3878969 fatcat:v7py2mfgongyfme7ihd5pqzmoy

The optimal crowd learning machine

Bilguunzaya Battogtokh, Majid Mojirsheibani, James Malley
2017 BioData Mining  
Any family of learning machines can be combined into a single learning machine using various methods with myriad degrees of usefulness. Results: For making predictions on an outcome, it is provably at least as good as the best machine in the family, given sufficient data. And if one machine in the family minimizes the probability of misclassification, in the limit of large data, then Optimal Crowd does also. That is, the Optimal Crowd is asymptotically Bayes optimal if any machine in the crowd
more » ... s such. Conclusions: The only assumption needed for proving optimality is that the outcome variable is bounded. The scheme is illustrated using real-world data from the UCI machine learning site, and possible extensions are proposed. Background The universe of statistical learning machines is still rapidly expanding, and new methods are being introduced almost daily. Despite these advances, choosing one machine over many other plausible machines, or, one particular version from within a family, can be arduous and resource intensive. Equally important, understanding how the schemes work and the results they produce, remains a separate and ongoing challenge. Unfortunately, for most researchers, many learning machines are "black boxes." For a general, self-contained, and relatively nontechnical introduction to learning machines, see [1]. The scheme described here encourages the implementation of multiple and diverse machines. It begins with a family of machines, each making separate predictions or classifications, given the training data. These individual predictions are then used as inputs to a single machine. This final machine is itself functionally transparent, does not require any user-supplied tuning parameters or parameter estimation. It is virtually assumption free, as it only requires that the outcome variable is bounded. Earlier version of this approach has been studied in the literature under the topic of stacking. More recently deep learning has been introduced, versions of which use individual machine predictions as layers, themselves used as inputs to another machine; for details of both, see [2] . The scheme discussed here is distinct from these approaches. Most notably, the optimal crowd uses predictions for a test point, generated by the separate machines, to direct the researcher to a specific subset of the training data. Then the known outcomes in the training data that are closest to the test point are simply averaged. As discussed below, when the scheme is used for pure classification, over zero/one outcomes, the measure of closeness is immediate, requiring no tuning or new parameters.
doi:10.1186/s13040-017-0135-7 pmid:28533819 pmcid:PMC5437584 fatcat:nxeoz7na3jebvjii7bfx6swp3a

A call for biological data mining approaches in epidemiology

Shannon M. Lynch, Jason H. Moore
2016 BioData Mining  
Analyzing big data requires knowledge and execution of data mining techniques.  ...  Thus, an Epidemiology-Big Data collaboration is of mutual benefit to both groups, and it is the goal of BioData Mining to foster these type of collaborations.  ...  A partnership with epidemiology would expand the application and reach of data mining methods beyond just genomic or proteomic investigations.  ... 
doi:10.1186/s13040-015-0079-8 pmid:26734074 pmcid:PMC4700596 fatcat:2vpqrboy55eq7jum4rfift5rrm

The Dark Proteome Database

Nelson Perdigão, Agostinho C. Rosa, Seán I. O'Donoghue
2017 BioData Mining  
Recently we surveyed the dark-proteome, i.e., regions of proteins never observed by experimental structure determination and inaccessible to homology modelling. Surprisingly, we found that most of the dark proteome could not be accounted for by conventional explanations (e.g., intrinsic disorder, transmembrane domains, and compositional bias), and that nearly half of the dark proteome comprised dark proteins, in which the entire sequence lacked similarity to any known structure. In this paper
more » ... will present the Dark Proteome Database (DPD) and associated web services that provide access to updated information about the dark proteome. Results: We assembled DPD from several external web resources (primarily Aquaria and Swiss-Prot) and stored it in a relational database currently containing~10 million entries and occupying~2 GBytes of disk space. This database comprises two key tables: one giving information on the 'darkness' of each protein, and a second table that breaks each protein into dark and non-dark regions. In addition, a second version of the database is created using also information from the Protein Model Portal (PMP) to determine darkness. To provide access to DPD, a web server has been implemented giving access to all underlying data, as well as providing access to functional analyses derived from these data. Conclusions: Availability of this database and its web service will help focus future structural and computational biology efforts to study the dark proteome, thus providing a basis for understanding a wide variety of biological functions that currently remain unknown. Availability and implementation: DPD is available at The complete database is also available upon request. Data use is permitted via the Creative Commons Attribution-NonCommercial International license (http://creativecommons. org/licenses/by-nc/4.0/).
doi:10.1186/s13040-017-0144-6 pmid:28736578 pmcid:PMC5520327 fatcat:xgsif7tnmfhd5dq3zc5o6yfyye
« Previous Showing results 1 — 15 out of 1,148 results