A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is
and J Thierry-Mieg unpublished observations). ... In contrast, the http://genomebiology.com/2006/7/S1/S12 Genome Biology 2006, Volume 7, Supplement 1, Article S12 Thierry-Mieg and Thierry-Mieg S12. specificity of ECgene drops to 28%, because of their ...doi:10.1186/gb-2006-7-s1-s12 pmid:16925834 pmcid:PMC1810549 fatcat:lvskvottynadho2fqaz2e7vikm
Next-generation sequencing technologies can produce tens of millions of reads, often paired-end, from transcripts or genomes. But few programs can align RNA on the genome and accurately discover introns, especially with long reads. To address these issues, we introduce Magic-BLAST, a new aligner based on ideas from the Magic pipeline. It uses innovative techniques that include the optimization of a spliced alignment score and selective masking during seed selection. We evaluate the performancedoi:10.1101/390013 fatcat:x4trrqjerbb3ndkebd7fm4fcte
more »... f Magic-BLAST to accurately map short or long sequences and its ability to discover introns on real RNA-seq data sets from PacBio, Roche and Illumina runs, and on six benchmarks, and compare it to other popular aligners. Additionally, we look at alignments of human idealized RefSeq mRNA sequences perfectly matching the genome. We show that Magic-BLAST is the best at intron discovery over a wide range of conditions. It is versatile and robust to high levels of mismatches or extreme base composition and works well with very long reads. It is reasonably fast. It can align reads to a BLAST database or a FASTA file. It can accept a FASTQ file as input or automatically retrieve an accession from the SRA repository at the NCBI.
Next-generation sequencing technologies can produce tens of millions of reads, often paired-end, from transcripts or genomes. But few programs can align RNA on the genome and accurately discover introns, especially with long reads. We introduce Magic-BLAST, a new aligner based on ideas from the Magic pipeline. Results: Magic-BLAST uses innovative techniques that include the optimization of a spliced alignment score and selective masking during seed selection. We evaluate the performance ofdoi:10.1186/s12859-019-2996-x fatcat:d6klskyz3vhvvaum3jtquwamju
more »... -BLAST to accurately map short or long sequences and its ability to discover introns on real RNA-seq data sets from PacBio, Roche and Illumina runs, and on six benchmarks, and compare it to other popular aligners. Additionally, we look at alignments of human idealized RefSeq mRNA sequences perfectly matching the genome. Conclusions: We show that Magic-BLAST is the best at intron discovery over a wide range of conditions and the best at mapping reads longer than 250 bases, from any platform. It is versatile and robust to high levels of mismatches or extreme base composition, and reasonably fast. It can align reads to a BLAST database or a FASTA file. It can accept a FASTQ file as input or automatically retrieve an accession from the SRA repository at the NCBI.
Acknowledgments We would like to thank Mehmet Kayaalp at the NLM and Yann Thierry-Mieg at LIP6 for insightful suggestions, our colleagues at the NCBI for numerous discussions on neural networks, and especially ...arXiv:1812.07538v1 fatcat:gryoulvqqrc5nojjrghja5pfoe
per sample area) were measured as dependent variable and treatment means, sample sizes and variance estimates were reported. Included were two experiments on macroalgae (this study), five experiments with periphyton in freshwater, brackish and marine ecosystems 27,28 , two experiments with salt marsh plants 29 , and one with lake phytoplankton 30 , including subtropical and temperate climates in North America and Europe. We analysed data from sampling dates when species richness reached thedoi:10.1038/nature00831 pmid:12075352 fatcat:hk2rk6qiwfcrzhpjngvyyssd7u
more »... onal peak, which was usually in late spring or summer. Data were standardized using the common meta-analysis metric of standardized effect size, Hedges's d (ref. 21). This is a measure of the difference between experimental and control means, divided by a pooled standard deviation and multiplied by a correction factor to account for small sample sizes. Homogeneity of effect sizes was tested using the Q-statistic 21 . As we detected significant heterogeneity among effect sizes we split the data set into low-productivity (oligotrophic and mesotrophic) and high-productivity (eutrophic) sites, based on information provided in the publications.
In vertebrates, Fibroblast Growth Factors (FGFs) and their receptors are involved in various developmental and pathological processes, including neoplasia. The number of FGFs and their large range of activities have made the understanding of their precise functions dicult. Investigating their biology in other species might be enlightening. A sequence encoding a putative protein presenting 30 ± 40% identity with the conserved core of vertebrate FGFs has been identi®ed by the C. elegansdoi:10.1038/sj.onc.1203074 pmid:10597282 fatcat:tkmyxogccfhf5ekbwswq6suxje
more »... consortium. We show here that this gene is transcribed and encodes a putative protein of 425 amino acids (aa). The gene is expressed at all stages of development beyond late embryogenesis, peaking at the larval stages. Loss-of-function mutants of the let-756 gene are rescued by the wild type fgf gene in germline transformation experiments. Two partial loss-of-function alleles, s2613 and s2809, have a mutation that replaces aa 317 by a stop. The truncated protein retains the FGF core but lacks a C-terminus portion. These worms are small and develop slowly into clear and scrawny, yet viable and fertile adults. A third allele, s2887, is inactivated by an inversion that disrupts the ®rst exon. It causes a developmental arrest early in the larval stages. Thus, in contrast to the other nematode fgf gene egl-17, let-756/fgf is essential for worm development.
Whole-transcriptome sequencing ('RNA-Seq') has been drastically changing the scale and scope of genomic research. In order to fully understand the power and limitations of this technology, the US Food and Drug Administration (FDA) launched the third phase of the MicroArray Quality Control (MAQC-III) project, also known as the SEquencing Quality Control (SEQC) project. Using two well-established human reference RNA samples from the first phase of the MAQC project, three sequencing platforms weredoi:10.1038/sdata.2014.20 pmid:25977777 pmcid:PMC4322577 fatcat:enyeye5qkreptpw2xytxm5hkqi
more »... tested across more than ten sites with built-in truths including spike-in of external RNA controls (ERCC), titration data and qPCR verification. The SEQC project generated over 30 billion sequence reads representing the largest RNA-Seq data ever generated by a single project on individual RNA samples. This extraordinarily ultradeep transcriptomic data set and the known truths built into the study design provide many opportunities for further research and development to advance the improvement and application of RNA-Seq. The recent advancement of next-generation sequencing (NGS) has generated tremendous opportunities and challenges in the communities of biomedical research, public health genomics and personalized medicine. Among the versatile applications of NGS, whole-transcriptome sequencing ('RNA-Seq' or WTS) has enabled quantitative profiling with a large dynamic range 1 . As demonstrated in many publications, RNA-Seq enables the discovery of new structural elements of genes such as exons, junctions, untranslated regions, and rare isoforms and thus has expanded our understanding of the transcriptome 2-5 . It provides increased sensitivity compared to the more mature microarray technology 6 and has opened new avenues of research in transcriptome work, such as the study of gene fusions and allele-specific expression, or the discovery of novel alternative transcripts, whereas the measurement noise of RNA-Seq was shown to be a direct consequence of the random sampling process 7-9 . While new platforms and protocols for RNA-Seq have emerged in recent years, the comparability of results across platforms and laboratories has not been extensively examined. With the widespread adoption of RNA-Seq in biomedical and clinical research, a comprehensive, cross-site and cross-platform analysis of the performance of RNA-Seq is essential. Reproducibility across laboratories, in particular, is a crucial requirement for any new experimental method to be relevant for research and clinical applications, and this can only be tested in an extensive multi-site and multi-platform comparison. Just as in the first phase of the MicroArray Quality Control (MAQC-I) project 10 , which tested multi-site and multi-platform agreement in gene-expression microarrays, the FDA coordinated again the third phase of MAQC (MAQC-III) as a large-scale community effort to assess the performance of RNA-Seq, testing different sequencing platforms and analysis pipelines. This project is also known as the SEquencing Quality Control (SEQC) project. A complementary effort that utilized the same samples, but different platforms (e.g., Life Technologies' Ion PGM and Ion Proton, and Pacific Biosciences' PacBio RS) and library protocols (polyA selection, ribosome depletion, size-selection, and RNA degradation) was coordinated with the Association of Biomolecular Resource Facilities (ABRF) Next Generation Sequencing Study (ABRF-NGS) 11 . The objective assessment of technical performance such as accuracy and sensitivity is a great challenge since there is no independent 'gold standard'. In the SEQC study, such assessments were achieved in a controlled test setting, where truths built into the study design could be directly validated, and then related to the performance of other transcriptome profiling technologies. Specifically, we utilized the two well-characterized human reference RNA samples A (Universal Human Reference RNA) and B (Human Brain Reference RNA) from the MAQC consortium, which had been studied extensively with microarrays in MAQC-I 10 . With spike-ins of synthetic RNA from the External RNA Control Consortium (ERCC) 12 , samples A and B were then mixed to construct samples C and D in known mixing ratios, 3:1 and 1:3, respectively (Figures 1 and 2) . All samples were distributed to independent sites for RNA-Seq library construction and profiling by Illumina's HiSeq 2000 platform (7 sites) and Life Technologies' SOLiD 5500 platform (4 sites). In addition, vendors created their own cDNA libraries that were then distributed to each test site, in order to examine the degree of a 'site effect' that was independent of the library preparation process (Figure 1 ). As depicted in Figure 1 , numbers 1-4 denote the 4 libraries prepared by the test sites themselves, while number 5 indicates the library created by the vendors. To support an assessment of gene models, samples A and B were also sequenced at three independent sites by the Roche 454 GS FLX platform, providing longer reads. For comparison to other technologies, data were also compared to the GeneChip Human Genome U133 Plus 2.0 microarrays used in MAQC-I, several current microarray platforms, and also assessed by 20,801 PrimePCR reactions 13 and a set of TaqMan assays from MAQC-I 14 . These data create an overlapping framework of orthogonal validation for any expression measure, splice form, or gene structure question. Thus different sequencing platforms were tested using four well-characterized reference RNA sample mixtures with built-in truths to test accuracy, precision, reproducibility, sensitivity and specificity in a detailed analysis of over 30 billion reads on these reference samples (Table 1 ). The data presented here provide the deepest molecular characterization of any RNA samples published to date. Leveraging this ultradeep transcriptomic data set and the known truths built into the study design, in our related work 13 , we provided an in depth analysis of these data and found that RNA-Seq was highly reproducible across sites and platforms, particularly in differential gene-expression analysis. However, performance was clearly dependent on data treatment and analysis, and transcript-level profiling showed larger variation. This indicates ample opportunities offered by this unique data set: algorithms and pipelines with better and more consistent performance may be developed for transcripts assembly and quantification, gene expression quantification, and gene fusion detection. The presented data set can thus serve a key resource in the development and validation of novel RNA-Seq data analysis algorithms to advance the maturity and performance of applications of RNA-Seq. In this Data Descriptor, we provide additional information aimed at helping others reuse these data within their own research, including more detailed methods descriptions. Methods RNA sample preparation This description on RNA sample preparation is expanded from descriptions in the related research manuscript 13 . The SEQC (MAQC-III) study design is based on the well-characterized MAQC-I RNA www.nature.com/sdata/
Computational Methods in Genome Research
RNA-Seq provides the capability to characterize the entire transcriptome in multiple levels including gene expression, allele specific expression, alternative splicing, fusion gene detection, and etc. The US FDA-led SEQC (i.e., MAQC-III) project conducted a comprehensive study focused on the transcriptome profiling of rat liver samples treated with 27 chemicals to evaluate the utility of RNA-Seq in safety assessment and toxicity mechanism elucidation. The chemicals represented multipledoi:10.1038/sdata.2014.21 pmid:25977778 pmcid:PMC4322565 fatcat:kv332uxmizhzzbitldoclxathe
more »... mic modes of action (MOA) and exhibited varying degrees of transcriptional response. The paired-end 100 bp sequencing data were generated using Illumina HiScanSQ and/or HiSeq 2000. In addition to the core study, six animals (i.e., three aflatoxin B1 treated rats and three vehicle control rats) were sequenced three times, with two separate library preparations on two sequencing machines. This large toxicogenomics dataset can serve as a resource to characterize various aspects of transcriptomic changes (e.g., alternative splicing) that are byproduct of chemical perturbation. Design Type(s) replicate design • compound treatment design • transcription profiling design • parallel group design Measurement Type(s) transcription profiling assay Technology Type(s) RNA sequencing Factor Type(s) technology type • technical replicate • compound • biological replicate Sample Characteristic(s) Rattus norvegicus • liver
Transcriptome sequencing using next-generation sequencing platforms will soon be competing with DNA microarray technologies for global gene expression analysis. As a preliminary evaluation of these promising technologies, we performed deep sequencing of cDNA synthesized from the Microarray Quality Control (MAQC) reference RNA samples using Roche's 454 Genome Sequencer FLX. Results: We generated more that 3.6 million sequence reads of average length 250 bp for the MAQC A and B samples anddoi:10.1186/1471-2164-10-264 pmid:19523228 pmcid:PMC2707382 fatcat:jogtftdbvfdxfachaql5tgeh3a
more »... ced a data analysis pipeline for translating cDNA read counts into gene expression levels. Using BLAST, 90% of the reads mapped to the human genome and 64% of the reads mapped to the RefSeq database of well annotated genes with e-values ≤ 10 -20 . We measured gene expression levels in the A and B samples by counting the numbers of reads that mapped to individual RefSeq genes in multiple sequencing runs to evaluate the MAQC quality metrics for reproducibility, sensitivity, specificity, and accuracy and compared the results with DNA microarrays and Quantitative RT-PCR (QRTPCR) from the MAQC studies. In addition, 88% of the reads were successfully aligned directly to the human genome using the AceView alignment programs with an average 90% sequence similarity to identify 137,899 unique exon junctions, including 22,193 new exon junctions not yet contained in the RefSeq database. Conclusion: Using the MAQC metrics for evaluating the performance of gene expression platforms, the ExpressSeq results for gene expression levels showed excellent reproducibility, sensitivity, and specificity that improved systematically with increasing shotgun sequencing depth, and quantitative accuracy that was comparable to DNA microarrays and QRTPCR. In addition, a careful mapping of the reads to the genome using the AceView alignment programs shed new light on the complexity of the human transcriptome including the discovery of thousands of new splice variants.
Thierry-Mieg, unpubl.), FAKII (Larson et al. 1996; Myers 1996) , and so forth, can derive all of the information it needs from the CAF file without reading any other data, except for trace information ...doi:10.1101/gr.8.3.260 pmid:9521929 pmcid:PMC310697 fatcat:73epqpusc5dj5ffti7gp23oape
Lecture Notes in Computer Science
Symmetry based approaches are known to attack the state space explosion problem encountered during the analysis of distributed systems. In another way, BDD-like encodings enable the management of huge data sets. In this paper, we show how to benefit from both approaches automatically. Hence, a quotient set is built from a coloured Petri net description modeling the system. The reachability set is managed under some explicit symbolic operations. Also, data representations are manageddoi:10.1007/978-3-540-30232-2_18 fatcat:qcl4wyzyjbbmfggkl7dm64trx4
more »... based on a recently introduced data structure, called Data Decisions Diagrams, that allow flexible definition of application specific operators. Performances yielded by our prototype are reported in the paper.
Lecture Notes in Computer Science
Symbolic model-checking using binary decision diagrams (BDD) can allow to represent very large state spaces. BDD give good results for synchronous systems, particularly for circuits that are well adapted to a binary encoding of a state. However both the operation definition mechanism (using more BDD) and the state representation (purely linear traversal from root to leaves) show their limits when trying to tackle globally asynchronous and typed specifications. Data Decision Diagrams (DDD) doi:10.1007/11562436_32 fatcat:ze6tucdczvff3g3p7rrfchkzna
more »... e a directed acyclic graph structure that manipulates(a priori unbounded) integer domain variables, and which o«ers a flexible and compositional definition of operations through inductive homomorphisms. We first introduce a new transitive closure unary operator for homomorphisms, that heavily reduces the intermediate peak size e«ect common to symbolic approaches. We then extend the DDD definition to introduce hierarchy in the data structure. We define Set Decision Diagrams, in which a variable's domain is a set of values. Concretely, it means the arcs of an SDD may be labeled with an SDD (or a DDD), introducing the possibility of arbitrary depth nesting in the data structure. We show how this data structure and operation framework is particularly adapted to the computation and representation of structured state-spaces, and thus shows good potential for symbolic model-checking of software systems, a problem that is diAEcult for plain BDD representations.
, Thierry-Mieg, Mattes, Ning and Shi. ... Nucleic Acids Res. 45, D158–D169. doi: 10.1093/nar/gkw1099 PubMed Abstract | CrossRef Full Text | Google Scholar Thierry-Mieg, D., and Thierry-Mieg, J. (2006). ...doi:10.3389/fcell.2019.00299 pmid:31824949 pmcid:PMC6881247 fatcat:6olqdexac5anpfmonmkyrcdts4
« Previous Showing results 1 — 15 out of 364 results