1,244 Hits in 6.5 sec

A 1 cycle-per-byte XML parsing accelerator

Zefu Dai, Nick Ni, Jianwen Zhu
2010 Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays - FPGA '10  
We demonstrate our design on a Xilinx Virtex-5 board, which successfully saturates a 1 Gbps Ethernet link.  ...  However, the task of XML parsing is often the bottleneck, and as a result, the target of acceleration using custom hardware or multicore CPUs.  ...  developed an open-source non-validating XML parser Parabix (parallel bit streams for XML) which exploits the SIMD capabilities of modern-day commodity processors to process multiple characters at the same  ... 
doi:10.1145/1723112.1723148 dblp:conf/fpga/DaiNZ10 fatcat:f4kgjzc4bvd3vjspjbhmbb67bi

Real-time 3D computed tomographic reconstruction using commodity graphics hardware

Fang Xu, Klaus Mueller
2007 Physics in Medicine and Biology  
We present a solution based on commodity graphics hardware (GPUs) to provide these capabilities.  ...  Many of these applications require interactive 3D image generation, which cannot be satisfied with inexpensive PC-based solutions using the CPU.  ...  Our discussions so far were focused on updating one volume slice with one projection. This represents one rendering cycle (pass) on the GPU.  ... 
doi:10.1088/0031-9155/52/12/006 pmid:17664551 fatcat:yg6kyhmvivcm3o42g2tbfelqjy

Heterogeneous computing architecture for fast detection of SNP-SNP interactions

Davor Sluga, Tomaz Curk, Blaz Zupan, Uros Lotric
2014 BMC Bioinformatics  
Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation.  ...  We report on differences between these two modern massively parallel architectures and their software environments.  ...  Nvidia K20 clearly outperforms every other configuration in terms of speed and is the perfect choice when one wants to cut on the execution times as much as possible.  ... 
doi:10.1186/1471-2105-15-216 pmid:24964802 pmcid:PMC4230497 fatcat:ebtiaiqzgfgchkbighft7eld5i

IP routing processing with graphic processors

Shuai Mu, Xinya Zhang, Nairen Zhang, Jiaxin Lu, Yangdong Steve Deng, Shu Zhang
2010 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010)  
Modern GPUs are offering significant computing power, and its dataparallel computing model well matches the typical patterns of packet processing on routers.  ...  For the deep packet inspection application, we implemented both a Bloom-filter based string matching algorithm and a finite automata based regular expression matching algorithm.  ...  Every SM is equipped with a 16KB shared memory, which could provide up to 16 4-byte words of data in one clock cycle.  ... 
doi:10.1109/date.2010.5457229 dblp:conf/date/MuZZLDZ10 fatcat:lptjskmrzbddxirs4ighke3oqe

Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors

O. Bessonov, D. Fougère, B. Roux
2005 Future generations computer systems  
Approaches for implementing matrix multiplication algorithms are suggested for hierarchical memory computers. Block versions of matrix multiplication and LUdecomposition algorithms are considered.  ...  Performance of the new algorithm will be even higher on the new AMD64 CPUs (Opteron and Athlon64).  ...  In the previous paper [2] we proposed a new approach based on multiplication of a block-vector by matrix, as opposed to vector-matrix (BLAS 2) and matrix-matrix (BLAS 3) ones.  ... 
doi:10.1016/j.future.2004.05.016 fatcat:rtfuljawdvh3pkow5a4xl5xnn4

Designing fast architecture-sensitive tree search on modern multicore/many-core processors

Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, Pradeep Dubey
2011 ACM Transactions on Database Systems  
Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units.  ...  FAST eliminates the impact of memory latency, and exploits thread-level and data-level parallelism on both CPUs and GPUs to achieve 50 million (CPU) and 85 million (GPU) queries per second for large trees  ...  Figure 8 shows the normalized search time, measured in cycles per query on CPUs and GPUs by applying optimization techniques one by one.  ... 
doi:10.1145/2043652.2043655 fatcat:aznq3gvf45g75goaxjno2bnj5u

Real-time energy/mass transfer mapping for online 4D dose reconstruction

Peter Ziegenhein, Cornelis Ph. Kamerling, Martin F. Fast, Uwe Oelfke
2018 Scientific Reports  
unit (CPU).  ...  Adding parallelisation decreased the runtime to about 50 ms while adding vectorisation satisfied our real-time constraint by further reducing the dose accumulation time to 15 ms without compromising on  ...  Especially for low latency and real-time applications using CPUs can have many advantages compared to GPUs. The performance of the EMT algorithm as presented in this work is memory-bound.  ... 
doi:10.1038/s41598-018-21966-x pmid:29483618 pmcid:PMC5827544 fatcat:vmre2hyigbcvvcgabeyvogym5u

Many-core CPUs can deliver scalable performance to stochastic simulations of large-scale biochemical reaction networks

Elias Kouskoumvekakis, Dimitrios Soudris, Elias S. Manolakos
2015 2015 International Conference on High Performance Computing & Simulation (HPCS)  
Method exact stochastic simulation algorithm.  ...  It is evaluated using Intel's experimental many-cores Single-chip Cloud Computer (SCC) CPU and the latest generation consumer grade Core i7 multi-core Intel CPU, when running Gillespie's First Reaction  ...  In this algorithm, a putative next reaction time τj is calculated for every reaction channel Rj.  ... 
doi:10.1109/hpcsim.2015.7237084 dblp:conf/ieeehpcs/Kouskoumvekakis15 fatcat:4kr474ygizfuhps6qk3cyvacx4

Fast computation of database operations using graphics processors

Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin, Dinesh Manocha
2005 ACM SIGGRAPH 2005 Courses on - SIGGRAPH '05  
We present new algorithms for performing fast computation of several common database operations on commodity graphics processors.  ...  We have compared their performance with an optimized implementation of CPU-based algorithms.  ...  on the modern CPUs.  ... 
doi:10.1145/1198555.1198787 dblp:conf/siggraph/GovindarajuL0LM05 fatcat:qq7tkv54enezxilqm7ndj6amni

FEAST-realization of hardware-oriented numerics for HPC simulations with finite elements

Stefan Turek, Dominik Göddeke, Christian Becker, Sven H. M. Buijssen, Hilmar Wobker
2010 Concurrency and Computation  
In this paper, we describe this concept and the modular design which enables applications built on top of FEAST to execute efficiently, without any code modifications, on commodity based clusters, the  ...  numerics', a holistic approach aiming at optimal performance for modern numerics.  ...  ACKLOWLEDGEMENTS Parts of this work are based on a joint collaboration with Robert Strzodka (Max Planck Center), and Patrick McCormick and Jamaludin Mohd-Yusof (Los Alamos National Laboratory).  ... 
doi:10.1002/cpe.1584 fatcat:6omy4woanrbn7dqigs3q4ql5hq

Fast computation of database operations using graphics processors

Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin, Dinesh Manocha
2004 Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD '04  
We present new algorithms for performing fast computation of several common database operations on commodity graphics processors.  ...  We have compared their performance with an optimized implementation of CPU-based algorithms.  ...  Branch mispredictions can be extremely expensive on the modern CPUs. Modern CPUs use specialized schemes for predicting the outcome of the branch instruction.  ... 
doi:10.1145/1007568.1007594 dblp:conf/sigmod/GovindarajuLWLM04 fatcat:r4m4oe7f6vf6rdgiac7fxncj4e

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks [article]

Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, Reetuparna Das
2018 arXiv   pre-print
Neural Cache improves inference throughput by 12.4x over CPU (2.2x over GPU), while reducing power consumption by 50% over CPU (53% over GPU).  ...  Our experimental results show that the proposed architecture can improve inference latency by 18.3x over state-of-art multi-core CPU (Xeon E5), 7.7x over server class GPU (Titan Xp), for Inception v3 model  ...  Including the initialization steps, it takes n 2 +5n−2 cycles to finish an n-bit multiplication. Division can be supported using a similar algorithm and takes 1.5n 2 +5.5n cycles. D.  ... 
arXiv:1805.03718v1 fatcat:d72fse5przg43h5ojhqydsl64i

DSP.Ear: Leveraging Co-Processor Support for Continuous Audio Sensing on Smartphones [article]

Petko Georgiev, Nicholas D. Lane, Kiran K. Rachuri, Cecilia Mascolo
2014 arXiv   pre-print
This is achieved through a series of pipeline optimizations that allow the computation to remain largely on the DSP.  ...  Through detailed evaluation of our prototype implementation we show that, by exploiting a smartphone's co-processor, DSP.Ear achieves a 3 to 7 times increase in the battery lifetime compared to a solution  ...  The measurements account for the worst case when the CPU needs to be woken up on every occasion any of the services needs an update.  ... 
arXiv:1409.3206v1 fatcat:l5o4jm7gxbbdtidyqiqdh373jm

Parallelism via Multithreaded and Multicore CPUs

A.C. Sodan, J. Machina, A. Deshmeh, K. Macnaughton, B. Esbaugh
2010 Computer  
Moore's Law, which projects that the density of circuits on chip will double every eighteen months, still applies and is providing hardware designers with the ability to add more complexity to a chip.  ...  The additional capacity was used in the past for development of superscalar CPUs with replicated execution units and deep pipelines to exploit instruction-level parallelism.  ...  Acknowledgments We thank (alphabetically) Tracy Carver of AMD, Jaime Moreno of IBM, Denis Sheahan of Sun, and Xinmin Tian of Intel for their helpful feedback and for validation of our CPU/GPU data.  ... 
doi:10.1109/mc.2010.75 fatcat:z34ptnd3rbgdvf7md5dmmqinfm

Multiple Pattern Matching for Network Security Applications: Acceleration through Vectorization

Charalampos Stylianopoulos, Magnus Almgren, Olaf Landsiedel, Marina Papatriantafilou
2017 2017 46th International Conference on Parallel Processing (ICPP)  
We first identify properties of pattern matching that make it fit for vectorization and show how to use them in the algorithmic design.  ...  Pattern matching is a key building block of Intrusion Detection Systems and firewalls, which are deployed nowadays on commodity systems from laptops to massive web servers in the cloud.  ...  In this work we utilize vector pipelines that are already part of modern commodity architectures.  ... 
doi:10.1109/icpp.2017.56 dblp:conf/icpp/StylianopoulosA17 fatcat:4o7ggldnanb7rovvxlmehmzbny
« Previous Showing results 1 — 15 out of 1,244 results