A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Filters
GPU-enabled efficient executions of radiation calculations in climate modeling
2013
20th Annual International Conference on High Performance Computing
By exploiting the parallelism offered by the GPU and using asynchronous executions on the CPU and GPU, we obtain a speed-up of about 26× for the routine radabs and about 5.6× for routine radcswmx. ...
This work attempts to extract the performance of GPUs to enable faster execution of CESM and obtain better model throughput. ...
CONCLUSIONS AND FUTURE WORK The use of GPUs to accelerate applications that are embarrassingly parallel leads to significant performance improvements. ...
doi:10.1109/hipc.2013.6799141
dblp:conf/hipc/KorwarVN13
fatcat:jrnyadx3wraazl6rzqnhtfue7m
A Parallel Hybrid Intelligence Algorithm for Solving Conditional Nonlinear Optimal Perturbation to Identify Optimal Precursors of North Atlantic Oscillation
2019
Nonlinear Processes in Geophysics Discussions
It has a profound influence on the strength of westerly winds as well as the storm tracks in North Atlantic, thus affecting winter climate in Northern Hemisphere. ...
In this paper, conditional nonlinear optimal perturbation (CNOP), which has been widely used in research on the optimal precursor (OPR) of climatic event, is adopted to investigate which kind of initial ...
Since GPU is suitable for parallel computing on a large scale, it can significantly improve the execution performance of climate models. ...
doi:10.5194/npg-2019-25
fatcat:odynt2v5ujbxfgl3e74lqdsdme
Optimizing high-resolution Community Earth System Model on a heterogeneous many-core supercomputing platform
2020
Geoscientific Model Development
The refactoring and optimizing efforts have improved the simulation speed of CESM-HR from 1 SYPD (simulation years per day) to 3.4 SYPD (with output disabled) and supported several hundred years of pre-industrial ...
(CPUs) and six graphics processing units (GPUs) inside one node. ...
All authors contributed to the improvement of ideas, software testing, experimental evaluation, and paper writing and proofreading. Acknowledgements. ...
doi:10.5194/gmd-13-4809-2020
fatcat:i7pxvfrgk5dhljfg5azi3epj6e
Deep and Shallow convections in Atmosphere Models on Intel Xeon Phi Coprocessor Systems
[article]
2017
arXiv
pre-print
We also employ proportional partitioning of independent column computations across both the CPU and coprocessor cores based on the performance ratio of the computations on the heterogeneous resources. ...
These calculations also present significant load imbalances due to varying cloud covers over different regions of the grid. ...
ACKNOWLEDGMENTS This project is supported by the Intel R Parallel Computing Centre for Modelling Monsoons and Tropical Climate (IPCC-MMTC), India sponsored by the Intel R Corporation. ...
arXiv:1711.00289v1
fatcat:bzxyru5cmzc6vpnlupxxxvwtee
NICEST2 - D4.5: First report on the identified bottlenecks for an efficient usage of Nordic ESMs on EuroHPC
2021
Zenodo
Report on the bottlenecks that would hinder the efficient usage of Nordic ESMs on EuroHPC and possible remediation actions (I/Os, adding GPU support, etc.) with clear information on costs in terms of manpower ...
For instance, being able to organize meetings/hackathons (online or face to face) with both experts from NorESM and EC-EARTH has been highlighted as an important requirements by those involved in the GPU ...
the performance of the code on GPUs. ...
doi:10.5281/zenodo.4749515
fatcat:hsvwkv2ayfhztaz5qistbdolm4
A Feasibility Study on Porting the Community Land Model onto Accelerators Using Openacc
2014
International Journal of Advanced Computer Science and Applications
Even it is a non-intensive kernel, on a single 16-core computing node, the performance (based on the actual computation time using one GPU) of OpenACC implementation is 2.3 time faster than that of OpenMP ...
of data parallelization and the benefit of data movement provided by current implementation of OpenACC. ...
break all the loop into parallel computation on GPU cores, and copy back these datastreams. ...
doi:10.14569/ijacsa.2014.051203
fatcat:as4dqw3pmbg4zjhekpxfbixchm
Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs
[article]
2021
arXiv
pre-print
architectures. (3) We carefully optimize each of cuSZ+ kernels by leveraging state-of-the-art CUDA parallel primitives. (4) We evaluate cuSZ+ using seven real-world HPC application datasets on V100 and ...
Experiments show cuSZ+ improves the compression throughputs and ratios by up to 18.4X and 5.3X, respectively, over cuSZ on the tested datasets. ...
The material was supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. ...
arXiv:2105.12912v3
fatcat:a5ayaxi4brdpvjzjqofqkmdfpu
High-Performance Distributed Multi-Model / Multi-Kernel Simulations: A Case-Study in Jungle Computing
[article]
2012
arXiv
pre-print
We make use of the software developed in the Ibis project, which addresses many of the problems faced when running applications on Jungle Computing Systems. ...
One striking example of applications that can benefit greatly of Jungle Computing Systems are Multi-Model / Multi-Kernel simulations. ...
The authors wish to kindly thank Vianney Govers and Kees Verstoep for providing support for the LGM and DAS-4, and setting up the LGM/DAS network connection. ...
arXiv:1203.0321v1
fatcat:bxi74ww3mvhtjkepflywwm7mhu
CEAZ: Accelerating Parallel I/O Via Hardware-Algorithm Co-Designed Adaptive Lossy Compression
[article]
2021
arXiv
pre-print
As parallel computers continue to grow to exascale, the amount of data that needs to be saved or transmitted is exploding. ...
To this end, many previous works have studied using error-bounded lossy compressors to reduce the data size and improve the I/O performance. ...
We evaluate the impact of update frequency on the final compression ratio. We perform the experiments on both CESM and HACC. ...
arXiv:2106.13306v2
fatcat:42fvquu3trcgxncxwdl5izksra
cuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data
[article]
2020
arXiv
pre-print
It also improves the compression ratio by up to 3.48x on the tested data compared with another state-of-the-art GPU supported lossy compressor. ...
. (2) We develop an efficient customized Huffman coding for the SZ compressor on GPUs. (3) We implement cuSZ using CUDA and optimize its performance by improving the utilization of GPU memory bandwidth ...
ACKNOWLEDGMENTS This research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC, a collaborative effort of two DOE organizations -the Office of Science and the National ...
arXiv:2007.09625v1
fatcat:f4sq3abcmvehvm4b6wishmudie
SIMD Lossy Compression for Scientific Data
[article]
2022
arXiv
pre-print
However, some HPC systems and applications do not use GPUs, and could still benefit from the fine-grained parallelism of this method. ...
Recent work proposes a parallel dual prediction/quantization algorithm for GPUs which removes these dependencies. ...
In addition, SZ takes advantage of on-node parallelization such as OpenMP, GPUs [12] .
C. SIMD Many operations in HPC applications such as linear algebra exhibit large amounts of data parallelism. ...
arXiv:2201.04614v1
fatcat:sbv7n4lnjbhj3kzq7fg5fr5mby
Runtime Tracing of the Community Earth System Model: Feasibility Study and Benefits
2012
Procedia Computer Science
The Community Earth System Model (CESM) is one of US's leading earth system modeling frameworks, which has decades of development history and was embraced by a large, active user community. ...
Then we present an offline global community land model simulation within the CESM framework to demonstrate the procedure of runtime tracing of CESM using the Vampir toolset. ...
Acknowledgments The authors thank the Vampir Team of the Center for Information Services and High Performance Computing ...
doi:10.1016/j.procs.2012.04.213
fatcat:tdbgofrg5vgovade5evwfghwzy
Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs
[article]
2022
arXiv
pre-print
Specifically, we take full advantage of CUDA GPU architectures by using shared memory on decoding/writing phases, online tuning the amount of shared memory to use, improving memory access patterns, and ...
In this work, we aim to significantly improve the Huffman decoding performance for cuSZ, thus improving the overall decompression performance in turn. ...
The material was supported by the U.S. Department of Energy, Office of Science and Office of Advanced Scientific Computing Research (ASCR), under contract DE-AC02-06CH11357. ...
arXiv:2201.09118v2
fatcat:4jzndm2n4ngirjvogrrzsu4exu
An automatic performance model-based scheduling tool for coupled climate system models
2018
Journal of Parallel and Distributed Computing
As a petascale supercomputer, Tianhe-1A features a massively parallel
processor (MPP) architecture of hybrid CPU-GPU computing. A proprietary
295
300
nodes for a total of 192 cores. ...
We test the performance improvement of B1850 f19 g16 on Tianhe-1A and HP cluster. 315 We profile the case B1850 f19 g16 at two parallelisms (12 and 96 processes) to determine the model parameters. ...
doi:10.1016/j.jpdc.2018.01.002
fatcat:lasxgh5a5jbnvak3gmmcnnbube
GPU-RRTMG_SW: Accelerating a Shortwave Radiative Transfer Scheme on GPU
2021
IEEE Access
Parallel computing technology based on graphics processing units (GPUs) has the characteristics of high parallelism, multi-threaded and multi-core processors, and high memory bandwidth. ...
With this technology, transferring data between a CPU and GPU is sped up by approximately 86%. When the total performance of CC- RRTMG_SW on one K20 GPU is analyzed, there is a 44% improvement. ...
doi:10.1109/access.2021.3087507
fatcat:kwrg4nmbxvhc3lqfhynrjw7wlm
« Previous
Showing results 1 — 15 out of 77 results