Computer Science Research Institute 2001 Annual Report of Activities [report]

DAVID E WOMBLE, BARBARA J DELAP, DEANNA R CEBALLOS
2002 unpublished
Design and Optimization: As the ability to do "forward" simulations increases, the ability to do the "inverse" problem needs to be developed, e.g., parameter identification and system design, as well as the traditional inverse problems of applied mathematics. Optimization tends to be very application-specific, although some toolkits have been developed that can be generally applied. Current research efforts include work on large-scale optimization, global optimization, and discrete
more » ... screte optimization. 2.1.2 Linear Solvers: Linear solvers are at the heart of many engineering simulations. There are many algorithms available; however, significant challenges remain. These challenges include the development of scalable preconditioners and preconditioners designed for the specific needs of various applications. Much attention is currently focused on "multiscale" methods and preconditioners as the hope for truly scalable solvers, but a lot of work remains to be done, Nonlinear Solvers: Nonlinear solvers often depend on repeated linear solvers, but there are additional research questions. For example, it will be necessary to solve systems with hundreds of variables for 3-D high-fidelity simulations. Present technology is expected to achieve tens of variables within the next year, falling far short of the ultimate requirement. Newton methods and their use in conjunction with preconditioned Krylov methods for specific problems, are of particular interest. Eigensolvers: Many scientific and engineering problems require the eigenvalues and eigenvectors of extremely large matrices. Examples of particular interest include modal analysis for structural dynamics, minimum energy eigenfunction calculations in quantum chemistry models, and detecting the onset of turbulence in fluid flow. A common feature of these eigenvalue problems is that the number of eigenvalues required is small relative to the size of the matrices, the matrix systems are often very sparse, and only the action of the matrix on a vector (or several of them) is available. Standard techniques that involve directly factoring the matrix (including sparse direct methods) are often impractical for these problems because of excessive memory and computational requirements. Algorithmic work is needed on scalable eigensolvers, reduced accuracy algorithms, parallel implementations and application-focused algorithmic research. 2.1.5 Algorithms for Differential and Integral Equations: Differential or integral equations lie at the heart of most engineering simulations. A mathematical analysis of these equations can often reduce the amount of computing needed by simplifying or improving models, choosing better algorithms, or designing better computational experiments. Research topics of interest include coupling or de-coupling of scales, subgrid modeling, asymptotics, bifurcation, and stability analysis. 2.1.6 Complex Phenomena: This is a very large area, but general goals include identifying and quantifying the effects of uncertainty, developing a predictive capability for complex systems and processes based on computational "experiments," and algorithms that reduce fundamental computational complexity. Topics of interest include stochastic finite elements, sensitivity analysis, experimental design, stability analysis, summability methods, and general methods for handling multiscale (time and space) phenomena. Adaptivity: The purpose of the adaptivity area is to develop the methodologies and algorithms for finite element error estimation and adaptive computing, with the general goal being to reduce the cost of computing by increasing the mesh resolution only in areas where needed. Finite element error estimation addresses the discretization error of the finite element solution for some (local) quantity of interest. The goal is to obtain tight bounds or estimates of the error in a way that is relatively cheap to compute (compared to the cost of solving the original problem). Enabling Technologies Focus Area: 2.2.1 Meshing: Meshing is a time consuming and difficult part of any engineering simulation, yet the quality of the simulation is highly dependent on the quality of the mesh. Of particular interest are hexahedral meshes and high-quality hex-tet meshes. Research questions here include mesh connectivity, mesh optimization, and mesh refinement. Fully automatic methods and the ability to mesh large complex geometries are of particular interest. The general issue of a robust parallel meshing toolkit remains a high-priority goal of the high-performance computing (HPC) programs at the laboratories. System Software Focus Area: 2.3.1 Operating Systems: The operating system is a critical component in the effective and efficient use of massively parallel processing (MPP) computers. Current research topics include the use of commodity operating systems (primarily Linux) with modifications and extensions for MPP computers and distributed, cluster-based, virtual MPP computers. As in other areas, a key focus is on scalability. Projects include adding simple memory management and process management to Linux to improve performance while preserving Linux's portability and expandability, improving communication and connectivity, and fault tolerance. The efficient use of SMP nodes within the MPP computing environment is also being considered; this includes the development and implementation of efficient thread and virtual node capabilities and the efficient utilization of resources that are un-partitionable, such as the network interface. Environments: An effective environment must address several issues. First, it must provide a fast and "user friendly" environment that allows designers to access easily all of the modeling tools, the data comprehension tools, the problem setup tools and the resources required. Second, it must provide a robust and efficient environment for developers to prototype new methods, algorithms and physics, without redoing major portions of the existing codes. Examples exist of application problem-solving-environments aimed at designers, but these are all "one-of-a-kind" products that are developed for a specific physics code. Examples also exist of component interfaces that allow specific methods to be rapidly prototyped, but again these are not generalpurpose, nor are they in common use. Finally, new software tools are needed to model and predict the performance of code and algorithms on MPP computers. The development of tools that combine object-based, Web-centric, client-server technology with high-performance parallel server technology, made available on demand, will also be pursued. I/O: Large-scale, simulation-based analysis requires efficient transfer of data among simulation, visualization, and data management applications. Current efforts seek to improve I/O performance of parallel codes by facilitating I/O operations from multiple nodes in parallel through highly portable user-level programming interfaces. This work will involve design, implementation, and testing of a portable parallel file system. Ideally, the parallel file system should include a server side, which may require a particular hardware configuration, and a client side, which is appropriate for use on any ASCI platform. This is not a replacement for MPI-IO. Just as the MPI data movement standard relies on an underlying message-passing or remote memory access protocol, the MPI-IO standard relies on an underlying file system. The goal is to produce at least a prototype of such a system and, if possible, a product that is appropriate for any future (or current) machine. Heterogeneous and Distributed Systems: Parallel computers based on heterogeneous clusters of commodity workstations are starting to appear and will become common. Yet the effective use of these machines presents many research problems. For example, resources such as processors must be scheduled and managed, systems must be fault-tolerant, operating systems must be compatible, protocols for communication must be established, environments must be developed, and the integrated system must be latency-tolerant. The distinguishing feature for work in this area will be scalability to terascale and larger distributed systems. Architecture: Our basic architecture is influenced by the highly successful ASCI Red. Cplant™ follows this architecture in spirit if not in details. This project will consider new architectures that will scale to 100 TF, petaflops, and beyond. Among other things is the need for research into interconnect technologies (hardware and software). In addition, for many current and future supercomputing applications, the enormity of the data in processing or post-processing for visualization is a major consideration. This project will consider such questions as how this should affect the architecture of future machines. Research opportunities at the CSRI The CSRI presents many opportunities for collaborations between university researchers and laboratory scientists in the areas of computer science, computational science and mathematics. These include the following 3.1 Collaborative research projects. The CSRI accepts proposals for collaborative research projects lasting from one to three years. Projects must have a principle investigator and a Sandia collaborator. Projects should address one of the technical areas listed above and the work must be performed on-site at Sandia. Proposals may be submitted to the CSRI director at any time and must be approved by the CSRI executive board. Postdoctoral appointments. The CSRI offers several postdoctoral positions each year. Postdoctoral positions are for one year and are renewable for one additional year. Applications should include a statement of research interests, a resume, and a list of references. Summer faculty positions and long-term research visits. Faculty are invited to consider the CSRI for summer employment or for extended visits. Salaries are generally commensurate with academic year salaries. Proposals to hire research groups including both faculty and graduate students for the summer are also encouraged. Faculty sabbaticals. Faculty may spend all or part of a sabbatical year at the CSRI. Proposals for sabbatical visits are accepted at any time and the salary depends on the normal academic year salary and the sabbatical salary. Project Summary: This project is concerned with applying our Automated Multi-Level Substructuring (AMLS) technique to the solution of very large eigenvalue problems and, in some cases, directly to applications for which partial eigensolutions are typically sought, such as frequency response analysis of structure finite element models having tens of millions of degrees of freedom or more. In AMLS, a finite element model is divided into thousands of substructures in a tree topology using nested dissection. The finite element representation is transformed to one in terms of substructure eigenvectors whose eigenvalues are below a given "cutoff" value. The number of unknowns is reduced by orders of magnitude by truncating at the substructure level. Then either frequency response or a partial eigensolution of the original model is approximated using the substructure eigenvector subspace. AMLS has been tested on dozens of production automobile body models that have millions of degrees of freedom. It delivers ample engineering accuracy on frequency response over broad frequency ranges, or on the partial eigensolution typically used for approximating the frequency response. This is accomplished in less time on a single workstation processor than the conventional block Lanczos eigensolution requires on a multiprocessor Cray T90. Moreover, AMLS is an inherently parallel approach. Evidently, the AMLS approach can be of great benefit in the next generation of structural analysis problems, which are of interest at Sandia. The two accepted paradigms for solving very large eigenvalue problems are the "shift and invert" block Lanczos algorithm using a sparse direct solver for the iterations, implemented in commercial code at Boeing, and the Lanczos implementation used in Salinas, in which the domain decomposition iterative solver "FETI" (Finite Element Tearing and Interconnecting) is used in place of the direct solver. Because of its reliance on a sparse direct solver, the first approach cannot be expected to perform well on problems having tens of millions of unknowns. In industry, the practical upper limit on problem size for this approach is generally accepted to be about 2.5 to 3 million unknowns. Solutions are obtained using an "out-of-core" algorithm, which requires the memory bandwidth and I/O capabilities of vector supercomputers to get even reasonably efficient usage of CPU resources. Project Summary: Numerous studies of the I/O characteristics of parallel applications have shown that in most cases multiple processors access shared data objects. However, the partitioning and layout of the shared data objects to be stored in the memory can be different from its physical layout on disks, in which case the I/O performance can significantly degrade. In order to solve such problem, collective I/O was proposed in which each participated processor performs I/O on behalf of other processors and, then, all processors use available interconnection network to exchange the data so that each processor obtains the desired data. This technique has been adopted by MPI-IO, the I/O part of the MPI-2 standard, whose goal is to design a highperformance I/O library for parallel and distributed I/O. Collective I/O operations may have the situations that multiple processors issue concurrent read/write requests to overlapped regions in the shared file. The results of writing to the overlapped regions can be defined as written by one of the processors, an aggregation of all processors, or undefined. The mechanism of solving this problem, called atomicity, is implemented differently across file systems, which may involve locking shared files to guarantee the desired results. However, file locking reduces the parallelism of performing concurrent I/O and becomes the bottleneck of the collective operations. We propose to develop techniques to solve this problem. We plan to design a mechanism that automatically detects overlapping region accesses in the collective I/O operations in order to reduce the number of file locking, pass proper parameters to file locking mechanism, or even remove the locking. Project Summary: Our current projects involve core research in the design and analysis of algorithms for discrete optimization problems. They have application to computer and infrastructure surety and logistics. One of the projects I am involved with is a problem related to network security and scheduling. The question we are interested in is the following. We have a network of users, and for each user pair, a permissible communication level. Some users are not allowed to communicate, while others can do so with limited bandwidth. The communication levels are maintained by routers located at nodes of the network. When communication permissions change, it is necessary to reprogram the routers to enforce the new permissions levels. The question we seek to address is how to reprogram the routers in an efficient manner so that the new permissions levels are enforced as quickly as possible. Even special cases of this problem are hard to solve exactly on large networks, so we investigate fast algorithms that and solutions that are close to optimal. This problem has connections to basic scheduling problems with precedence constraints, and we also plan on examining these connections. Project Summary: PDE simulation-constrained optimization is a frontier problem in computational science and engineering. Often, the ultimate goal of simulation is an optimal design, optimal control, or parameter estimation problem, in which the PDE simulation is just an inner loop within the optimization iteration. Thus, the optimization (or "inverse") problem is significantly more difficult to solve than the simulation (or "forward") problem. When the simulation problem requires multi-gigaflop computing, as is often the case with complex 3D PDE systems, the optimization problem is of teraflop scale. PDE simulation-constrained optimization is a frontier problem in computational science and engineering. Often, the ultimate goal of simulation is an optimal design, optimal control, or parameter estimation problem, in which the PDE simulation is just an inner loop within the optimization iteration. Thus, the optimization (or "inverse") problem is significantly more difficult to solve than the simulation (or "forward") problem. When the simulation problem requires multi-gigaflop computing, as is often the case with complex 3D PDE systems, the optimization problem is of teraflop scale. Simulation of Flow Through Porous Media Abstract: Porous media problems are typified by flow equations, which relate velocities and pressure, and transport equations, which account for the conservation of certain chemical species. These mathematical models include coefficients which can vary by several orders of magnitude, point sources and sinks, and chemical reactions with widely disparate time scales. Steep gradients can occur in both fluid pressures and transported quantities. Simulating these features accurately and efficiently, while honoring the underlying conservation principles, are desirable goals when applying numerical solution techniques to these equations. In this talk, we will review some of the common numerical methods in use today for these problems, and discuss current research and future directions. CSRI POC: Monica Martinez-Canales, 08950, (925) 294-3157 Algorithms and Systems for High-Throughput Structural Biology Abstract: In the post-genomic era, key problems in molecular biology center on the determination and exploitation of three-dimensional protein structure and function. For example, modern drug design techniques use protein structure to understand how a drug can bind to an enzyme and inhibit its function. Large-scale structural and functional genomics will require high-throughput experimental techniques, coupled with sophisticated computer algorithms for data analysis and experiment planning. This talk will introduce techniques my lab is developing in two key areas: (1) data-directed computational protocols for high-throughput protein structure determination with nuclear magnetic resonance spectroscopy, and (2) experiment planning and data interpretation algorithms for reducing mass degeneracy in structural mass spectrometry for protein complex binding mode identification. These techniques promise to lead to fast, automated tools to pursue the structural and functional understanding of biopolymer interactions in systems of significant biochemical and pharmacological interest. Applications to the challenge of structural proteomics will be discussed. Abstract: We discretize the shallow water equations with an Adams-Bashford scheme combined with the Crank-Nicholson scheme for the time derivatives and spectral elements for the discretization in space. The resulting coupled system of equations will be reduced to a Schur complement system with a special structure of the Schur complement. This system can be solved with a preconditioned conjugate gradients, where the matrix-vector product is only implicitly given. We derive an overlapping block preconditioner based on additive Schwarz methods for preconditioning the reduced system. This is joint work with Gundolf Haase (Johannes Kepler University) and Mohamed Iskandarani (University of Miami). Abstract: ASCI red and the Cray T3E's, for better or for worse, marked the culmination of the massively parallel architecture. They combined top of the line commodity processors with interconnects which were (and still are) far ahead of the commodity world. Their programming environments were relatively bare bones but top programmers could get excellent performance. Now we are entering the era of commodity clusters. The IBM-SP, Linux clusters of various stripes, and the new Compaq Sierra Class machines all are scaling to 1000s of processors using interconnects which are best described as "commodity plus". The colony switch, Myrinet, and Quadrics all aim to provide better bandwidth and latency than ethernet but will they be good enough? The performance of SnRad, a radiation solver developed at Sandia, and Metis, a graph partitioner from U. Minnesota, will be used as an indicator of what we should expect from new machines and what we need to push for in their evolution. Performance results will be presented for ports to the Cray T3E and to the Sierra Class machines. We will discuss the use of advanced network features provided through the T3E's E-registers and the Quadrics elan co-processor. Advantages and disadvantages of using MPI and of using UPC and shmem-like calls will be discussed. We can debate whether the use of MPI with efficient sends and receives should drive our machine architecture desires or, if not, what features we will need and why. Inevitably the generality of the machines vendors sell will be bounded above by the requirements we give them. The good news is that better technology exists and is relatively affordable. The bad news is that vendors won't give it to us unless we insist on it. Abstract: We discuss the solution of systems of linear equations Ax=b, where the coefficient matrix A is large, sparse, real symmetric indefinite with KKT structure. These matrices arise from interior point methods in a wide variety of applications, including convective heat flow, putting, flight optimization between two cities, linear tangent steering, and space shuttle optimization. The blocks of the KKT matrix consist of structured bands. We consider solution by preconditioned Krylov space methods. The Krylov space methods include GMRES and CG, while the preconditioner is constructed from a congruence transformation made up of incomplete LDL T factorizations and judiciously chosen permutations that are designed to drastically reduce the bandwidth. Abstract: Additive Runge-Kutta (ARK) methods are considered for application to the spatially discretized compressible Navier-Stokes equations. First, accuracy and stability are considered for the general case when N different Runge-Kutta methods are grouped into a single composite method. Then, implicit-explicit, N = 2, additive Runge-Kutta (ARK 2 ) methods from third-to fifth-order are presented that allow for integration of stiff terms by an L-stable, stiffly accurate explicit, singly diagonally implicit Runge-Kutta (ESDIRK) method while the nonstiff terms are integrated with a traditional explicit Runge-Kutta method (ERK). Coupling error terms are of equal order to those of the elemental methods. Derived ARK 2 methods have vanishing stability functions for very large values of the stiff scaled eigenvalue, z [I] -. All constructed methods retain high stability efficiency in the absence of stiffness, z [I] 0. Extrapolation-type stage value predictors are provided based on dense output formulae. Methods have been optimized to minimize the leading order ARK 2 error terms, minimize the size of the Butcher coefficients, and maximize the conservation properties. Numerical tests of the new schemes on a chemical reaction inducing propagating shock wave and a two-equation example of a singularly perturbed initial-value problem confirm the predicted stability and accuracy of each of the methods. Abstract: The Albuquerque High Performance Computing Center (AHPCC) has designed several production Linux superclusters for scientists and researchers to run a variety of parallel applications. The goal of these clusters is to provide easy-to-use high performance computing systems at reasonable prices. Superclusters are large scale clusters built from commodity parts with high performance interconnects, large storage and remote management tools. Details on the design, implementation and management of these systems will be discussed using examples from AHPCC's largest supercluster -LosLobos. This cluster has 512 733 MHz processors with 256 dual processor nodes. Setting up the user application environment and tuning the systems for optimal performance will also be reviewed. Abstract: Genetic algorithms (GAs) combined with local search were named "Memetic Algorithms" (MAs). These methods are inspired by models of adaptation in natural systems that combine evolutionary adaptation of populations of individuals with individual learning within a lifetime. Additionally, Mas are inspired by Richard Dawkins concept of a meme, which represents a unit of cultural evolution that can exhibit local refinement. In this talk we will review some works on the application of MAs to well known combinatorial optimization problems and the architecture of these algorithms will be studied. A syntactic model will be defined and a classification scheme based on a computable index D will be given for MAs. The existence of both a model and a taxonomy for MAs is of theoretical and practical relevance, i.e., they allow for more sensible and "fair" comparisons of approaches and experiment designs while providing with a suitable tool for developing novel MAs for new problems. In essence we introduce here a research program on Memetic Algorithms. (Time permitting) A particular insight on the new kind of MAs that can be extrapolated from our model will be given, specifically we will describe the "Simple Inheritance Multimeme Algorithm" for TSP and Protein Structure Prediction. Abstract: There are many difficulties in modeling growth of brittle fracture of elastic materials. The main model for fracture growth, based on Griffith's criterion, addresses only the rate of crack growth, but not the direction, branching, or "brutal" crack formation. I will describe a model based on total energy minimization that does not require a crack path to be specified a priori, nor the presence of an initial crack. I will also address the problem of existence for this model. Abstract: The development of an unstructured agglomeration multigrid algorithm will be discussed. This algorithm has been developed initially as a solver for steady and unsteady Reynolds averaged Navier-Stokes problems, but recently has also been applied to Radiation-Diffusion problems on unstructured meshes. Agglomeration multigrid constructs coarse grid levels using a graph algorithm which identifies groups of cells or control volumes to be merged together or agglomerated to form fewer but larger coarse level control volumes. Coarse level equations are obtained by Galerkin projection, using piecewise constant restriction and prolongation operators. The algorithm is closely related to algebraic multigrid, but the coarsening phase depends only on the grid characteristics rather than the stencil coefficients, thus resulting in static coarse levels for non-linear problems. The algorithm can be formulated as a non-linear solver (FAS multigrid), a linear solver (Correction Scheme), or as a preconditioner to a Krylov method, and efficiency comparisons using these three variants will be discussed. For anisotropic problems, directional (weighted-graph) coarsening as well as directional (line-based) smoothers have been developed, and convergence rates which are insensitive to the degree of grid stretching are demonstrated. The algorithm is parallelized through domain decomposition, using MPI and/or OpenMP for communication between the partitions assigned to individual processors. Scalability benchmarks on all three ASCI machines using up to 2048 processors are given for a large scale aerodynamic simulation. Comparisons of MPI versus OpenMP and combinations of MPI/OpenMP in a dual level hybrid parallel mode are also given. Abstract: Presents the development and application of adaptive heterodyne filter for use in Adaptive Line Enhancers. Adaptive heterodyne filter provide a means of translating a high order IIR filter with adaptation to enhance or attenuate sinusoids. Further, demonstrates properties of similar adaptive line enhancer filters and the existence of unique solutions for 2nd order IIR constrained filters. Abstract: We report on current activities in the development of the PYRAMID parallel unstructured AMR library. This Fortran-90 based library supports adaptive refinement, mesh migration, load balancing, partitioning, and automatic mesh quality control all in parallel. Current and future design issues are described including performance metrics associated with our transition to the 2.0 Gbit/s Myricom 2000 network on our 800 Mhz dual-processor Pentium-III Beowulf cluster. Abstract: With a distributed shared memory (DSM) multiprocessor, programmers are provided with a globally shared memory address space on top of physically distributed memory modules. This view of the shared memory space is accomplished by sending request and data messages over the interconnection network that connects processing elements. Thus, the cost of inter-processor communication has a significant effect on the DSM multiprocessors. In this talk, the following aspects of DSM multiprocessors that affect the inter processor communication are discussed: (1) interconnection network, (2) cache coherence protocol, (3) on-chip integration. The effects of these design choices on the performance are studied by analytical model and simulations. Abstract: Many scientific calculations require redistribution of their data and work during computation, either because the computation changes, or different components of the algorithm favor different partitions for better performance. Although effective algorithms and tools have been developed to find a new distribution of data, several problems remain open. First, data redistribution raises communication problems such as actually moving data to implement a redistribution subject to memory constraints and determining the new communication pattern after the new subdomains are assigned. Second, most partitioners rely on a priori information on the runtime of tasks, which is not always possible as in the case of overlapped Schwartz domain decomposition, ILU preconditioners, or complete factorizations. Finally, for a given data distribution some tasks can be assigned to one of many processors without altering the communication costs, and a clever assignment of these flexibly assignable tasks will improve performance. In this talk, I will briefly describe these problems and our solutions.
doi:10.2172/800954 fatcat:yx2k7z6krbd3pmpzlm25wrzxne