Today's high performance computing environment differs in three fundamental ways from the environment of 1991 when the HPCC Program began:
This contrasts with the 1991 environment in which users commonly ran their applications on single scalar or vector computers accessed via comparatively low speed local area networks and interacted little with the execution, or they ran their applications on their own workstations.
Several software developments have made the new environment possible. These include the following:
Systems Software
New scalable microkernel operating systems are the product of R&D during earlier years of the HPCC Program and have become more robust, more stable, and more complex. These operating systems manage dozens to thousands of parallel processors and associated memory (as well as vector processors in some cases); control I/O to peripherals such as disk drives, tape drives, printers, and advanced mass storage systems that incorporate robotics; and facilitate communications with more complicated and higher speed networks.
The microkernel basis for HPCC prototype operating systems has now been accepted by the broader computing community. In the past year, DEC, Hewlett-Packard, IBM, and Microsoft committed to use microkernel operating systems as the basis of their own offerings. The overhead involved in transitioning into and out of the microkernel has received considerable attention in the past two years. Technology developments include software fault isolation, co-location, and extensible microkernels.
Begun in FY 1994, the Scalable I/O Grand Challenge addresses the need to complement fast computing systems with fast input/output to memory. Teraflops systems that access terabytes of data will require Gb/s I/O. By instrumenting I/O intensive Grand Challenge applications, insights into I/O behavior are providing opportunities for I/O performance optimization. By adding features to operating systems, languages, compilers, memory management, interfaces to mass storage systems, and runtime support, I/O performance increases of 10-fold to 100-fold are possible. The initiative is funded by ARPA, DOE, NASA, and NSF, and involves vendors and researchers from more than 30 institutions. Testbeds include a 512-node Intel Paragon at Caltech's Concurrent Supercomputing Consortium, and a 128-node IBM Scalable POWERparallel (SP) at DOE's Argonne National Laboratory (ANL). Products will include an integrated set of tools and software systems providing scalable I/O capabilities available to the U.S. high performance computing community, and a scalable I/O benchmark suite. One example is the Pablo software used for performance instrumentation, analysis, and visualization. Pablo development was funded by ARPA and NSF; it has been commercialized by Intel.
http://www.ccsf.caltech.edu/bluebook_96/nsfgcpiom.html
http://www-pablo.cs.uiuc.edu/HPCC.html
Snapshot of the dynamic patterns of read behavior in a parallel version of software to calculate electron-molecule cross-sections using a 128-processor Intel Paragon at Caltech's Concurrent Supercomputing Consortium. The axes are file open duration, file seek duration, and file read duration. The locations of the octahedra are the current values of each processor's performance metric. History ribbons show the last N positions for three select octahedra (red is most recent). The Pablo software was used to produce this image.
Programming Languages and Compilers
High Performance Fortran (HPF)
The HPF language definition was completed in 1994. It allows straightforward expression of data parallel constructs. HPF has become a de facto standard language for parallel computing systems, providing Fortran programmers with a familiar portable language. Both computing systems vendors and independent software vendors have written compilers that translate HPF commands into machine instructions that distribute the computation across the processors, memory, and networks. The HPF Forum is a coalition of government, high performance computing systems vendors, and academic groups that is coordinated by the Center for Research on Parallel Computation (CRPC). The Forum is studying extensions that support task parallelism and scalable I/O. ARPA and NSF directly support this effort.
http://www.erc.msstate.edu/hpff/home.html
High Performance C++ (HPC++)
ARPA supports a multi-institution collaborative effort that is defining a minimal set of extensions to C++ that support both task and data parallel constructs, for use by C and C++ programmers. ARPA and NSF are also developing common runtime support environments for both HPF and HPC++ in order to increase efficiency and reduce barriers in compiler development.
High-Efficiency Languages and Compilers
NSA has long supported research in high performance languages and compilers for specific high performance computing systems with the goal of achieving near-peak performance in many applications and helping to achieve advertised price/performance characteristics of existing systems. These joint NSA/industry efforts include the AC extended C compiler for the Thinking Machines CM-5, which shows dramatic speedups over the vendor's data parallel compiler; the DBC bit serial data parallel C compiler for the joint Cray Computer/NSA Cray-3/PIM system; and AC for the Cray Research T3D that has mechanisms allowing efficient emulation of shared- memory programming on distributed memory multiprocessors. Consistency and clarity of the mapping between language and architecture underlie these successes. Continued research has the goal of extending these models to other architectures and using them as a basis for higher level programming models.
Parallel Virtual Machine (PVM)
PVM software permits a heterogeneous collection of networked computing systems, all of which use the Unix operating system, to be used as a single large computer. Developed early in the HPCC Program with multi-agency support, PVM is used at hundreds of sites worldwide both for problem solving and as a tool to teach parallel programming.
http://www.netlib.org/pvm3/index.html
Message Passing Interface (MPI)
MPI is a specification for a standard portable library of subprograms for message passing that can be called from programs written in Fortran or C. Begun at Supercomputing '92 in November 1992, the first phase of MPI was completed in May 1994 with ARPA, DOE, and NSF support. Message passing gives control of parallelism to the application developer rather than to the hardware or the compiler. MPI can be implemented on any of today's parallel computing systems or networked heterogeneous workstations. DOE funded the public-domain portable version that was implemented by Argonne National Laboratory and Mississippi State University. Future DOE products will include an instrumented version that produces communications statistics and conversion of linear algebra and partial differential equations libraries to use MPI.
http://www.mcs.anl.gov/Projects/mpi/index.html
Parallel Tools Consortium (Ptools)
Ptools is an open community of vendors (both systems and software), Federal organizations, and university researchers developing public-domain portable reference implementations of tools for parallel environments. The Consortium seeks to identify and build tools that users want and will use, and to involve users in tool development and refinement. Ptools was established in 1993 and is headquartered at Oregon State University. Current projects include distributed array visualization, a lightweight corefile browser, a message queue manager, parallel Unix commands, and portable timing routines.
http://www.llnl.gov/ptools/ptools.html
Computational Techniques
Algorithms for numerical computations and for finding and moving data are widely used in Grand Challenge applications. They are developed by experts and included in general purpose libraries of reusable software. The HPCC Program has funded such software development since its inception, and much of that software is becoming available through the National HPCC Software Exchange (NHSE). A catalog of available software can be found at:
http://www.netlib.org/nse/
Multipole Method for Solving Differential Equations
Performance improvements have been the result of algorithm development as much as hardware speedup. An example of this trend is the development of the multipole method, which also illustrates the contributions of basic research to HPCC. The algorithm provides the solution of a differential equation approximated at N points in order N operations. Thus doubling the number of points in order to obtain a more accurate solution requires only twice as much work; other algorithms might require many times the work to obtain the same solution. The original research for the method was supported by NSF at Yale University. The algorithm is now a cornerstone for major simulations including Grand Challenges at Caltech and the University of Illinois.
Unstructured Mesh Computation with PUMAA3D
Both computations and parallel data management are issues in unstructured mesh computations, in which a complex surface or solid object is partitioned into a collection of covering triangles or trapezoids. An application package then solves a system of equations at the nodes of this mesh. A practical application is finite element analysis of the effects of pressure and temperature on disk brakes. The PUMAA3D software developed at DOE's Argonne National Laboratory implements efficient parallel algorithms for unstructured mesh generation, adaptive mesh refinement (for example, where the pressure or temperature are greatest), mesh partitioning, and the solution of the sparse linear systems that commonly occur in these applications. The software is in the public domain, flexible, and portable to a wide variety of distributed memory systems. PUMAA3D development has been done on the Intel Delta and the IBM SP system.
http://www.mcs.anl.gov/Projects/meshacc94.html
New techniques adaptively refine, de-refine, and partition meshes to accurately model rapidly changing solutions such as those that arise in simulating layered high temperature superconductors.
ScaLAPACK
Funded by ARPA and NSF, ScaLAPACK is a scalable software library for common linear algebra computations. These include linear and eigen solvers for both dense and sparse matrices. Many sparse problems have internal data representations that make it infeasible to use standard libraries. A book of templates for building custom sparse linear solvers has been developed. The use of reverse communication that allows users to customize solvers to their data structures is being investigated.
Handling Irregular Data using CHAOS
A number of scientific computing applications have unstructured, sparse, adaptive, or block-structured data. Two such applications are computational fluid dynamics and molecular dynamics. With ARPA funding the portable CHAOS runtime support library has been developed to help parallelize these applications. Library features include (1) coordinated interprocessor data movement, (2) off- processor data management, (3) support for a shared name space, and (4) coupled runtime data and workload partitioners to compilers. The library can be called from the Fortran D and HPF programming languages and, in the future, from C++. CHAOS has been used to parallelize the CHARMM molecular dynamics application to run on several systems. Current efforts address handling hierarchical data structures and optimizing disk access. This work is being conducted at the University of Maryland, where it is being used for the land cover dynamics Grand Challenge, and at LANL, Caltech, Rice University, and Syracuse University. It is being used by the ARPA- funded Parallel Compiler Runtime Consortium and the Scalable I/O initiative.
http://www.cs.umd.edu/projects/hpsl/projects/blue_book.html
Input molecule for the CHARMM molecular dynamics software that has been parallelized on multiple systems using the CHAOS runtime library. Key portions of CHARMM have been automatically parallelized using an enhanced version of the Fortran D compiler.
Performance Measurement
Automated Instrumentation and Monitoring System (AIMS)
NASA has developed AIMS tools to help detect performance bottlenecks and suggest ways to eliminate them. The following features have been included or are under development:
AIMS is used at NASA and DOE facilities and in universities for both teaching and evaluation. Convex Computer Corporation produced CXTRACE that is based on AIMS and runs on their SPP-1 and HP cluster. AIMS will be implemented on the IBM SP-2 and is under evaluation by other vendors.
http://cesdis.gsfc.nasa.gov/hpccm/accomp/94accomp/cas94.accomps/cas3.html
Performance of a Toroidal Architecture
The NSA-funded Supercomputing Research Center, part of the Institute for Defense Analyses, has developed a performance monitoring system that displays the dynamic performance of system components such as memory references, memory bank stalls, network traffic, and processor stalls (shown at the right). Features include discrete group actions in hyperbolic geometry, scrolling the workstation screen with hyperbolic transformations, and displaying function graphs over network nodes and links.
This hyperbolic scrollable display that maps the three-dimensional toroidal Cray Research T3D network onto the two-dimensional workstation screen without false crossings was developed at the Supercomputing Research Center.
Performance Instrumentation
NIST, with ARPA support, has developed the MultiKron chip that unobtrusively monitors memory busy traffic and can gather statistics or traces at user request. This chip has been installed on Intel Paragons. The NIST S-Check project, also supported by ARPA, and the Pardyn project at the University of Wisconsin, which receives ARPA and DOE support, automatically instruments a program and gathers runtime statistics useful in understanding performance.
HINT (Hierarchical INTegration)
HINT evaluates computer performance by measuring the quality of an answer as a function of time, instead of counting operations as done in the usual benchmarking. The unit of measurement is QUIPS (QUality Improvement Per Second). HINT reveals the characteristics of a computing system, demonstrating, for example, the impact of cache, memory size, and operating system overhead. The software is easy to transport to any type of modern computing system, making the benchmarking of parallel systems as easy as conventional systems. Because HINT was designed to predict application performance, it can be used as a computer design tool. HINT was developed by DOE's Ames Laboratory.
http://www.scl.ameslab.gov/scl/Projects/hint1.html
Benchmarking
NASA Parallel Benchmarks
These benchmarks reflect the diverse computational demands of NASA's mission-oriented science and engineering programs. There are two sets of benchmarks: the Parallel Benchmarks for Numerical Aerodynamic Simulation (NAS) and the Parallel Benchmarks for Earth and Space Sciences (ESS). The NAS Parallel Benchmarks were developed in 1991 to evaluate the performance of parallel computing systems for workloads that typify those used by the aerospace engineering community. The ESS Parallel Benchmarks are a new set of test programs typifying those used by the Earth and space sciences community. Both are designed to enable portability across disparate classes of parallel architectures and system configurations while retaining the basic nature of the computations to be performed. They increasingly serve as procurement criteria in industry and academia, and aid internal development of current and future systems by parallel computing systems vendors.
http://cesdis.gsfc.nasa.gov/hpccm/accomp/94accomp/bench.html
Joint NSF-NASA Initiative on Evaluation (JNNIE)
The JNNIE study, a collaboration between NSF and NASA high performance computing centers, is evaluating the effectiveness of various parallel architectures for large-scale scientific computing, determining the level of effort required to transport realistic-sized codes to parallel environments, determining the extent to which portability is achievable across diverse architectures (and at what price in performance), and disseminating this information to academia and industry. To this end, participants have been conducting scalability studies and cross-architecture comparisons of a number of academic and third-party application codes in the multi-vendor scientific computing environment.
Software Sharing
National HPCC Software Exchange (NHSE)
In September 1994 the HPCC Program funded NHSE to collect software (or software descriptions) for high performance computing systems and make it available on the Web. The top level organizational structure for the software is: application level, library software, parallel programming environments, parallel programming languages, parallel programming tools, and performance visualizations. NHSE also contains a hardware and software vendor catalog and information about reports, journals, and professional associations.
http://www.cs.rice.edu/CRPC/bluebook/bluebook.html
Critical to efficiently understanding results from high performance computations is rapid three-dimensional high-resolution color display of results from simulations, often overlaid with other data such as experimental observations, with which the user can interact. These displays can be single images or video. Several projects are developing improvements in this field:
Cave Automatic Virtual Environment (CAVE)
The CAVE was developed at the Electronic Visualization Laboratory at the University of Illinois at Chicago. Two additional CAVEs have been built at the National Center for Supercomputing Applications (NCSA) and Argonne National Laboratory. It is a multi-person, room- sized, surround-screen, surround-sound, projection-based virtual reality environment. Graphics are rear projected in three dimensions onto the walls and floor and viewed with stereo glasses. Several people can explore an environment together in the room. As the principal viewer wearing a location sensor moves within the display boundaries or controls the image using a "wand" (the CAVE equivalent of a "mouse"), the perspective and stereo projections of the environment are updated. Each institution pursues a wide range of interactive applications including molecular modeling (drug design), product design, medical imaging, manufacturing, cosmology, and education. In addition, research continues to develop and extend CAVE capabilities such as coupling these virtual environments to supercomputing resources for large-scale background computations. A key goal is demonstrating distributed collaboration in which the three CAVEs are linked by high-bandwidth network connections for visualizing the same application in the three locations.
http://www.ncsa.uiuc.edu/EVL/docs/html/CAVE.html
http://www.mcs.anl.gov/anlcave.html
http://www.ncsa.uiuc.edu/VR/VR/VRHomePage.html
The Space Science and Engineering Center (SSEC) at the University of Wisconsin at Madison in cooperation with EPA has developed an interactive three-dimensional visualization tool (shown below) for use at desk-top workstations that is based on SSEC's Vis5D for visualizing output from atmospheric and ocean models. Vis5D has also received non-HPCC NASA funding.
http://ssec.wisc.edu/~billh/epa.html
Application of Vis5D to EPA's Regional Acid Deposition Model shows transparent volume rendering of sulfur dioxide (the red fog) and a horizontal slice with iso-lines of nitric acid over a topographic map of the Eastern U.S. The icons on the left give the user interactive control over the three-dimensional images as they are animated. Vis5D makes this interactive exploration possible by compressing data sets to fit in workstation memories. Vis5D has been used for experiments over the Blanca Gigabit Testbed and has been adapted to run in the virtual reality CAVE (described above); it is freely available over the Internet.
VolVis
VolVis is a multi-featured visualization package developed at the State University of New York at Stony Brook through DOE's Partnership in Computational Sciences (PICS) consortium centered at ORNL. It accommodates both regular and irregular grids and includes (1) manipulation and rendering tools, (2) flexible I/O, (3) volumetric navigation, (4) a key-frame animation generator, (5) quantitative analysis tools, and (6) a protocol for communicating with three-dimensional input devices. It is available for systems with XWindows and Motif. Parallel versions of key algorithms including output visualization algorithms for several platforms have been developed.
Other Research Activities
ARPA is developing technology and manufacturing capability for high definition displays and scalable image processing systems. Components include projection, head-mounted and direct-view displays based on multiple technologies, display architectures and processors, compression algorithms, and high speed data transmission. In FY 1996 ARPA plans to establish a testbed for interoperability standards for display interfaces and to demonstrate a prototype high-resolution progressive scan digital camera.
ARPA has also funded the development of multi-resolution modeling algorithms to simplify surface models so that they can be displayed faster. The algorithms display nearby objects using the detailed input data but display distant objects more coarsely. Related projects include (1) automating shape acquisition, (2) modeling of enclosing environments such as rooms, factories, and ports for architecture, simulation, and training applications, and (3) radiosity algorithms that simulate indirect lighting.
A virtual reality Grand Challenge is described in Section II.6.
Synthetic cutaway view of Crater Lake, OR. The upper figure is the original digital elevation model that contains 307,000 triangles and requires approximately 10 seconds to display. A new algorithm was used to produce the lower figure, a simplified model with 800 triangles that can be displayed in real time.