High Performance Systems

DARPA, NSF, NASA, DOE, NSA, NIST

Over the past year, advances in high performance systems include system prototypes that demonstrate an order of magnitude reduction in cost per megaflop; workstation clusters using novel operating system techniques that achieve an order of magnitude reduction in application level latency; unprecedented performance levels of graphics rendering (50 million polygons/sec); and dramatic improvements in low latency communications rates for local area distributed computing applications (2 Gb/s). Advances also include formal theoretical methods for the verification of complex chip design, technologies required to design, fabricate, and test computer system microarchitectures, and the infrastructure required to allow heterogenous architectures to function together in an integrated system.

Heterogeneous computing environments combine computational engines with different architectures, file systems, and high speed interconnects. These heterogeneous systems provide support for solving computational problems that require different capabilities of their constituent subsystems. HPCC-supported researchers used a collection of heterogeneous supercomputers on the Internet to simulate and render the collision of two galaxies. This process is too large to implement on any single system; therefore, researchers combined several geographically separated high end systems by high speed networks to simulate appropriate parts of this single model. The successful simulation yielded graphical display capabilities not possible before, and resulted in the first computer-generated scientific IMAX movie, to be shown at the Smithsonian National Air and Space Museum.

Networks of workstations (NOWs) -- systems that use a newly developed communication fabric to connect collections of heterogeneous, commercial workstations -- demonstrate the potential for supercomputer performance with workstations already in use. Research focuses on communications microarchitectures and high performance memory interfaces that improve bandwidth and latency for distributed applications. NOWs require research in distributed operating systems, fault detection and recovery, resource discovery and allocation across multiple administrative domains, real-time response guarantees, and multi-level security. This research demonstrated a promising first generation distributed resource management system that uses techniques for resource discovery and allocation, and advanced fault management. Researchers using a high performance interconnect of clusters of up to 50 workstations have demonstrated applications that engender memory latencies of only a few microseconds. Driving applications for NOWs include the fastest known World Wide Web indexer and the world's fastest connected components kernel for solid-state physics.

Researchers using scalable, high performance computing systems for embedded applications focus on techniques to ensure that the highest possible performance scalable computing technology is openly available from a commercial technology base. R&D investments include research into novel architectures, real-time operating systems, standard library interfaces, program development environments, and demonstrations of the early insertion of new technologies into agency mission driven applications. The two level multicomputer architectural concept cleanly separates computation from communication to facilitate the development of heterogeneous systems critical to embedded applications. Researchers demonstrated a multicomputer capable of reducing costs to $10/Mflops and this design is being transferred to several Defense systems houses for future insertions into embedded applications.

University researchers developed the PixelFlow system prototype that produced unprecedented performance levels for graphics rendering. The PixelFlow system uses a massively parallel processor-per-pixel approach capable of achieving a record 3-D graphics rendering performance of 50 million polygons per second. A major vendor will acquire the license to this technology and produce a series of new products that could set new standards in scalability, programmability, and performance.

In addition to the well known computing paradigms of shared memory and distributed memory, there is a third paradigm -- distributed shared memory (DSM) -- in which memory is physically distributed but logically shared. The DSM technique enables the user to view distributed memory resources as a single address space and provides transparent access to computational resources in scalable systems. DSM research efforts investigate both software and hardware approaches to implement a logically shared view. For example, the FLASH project designed a very high performance protocol engine to implement DSM on a VLSI chip. This programmable chip supports experimentation with various shared and distributed memory protocols emerging from the research community.

An exciting new area of research is Biomolecular Computing (BC). Researchers have applied innovative molecular biology methods to a Hamiltonian path-finding problem to determine a feasible path through a complex search space using molecules of DNA in a test tube. BC has the potential to solve problems much faster than traditional silicon based supercomputers and to provide technologies capable of storing information in a tiny fraction of the spatial volume required for today's storage media. This is only one of several high risk technologies being explored under the HPCC Program over the next few years. Other novel approaches include computing systems based on quantum, molecular, and amorphous techniques. These systems explore entirely new algorithms to address problems in cryptography, very large distributed information systems, and symbolic computing.

The MARQUISE program will repackage the J90 processor into a system with four parallel vector processors and 1024 Mbytes of DRAM, providing a peak performance of 800 Mflops. The prototype will employ four advanced packaging techniques to transform a conventional computer room-sized machine into a pizza box-sized package. The techniques are (1) multi-chip modules for weight and volume reduction; (2) diamond substrate to support high power and current density with robust reliability; (3) innovative phase change spray cooling to remove heat with very little added weight, volume, and power overhead; and (4) high density flex interconnect to support the signal-rich network interconnect.

Links to more detailed information:   http://www.nitrd.gov/blue97/hps/