• Big Data
    Interagency Working Group
    (BD IWG)

    The Big Data Interagency Working Group (BD IWG) works to facilitate and further the goals of the White House Big Data R&D Initiative.

    BigData
  • Cyber Physical Systems Interagency Working Group (CPS IWG)

    The CPS IWG is to coordinate programs, budgets, and policy recommendations for Cyber Physical Systems (CPS) research and development (R&D).

    CPS
  • Cyber Security and Information Assurance Interagency Working Group (CSIA IWG)

    Cyber Security and Information Assurance (CSIA) Interagency Working Group coordinates the activities of the CSIA Program Component Area.

    CSIA
  • Health IT R&D
    Interagency Working Group

    The Health Information Technology Research and Development Interagency Working Group coordinates programs, budgets and policy recommendations for Health IT R&D.

    healthitrd
  • Human Computer Interaction & Information Management Interagency Working Group (HCI&IM IWG)

    HCI&IM focuses on information interaction, integration, and management research to develop and measure the performance of new technologies.

    hciim
  • High Confidence Software & Systems Interagency Working Group (HCSS IWG)

    HCSS R&D supports development of scientific foundations and enabling software and hardware technologies for the engineering, verification and validation, assurance, and certification of complex, networked, distributed computing systems and cyber-physical systems (CPS).

    hcss
  • High End Computing Interagency Working Group (HEC IWG)

    The HEC IWG coordinates the activities of the High End Computing (HEC) Infrastructure and Applications (I&A) and HEC Research and Development (R&D) Program Component Areas (PCAs).

    hec
  • Large Scale Networking Interagency Working Group
    (LSN IWG)

    LSN members coordinate Federal agency networking R&D in leading-edge networking technologies, services, and enhanced performance.

    lsn
  • Software Productivity, Sustainability, and Quality Interagency Working Group (SPSQ IWG)

    The purpose of the SPSQ IWG is to coordinate the R&D efforts across agencies that transform the frontiers of software science and engineering and to identify R&D areas in need of development that span the science and the technology of software creation and sustainment.

    sdp
  • Video and Image Analytics
    Interagency Working Group (VIA IWG)

    Formed to ensure and maximize successful coordination and collaboration across the Federal government in the important and growing area of video and image analytics

    VIA CG
  • Wireless Spectrum Research and Development Interagency Working Group (WSRD IWG)

    The Wireless Spectrum R&D (WSRD) Interagency Working Group (IWG) has been formed to coordinate spectrum-related research and development activities across the Federal government.

    WSRD

File:VALUATION OF ULTRA-SCALE COMPUTING SYSTEMS.pdf

From NITRDGROUPS
Jump to: navigation, search
VALUATION_OF_ULTRA-SCALE_COMPUTING_SYSTEMS.pdf(file size: 689 KB, MIME type: application/pdf)


EXECUTIVE SUMMARY:

The Ultra-Scale Computing Valuation Project was undertaken to gain insight on utilization issues for both users and managers of the largest scientific computing systems and to begin developing appropriate metrics and models for such systems. The objective of this Project was to define a consensus-based approach within the community for assessing the value of ultra-scale computing systems. The ultimate goal is acceptance of a proposed approach and of the resulting recommendations by the appropriate stakeholders. Ultra-scale computers are general-purpose computers in actual use, whose computing power (the combination of aggregate processor speed, memory size, and I/O speeds) is within a factor of ten of the highest performance machine available. These ultra-scale computers are normally outside the area of focus of the commercial market forces. Today’s ultra-scale machines incorporate thousands of powerful processors, are considered accelerated development machines, and are purchased to enable much larger and more complex computations than can be performed presently on more conventionally available platforms.

Participants in this effort included experts from universities, the federal government, national laboratories, and the computing industry. Several meetings were held, operational data were analyzed, and many discussions took place to arrive at the conclusions and recommendations outlined in this report.

An initial brainstorming session characterizing approaches to assessing the value of advanced computer systems kicked off the Project. This early session revealed a range of current practices among participants and resulted in a decision to collect and analyze job trace data from several supercomputing facilities. Some participants defined utilization as the fraction of node hours used out of the total time the advanced computing platform is available for use, while others defined utilization as the fraction of time the platform is in use regardless of its availability. Participants agreed that the distinction between these two definitions is relevant mostly for new machines where the machine is unavailable for significant periods of time and, as a machine matures, there is less down time and the two definitions converge. According to either definition, a computer would be considered fully utilized if adding more jobs to the queue of jobs awaiting execution serves only to increase the average delay for jobs. Participants noted that neither of the above definitions refers to the utilization rate for any of a computer's sub-systems such as memory, disks, or processors. Rather, it assumes that when a job is assigned to a particular node (set of tightly coupled processors, memory, disks, etc.) within a parallel computer, all of those resources are unavailable for use by other jobs and therefore are considered utilized. Theoretically, peak utilization would be achieved when jobs were assigned to all nodes all of the time. As the Project progressed, it became increasing evident that utilization is not a sufficient measure of ultra-scale computer valuation. The first problem is that different organizations using different platforms use different approaches to address and measure utilization. Some progress has already been made on this front and more is expected as the recommendations of this report are implemented. A second, more Valuation of Ultra-Scale Computing Systems vi December 22, 1999 serious problem with utilization as a metric is that driving utilization to too high a level almost always results in an overall slowdown in system performance. When the slowdown is significant, the effect of achieving very high utilization is a counterproductive decrease in the ability of the system to support the applications for which its acquisition was justified. A third and more subtle weakness with utilization is that it does not measure the capability quality of the machine. In fact, the replacement of many capacity jobs by any capability job requiring the same total amount of resource can only decrease the utilization. Utilization as a measure penalizes exactly those capability jobs that are the driving rationale for the creation of large, integrated, ultrascale machines.

The bottom line for the participants is a consensus that the value provided by an ultrascale machine can only truly be measured — in the long run — by the scientific output produced by using it. One must ask, "Is the system doing what it was designed and funded to do?" Ultimately this is not measured by node-hours used, but by capabilities converted into discoveries or other valuable scientific or technical outcomes. In the end, the facilities that operate ultra-scale computing systems should be judged in the same way other national facilities, such as accelerators, are judged. Typically, periodic peer review is used to assess whether the missions and goals of the facilities are being met. Such peer reviews have worked very well to ensure the effectiveness and efficiency of facilities that serve the targeted scientific community. The value of ultra-scale computing facilities and the scientific output of the systems should be evaluated in a similar manner.

As a result of this Project, participants were able to identify operational similarities at the sites they represented while recognizing that there are few general practices for measuring use and assessing value that will hold across all sites. Rather, what is needed is a sufficiently flexible and graded approach that can be used by each site to measure the contributions of advanced computing systems to scientific thinking and meeting programmatic objectives. This approach must recognize that the first-of-theirkind status of ultra-scale platforms directly impacts initial utilization. Other factors that affect system performance and its overall value, such as allocation policies, utilization tradeoffs, and the absence of sufficient tools for measuring performance, were identified.

Acceptable ways to evaluate "ultra-scale" computing systems are being defined and a degree of consensus on these approaches is emerging within the ultra-scale computing community. Analysis of trace data provided by the Project Team revealed desired operation ranges of response time and throughput (number of jobs) for a given workload. It was important to consider different classes of jobs and differing workloads in the analysis. It was also learned that attempting to obtain greater throughput than that obtained by running the machine within the Desired Operation Range (DOR) of each system results in a rapid deterioration of system response time. Because of these considerations, the large-scale computing research and applications programs in government and academia agree that developing understandable and defensible measures for assessing value and utilization of these platforms is essential. The Valuation of Ultra-Scale Computing Systems vii December 22, 1999 community must make every effort to measure how effectively our national computing resources are being used, so that continued improvements can be made. The ultimate impact of this effort on individual sites and the agencies that manage them is tied to a willingness to define metrics for arriving at a desired operation range — on a system by system basis — and subsequent agreement to modify practices and policies to move toward the optimum. Further research is needed in key areas such as designing more efficient scheduling algorithms. Ramifications of this Project for the high-end computing industry include probable changes in future procurement arrangements and recognition that new tools are required by managers of ultra-scale computing platforms to address utilization considerations.

File history

Click on a date/time to view the file as it appeared at that time.

Date/TimeDimensionsUserComment
current10:21, 28 December 2012 (689 KB)Webmaster (talk | contribs) Category:HEC

There are no pages that link to this file.