Reprints from my posting to SAN-Tech Mailing List and ...


[san-tech][03515] Oak Ridge HPC Operational Assessment Report, CY 2011, February 2012

Date: Mon, 09 Apr 2012 13:58:42 +0900
Oak Ridge Leadership Computing Facilityの CY 2011 (2011/01/01-12/31)
Operational Assessment Report (OAR) です:

"High Performance Computing Facility Operational Assessment,
 CY 2011 Oak Ridge Leadership Computing Facility"
 February 2012, 89 Page
 U.S. Department of Energy, Office of Science

Oak Ridge National Laboratory's Leadership Computing Facility (OLCF)
Argonne Leadership Computing Facility (ALCF)

Innovative & Novel Computational Impact on Theory and Experiment (INCITE)
"INCITE in Review", March, 2012

User Results/Business Results/Strategic Results/Innovation/
Risk Management/Summary of the Proposed Metric Values に対して、
CHARGE QUESTIONが設定され、各章の冒頭でそれに答え、引き続き裏付け

CHARGE QUESTION、斜め読みして気になる用語やグラフとか:

"Oak Ridge National Laboratory's Leadership Computing Facility (OLCF)
 continues to deliver the most powerful resources in the U.S. for
 open science. At 2.33 petaflops peak performance, the Cray XT Jaguar
 delivered more than 1.4 billion core hours in calendar year (CY) 2011
 to researchers around the world ..."
"Effective operations of the OLCF play a key role in the scientific
 missions and accomplishments of its users. This Operational Assessment
 Report (OAR) will delineate the policies, procedures, and innovations
 implemented by the OLCF to continue delivering a petaflop-scale resource
 for cutting-edge research. This report covers CY 2011 that unless
 otherwise specified, denotes January 1, 2011 through December 31, 2011."
  Communications with Key Stakeholders
    Communication with the Program Office
    Communication with the User Community
    Communication with the Vendors
    Communication with Advisory Groups
  Summary of 2011 Metrics
  Responses to Recommendations from the Previous 2011 Operational Assessment Review

User Results
    Are the processes for supporting the customers, resolving problems,
    and outreach effective?
  1.1 User Results Summary
  1.2 User Support Metrics
  1.2.1 Overall Satisfaction Rating for the Facility
  1.2.2 Average Rating across All User Support Questions
  1.4 Problem Resolution Metrics
  1.4.1 Problem Resolution Metric Summary
    Figure 1.1. Number of Helpdesk Tickets Issued per Month
    Figure 1.2. Categorization of Helpdesk Tickets
  1.5 User Support and Outreach
  1.5.3 Scientific Computing Liaisons
    Responding to Time-Critical Needs
      the OLCF's rapid response to the Fukushima nuclear accident
    Table 1.9. Training Event Summary
  1.6 User Support Conclusion (Page 26)
    user satisfaction (4.2/5.0)
    user services (4.1/5.0)
    problem resolution (4.2/5.0)
    to address user problems within 3 business days
    OLCF training effort and rated it a 4.2/5.0

Business Results
    Is the facility maximizing the use of its HPC systems and other
    resources consistent with its mission?
  2.2 Cray XT Compute Partition Summary
    Table 2.3. OLCF Business Results Summary for HPC Systems
      Cray XT/HPSS: Scheduled Availability/MTTI/MTTF/Total Usage etc.
  2.3 Resource Availability
  2.3.1 Scheduled Availability
  2.3.2 Overall Availability
    Increasing System Availability
    Figure 2.1. Eliminating VRM failures increases system stability
  2.3.3 Mean time to Interrupt
  2.3.4 Mean Time to Failure
  2.4 Resource Utilization
  2.4.1 Total System Utilization
    Table 2.10. 2011 OLCF System Utilization
    Figure 2.2. 2011 XT5 Resource Utilization - Core Hours by Program
  2.5 Capability Utilization
    Figure 2.3. Effective Scheduling Policy Enables Leadership-class Usage

Strategic Results
    Is OLCF enabling scientific achievements consistent with
    the Department of Energy Strategic Goal 2, which is to
    "maintain a vibrant U. S. effort in science and engineering as
    a cornerstone of our economic prosperity and clear leadership
    in strategic areas?"
  3.1 Science Output
    Table 3.1. List of OLCF Publications
  3.2 Scientific Accomplishments
  3.3 Accomplishments in Energy Systems Research
  3.4 Allocation of Facility Director's Reserve
  3.4.1 Director's Discretionary Program
    Table 3.2. Director's Discretionary Program: Domain Allocation Distribution
    Table 3.3. Director's Discretionary Program: Awards and User Demographics
  3.4.2 Industrial HPC Partnerships Program
    Table 3.4. Industry Projects at the OLCF

    What innovations have been implemented that have improved
    the facility's operations?
  4.1 Application Readiness
  4.2 Application Support
  4.3 Outreach
  4.4 Systems
    Breaking Bottlenecks in Parallel I/O - Innovative Systems
    "I/O Congestion Avoidance via Routing and Object Placement"
     2011 Cray User Group meeting
    Intuitive Data Portal for Collaborative Climate Science
    - Innovative Systems
    Real-time Monitoring of Simulations through an Integrated Dashboard
    - Innovative Systems
  4.5 Leadership
    Empowering a Sustainable Lustre Ecosystem through OpenSFS
    - Innovative Leadership
  4.6 Energy Management
    Effects of CRU Top Hats on Air Flow - Innovative Energy Management

Risk Management
    Is the Facility effectively managing risk?
  5.1 Risk Management
  5.2 Major Risks Tracked in the Current Year

Summary of the Proposed Metric Values
    Are the performance metrics used for the review year and proposed
    for future years sufficient and reasonable for assessing Operational
  The OLCF provides (below) a summary table of the metrics and actuals
  for 2011, and proposed metrics and targets for 2012 and 2013.

1 件のコメント: