[san-tech][03110] 講演資料:Resilience Summit 2010 (2010/10/13)

Date: Tue, 24 May 2011 18:22:16 +0900
2010年 10月の会議ですが、Resilience Summit 2010の講演資料が公開されて

Resilience Summit 2010

"Hard Data on Soft Errors: A Global-Scale Assessment of GPGPU Memory Soft Error Rates"
"Soft Errors, Silent Data Corruption, and Exascale Computing"
"Scalable HPC System Monitoring"
"Mining event log patterns in HPC systems"
"Integrating Fault Tolerance into the Monte Carlo Application Toolkit"
"An Uncoordinated Checkpoint Protocol for Send-deterministic HPC Application"
"VolpexMPI: Robust Execution of MPI Applications through Process Replication"

関連サイト (どれかで紹介されていたプロジェクト)
MemtestG80 and MemtestCL

以前、MPP系 HPCではチェックポイントのストアーが演算実行の足を引っ張る
方法でのチェックポイントやそれを補完する技術 (確実性の高い MPI通信の
開発等) が進められてきています。

以前紹介しましたけど Resilience系情報サイト (HPCに限定していません)
HPC Resilience Consortium Wiki!
Checkpoint/Failure and Anomaly Prediction/Failure (Related Papers)/
Large Scale Application

Open MPIだと
Open Resilient Cluster Manager (ORCM)
"The Open Resilient Cluster Manager (ORCM, or OpenRCM) is an open-source
 project focused on development of an "always on" resource manager for
 high-performance computing systems of any size."

MPI系の新しいペーパー (中心メンバーは Sandia Labs)
"rMPI : increasing fault resiliency in a message-passing environment."
 Kurt Ferreira, Rolf Riesen (IBM), Ron Oldfield, Jon Stearley,
 James Laros, Kevin Pedretti,Ron Brightwell, Sandia National Laboratories
 SANDIA REPORT, SAND2011-2488, Unlimited Release, Printed April, 2011

  "... Current techniques to ensure progress across faults, like
   checkpoint-restart, are unsuitable at these scale due to excessive
   overheads predicted to more than double an applications time to
   solution. Redundant computation, long used in distributed and mission
   critical systems, has been suggested as an alternative to
   checkpoint-restart on its own. In this paper we describe the rMPI
   library which enables portable and transparent redundant computation
   for MPI applications. We detail the design of the library as well as
   two replica consistency protocols, outline the overheads of this
   library at scale on a number of real-world applications, and finally
   outline the significant increase in an applications time to solution
   at extreme scale as well as show the scenarios in which redundant
   computation makes sense."

このメンバーは、数年前から MPIでの Fault Resiliencyを研究しています。

