Reprints from my posting to SAN-Tech Mailing List and ...

2011/06/11

[san-tech][03110] 講演資料:Resilience Summit 2010 (2010/10/13)

Date: Tue, 24 May 2011 18:22:16 +0900
--------------------------------------------------
2010年 10月の会議ですが、Resilience Summit 2010の講演資料が公開されて
います:

Resilience Summit 2010
  http://www.csm.ornl.gov/srt/conferences/ResilienceSummit/2010/

公開されている講演資料のタイトル
"Hard Data on Soft Errors: A Global-Scale Assessment of GPGPU Memory Soft Error Rates"
"Soft Errors, Silent Data Corruption, and Exascale Computing"
"Scalable HPC System Monitoring"
"Mining event log patterns in HPC systems"
"Integrating Fault Tolerance into the Monte Carlo Application Toolkit"
"An Uncoordinated Checkpoint Protocol for Send-deterministic HPC Application"
"VolpexMPI: Robust Execution of MPI Applications through Process Replication"


関連サイト (どれかで紹介されていたプロジェクト)
MemtestG80 and MemtestCL
  https://simtk.org/home/memtest


以前、MPP系 HPCではチェックポイントのストアーが演算実行の足を引っ張る
可能性が高いのであまり真剣に取り組まれていない、ような事を書いた記憶
がありますが、ネットワークを含めてシステムの規模が大きくなってきて、
各パーツの故障率が全体に及ぼす影響が無視出来ない状況にり、いろいろな
方法でのチェックポイントやそれを補完する技術 (確実性の高い MPI通信の
開発等) が進められてきています。

以前紹介しましたけど Resilience系情報サイト (HPCに限定していません)
HPC Resilience Consortium Wiki!
  http://resilience.latech.edu/mediawiki/index.php/Main_Page
Resources
  http://resilience.latech.edu/mediawiki/index.php/Resources
Papers
Checkpoint/Failure and Anomaly Prediction/Failure (Related Papers)/
Large Scale Application


Open MPIだと
Open Resilient Cluster Manager (ORCM)
  http://www.open-mpi.org/projects/orcm/
"The Open Resilient Cluster Manager (ORCM, or OpenRCM) is an open-source
 project focused on development of an "always on" resource manager for
 high-performance computing systems of any size."


MPI系の新しいペーパー (中心メンバーは Sandia Labs)
"rMPI : increasing fault resiliency in a message-passing environment."
 Kurt Ferreira, Rolf Riesen (IBM), Ron Oldfield, Jon Stearley,
 James Laros, Kevin Pedretti,Ron Brightwell, Sandia National Laboratories
 SANDIA REPORT, SAND2011-2488, Unlimited Release, Printed April, 2011
  http://prod.sandia.gov/techlib/access-control.cgi/2011/112488.pdf

Abstract
  "... Current techniques to ensure progress across faults, like
   checkpoint-restart, are unsuitable at these scale due to excessive
   overheads predicted to more than double an applications time to
   solution. Redundant computation, long used in distributed and mission
   critical systems, has been suggested as an alternative to
   checkpoint-restart on its own. In this paper we describe the rMPI
   library which enables portable and transparent redundant computation
   for MPI applications. We detail the design of the library as well as
   two replica consistency protocols, outline the overheads of this
   library at scale on a number of real-world applications, and finally
   outline the significant increase in an applications time to solution
   at extreme scale as well as show the scenarios in which redundant
   computation makes sense."

※残念ながら、まだ公開サイトはないようです。
このメンバーは、数年前から MPIでの Fault Resiliencyを研究しています。

0 件のコメント:

コメントを投稿