Date: Tue, 24 May 2011 18:22:16 +0900
--------------------------------------------------
2010年 10月の会議ですが、Resilience Summit 2010の講演資料が公開されて
います:
Resilience Summit 2010
http://www.csm.ornl.gov/srt/conferences/ResilienceSummit/2010/
公開されている講演資料のタイトル
"Hard Data on Soft Errors: A Global-Scale Assessment of GPGPU Memory Soft Error Rates"
"Soft Errors, Silent Data Corruption, and Exascale Computing"
"Scalable HPC System Monitoring"
"Mining event log patterns in HPC systems"
"Integrating Fault Tolerance into the Monte Carlo Application Toolkit"
"An Uncoordinated Checkpoint Protocol for Send-deterministic HPC Application"
"VolpexMPI: Robust Execution of MPI Applications through Process Replication"
関連サイト (どれかで紹介されていたプロジェクト)
MemtestG80 and MemtestCL
https://simtk.org/home/memtest
以前、MPP系 HPCではチェックポイントのストアーが演算実行の足を引っ張る
可能性が高いのであまり真剣に取り組まれていない、ような事を書いた記憶
がありますが、ネットワークを含めてシステムの規模が大きくなってきて、
各パーツの故障率が全体に及ぼす影響が無視出来ない状況にり、いろいろな
方法でのチェックポイントやそれを補完する技術 (確実性の高い MPI通信の
開発等) が進められてきています。
以前紹介しましたけど Resilience系情報サイト (HPCに限定していません)
HPC Resilience Consortium Wiki!
http://resilience.latech.edu/mediawiki/index.php/Main_Page
Resources
http://resilience.latech.edu/mediawiki/index.php/Resources
Papers
Checkpoint/Failure and Anomaly Prediction/Failure (Related Papers)/
Large Scale Application
Open MPIだと
Open Resilient Cluster Manager (ORCM)
http://www.open-mpi.org/projects/orcm/
"The Open Resilient Cluster Manager (ORCM, or OpenRCM) is an open-source
project focused on development of an "always on" resource manager for
high-performance computing systems of any size."
MPI系の新しいペーパー (中心メンバーは Sandia Labs)
"rMPI : increasing fault resiliency in a message-passing environment."
Kurt Ferreira, Rolf Riesen (IBM), Ron Oldfield, Jon Stearley,
James Laros, Kevin Pedretti,Ron Brightwell, Sandia National Laboratories
SANDIA REPORT, SAND2011-2488, Unlimited Release, Printed April, 2011
http://prod.sandia.gov/techlib/access-control.cgi/2011/112488.pdf
Abstract
"... Current techniques to ensure progress across faults, like
checkpoint-restart, are unsuitable at these scale due to excessive
overheads predicted to more than double an applications time to
solution. Redundant computation, long used in distributed and mission
critical systems, has been suggested as an alternative to
checkpoint-restart on its own. In this paper we describe the rMPI
library which enables portable and transparent redundant computation
for MPI applications. We detail the design of the library as well as
two replica consistency protocols, outline the overheads of this
library at scale on a number of real-world applications, and finally
outline the significant increase in an applications time to solution
at extreme scale as well as show the scenarios in which redundant
computation makes sense."
※残念ながら、まだ公開サイトはないようです。
このメンバーは、数年前から MPIでの Fault Resiliencyを研究しています。
0 件のコメント:
コメントを投稿