Reprints from my posting to SAN-Tech Mailing List and ...
ラベル Resilience の投稿を表示しています。 すべての投稿を表示
ラベル Resilience の投稿を表示しています。 すべての投稿を表示

2012/06/04

[san-tech][03562] Slide: HEPiX Spring 2012 Workshop (23-27 April 2012)

Date: Mon, 04 Jun 2012 20:47:38 +0900
--------------------------------------------------
少し特別なワークショップですが、4月 23日~27日に開催された

HEPiX Spring 2012 Workshop, 23-27 April 2012
  http://indico.cern.ch/conferenceDisplay.py?confId=160737

の Slideが公開されています。

"HEPiX meetings bring together IT system support engineers from
 the High Energy Physics (HEP) laboratories, institutes, and universities,
 such as BNL, CERN, DESY, FNAL, IN2P3, INFN, JLAB, NIKHEF, RAL, SLAC,
 TRIUMF and others."

CERN等、リアルに大規模データを管理しているサイトからの報告があります。

2012/04/20

[san-tech][03534] Violin Memory vRAID Technology

Date: Fri, 20 Apr 2012 02:29:45 +0900
--------------------------------------------------
2012/07/09
"Violin Memory Enterprise Flash Serviceability",
 2012/06/12, StorageMojo
  http://www.youtube.com/watch?v=FW3Gjvkr6vg
  "Violin Memory's V6000 flash storage array has exceptionally low
   and consistent latency as proven on audited TPC-C benchmarks.
   One reason: they don't use standard SSDs. Instead they use
   Violin Intelligent Memory Modules - VIMMs - that, unlike DIMMs,
   are hot swappable."

--------------------------------------------------
2012/06/27
"Violin Memory Enterprise Flash Serviceability"
 2012/06/12, StorageMojo
  http://www.youtube.com/watch?v=FW3Gjvkr6vg
   "Violin Memory's V6000 flash storage array has exceptionally low
   and consistent latency as proven on audited TPC-C benchmarks.
   One reason: they don't use standard SSDs. Instead they use
   Violin Intelligent Memory Modules - VIMMs - that, unlike DIMMs,
   are hot swappable."

--------------------------------------------------
StorageMojoによる Violin Memory (Hardware-Based vRAID) 技術紹介ビデオです:

"Violin Memory: a clean-sheet flash architecture"
 2012/04/11, StorageMojo, 4:41, 720p
  http://www.youtube.com/watch?&v=L2VibZhNFbE

"Violin's clean-sheet architecture"
 11 April, 2012, StorageMojo
  http://storagemojo.com/2012/04/11/violins-clean-sheet-architecture/

2011/09/19

[san-tech][03372] IBM 100 petaOPS Supercomputer出願特許公開、September 8, 2011

Date: Sat, 17 Sep 2011 14:07:59 +0900
--------------------------------------------------
IBMが 2011年 1月 10日に出願した特許書類が、2011年 9月 8日に公開され
ました (特許成立ではありません):

Title: MULTI-PETASCALE HIGHLY EFFICIENT PARALLEL SUPERCOMPUTER
United States Patent Application: 20110219208
Kind Code: A1
Inventors: Asaad; Sameh ; et al.
Assignee: International Business Machines Corporation
Publication Date: September 8, 2011
Filed Date: January 10, 2011
  http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PG01&s1=20110219208&OS=20110219208&RS=20110219208
※↑The United States Patent and Trademark Office (USPTO)

Imagesをクリックすると出願書類イメージとなるはずですが、私の場合
ブラウザー (Chrome) との相性問題か、分量の問題かで上手くいきません。
※A4 (Letter?) で、649枚です。460枚程度は図表です。
※後述する民間サイトでも PDFファイルを入手出来ますが、画像 PDFなので
単語検索出来ません。上記 URLはフラット HTMLです (ただし読み難い)。

2011/08/01

[san-tech][03317] SSDの信頼性 ("Investigation: Is Your SSD More Reliable Than A Hard Drive?", July 29, 2011, Tom's Hardware)

Date: Sun, 31 Jul 2011 23:43:46 +0900
--------------------------------------------------
HDDの信頼性評価は難しいのですが、SSDも同様です:

"Investigation: Is Your SSD More Reliable Than A Hard Drive?"
 July 29, 2011, Tom's Hardware

  "Does a lack of moving parts translate to higher reliability?
   That's the assumption many enthusiasts and IT professionals make
   about SSDs. We go straight to the data centers using these devices,
   dig into failure rate statistics, and suggest otherwise."

2011/06/13

[san-tech][01877] Re: DRAM信頼性についての報告

Date: Sun, 11 Oct 2009 12:42:56 +0100
------------------------------------------------
[san-tech][01874] DRAM信頼性についての報告

StorageMojoの Robin Harrisさんは、本家ではなく ZDNetで取り上げてます:
"DRAM error rates: Nightmare on DIMM street", October 4th, 2009
  http://blogs.zdnet.com/storage/?p=638&tag=col1;post-638

"A two-and-a-half year study of DRAM on 10s of thousands Google servers
 found DIMM error rates are hundreds to thousands of times higher
 than thought - a mean of 3,751 correctable errors per DIMM per year."
Table 1: Memory errors per yera:
  http://i.zdnet.com/blogs/picture-27.png
元論文の表を切り出してます。

[san-tech][01874] DRAM信頼性についての報告

Date: Fri, 09 Oct 2009 22:20:48 +0100
------------------------------------------------
2011/06/13
[san-tech][01877] Re: DRAM信頼性についての報告
------------------------------------------------
以前 CMU Gibson教授のところでディスクの信頼性を研究されていた Bianca
Schroeder博士が、DRAMの信頼性についての報告をされてました:

"DRAM errors in the wild: A Large-Scale Field Study."
 B. Schroeder, E. Pinheiro, W.-D. Weber. Sigmetrics/Performance 2009
  http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
ABSTRACT
"The goal of this paper is to answer questions such as the follow-
 ing: How common are memory errors in practice? What are their
 statistical properties? How are they affected by external factors,
 such as temperature and utilization, and by chip-specific factors,
 such as chip density, memory technology and DIMM age?"
で、どこのデータを解析したかというと

2011/06/12

[san-tech][02140] "Disaster Recovery by Google", 2010/03/04, Official Google Enterprise Blog

Date: Tue, 09 Mar 2010 21:42:33 +0900
------------------------------------------------
2010年 3月 4日付けの Official Google Enterprise Blogに

"Disaster Recovery by Google", March 04, 2010
 Posted by Rajen Sheth, Senior Product Manager, Google Apps
  http://googleenterprise.blogspot.com/2010/03/disaster-recovery-by-google.html

が投稿されているのですが、

  "For Google Apps customers, our RPO design target is zero, and our
   RTO design target is instant failover. We do this through live or
   synchronous replication: every action you take in Gmail is
   simultaneously replicated in two data centers at once, so that if
   one data center fails, we nearly instantly transfer your data over
   to the other one that's also been reflecting your actions."

[san-tech][02137] 大規模HPC向けデバッガ開発

Date: Tue, 09 Mar 2010 16:21:49 +0900
------------------------------------------------
オーストラリア Monash大学 Abramson教授が進めている
大規模 HPC向けデバッガ環境開発に US DOEの予算が付きました:

"International recognition for Australian supercomputer debuggers"
 18 February 2010 (HPCWireの記事は March 08, 2010)
  http://www.monash.edu.au/news/newsline/story/1578

  "The research team, led by the Lab's Director, Professor David
   Abramson, recently received funding support from the United
   States Department of Energy, an agency leading an international
   supercomputer R&D consortium that includes IBM, and has a
   commercialisation agreement with supercomputer manufacturing
   giant Cray.

[san-tech][02121] TRAMS:欧州次世代高信頼性メモリプロジェクト

Date: Sat, 27 Feb 2010 21:07:01 +0900
------------------------------------------------
欧州の新しいメモリ開発プロジェクトです:

TRAMS : Terascale reliable adaptive memory systems
  http://cordis.europa.eu/fetch?CALLER=PROJ_ICT&ACTION=D&CAT=PROJ&RCN=93073
  Total cost: 3.43 million euro
  Execution: From 2010-01-01 to 2012-12-31 (36 months)

16nm CMOS: Late CMOSや 10nm CMOS; Beyond CMOSをターゲットとしたものの
ようです (でも、3年間の予算)

 "The TRAMS project is the bridge for reliable, energy efficient and
  cost effective computing in the era of nanoscale challenges and
  teraflop opportunities."

2011/06/11

[san-tech][02463] 講演資料:HPC Resilience 系 2件 (Resilience 2010, 2010/05/17 & FTXS 2010, 2010/06/28)

Date: Thu, 15 Jul 2010 17:26:59 +0900
--------------------------------------------------
HPC系ですが、Resilienceについてのワークショップ 2件の講演資料です:

3rd Workshop on Resiliency in High Performance Computing (Resilience)
in Clusters, Clouds, and Grids, May 17, 2010
  http://xcr.cenit.latech.edu/resilience2010/

1st Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2010)
 June 28th, 2010
  http://institute.lanl.gov/resilience/workshops/ftxs2010/

どちらも同じページから講演資料のダウンロードが可能です。
2018年の ExaFlopsに備えて、これから用語の定義等をしていくのでしょう
(一部内容が重なったりもしています)

[san-tech][03110] 講演資料:Resilience Summit 2010 (2010/10/13)

Date: Tue, 24 May 2011 18:22:16 +0900
--------------------------------------------------
2010年 10月の会議ですが、Resilience Summit 2010の講演資料が公開されて
います:

Resilience Summit 2010
  http://www.csm.ornl.gov/srt/conferences/ResilienceSummit/2010/

公開されている講演資料のタイトル
"Hard Data on Soft Errors: A Global-Scale Assessment of GPGPU Memory Soft Error Rates"
"Soft Errors, Silent Data Corruption, and Exascale Computing"
"Scalable HPC System Monitoring"
"Mining event log patterns in HPC systems"
"Integrating Fault Tolerance into the Monte Carlo Application Toolkit"
"An Uncoordinated Checkpoint Protocol for Send-deterministic HPC Application"
"VolpexMPI: Robust Execution of MPI Applications through Process Replication"

[san-tech][02097] Re:US HEC/HPC Resilienceレポート

Date: Tue, 16 Feb 2010 16:12:03 +0900
--------------------------------------------------
[san-tech][02096] US HEC/HPC Resilienceレポート 

Resilienceといえば Open MPIにも
Open Resilient Cluster Manager (ORCM) プロジェクト
  http://www.open-mpi.org/projects/orcm/
が立ち上がっています。

MPI関係では、オハイオ州立大学 Panda教授の率いる
Network Based Computing Lab, The Ohio State University.
  http://nowlab.cse.ohio-state.edu/

Fault Tolerance Backplane (FTB)
  http://nowlab.cse.ohio-state.edu/projects/ftb-ib/index.html
プロジェクトがあります。

[san-tech][02096] US HEC/HPC Resilienceレポート

Date: Tue, 16 Feb 2010 13:49:30 +0900
--------------------------------------------------
米国で新たな HEC (High‐End Computing) の Resilience (回復力・復元力)の
レポートが公開されました (計 3レポート):
※Resilienceは、HPCに限らず大規模システムでのキーワードになるでしょう。

"High‐End Computing Resilience: Analysis of Issues Facing the HEC
 Community and Path‐Forward for Research and Development"
  DOE NNSA: Nathan DeBardeleben, LANL, James Laros, SNL
  DOD ACS Research Program: John Daly, CEC
  DOE Office of Science: Stephen Scott, ORNL, Christian Engelmann, ORNL
  DOD DARPA: Bill Harrod, IPTO
  *)This document was cleared by DARPA on [1/20/10].
  http://institute.lanl.gov/resilience/docs/HECResilience_WhitePaper_Jan2010_final.pdf
  ※これは以下で紹介したワークショップをまとめたモノです
[san-tech][01803] National HPC Workshop on Resilience 2009資料公開
[san-tech][01931] Re: National HPC Workshop on Resilience 2009資料公開
[san-tech][01997] Re: National HPC Workshop on Resilience 2009資料公開

2011/06/10

[san-tech][01433] Risk Management Techniques and Practice Workshop for HPC Centers

Date: Wed, 14 Jan 2009 12:09:05 +0900
--------------------------------------------------
かなり変わった (ある意味さすがアメリカの)ワークショップの紹介です:
(やっと Webページ (公開資料) が見つかりました)

Risk Management Techniques and Practice Workshop for
High-Performance Computing Centers, September 17 - 18, 2008
https://rmtap.llnl.gov/

Abstract and Goals
https://rmtap.llnl.gov/abstract.php
"PURPOSE: To assess current and emerging techniques, practices,
 and lessons learned for effectively identifying, understanding,
 managing, and mitigating risks associated with acquiring
 leading-edge computing systems at high-performance computing
 centers (HPCCs).

"AUDIENCE: HPCC managers and key staff who are planning for
 leading-edge systems."
※研究者を対象としているのではなく、運用側を対象としています。

[san-tech][02835] 講演資料:5th Petascale Data Storage Workshop, SC10 (2010/11/15)

Date: Sat, 27 Nov 2010 21:13:47 +0900
--------------------------------------------------
2010年 11月 15日に SC10 (Supercomputing '10) 併設ワークショップとして、
PDSI (Petascale Data Storage Institute) 主催で開催された Petascale
Data Storage Workshopのペーパー・スライドが公開されています:

5th Petascale Data Storage Workshop, Supercomputing '10
 November 15, 2010
  http://www.pdsi-scidac.org/events/PDSW10/

[san-tech][02627] SGI Altix UVを支える IDT VRM (Voltage Regulator Module)

Date: Wed, 15 Sep 2010 13:04:07 +0900
--------------------------------------------------
IDTのプレスリリースです、

"IDT Voltage Regulator Modules Power The World's Fastest Supercomputer"
 September 14, 2010
  http://www.idt.com/?id=5751

  "Altix UV utilizes the IDT Power VRMs for microprocessor, ASIC and
   memory power requirements."

  "The VRMs selected by SGI utilize the IDT-patented coupled inductor
   technology to improve performance and reduce power consumption in
   computing applications."

  "Each Altix UV server blade uses nine high-density IDT VRMs to
   deliver fast transient response, tight regulation, high efficiency,
   and reliability."

2011/06/09

[san-tech][02438] 講演資料:FAST-OS Workshop, (2010/06/22)

Date: Fri, 02 Jul 2010 12:46:54 +0900
--------------------------------------------------
2010年 6月 22日に開催された FAST-OS Workshopの講演資料が公開されています:

FAST-OS Workshop, June 22, 2010
  http://www.usenix.org/events/fastos10/
※2010 USENIX Federated Conferences Week, June 22-25, 2010の一環として
開催されました。↑については別途紹介します。

USENIX 2010 Workshop
  http://www.fastos2.org/usenix-2010-workshop
※Slidesをクリックすると新しい tab (window) が立ち上がります。
資料のダウンロードは、新しい画面の左上のボタンから可能です。

FAST-OSプロジェクトの説明は省かせて頂きますが、以下に公開されている講演と
関連リンクを簡単にリストします (発表者省略):

[san-tech][03151] "Survey of Error and Fault Detection Mechanisms", Technical report, April 2011

Date: Sun, 05 Jun 2011 07:35:15 +0900
--------------------------------------------------
システムの耐障害性/弾力性 (Resiliency) 動向の基礎理解にお役に立つと
思われる、サーベイレポートです:

"Survey of Error and Fault Detection Mechanisms"
 Ikhwan Lee, ..... Mattan Erez, The University of Texas at Austin
 Technical report TR-LPH-2011-002, April 2011 (24 Page)
  http://lph.ece.utexas.edu/merez/uploads/MattanErez/detection_mechanisms_TR_LPH_2011_002.pdf

Abstract
"This report describes diverse error detection mechanisms that can be
 utilized within a resilient system to protect applications against
 various types of errors and faults, both hard and soft. These
 detection mechanisms have different overhead costs in terms of energy,
 performance, and area, and also differ in their error coverage,
 complexity, and programmer effort.

[san-tech][02307] "Silent Corruptions", CERN, 2007

Date: Wed, 19 May 2010 12:17:16 +0900
--------------------------------------------------
少し古い資料ですが (ここで紹介したと思ってました):

"Silent Corruptions",
 KELEMEN Peter, CERN, June 1st, 2007
  http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf

"Data integrity"
 Bernd Panzer-Steindel, CERN/IT
 Draft 1.3 8. April 2007
  http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797
※PDFファイルです

日本語 Blog:
"あなたのデータは既に壊れているかもしれない(Silent Data Corruption)"
 2009年7月27日, 私家版 ITプロフェッショナルの仕事術
  http://raven.air-nifty.com/night/2009/07/silent-data-cor.html

[san-tech][02207] Re: RAID信頼性モデル (博士論文:Jon Elerath, NetApp)

Date: Mon, 12 Apr 2010 00:55:57 +0900
--------------------------------------------------
[san-tech][01482] RAID信頼性モデル (博士論文:Jon Elerath, NetApp)
[san-tech][01693] Re: RAID信頼性モデル (博士論文:Jon Elerath, NetApp)
[san-tech][02154] Re: RAID信頼性モデル (博士論文:Jon Elerath, NetApp)

> "An Analysis of Latent Sector Errors in Disk Drives"
>  Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, Jiri Schindler.
>  Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'07)
>  San Diego, California. June 2007.
>   http://www.cs.wisc.edu/adsl/Publications/trust-storagess07.html
>   http://www.cs.wisc.edu/adsl/Publications/trust-storagess07.pdf

上の発表を元にした分析:
"Evaluating the Impact of Irrecoverable Read Errors on Disk Array Reliability"
 Jehan-Francois Paris, Ahmed Amer, Darrell D. E. Long and Thomas Schwarz
 Proceedings of the IEEE 15th Paci?c Rim International Symposium on
 Dependable Computing (PRDC09)
  http://www.ssrc.ucsc.edu/pub/paris09-prdc.html
  http://www2.cs.uh.edu/~paris/MYPAPERS/Prdc09.pdf