Reprints from my posting to SAN-Tech Mailing List and ...

2011/06/09

[san-tech][03151] "Survey of Error and Fault Detection Mechanisms", Technical report, April 2011

Date: Sun, 05 Jun 2011 07:35:15 +0900
--------------------------------------------------
システムの耐障害性/弾力性 (Resiliency) 動向の基礎理解にお役に立つと
思われる、サーベイレポートです:

"Survey of Error and Fault Detection Mechanisms"
 Ikhwan Lee, ..... Mattan Erez, The University of Texas at Austin
 Technical report TR-LPH-2011-002, April 2011 (24 Page)
  http://lph.ece.utexas.edu/merez/uploads/MattanErez/detection_mechanisms_TR_LPH_2011_002.pdf

Abstract
"This report describes diverse error detection mechanisms that can be
 utilized within a resilient system to protect applications against
 various types of errors and faults, both hard and soft. These
 detection mechanisms have different overhead costs in terms of energy,
 performance, and area, and also differ in their error coverage,
 complexity, and programmer effort.



 In order to achieve the highest efficiency in designing and running
 a resilient computer system, one must understand the trade-offs among
 the aforementioned metrics for each detection mechanism and choose
 the most efficient option for a given running environment. To
 accomplish such a goal, we first enumerate many error detection
 techniques previously suggested in the literature."

1 Introduction
2 Failure Mechanisms
3 Detection Mechanisms for Memory
3.1 Information Redundancy
3.2 Cache Memory Error Protection
3.3 Main Memory Error Protection
4 Detection Mechanisms for Compute
4.1 Circuit-level Techniques
4.2 Architecture-level Techniques
4.2.1 Code-based Techniques
4.2.2 Execution Redundancy
4.3 Software Systems
4.4 Application-level Techniques
4.4.1 Algorithmic Based Fault Tolerance (ABFT)
4.4.2 Assertion and Sanity-Based Fault Tolerance
4.5 Hybrid Techniques
5 System-Level Detection Mechanisms
5.1 Detection at the Core Level
5.2 Detection at the System Level
5.2.1 Detecting Network Failures
5.2.2 Detecting Node Failures
6 Conclusion
Acknowledgements
References [1] - [81]


Mattan Erez, Assistant Professor
 Electrical and Computer Engineering Department,
 The University of Texas at Austin
  http://lph.ece.utexas.edu/merez/MattanErez/Home
Erez博士は、System Resiliency, Reliability, and Dependability等を
研究されている方です。
Research
  http://lph.ece.utexas.edu/merez/MattanErez/Research

上記テクニカルレポート発行以前から、
"Virtualized and Flexible ECC for Main Memory"
 Doe Hyun Yoon, and Mattan Erez
 Fifteenth International Conference on Architectural Support for
 Programming Languages and Operating Systems (ASPLOS'10)
  http://lph.ece.utexas.edu/merez/uploads/MattanErez/vecc_asplos_2010.pdf
  http://lph.ece.utexas.edu/merez/uploads/MattanErez/vecc_asplos_2010.pptx
等、面白そうな研究を続けてきています。
==================================================
[san-tech][01874] DRAM信頼性についての報告
[san-tech][01877] Re: DRAM信頼性についての報告

0 件のコメント:

コメントを投稿