[san-tech][02075] 大規模システム (16kノード) での OS Jitter報告 (HPC Colony Project)

Date: Wed, 10 Feb 2010 17:41:27 +0900
"Practical experiences with OS Jitter"
 Feb 09, 2012, IBM developerWorks Wikis

"OS Jitter Mitigation Techniques"
 Feb 09, 2010, IBM developerWorks Wikis
大規模システム (16,000ノード) での OS Jitterに関するレポートです
(実際は HPC Colony Project報告)

"Linux OS Jitter Measurements at Large Node Counts using a BlueGene/L"
 Jones, Terry R [ORNL] ;
 Tauferner, Mr. Andrew [IBM T.J. Watson Research Center] ;
 Inglett, Mr. Todd [IBM T.J. Watson Research Center]
 Publication Date: 2010 Jan 01 (On Paper: November 30, 2009)

 "We present experimental results for a coordinated scheduling
  implementation of the Linux operating system. Results were collected
  on an IBM Blue Gene/L machine at scales up to 16K nodes. Our results
  indicate coordinated scheduling was able to provide a dramatic
  improvement in scaling performance for two applications characterized
  as bulk synchronous parallel programs."

比較オペレーティングシステム (カーネル)
  Kernel 1: Blue Gene/L Compute Node Kernel (CNK)
    "One of CNK's principal design points was to avoid OS noise.
     It runs one process at a time; therefore it does not need to
     perform  time-slicing or preemptive multitasking."
    "This static memory map completely avoids TLB misses ..."
  Kernel 2: Colony Linux Kernel with unmodified Scheduler
     Linux version 2.6.16
    "A console driver and RAS driver were added in addition to various
     changes to support the BlueGene/L platform. The default 4KB pages
     were replaced with 64KB pages."
  Kernel 3: Colony Linux Kernel with Coordinated Scheduler
    "Two /proc interfaces were created and the scheduler was modified
     to give priority to the HPC applications in a coordinated fashion."
  Application 1: Allreduce
  Application 2: glob

いろいろ試行錯誤しながら、大規模システムに適した OSを作り込んでいます。
(後述しますが、HPC-Colonyプロジェクトは INCITE 2010に採択されました)

HPC-Colony Project

"Colony Update", Terry Jones, Principal Investigator
  ↑ PPTファイル
FastOS 2, Birds-of-a-Feather at  Supercomputing 2009

Terry Jones, Application Performance Tools group, CSM, ORNL
Terry Jones, Stanford University

HPC Colonyは、INCITE 2010で新規に採択されました。マシンは XT5ですが、
協同研究者の半数以上は IBMの方です。4,000,000コア時間 (= 455年)
    "HPC Colony: Removing Scalability, Fault, and Performance
     Barriers in Leadership Class Systems through Adaptive System
Principal Investigator: Terry Jones (Oak Ridge National Laboratory)
    Laxmikant Kale(University of Illinois?Urbana-Champaign)
    Jose Moreira (International Business Machines)
    Celso Mendes, Esteban Meneses, (UIUC),
    Yoav Tock, Eliezer Dekel, Roie Melamed, Eli Luboshitz,
    Menachem Shtalhaim, Benjamin Mandler (IBM)
Scientific Discipline: Computer Science
INCITE Allocation: 4,000,000 processor hours
Site: Oak Ridge National Laboratory
Machine (Allocation): Cray XT (4,000,000 processor hours)

[san-tech][02043] Re: US DOE INCITE 2010 AWARDS発表 (10/01/26), 28 Jan 2010
2010 Awards Fact Sheet

