Comprehensive, open‐source resource usage measurement and analysis for HPC systems‡§. (6th March 2014)
- Record Type:
- Journal Article
- Title:
- Comprehensive, open‐source resource usage measurement and analysis for HPC systems‡§. (6th March 2014)
- Main Title:
- Comprehensive, open‐source resource usage measurement and analysis for HPC systems‡§
- Authors:
- Browne, James C.
DeLeon, Robert L.
Patra, Abani K.
Barth, William L.
Hammond, John
Jones, Matthew D.
Furlani, Thomas R.
Schneider, Barry I.
Gallo, Steven M.
Ghadersohi, Amin
Gentner, Ryan J.
Palmer, Jeffrey T.
Simakov, Nikolay
Innus, Martins
Bruno, Andrew E.
White, Joseph P.
Cornelius, Cynthia D.
Yearke, Thomas
Marcus, Kyle
von Laszewski, Gregor
Wang, Fugang
Wilkins‐Diehr, Nancy
Majumdar, Amit - Abstract:
- <abstract abstract-type="main"> <title>SUMMARY</title> <p>The important role high‐performance computing (HPC) resources play in science and engineering research, coupled with its high cost (capital, power and manpower), short life and oversubscription, requires us to optimize its usage – an outcome that is only possible if adequate analytical data are collected and used to drive systems management at different granularities – job, application, user and system. This paper presents a method for comprehensive job, application and system‐level resource use measurement, and analysis and its implementation. The steps in the method are system‐wide collection of comprehensive resource use and performance statistics at the job and node levels in a uniform format across all resources, mapping and storage of the resultant job‐wise data to a relational database, which enables further implementation and transformation of the data to the formats required by specific statistical and analytical algorithms. Analyses can be carried out at different levels of granularity: job, user, application or system‐wide. Measurements are based on a new lightweight job‐centric measurement tool 'TACC_Stats', which gathers a comprehensive set of resource use metrics on all compute nodes and data logged by the system scheduler. The data mapping and analysis tools are an extension of the XDMoD project. The method is illustrated with analyses of resource use for the Texas Advanced Computing Center's Lonestar4,<abstract abstract-type="main"> <title>SUMMARY</title> <p>The important role high‐performance computing (HPC) resources play in science and engineering research, coupled with its high cost (capital, power and manpower), short life and oversubscription, requires us to optimize its usage – an outcome that is only possible if adequate analytical data are collected and used to drive systems management at different granularities – job, application, user and system. This paper presents a method for comprehensive job, application and system‐level resource use measurement, and analysis and its implementation. The steps in the method are system‐wide collection of comprehensive resource use and performance statistics at the job and node levels in a uniform format across all resources, mapping and storage of the resultant job‐wise data to a relational database, which enables further implementation and transformation of the data to the formats required by specific statistical and analytical algorithms. Analyses can be carried out at different levels of granularity: job, user, application or system‐wide. Measurements are based on a new lightweight job‐centric measurement tool 'TACC_Stats', which gathers a comprehensive set of resource use metrics on all compute nodes and data logged by the system scheduler. The data mapping and analysis tools are an extension of the XDMoD project. The method is illustrated with analyses of resource use for the Texas Advanced Computing Center's Lonestar4, Ranger and Stampede supercomputers and the HPC cluster at the Center for Computational Research. The illustrations are focused on resource use at the system, job and application levels and reveal many interesting insights into system usage patterns and also anomalous behavior due to failure/misuse. The method can be applied to any system that runs the TACC_Stats measurement tool and a tool to extract job execution environment data from the system scheduler. Copyright © 2014 John Wiley &amp; Sons, Ltd.</p> </abstract> … (more)
- Is Part Of:
- Concurrency and computation. Volume 26:Number 13(2014:Sep.)
- Journal:
- Concurrency and computation
- Issue:
- Volume 26:Number 13(2014:Sep.)
- Issue Display:
- Volume 26, Issue 13 (2014)
- Year:
- 2014
- Volume:
- 26
- Issue:
- 13
- Issue Sort Value:
- 2014-0026-0013-0000
- Page Start:
- 2191
- Page End:
- 2209
- Publication Date:
- 2014-03-06
- Subjects:
- Parallel processing (Electronic computers) -- Periodicals
Parallel computers -- Periodicals
004.35 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/cpe.3245 ↗
- Languages:
- English
- ISSNs:
- 1532-0626
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3405.622000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 3072.xml