Integration of PanDA workload management system with Titan supercomputer at OLCF. Issue 9 (December 2015)
- Record Type:
- Journal Article
- Title:
- Integration of PanDA workload management system with Titan supercomputer at OLCF. Issue 9 (December 2015)
- Main Title:
- Integration of PanDA workload management system with Titan supercomputer at OLCF
- Authors:
- De, K.
Klimentov, A.
Oleynik, D.
Panitkin, S.
Petrosyan, A.
Schovancova, J.
Vaniachine, A.
Wenaus, T. - Other Names:
- collab.
- Abstract:
- Abstract: The PanDA (Production and Distributed Analysis) workload management system (WMS) was developed to meet the scale and complexity of LHC distributed computing for the ATLAS experiment. While PanDA currently distributes jobs to more than 100, 000 cores at well over 100 Grid sites, the future LHC data taking runs will require more resources than Grid computing can possibly provide. To alleviate these challenges, ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. We will describe a project aimed at integration of PanDA WMS with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). The current approach utilizes a modified PanDA pilot framework for job submission to Titan's batch queues and local data management, with light-weight MPI wrappers to run single threaded workloads in parallel on Titan's multicore worker nodes. It also gives PanDA new capability to collect, in real time, information about unused worker nodes on Titan, which allows precise definition of the size and duration of jobs submitted to Titan according to available free resources. This capability significantly reduces PanDA job wait time while improving Titan's utilization efficiency. This implementation was tested with a variety of Monte-Carlo workloads on Titan and is being tested on several other supercomputing platforms. Notice: This manuscript has been authored, by employeesAbstract: The PanDA (Production and Distributed Analysis) workload management system (WMS) was developed to meet the scale and complexity of LHC distributed computing for the ATLAS experiment. While PanDA currently distributes jobs to more than 100, 000 cores at well over 100 Grid sites, the future LHC data taking runs will require more resources than Grid computing can possibly provide. To alleviate these challenges, ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. We will describe a project aimed at integration of PanDA WMS with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). The current approach utilizes a modified PanDA pilot framework for job submission to Titan's batch queues and local data management, with light-weight MPI wrappers to run single threaded workloads in parallel on Titan's multicore worker nodes. It also gives PanDA new capability to collect, in real time, information about unused worker nodes on Titan, which allows precise definition of the size and duration of jobs submitted to Titan according to available free resources. This capability significantly reduces PanDA job wait time while improving Titan's utilization efficiency. This implementation was tested with a variety of Monte-Carlo workloads on Titan and is being tested on several other supercomputing platforms. Notice: This manuscript has been authored, by employees of Brookhaven Science Associates, LLC under Contract No. DE-AC02-98CH10886 with the U.S. Department of Energy. The publisher by accepting the manuscript for publication acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes . … (more)
- Is Part Of:
- Journal of physics. Volume 664:Issue 9(2015)
- Journal:
- Journal of physics
- Issue:
- Volume 664:Issue 9(2015)
- Issue Display:
- Volume 664, Issue 9 (2015)
- Year:
- 2015
- Volume:
- 664
- Issue:
- 9
- Issue Sort Value:
- 2015-0664-0009-0000
- Page Start:
- Page End:
- Publication Date:
- 2015-12
- Subjects:
- Physics -- Congresses
530.5 - Journal URLs:
- http://www.iop.org/EJ/journal/1742-6596 ↗
http://ioppublishing.org/ ↗ - DOI:
- 10.1088/1742-6596/664/9/092020 ↗
- Languages:
- English
- ISSNs:
- 1742-6588
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 5036.223000
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 90.xml