RIFLING: A reinforcement learning‐based GPU scheduler for deep learning research and development platforms. (30th December 2021)

Record Type:: Journal Article
Title:: RIFLING: A reinforcement learning‐based GPU scheduler for deep learning research and development platforms. (30th December 2021)
Main Title:: RIFLING: A reinforcement learning‐based GPU scheduler for deep learning research and development platforms
Authors:: Chen, Zhaoyun
Abstract:: Abstract: GPU platforms have been widely adopted in both academia and industry to support deep learning (DL) research and development (R&D). Compared with giant companies who favor custom‐designed AI platforms, most small‐and‐medium‐sized enterprises, institutes and universities (EIUs) prefer to build or rent a cost‐effective GPU cluster, usually in a limited‐scale, to process diverse DL R&D workloads. Therefore, more attention has been attracted by DL scheduling with the aim of improving the system efficiency and task performance. However, prior prediction‐based schedulers are limited in terms of their prediction accuracy and profiling overhead. Accordingly, in this article, we propose a reinforcement learning (RL)‐based online GPU scheduler, RIFLING, to model the scheduling problem as an online decision‐making process. Scheduling decisions are made according to Q‐learning, which is a typical RL method. RIFLING can achieve high scheduling efficiency based on the online exploring and exploiting of diverse scheduling strategies for various DL workloads, without the need for expensive offline profiling or sophisticated prediction model. We implement RIFLING as a plugin of Tensorflow, and deploy it on a distributed GPU cluster. Experiments demonstrate that RIFLING achieves up to 47.8% reductions and 19.6% improvements in makespan and average normalized processing rate respectively compared to the best available baseline without any manual intervention.
Is Part Of:: Software, practice & experience. Volume 52:Number 6(2022)
Journal:: Software, practice & experience
Issue:: Volume 52:Number 6(2022)
Issue Display:: Volume 52, Issue 6 (2022)
Year:: 2022
Volume:: 52
Issue:: 6
Issue Sort Value:: 2022-0052-0006-0000
Page Start:: 1319
Page End:: 1336
Publication Date:: 2021-12-30
Subjects:: cost‐effective -- deep learning platform -- Q‐learning -- reinforcement learning -- scheduling
Computer software -- Periodicals
Computer programming -- Periodicals
Computer programs -- Periodicals
005.3
Journal URLs:: http://onlinelibrary.wiley.com/ ↗
DOI:: 10.1002/spe.3066 ↗
Languages:: English
ISSNs:: 0038-0644
Deposit Type:: Legaldeposit
View Content:: Available online (eLD content is only available in our Reading Rooms) ↗
Physical Locations:: British Library DSC - 8321.453000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store
Ingest File:: 21347.xml