JPAS: Job-progress-aware flow scheduling for deep learning clusters. (15th May 2020)
- Record Type:
- Journal Article
- Title:
- JPAS: Job-progress-aware flow scheduling for deep learning clusters. (15th May 2020)
- Main Title:
- JPAS: Job-progress-aware flow scheduling for deep learning clusters
- Authors:
- Zhou, Pan
He, Xinshu
Luo, Shouxi
Yu, Hongfang
Sun, Gang - Abstract:
- Abstract: Deep learning (DL) is an increasingly important tool for large-scale data analytics and DL workloads are also common in today's production clusters due to the increasing number of deep-learning-driven services (e.g., online search and speech recognition). To handle ever-growing training datasets, it is common to conduct distributed DL (DDL) training to leverage multiple machines in parallel. Training DL models in parallel can incur significant bandwidth contention on shared clusters. As a result, the network is a well-known bottleneck for distributed training. Efficient network scheduling is essential for maximizing the performance of DL training. DL training is feedback-driven exploration (e.g., hyper-parameter tuning, model structure optimization), which requires multiple retrainings of deep learning models that differ in terms of their configuration. The information at the early stage of each retraining can facilitate the direct search for high-quality models. Thus, reducing the early-stage time can accelerate the exploration of DL training. In this paper, we propose JPAS, which is a flow scheduling system for DDL training jobs that aims at reducing the early-stage time. JPAS uses a simple greedy mechanism to periodically order all DDL jobs. Each host machine sets priorities for its flows using the corresponding job order and offloads the flow scheduling and rate allocation to the underlying priority-enabled network. We evaluate JPAS over a real testbed that isAbstract: Deep learning (DL) is an increasingly important tool for large-scale data analytics and DL workloads are also common in today's production clusters due to the increasing number of deep-learning-driven services (e.g., online search and speech recognition). To handle ever-growing training datasets, it is common to conduct distributed DL (DDL) training to leverage multiple machines in parallel. Training DL models in parallel can incur significant bandwidth contention on shared clusters. As a result, the network is a well-known bottleneck for distributed training. Efficient network scheduling is essential for maximizing the performance of DL training. DL training is feedback-driven exploration (e.g., hyper-parameter tuning, model structure optimization), which requires multiple retrainings of deep learning models that differ in terms of their configuration. The information at the early stage of each retraining can facilitate the direct search for high-quality models. Thus, reducing the early-stage time can accelerate the exploration of DL training. In this paper, we propose JPAS, which is a flow scheduling system for DDL training jobs that aims at reducing the early-stage time. JPAS uses a simple greedy mechanism to periodically order all DDL jobs. Each host machine sets priorities for its flows using the corresponding job order and offloads the flow scheduling and rate allocation to the underlying priority-enabled network. We evaluate JPAS over a real testbed that is composed of 13 servers and a commodity switch. The evaluation results demonstrate that JPAS can reduce the time to reach 90% or 95% of the converged accuracy by up to 38%. Hence, JPAS can remarkably reduce the early-stage time and thus accelerate the search for high-quality models. … (more)
- Is Part Of:
- Journal of network and computer applications. Volume 158(2020)
- Journal:
- Journal of network and computer applications
- Issue:
- Volume 158(2020)
- Issue Display:
- Volume 158, Issue 2020 (2020)
- Year:
- 2020
- Volume:
- 158
- Issue:
- 2020
- Issue Sort Value:
- 2020-0158-2020-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-05-15
- Subjects:
- Machine learning -- Deep learning -- Flow scheduling -- Job progress aware
Microcomputers -- Periodicals
Computer networks -- Periodicals
Application software -- Periodicals
Micro-ordinateurs -- Périodiques
Réseaux d'ordinateurs -- Périodiques
Logiciels d'application -- Périodiques
Application software
Computer networks
Microcomputers
Periodicals
004.05
004 - Journal URLs:
- http://www.sciencedirect.com/science/journal/10848045 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.jnca.2020.102590 ↗
- Languages:
- English
- ISSNs:
- 1084-8045
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 5021.410600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 13480.xml