Scheduling restartable jobs with short test runs

Written with Ozzy Thebe and Vitus J. Leung.
Proceedings of the 14th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), volume 5798 of LNCS, pages 116-137, 2009.

Download

© 2009 Springer-Verlag.


Abstract:

In this paper, we examine the concept of giving every job a trial run before committing it to run until completion. Trial runs allow immediate job failures to be detected shortly after job submission and benefit short jobs by letting them run and finish early. This occurs without incurring a significant penalty on longer jobs, whose average and maximum waiting time are actually improved in some cases. The strategy does not require preemption and instead uses the ability to kill and restart a job from the beginning, which it does at most once for each job. While others have proposed similar strategies, our algorithm is distinguished by its determination to give each job a fixed-length trial run as soon as possible. Our study is also more focused, including a detailed description of the algorithm and an examination of the effect of varying the length of a trial run.