Skip to main content

GPU availability and job interruption can be an issue for me, so this paper caught my attention. The paper is titled “Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning” and it proposes a method for reducing service interruptions on GPU clusters for deep learning jobs.

The tool is called Mirage. Mirage is a Slurm-compatible foundational model that uses statistical and reinforcement learning (RL) techniques to proactively provision resources to mitigate interruptions. To accomplish this, researchers alter the decoder transformer architecture to create a Mixture of Experts (MoE) of neural networks. The neural networks share the same network architecture to support scaling up of model capacity and determine ‘best-fit experts’ that optimize job trace analysis outputs for resource allocations. 

What’s really cool about this paper is how it investigates the use of Deep Q-Learning RL techniques to reenforce rewards to improve GPU queue wait times for a finite set of GPU resources to improve service continuity.

Mirage looks like a promising method to enable 23 to 76% more jobs on finite resources with zero interruptions!

Be the first to reply!

Reply