Toward Efficient Cloud Resource Management with Deep Reinforcement Learning
Project Description

Cloud datacenters serve as a common shared infrastructure running mixed workloads with diverse resource demands along CPU, memory, disk and network. In addition, depending on the underlying applications, datacenter jobs have divergent service-level objectives (SLOs). Notably, business critical jobs have stringent requirements on latency and reliability and are usually executed in long-running containers; batch processing jobs consist of many short-running tasks and are more sensitive to the task throughput. A critical problem facing cloud operators is how to judiciously schedule mixed workloads on machines, so as to meet their diverse SLOs and resource requirements while at the same time attaining high cluster utilization. This problem often manifests as challenging online decision-making tasks without a precise model and are usually solved by heuristics whose performance critically depends on the workload and environment. Therefore, operators need to painstakingly test and tune the heuristics for good performance in practice.

In this proposed project, we will investigate if machine learning can provide a viable alternative to human-generated heuristics for resource management. More precisely, we propose to take full advantage of the state-of-the-art deep reinforcement learning algorithms so that the systems can learn to manage resources by their own. Deep reinforcement learning (DRL) has enjoyed remarkable success in recent years, such as Alpha Go and video game playing. It deals with agents that learn to make better decisions directly from experience interacting with the environment, without any a prior knowledge about the tasks at hand. We believe RL approaches are especially well-suited to cloud resource management systems, as decisions made by these systems are often repetitive, thus generating an abundant training data for RL algorithms. Our new algorithm will model complex scheduling decision policies as deep neural networks analogous to the models used for game-playing agents. By continuing to learn, our algorithms can optimize for a specific workload and gracefully adapt to environmental changes under varying conditions. We will implement our algorithms as a new scheduler in popular resource management frameworks such as YARN and Kubernetes, and evaluate its performance using production traces and standard benchmarks for data analytics and machine learning.

Supervisor
WANG Wei
Quota
1
Course type
UROP1100
Applicant's Roles

Applicant will work with the UROP advisor and his PhD students on the following tasks:

1. Literature survey on cloud resource management and scheduling and deep reinforcement learning techniques;
2. Study of production traces, including Google and Alibaba;
3. Simulator implementation;
4. Algorithm design;
5. Real-world deployment in Amazon EC2 cloud;
6. Paper writing.

Applicant's Learning Objectives

1. Having a deep understanding of cloud resource management in production clusters;
2. Mastering the reinforcement learning algorithms and applying them in cloud scheduling;
3. Learning how to conduct literature survey efficiently and effectively;
4. Enhancing hands-on programming skills in large open-source projects such as Apache YARN and Kubernetes.
5. Improving research and communication skills

Complexity of the project
Challenging