Today's data-parallel clusters critically rely on in-memory solutions for high-performance big data analytics. By caching data objects in memory, I/O-intensive applications can gain order-of-magnitude performance improvement over traditional on-disk solutions.
However, one key challenge faced by in-memory solutions is the severe load imbalance across cache servers. In production clusters, data objects typically have the heavily skewed popularity - meaning, a small number of hot files account for a large fraction of data accesses. The cache servers containing hot files hence turn into hot spots. This problem is further aggravated by the network load imbalance. It is reported in a Facebook cluster that the most heavily loaded links have over 4.5x higher utilization than the average for more than 50% of the time. The routinely observed hot spots, along with the network load imbalance, result in a significant degradation of I/O performance that could even eliminate the performance advantage of in-memory solutions
In this project, we aim to study an effective approach to achieve load balancing in cluster caching systems. The expected deliverables would be a load balancing algorithm along with its prototype implementation in Alluxio, a popular in-memory distributed storage for data intensive clusters.
Applicants are expected to survey existing load balancing techniques, analyze their inefficiency for cluster caching systems, propose new solutions, implement them atop Alluxio, and evaluate the performance in real cloud environments.
1. To learn state-of-the-art in-memory caching systems such as Alluxio and their interactions with data analytic frameworks such as Hadoop and Spark.
2. To study the memory management and load balancing problems in cluster caching systems.
3. To learn programming in real cloud environments and evaluating big data systems against popular benchmark workload suites.
4. To gain basic technical skills needed in the system research.