Evaluation and Optimization of Turnaround Time and Cost of HPC Applications on the Cloud
AuthorMarathe, Aniruddha Prakash
High Performance Computing
AdvisorLowenthal, David K.
MetadataShow full item record
PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
AbstractThe popularity of Amazon's EC2 cloud platform has increased in commercial and scientific high-performance computing (HPC) applications domain in recent years. However, many HPC users consider dedicated high-performance clusters, typically found in large compute centers such as those in national laboratories, to be far superior to EC2 because of significant communication overhead of the latter. We find this view to be quite narrow and the proper metrics for comparing high-performance clusters to EC2 is turnaround time and cost. In this work, we first compare the HPC-grade EC2 cluster to top-of-the-line HPC clusters based on turnaround time and total cost of execution. When measuring turnaround time, we include expected queue wait time on HPC clusters. Our results show that although as expected, standard HPC clusters are superior in raw performance, they suffer from potentially significant queue wait times. We show that EC2 clusters may produce better turnaround times due to typically lower wait queue times. To estimate cost, we developed a pricing model---relative to EC2's node-hour prices---to set node-hour prices for (currently free) HPC clusters. We observe that the cost-effectiveness of running an application on a cluster depends on raw performance and application scalability. However, despite the potentially lower queue wait and turnaround times, the primary barrier to using clouds for many HPC users is the cost. Amazon EC2 provides a fixed-cost option (called on-demand) and a variable-cost, auction-based option (called the spot market). The spot market trades lower cost for potential interruptions that necessitate checkpointing; if the market price exceeds the bid price, a node is taken away from the user without warning. We explore techniques to maximize performance per dollar given a time constraint within which an application must complete. Specifically, we design and implement multiple techniques to reduce expected cost by exploiting redundancy in the EC2 spot market. We then design an adaptive algorithm that selects a scheduling algorithm and determines the bid price. We show that our adaptive algorithm executes programs up to 7x cheaper than using the on-demand market and up to 44% cheaper than the best non-redundant, spot-market algorithm. Finally, we extend our adaptive algorithm to exploit several opportunities for cost-savings on the EC2 spot market. First, we incorporate application scalability characteristics into our adaptive policy. We show that the adaptive algorithm informed with scalability characteristics of applications achieves up to 56% cost-savings compared to the expected cost for the base adaptive algorithm run at a fixed, user-defined scale. Second, we demonstrate potential for obtaining considerable free computation time on the spot market enabled by its hour-boundary pricing model.
Degree ProgramGraduate College