On the Power of In-Network Caching in the Hadoop Distributed File System
Publisher
ASSOC COMPUTING MACHINERYCitation
Eric Newberry and Beichuan Zhang. 2019. On the Power of In-Network Caching in the Hadoop Distributed File System. In 6th ACM Conference on Information-Centric Networking (ICN ’19), September 24–26, 2019, Macao, China. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3357150. 3357392Rights
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.Collection Information
This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at repository@u.library.arizona.edu.Abstract
The Hadoop Distributed File System (HDFS) is a network file system used to support multiple widely-used big data frameworks that can scale to run on large clusters. In this paper, we evaluate the effectiveness of using in-network caching on switches in HDFSsupported clusters in order to reduce per-link bandwidth usage in the network. We discovered that some applications featured large amounts of data requested by multiple clients and that, by caching read data in the network, the average per-link bandwidth usage of read operations in these applications could be reduced by more than half. We also found that the choice of cache replacement policy could have a significant impact on caching effectiveness in this environment, with LIRS and ARC generally performing the best for larger and smaller cache sizes, respectively. Moreover, given the structure of HDFS write operations, we developed a mechanism to reduce the total per-link bandwidth usage of HDFS write operations by replacing write pipelining with multicast. In order to evaluate in-network caching potential, we developed a simulator to replay real traces through a fat tree network simulating the caching architecture used in the Named Data Networking (NDN) information-centric networking (ICN) architecture. Our results suggest that ICN-style in-network caching can provide significant benefits to HDFS-supported big data clusters, justifying future work to apply ICN architectures to cluster environments.Version
Final accepted manuscriptSponsors
National Science Foundationae974a485f413a2113503eed53cd6c53
10.1145/3357150.3357392