PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
AbstractThe Hadoop Distributed File System (HDFS) is a distributed le system used to support multiple widely-used big data frameworks, including Apache Hadoop and Apache Spark. Since these frameworks are often run across many compute nodes, it is possible that multiple nodes will read the same data. In addition, since data is replicated across multiple nodes for storage, the same data will be written multiple times across the network. In this paper, we conduct an evaluation of the caching potential present in HDFS in order to determine if in-network caching, particularly of the type seen in Named Data Networking (NDN), would reduce the amount of tra c seen in a Spark cluster network, as well as the average load on each data storage node. Our results show that for most benchmarks running on Apache Spark, a majority of the large read operations were done to transfer the Spark and application dependency libraries to each compute node. In addition, there was not a signi cant amount of read tra c in the network for most of the applications we evaluated, making the bene ts of in-network caching for HDFS questionable.
Degree ProgramHonors College