AffiliationDepartment of Computer Science, University of Arizona
Keywordsremote data access
on-demand remote data access
WAN file system
cluster-to-cluster data transfer
MetadataShow full item record
CitationIllyoung Choi and John H. Hartman, "Stargate: Remote Data Access between Hadoop Clusters," In Proceedings of the 36th ACM/SIGAPP Symposium On Applied Computing, 2021.
Rights© 2021 Association for Computing Machinery.
Collection InformationThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at email@example.com.
AbstractThe transfer of large-scale datasets between geographically separated systems is a challenge in scientific computing, made even more complicated when the systems are clusters of computers. In this paper we present Stargate, a file system that enables efficient on-demand remote data access for Hadoop-based scientific computations. Stargate uses a content-addressable protocol, on-demand access, and multi-tier caching to address the challenges of large data transfers over a WAN. Stargate also uses a novel approach that co-locates computations and transfers to achieve efficient data access in cluster computing. Unlike other approaches, Stargate is implemented as an independent file system service that works with any computation framework. In our experiments Stargate’s performance on heavy I/O workloads was 7% faster than WebHDFS and only 8% slower than HDFS. In addition, Stargate’s caches effectively trade high-cost WAN traffic for low-cost LAN traffic. Stargate’s performance, on-demand data access, and reduction in WAN traffic make it a good platform for providing remote dataset access to scientific computations on clusters.
VersionFinal accepted manuscript
SponsorsThis research was funded in part by NSF grants OAR-1640775 and OAR-1541318.