• Stargate: Remote Data Access between Hadoop Clusters

      Choi, Illyoung; Hartman, John; Department of Computer Science, University of Arizona (Association for Computing Machinery (ACM), 2021-03)
      The transfer of large-scale datasets between geographically separated systems is a challenge in scientific computing, made even more complicated when the systems are clusters of computers. In this paper we present Stargate, a file system that enables efficient on-demand remote data access for Hadoop-based scientific computations. Stargate uses a content-addressable protocol, on-demand access, and multi-tier caching to address the challenges of large data transfers over a WAN. Stargate also uses a novel approach that co-locates computations and transfers to achieve efficient data access in cluster computing. Unlike other approaches, Stargate is implemented as an independent file system service that works with any computation framework. In our experiments Stargate’s performance on heavy I/O workloads was 7% faster than WebHDFS and only 8% slower than HDFS. In addition, Stargate’s caches effectively trade high-cost WAN traffic for low-cost LAN traffic. Stargate’s performance, on-demand data access, and reduction in WAN traffic make it a good platform for providing remote dataset access to scientific computations on clusters.