Traditionally, distributed storage systems have relied upon the interfaces provided by OS kernels to interact with storage hardware. However, much research has shown that OSes impose serious overheads on every I/O operation, especially on high-performance storage and networking hardware (e.g., PMEM and 200GBe). Thus, distributed storage stacks are being re-designed to take advantage of this modern hardware by utilizing new hardware interfaces which bypass the kernel entirely. However, the impact of these optimizations have not been well-studied for real HPC workloads on real hardware. In this work, we provide a comprehensive evaluation of DAOS: a state-of-the-art distributed storage system which re-architects the storage stack from scratch for modern hardware. We compare DAOS against traditional storage stacks and demonstrate that by utilizing optimal interfaces to hardware, performance improvements of up to 6x can be observed in real scientific applications.
Data Flow Lifecycles for Optimizing Workflow Coordination
Hyungro Lee , Luanzheng Guo , Meng Tang , Jesun Firoz , Nathan Tallent, and 2 more authors
In SC’22: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , Nov 2023
A critical performance challenge in distributed scientific workflows is coordinating tasks and data flows on distributed resources. To guide these decisions, this paper introduces data flow lifecycle analysis. Workflows are commonly represented using directed acyclic graphs (DAGs). Data flow lifecycles (DFL) enrich task DAGs with data objects and properties that describe data flow and how tasks interact with that flow. Lifecycles enable analysis from several important perspectives: task, data, and data flow. We describe representation, measurement, analysis, visualization, and opportunity identification for DFLs. Our measurement is both distributed and scalable, using space that is constant per data file. We use lifecycles and opportunity analysis to reason about improved task placement and reduced data movement for five scientific workflows with different characteristics. Case studies show improvements of 15×, 1.9×, and 10–30×. Our work is implemented in the DataLife tool.
IOMax: Maximizing Out-of-Core I/O Analysis Performance on HPC Systems
Izzet Yildirim , Hariharan Devarajan, Anthony Kougkas , Xian-He Sun , and Kathryn Mohror
In Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis , Nov 2023
I/O analysis is an essential task for improving the performance of scientific applications on high-performance computing (HPC) systems. However, current analysis tools, which often use data drilling techniques (iterative exploration for deeper insights), treat every query independently and do not optimize column data for data-slicing (extracting specific data subsets), resulting in subpar querying performance. In this paper, we designed IOMax, a tool for efficient data drilling analysis on large-scale I/O traces. IOMax utilizes a novel query optimization technique to improve the query performance by 8.6x while reducing the memory footprint required for analysis by 11x. Additionally, it employs data transformation techniques to improve data-slicing performance by up to 11.4x. In conclusion, IOMax optimizes I/O analysis for scientific workflows on the Lassen supercomputer, resulting in up to 7x improvement.
2022
Stimulus: Accelerate Data Management for Scientific AI applications in HPC
Hariharan Devarajan, Anthony Kougkas , Huihuo Zheng , Venkatram Vishwanath , and Xian-He Sun
In CCGrid’22: Proceedings of the 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing , May 2022
Modern scientific workflows couple simulations with AI-powered analytics by frequently exchanging data to accelerate time-to-science to reduce the complexity of the simulation planes. However, this data exchange is limited in performance and portability due to a lack of support for scientific data formats in AI frameworks. We need a cohesive mechanism to effectively integrate at scale complex scientific data formats such as HDF5, PnetCDF, ADIOS2, GNCF, and Silo into popular AI frameworks such as TensorFlow, PyTorch, and Caffe. To this end, we designed Stimulus, a data management library for ingesting scientific data effectively into the popular AI frameworks. We utilize the StimOps functions along with StimPack abstraction to enable the integration of scientific data formats with any AI framework. The evaluations show that Stimulus outperforms several large-scale applications with different use-cases such as Cosmic Tagger (consuming HDF5 dataset in PyTorch), Distributed FFN (consuming HDF5 dataset in TensorFlow), and CosmoFlow (converting HDF5 into TFRecord and then consuming that in TensorFlow) by 5.3 x, 2.9 x, and 1.9 x respectively with ideal I/O scalability up to 768 GPUs on the Summit supercomputer. Through Stimulus, we can portably extend existing popular AI frameworks to cohesively support any complex scientific data format and efficiently scale the applications on large-scale supercomputers.
LabStor: A modular and extensible platform for developing high-performance, customized I/O stacks in userspace
Luke Logan, Jaime Cernuda Garcia , Jay Lofstead , Xian–He Sun, and Anthony Kougkas
In SC’22: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , Nov 2022
Traditionally, I/O systems have been developed within the confines of a centralized OS kernel. This led to monolithic and rigid storage systems that are limited by low development speed, expressiveness, and performance. Various assumptions are imposed including reliance on the UNIX-file abstraction, the POSIX standard, and a narrow set of I/O policies. However, this monolithic design philosophy makes it difficult to develop and deploy new I/O approaches to satisfy the rapidly-evolving I/O requirements of modern scientific applications. To this end, we propose LabStor: a modular and extensible platform for developing high-performance, customized I/O stacks. Single-purpose I/O modules (e.g, I/O schedulers) can be developed in the comfort of userspace and released as plug-ins, while end-users can compose these modules to form workload- and hardware-specific I/O stacks. Evaluations show that by switching to a fully modular design, tailored I/O stacks can yield performance improvements of up to 60% in various applications.
LuxIO: Intelligent Resource Provisioning and Auto-Configuration for Storage Services
Keith Bateman , Neeraj Rajesh, Jaime Cernuda Garcia , Luke Logan , Jie Ye, and 3 more authors
In HiPC’22: Proceedings of the 29th International Conference on High Performance Computing, Data, and Analytics , Dec 2022
Storage in HPC is typically a single Remote and Static Storage (RSS) resource. However, applications demonstrate diverse I/O requirements that can be better served by a multi-storage approach. Current practice employs ephemeral storage systems running on either node-local or shared storage resources. Yet, the burden of provisioning and configuring intermediate storage falls solely on the users, while global job schedulers offer little to no support for custom deployments. This lack of support often leads to over- or under-provisioning of resources and poorly configured storage systems. To mitigate this, we present LuxIO, an intelligent storage resource provisioning and auto-configuration service. LuxIO constructs storage deployments configured to best match I/O requirements. LuxIO-tuned storage services show performance improvements up to 2× across common applications and benchmarks, while introducing minimal overhead of 93.40 ms on top of existing job scheduling pipelines. LuxIO improves resource utilization by up to 25% in select workflows.
2021
DLIO: A data-centric benchmark for scientific deep learning applications
Hariharan Devarajan , Huihuo Zheng, Anthony Kougkas , Xian-He Sun , and Venkatram Vishwanath
In CCGrid’21: Proceedings of the 21st International Symposium on Cluster, Cloud and Internet Computing ║ Best Paper Award ║ , May 2021
Deep learning has been shown as a successful method for various tasks, and its popularity results in numerous open-source deep learning software tools. Deep learning has been applied to a broad spectrum of scientific domains such as cosmology, particle physics, computer vision, fusion, and astrophysics. Scientists have performed a great deal of work to optimize the computational performance of deep learning frameworks. However, the same cannot be said for I/O performance. As deep learning algorithms rely on big-data volume and variety to effectively train neural networks accurately, I/O is a significant bottleneck on large-scale distributed deep learning training. This study aims to provide a detailed investigation of the I/O behavior of various scientific deep learning workloads running on the Theta supercomputer at Argonne Leadership Computing Facility. In this paper, we present DLIO, a novel representative benchmark suite built based on the I/O profiling of the selected workloads. DLIO can be utilized to accurately emulate the I/O behavior of modern scientific deep learning applications. Using DLIO, application developers and system software solution architects can identify potential I/O bottlenecks in their applications and guide optimizations to boost the I/O performance leading to lower training times by up to 6.7x.
Apollo: An ML-assisted real-time storage resource observer
Neeraj Rajesh , Hariharan Devarajan, Jaime Cernuda Garcia , Keith Bateman , Luke Logan, and 3 more authors
In HPDC’21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing , Jun 2021
Applications and middleware services, such as data placement engines, I/O scheduling, and prefetching engines, require low-latency access to telemetry data in order to make optimal decisions. However, typical monitoring services store their telemetry data in a database in order to allow applications to query them, resulting in significant latency penalties. This work presents Apollo: a low-latency monitoring service that aims to provide applications and middleware libraries with direct access to relational telemetry data. Monitoring the system can create interference and overhead, slowing down raw performance of the resources for the job. However, having a current view of the system can aid middleware services in making more optimal decisions which can ultimately improve the overall performance. Apollo has been designed from the ground up to provide low latency, using Publish-Subscriber Pub-Sub semantics, and low overhead, using adaptive intervals in order to change the length of time between polling the resource for telemetry data and machine learning in order to predict changes to the telemetry data between actual resource polling. This work also provides some high level abstractions called I/O curators, which can further aid middleware libraries and applications to make optimal decisions. Evaluations showcase that Apollo can achieve sub-millisecond latency for acquiring complex insights with a memory overhead of 57 MB and CPU overhead being only 7% more than existing state-of-the-art systems.
pMEMCPY: a simple, lightweight, and portable I/O library for storing data in persistent memory
Luke Logan , Jay Lofstead, Scott Levy, Patrick Widener , Xian-He Sun, and 1 more author
In Cluster’21: Proceedings of the International Conference on Cluster Computing , Sep 2021
Persistent memory (PMEM) devices can achieve comparable performance to DRAM while providing significantly more capacity. This has made the technology compelling as an expansion to main memory. Rethinking PMEM as storage devices can offer a high performance buffering layer for HPC applications to temporarily, but safely store data. However, modern parallel I/O libraries, such as HDF5 and pNetCDF, are complicated and introduce significant software and metadata overheads when persisting data to these storage devices, wasting much of their potential. In this work, we explore the potential of PMEM as storage through pMEMCPY: a simple, lightweight, and portable I/O library for storing data in persistent memory. We demonstrate that our approach is up to 2x faster than other popular parallel I/O libraries under real workloads.
Hflow: A dynamic and elastic multi-layered i/o forwarder
Jaime Cernuda , Hariharan Devarajan , Luke Logan , Keith Bateman , Neeraj Rajesh, and 3 more authors
In Cluster’21: Proceedings of the International Conference on Cluster Computing , Sep 2021
Modern applications are highly data-intensive, leading to the well-known I/O bottleneck problem. Scientists have proposed the placement of fast intermediate storage resources which aim to mask the I/O penalties. To manage these resources, three core software abstractions are being used in leadership-class computing facilities: IO Forwarders, Burst Buffers, and Data Stagers. Yet, with the rise of multi-tenant deployment in HPC systems, these software abstractions are: managed and maintained in isolation, leading to inefficient interactions; allocated statically, leading to load imbalance; exclusively bifurcated between the intermediate storage, leading to under-utilization of resources, and, in many cases, do not support in-situ operations. To this end, we present HFlow, a new class of data forwarding system that leverages a real-time data movement paradigm. HFlow introduces a unified data movement abstraction (the ByteFlow) providing data-independent tasks that can be executed anywhere and thus, enabling dynamic resource provisioning. Moreover, the processing elements executing the ByteFlows are designed to be ephemeral and, hence, enable elastic management of intermediate storage resources. Our results show that applications running under HFlow display an increase in performance of 3x when compared with state-of-the-art software solutions.
2020
I/O Acceleration via Multi-Tiered Data Buffering and Prefetching
Anthony Kougkas , Hariharan Devarajan , and Xian-He Sun
In JCST’20: International Journal of Computer Science and Technology , Jan 2020
Modern High-Performance Computing (HPC) systems are adding extra layers to the memory and storage hierarchy, named deep memory and storage hierarchy (DMSH), to increase I/O performance. New hardware technologies, such as NVMe and SSD, have been introduced in burst buffer installations to reduce the pressure for external storage and boost the burstiness of modern I/O systems. The DMSH has demonstrated its strength and potential in practice. However, each layer of DMSH is an independent heterogeneous system and data movement among more layers is significantly more complex even without considering heterogeneity. How to efficiently utilize the DMSH is a subject of research facing the HPC community. Further, accessing data with a high-throughput and low-latency is more imperative than ever. Data prefetching is a well-known technique for hiding read latency by requesting data before it is needed to move it from a high-latency medium (e.g., disk) to a low-latency one (e.g., main memory). However, existing solutions do not consider the new deep memory and storage hierarchy and also suffer from under-utilization of prefetching resources and unnecessary evictions. Additionally, existing approaches implement a client-pull model where understanding the application’s I/O behavior drives prefetching decisions. Moving towards exascale, where machines run multiple applications concurrently by accessing files in a workflow, a more data-centric approach resolves challenges such as cache pollution and redundancy. In this paper, we present the design and implementation of Hermes: a new, heterogeneous-aware, multi-tiered, dynamic, and distributed I/O buffering system. Hermes enables, manages, supervises, and, in some sense, extends I/O buffering to fully integrate into the DMSH. We introduce three novel data placement policies to efficiently utilize all layers and we present three novel techniques to perform memory, metadata, and communication management in hierarchical buffering systems. Additionally, we demonstrate the benefits of a truly hierarchical data prefetcher that adopts a server-push approach to data prefetching. Our evaluation shows that, in addition to automatic data movement through the hierarchy, Hermes can significantly accelerate I/O and outperforms by more than 2x state-of-the-art buffering platforms. Lastly, results show 10% to 35% performance gains over existing prefetchers and over 50% when compared to systems with no prefetching.
Hfetch: Hierarchical data prefetching for scientific workflows in multi-tiered storage environments
Hariharan Devarajan, Anthony Kougkas , and Xian-He Sun
In IPDPS’20: Proceedings of the International Parallel and Distributed Processing Symposium , Jul 2020
In the era of data-intensive computing, accessing data with a high-throughput and low-latency is more imperative than ever. Data prefetching is a well-known technique for hiding read latency. However, existing solutions do not consider the new deep memory and storage hierarchy and also suffer from under-utilization of prefetching resources and unnecessary evictions. Additionally, existing approaches implement a client-pull model where understanding the application’s I/O behavior drives prefetching decisions. Moving towards exascale, where machines run multiple applications concurrently by accessing files in a workflow, a more data-centric approach can resolve challenges such as cache pollution and redundancy. In this study, we present HFetch, a truly hierarchical data prefetcher that adopts a server-push approach to data prefetching. We demonstrate the benefits of such an approach. Results show 10-35% performance gains over existing prefetchers and over 50% when compared to systems with no prefetching.
Hcompress: Hierarchical data compression for multi-tiered storage environments
Hariharan Devarajan, Anthony Kougkas , Luke Logan , and Xian-He Sun
In IPDPS’20: Proceedings of the International Parallel and Distributed Processing Symposium , Jul 2020
Modern scientific applications read and write massive amounts of data through simulations, observations, and analysis. These applications spend the majority of their runtime in performing I/O. HPC storage solutions include fast node-local and shared storage resources to elevate applications from this bottleneck. Moreover, several middleware libraries (e.g., Hermes) are proposed to move data between these tiers transparently. Data reduction is another technique that reduces the amount of data produced and, hence, improve I/O performance. These two technologies, if used together, can benefit from each other. The effectiveness of data compression can be enhanced by selecting different compression algorithms according to the characteristics of the different tiers, and the multi-tiered hierarchy can benefit from extra capacity. In this paper, we design and implement HCompress, a hierarchical data compression library that can improve the application’s performance by harmoniously leveraging both multi-tiered storage and data compression. We have developed a novel compression selection algorithm that facilitates the optimal matching of compression libraries to the tiered storage. Our evaluation shows that HCompress can improve scientific application’s performance by 7x when compared to other state-of-the-art tiered storage solutions.
Hcl: Distributing parallel data structures in extreme scales
Hariharan Devarajan, Anthony Kougkas , Keith Bateman , and Xian-He Sun
In Cluster’20: Proceedings of the International Conference on Cluster Computing , Sep 2020
Most parallel programs use irregular control flow and data structures, which are perfect for one-sided communication paradigms such as MPI or PGAS programming languages. However, these environments lack the presence of efficient function-based application libraries that can utilize popular communication fabrics such as TCP, Infinity Band (IB), and RDMA over Converged Ethernet (RoCE). Additionally, there is a lack of high-performance data structure interfaces. We present Hermes Container Library (HCL), a high-performance distributed data structures library that offers high-level abstractions including hash-maps, sets, and queues. HCL uses a RPC over RDMA technology that implements a novel procedural programming paradigm. In this paper, we argue a RPC over RDMA technology can serve as a high-performance, flexible, and co-ordination free backend for implementing complex data structures. Evaluation results from testing real workloads shows that HCL programs are 2x to 12x faster compared to BCL, a state-of-the-art distributed data structure library.
Bridging Storage Semantics Using Data Labels and Asynchronous I/O
Anthony Kougkas , Hariharan Devarajan , and Xian-He Sun
In the era of data-intensive computing, large-scale applications, in both scientific and the BigData communities, demonstrate unique I/O requirements leading to a proliferation of different storage devices and software stacks, many of which have conflicting requirements. Further, new hardware technologies and system designs create a hierarchical composition that may be ideal for computational storage operations. In this article, we investigate how to support a wide variety of conflicting I/O workloads under a single storage system. We introduce the idea of a Label, a new data representation, and, we present LABIOS: a new, distributed, Label- based I/O system. LABIOS boosts I/O performance by up to 17× via asynchronous I/O, supports heterogeneous storage resources, offers storage elasticity, and promotes in situ analytics and software defined storage support via data provisioning. LABIOS demonstrates the effectiveness of storage bridging to support the convergence of HPC and BigData workloads on a single platform.
ChronoLog: A Distributed Shared Tiered Log Store with Time-based Data Ordering
Anthony Kougkas , Hariharan Devarajan , Keith Bateman , Jaime Cernuda , Neeraj Rajesh, and 1 more author
In MSST’20: 36th International Conference on Massive Storage Systems and Technology , Oct 2020
Modern applications produce and process massive amounts of activity (or log) data. Traditional storage systems were not designed with an append-only data model and a new storage abstraction aims to fill this gap: the distributed shared log store. However, existing solutions struggle to provide a scalable, parallel, and high-performance solution that can support a diverse set of conflicting log workload requirements. Finding the tail of a distributed log is a centralized point of contention. In this paper, we show how using physical time can help alleviate the need of centralized synchronization points. We present ChronoLog, a new, distributed, shared, and multi-tiered log store that can handle more than a million tail operations per second. Evaluation results show ChronoLog’s potential, outperforming existing solution by an order of magnitude.
Hreplica: a dynamic data replication engine with adaptive compression for multi-tiered storage
Hariharan Devarajan, Anthony Kougkas , and Xian-He Sun
In Big Data’20: Proceedings of the International Conference on Big Data , Dec 2020
As the diversity of big data applications increases, their requirements diverge and often conflict with one other. Managing this diversity in any supercomputer or data center is a major challenge for system designers. Data replication is a popular approach to meet several of these requirements, such as low latency, read availability, durability, etc. This approach can be enhanced using new modern heterogeneous hardware and software techniques such as data compression. However, both these enhancements work in isolation to the detriment of both. In this work, we present HReplica: a dynamic data replication engine which harmoniously leverages data compression and hierarchical storage to increase the effectiveness of data replication. We have developed a novel dynamic selection algorithm that facilitates the optimal matching of replication schemes, compression libraries, and tiered storage. Our evaluation shows that HReplica can improve scientific and cloud application performance by 5.2x when compared to other state-of-the-art replication schemes.
2019
An intelligent, adaptive, and flexible data compression framework
Hariharan Devarajan, Anthony Kougkas , and Xian-He Sun
In CCGRID’18: Proceedings of the 19th International Symposium on Cluster, Cloud and Grid Computing , May 2019
The data explosion phenomenon in modern applications causes tremendous stress on storage systems. Developers use data compression, a size-reduction technique, to address this issue. However, each compression library exhibits different strengths and weaknesses when considering the input data entry_type and format. We present Ares, an intelligent, adaptive, and flexible compression framework which can dynamically choose a compression library for a given input data based on the entry_type of the workload and provides an appropriate infrastructure to users to fine-tune the chosen library. Ares is a modular framework which unifies several compression libraries while allowing the addition of more compression libraries by the user. Ares is a unified compression engine that abstracts the complexity of using different compression libraries for each workload. Evaluation results show that under real-world applications, from both scientific and Cloud domains, Ares performed 2-6x faster than competitive solutions with a low cost of additional data analysis (i.e., overheads around 10%) and up to 10x faster against a baseline of no compression at all.
Labios: A distributed label-based i/o system
Anthony Kougkas , Hariharan Devarajan , Jay Lofstead , and Xian-He Sun
In HPDC’19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. ║ Best Paper Award ║ , Jun 2019
In the era of data-intensive computing, large-scale applications, in both scientific and the BigData communities, demonstrate unique I/O requirements leading to a proliferation of different storage devices and software stacks, many of which have conflicting requirements. In this paper, we investigate how to support a wide variety of conflicting I/O workloads under a single storage system. We introduce the idea of a Label, a new data representation, and, we present LABIOS: a new, distributed, Label- based I/O system. LABIOS boosts I/O performance by up to 17x via asynchronous I/O, supports heterogeneous storage resources, offers storage elasticity, and promotes in-situ analytics via data provisioning. LABIOS demonstrates the effectiveness of storage bridging to support the convergence of HPC and BigData workloads on a single platform.
NIOBE: An intelligent i/o bridging engine for complex and distributed workflows
Kun Feng , Hariharan Devarajan, Anthony Kougkas , and Xian-He Sun
In BigData’19: Proceedings of the International Conference on Big Data , Dec 2019
In the age of data-driven computing, integrating High Performance Computing(HPC) and Big Data(BD) environments may be the key to increasing productivity and to driving scientific discovery forward. Scientific workflows consist of diverse applications (i.e., HPC simulations and BD analysis) each with distinct representations of data that introduce a semantic barrier between the two environments. To solve scientific problems at scale, accessing semantically different data from different storage resources is the biggest unsolved challenge. In this work, we aim to address a critical question: ”How can we exploit the existing resources and efficiently provide transparent access to data from/to both environments”. We propose iNtelligent I/O Bridging Engine(NIOBE), a new data integration framework that enables integrated data access for scientific workflows with asynchronous I/O and data aggregation. NIOBE performs the data integration using available I/O resources, in contrast to existing optimizations that ignore the I/O nodes present on the data path. In NIOBE, data access is optimized to consider both the ongoing production and the consumption of the data in the future. Experimental results show that with NIOBE, an integrated scientific workflow can be accelerated by up to 10x when compared to a no-integration baseline and by up to 133% compared to other state-of-the-art integration solutions.
2018
Hermes: a heterogeneous-aware multi-tiered distributed I/O buffering system
Anthony Kougkas , Hariharan Devarajan , and Xian-He Sun
In HPDC’18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing , Jun 2018
Modern High-Performance Computing (HPC) systems are adding extra layers to the memory and storage hierarchy named deep memory and storage hierarchy (DMSH), to increase I/O performance. New hardware technologies, such as NVMe and SSD, have been introduced in burst buffer installations to reduce the pressure for external storage and boost the burstiness of modern I/O systems. The DMSH has demonstrated its strength and potential in practice. However, each layer of DMSH is an independent heterogeneous system and data movement among more layers is significantly more complex even without considering heterogeneity. How to efficiently utilize the DMSH is a subject of research facing the HPC community. In this paper, we present the design and implementation of Hermes: a new, heterogeneous-aware, multi-tiered, dynamic, and distributed I/O buffering system. Hermes enables, manages, supervises, and, in some sense, extends I/O buffering to fully integrate into the DMSH. We introduce three novel data placement policies to efficiently utilize all layers and we present three novel techniques to perform memory, metadata, and communication management in hierarchical buffering systems. Our evaluation shows that, in addition to automatic data movement through the hierarchy, Hermes can significantly accelerate I/O and outperforms by more than 2x state-of-the-art buffering platforms.
Iris: I/o redirection via integrated storage
Anthony Kougkas , Hariharan Devarajan , and Xian-He Sun
In ICS’18: Proceedings of the 2018 International Conference on Supercomputing , Jun 2018
There is an ocean of available storage solutions in modern high-performance and distributed systems. These solutions consist of Parallel File Systems (PFS) for the more traditional high-performance computing (HPC) systems and of Object Stores for emerging cloud environments. More of ten than not, these storage solutions are tied to specific APIs and data models and thus, bind developers, applications, and entire computing facilities to using certain interfaces. Each storage system is designed and optimized for certain applications but does not perform well for others. Furthermore, modern applications have become more and more complex consisting of a collection of phases with different computation and I/O requirements. In this paper, we propose a unified storage access system, called IRIS (i.e., I/O Redirection via Integrated Storage). IRIS enables unified data access and seamlessly bridges the semantic gap between file systems and object stores. With IRIS, emerging High-Performance Data Analytics software has capable and diverse I/O support. IRIS can bring us closer to the convergence of HPC and Cloud environments by combining the best storage subsystems from both worlds. Experimental results show that IRIS can grant more than 7x improvement in performance than existing solutions.
Harmonia: An interference-aware dynamic I/O scheduler for shared non-volatile burst buffers
Anthony Kougkas , Hariharan Devarajan , Xian-He Sun , and Jay Lofstead
In CLUSTER’18: Proceedings of the International Conference on Cluster Computing , Sep 2018
Modern HPC systems employ burst buffer installations to reduce the peak I/O requirements for external storage and deal with the burstiness of I/O in modern scientific applications. These I/O buffering resources are shared between multiple applications that run concurrently. This leads to severe performance degradation due to contention, a phenomenon called cross-application I/O interference. In this paper, we first explore the negative effects of interference at the burst buffer layer and we present two new metrics that can quantitatively describe the slowdown applications experience due to interference. We introduce Harmonia, a new dynamic I/O scheduler that is aware of interference, adapts to the underlying system, implements a new 2-way decision-making process and employs several scheduling policies to maximize the system efficiency and applications’ performance. Our evaluation shows that Harmonia, through better I/O scheduling, can outperform by 3x existing state-of-the-art buffering management solutions and can lead to better resource utilization.
Vidya: Performing code-block I/O characterization for data access optimization
Hariharan Devarajan, Anthony Kougkas , Prajwal Challa , and Xian-He Sun
In HiPC’18: Proceedings of the 25th International Conference on High Performance Computing , Dec 2018
Understanding, characterizing and tuning scientific applications’ I/O behavior is an increasingly complicated process in HPC systems. Existing tools use either offline profiling or online analysis to get insights into the applications’ I/O patterns. However, there is lack of a clear formula to characterize applications’ I/O. Moreover, these tools are application specific and do not account for multi-tenant systems. This paper presents Vidya, an I/O profiling framework which can predict application’s I/O intensity using a new formula called Code-Block I/O Characterization (CIOC). Using CIOC, developers and system architects can tune an application’s I/O behavior and better match the underlying storage system to maximize performance. Evaluation results show that Vidya can predict an application’s I/O intensity with a variance of 0.05%. Vidya can profile applications with a high accuracy of 98% while reducing profiling time by 9x. We further show how Vidya can optimize an application’s I/O time by 3.7x.
2017
Rethinking key–value store for parallel i/o optimization
Anthony Kougkas, Hassan Eslami , Xian-He Sun, Rajeev Thakur, and William Gropp
In IJHPCA’17: The International Journal of High Performance Computing Applications , 2017
Key–value stores are being widely used as the storage system for large-scale internet services and cloud storage systems. However, they are rarely used in HPC systems, where parallel file systems are the dominant storage solution. In this study, we examine the architecture differences and performance characteristics of parallel file systems and key–value stores. We propose using key–value stores to optimize overall Input/Output (I/O) performance, especially for workloads that parallel file systems cannot handle well, such as the cases with intense data synchronization or heavy metadata operations. We conducted experiments with several synthetic benchmarks, an I/O benchmark, and a real application. We modeled the performance of these two systems using collected data from our experiments, and we provide a predictive method to identify which system offers better I/O performance given a specific workload. The results show that we can optimize the I/O performance in HPC systems by utilizing key–value stores.
2016
Leveraging burst buffer coordination to prevent I/O interference
Anthony Kougkas , Matthieu Dorier , Rob Latham , Rob Ross , and Xian-He Sun
In eScience’16: Proceedings of the 12th International Conference on e-Science , Jun 2016
Concurrent accesses to the shared storage resources in current HPC machines lead to severe performance degradation caused by I/O contention. In this study, we identify some key challenges to efficiently handling interleaved data accesses, and we propose a system-wide solution to optimize global performance. We implemented and tested several I/O scheduling policies, including prioritizing specific applications by leveraging burst buffers to defer the conflicting accesses from another application and/or directing the requests to different storage servers inside the parallel file system infrastructure. The results show that we mitigate the negative effects of interference and optimize the performance up to 2x depending on the selected I/O policy.
Towards energy efficient data management in HPC: the open ethernet drive approach
Anthony Kougkas, Anthony Fleck , and Xian-He Sun
In PDSW-DISCS16: Proceedings of the 1st Joint International Workshop On Parallel Data Storage & Data Intensive Scalable Computing Systems , Nov 2016
An Open Ethernet Drive (OED) is a new technology that encloses into a hard drive (HDD or SSD) a low-power processor, a fixed-size memory and an Ethernet card. In this study, we thoroughly evaluate the performance of such device and the energy requirements to operate it. The results show that first it is a viable solution to offload data-intensive computations on the OED while maintaining a reasonable performance, and second, the energy consumption savings from utilizing such technology are significant as it only consumes 10% of the power needed by a normal server node. We propose that by using OED devices as storage servers in HPC, we can run a reliable, scalable, cost and energy efficient storage solution.
2015
A Heterogeneity-Aware Region-Level Data Layout for Hybrid Parallel File Systems
Shuibing He , Xian-He Sun , Yang Wang, Anthony Kougkas , and Adnan Haider
In ICPP’15: Proceedings of the 44th International Conference on Parallel Processing , Dec 2015
Parallel file systems (PFS) are commonly used in high-end computing systems. With the emergence of solid state drives (SSD), hybrid PFSs, which consist of both HDD and SSD servers, provide a practical I/O system solution for data-intensive applications. However, most existing PFS layout schemes are inefficient for hybrid PFSs due to their lack of awareness of the performance differences between heterogeneous servers and the workload changes between different parts of a file. This lack of recognition can result in severe I/O performance degradation. In this study, we propose a heterogeneity-aware region-level (HARL) data layout scheme to improve the data distribution of a hybrid PFS. HARL first divides a file into fine-grained, varying sized regions according to the changes of an application’s I/O workload, then chooses appropriate file stripe sizes on heterogeneous servers based on the server performance for each file region. Experimental results of representative benchmarks show that HARL can greatly improve the I/O system performance.
Efficient disk-to-disk sorting: a case study in the decoupled execution paradigm
Hassan Eslami, Anthony Kougkas, Maria Kotsifakou, Theodoros Kasampalis , Kun Feng, and 5 more authors
In DISCS’15: Proceedings of the International Workshop on Data-Intensive Scalable Computing , Nov 2015
Many applications foreseen for exascale era should process huge amount of data. However, the IO infrastructure of current supercomputing architecture cannot be generalized to deal with this amount of data due to the need for excessive data movement from storage layers to compute nodes leading to limited scalability. There has been extensive studies addressing this challenge. Decoupled Execution Paradigm (DEP) is an attractive solution due to its unique features such as available fast storage devices close to computational units and available programmable units close to file system. In this paper we study the effectiveness of DEP for a well-known data-intensive kernel, disk-to-disk (aka out-of-core) sorting. We propose an optimized algorithm that uses almost all features of DEP pushing the performance of sorting in HPC even further compared to other existing solutions. Advantages in our algorithm are gained by exploiting programming units close to parallel file system to achieve higher IO throughput, compressing data before sending it over network or to disk, storing intermediate results of computation close to compute nodes, and fully overlapping IO with computation. We also provide an analytical model for our proposed algorithm. Our algorithm achieves 30% better performance compared to the theoretically optimal sorting algorithm running on the same testbed but not designed to exploit the DEP architecture.