publications | akougkas.io

2025

ipdps25 Characterizing the Behavior and Impact of KV Caching on Transformer Inferences under Concurrency

Jie Ye, Jaime Cernuda, Avinash Maurya, Xian-He Sun, Anthony Kougkas, and Bogdan Nicolae

In IPDPS’25: The 39th IEEE International Parallel and Distributed Processing Symposium , Jun 2025

ABS BIB Cite

Pre-training of LLMs and transformers is known to take weeks if not months even with powerful HPC systems. However, inferences are an equally important problem: once pre-trained, the model needs to serve a large number of inferences submitted under concurrency by multiple users. Thus, speeding up each inference request is instrumental in achieving high throughput and latency at scale. To avoid redundant recomputation in each decode iteration, a Key-Value (KV) cache is used to store previously computed keys (K) and values (V), speeding up token generation. GPU memory is primarily consumed by model weights and the remainder is used by the KV cache. Thus, the free GPU space available to the KV cache is a scarce resource that needs to be managed in an efficient way in order to minimize the overhead of redundant recomputations. There are many optimizations applied in this context: batching of inference requests to enable them to run in the same forward pass (and thus increase the parallelism and inference throughput), different KV cache eviction policies (simply drop KV entries and recompute them later vs. swap to host memory), etc. Under these circumstances, the decision of what batching strategy, what KV cache eviction policy to apply and how the KV cache impacts the inference performance is non-trivial. Unlike the case of pre-training, state-of-art studies are scarce in this context. To fill this gap, in this paper we study the impact of KV caching. Specifically, we instrument vLLM to measure and analyze fine-grain metrics (token throughput, KV cache memory access patterns, load balancing of the forward passes), during different inference stages (prefill, decode) in several scenarios that involve concurrent inference requests using several benchmarks. Based on the measurements and associated observations, we identify several opportunities to improve the design of inference frameworks.
@inproceedings{ye2025unboxkv-io, entry_type = {conference}, author = {Ye, Jie and Cernuda, Jaime and Maurya, Avinash and Sun, Xian-He and Kougkas, Anthony and Nicolae, Bogdan}, booktitle = {IPDPS'25: The 39th IEEE International Parallel and Distributed Processing Symposium}, title = {Characterizing the Behavior and Impact of KV Caching on Transformer Inferences under Concurrency}, year = {2025}, month = jun, publisher = {}, volume = {}, number = {}, pages = {}, keywords = {LLM inference, KV Cache Profiling, Access Pattern Characterization}, doi = {hal-04984000}, url = {https://hal.science/hal-04984000/}, }
ics25 WisIO: Automated I/O Bottleneck Detection with Multi-Perspective Views for HPC Workflows

Izzet Yildirim, Hariharan Devarajan, Anthony Kougkas, Xian-He Sun, and Kathryn Mohror

In ICS ’25: 2025 International Conference on Supercomputing , Jun 2025

ABS BIB Cite

Modern HPC workloads involve large data transfers that can become bottlenecks. Existing analysis tools identify bottlenecks from per-file performance data but have limitations in parallelizability and rigid heuristic-based rules, necessitating an automated, efficient, and multi-perspective solution. We designed an automated tool, WisIO, that enables parallel and distributed analysis of multi-terabyte-scale workflow performance data. WisIO examines performance data from multiple perspectives, uses metric-driven bottleneck classification, and allows extensible mapping of bottlenecks to root causes. Experimental results demonstrate that WisIO’s multiperspective views substantially improve bottleneck coverage, showing an average increase of up to 805× when compared to analyzing performance data from a single perspective. In our performance evaluation, WisIO’s metric-driven classification processed 340K bottlenecks per second, while its reasoning engine handled around 35K bottlenecks per second. In an analysis of five real-world HPC workloads, WisIO demonstrated up to 11× faster analysis time and identified up to 144× more bottlenecks compared to existing solutions.
@inproceedings{yildirim_wisio_2025, entry_type = {conference}, author = {Yildirim, Izzet and Devarajan, Hariharan and Kougkas, Anthony and Sun, Xian-He and Mohror, Kathryn}, booktitle = {ICS '25: 2025 International Conference on Supercomputing}, title = {WisIO: Automated I/O Bottleneck Detection with Multi-Perspective Views for HPC Workflows}, year = {2025}, month = jun, publisher = {}, volume = {}, number = {}, pages = {}, keywords = {HPC, Workflows, I/O Analysis, I/O Bottleneck Detection}, doi = {10.1145/3721145.3725742}, url = {}, }
ssdbm25 DTIO: Data Stack for AI-driven Workflows

Keith Alex Bateman, Neeraj Rajesh, Jaime Cernuda, Luke Logan, Bogdan Nicolae, Franck Cappello, Xian-He Sun, and Anthony Kougkas

In Proceedings of the 37th International Conference on Scalable Scientific Data Management , 2025

ABS BIB Cite

HPC, Big Data Analytics, and Machine Learning have become increasingly intertwined as popular models such as LLMs and Diffusion Models have been driving discovery in scientific fields. However, each of these domains has its own storage infrastructure with unique I/O interfaces and storage systems, requiring feature sets that are often incompatible. Users with experience in one domain lack the expertise to change their applications to match the data stacks of the other domains, necessitating expensive conversions. There is a need for a transparent solution for the unification of disparate data stacks for the triple convergence of HPC, Big Data, and ML that can provide the required functionality while achieving higher performance. To better support converged HPC, Big Data, and ML workflows, this paper proposes DTIO, a scalable I/O runtime that unifies the disparate I/O stack for modern scientific ML workflows. DTIO utilizes a unique DataTask abstraction to express the movement of data, its ordering, and its dependencies on other data as a task. DTIO achieves a unification of scientific and ML workflows by utilizing intelligent mapping of interfaces, and automatically determines the best method to relate their unique semantics. DTIO’s online translation with DataTask caching can improve performance by 49.6% compared to offline translation methods. DTIO also offers numerous optimizations, such as asynchronous I/O and aggregation.
@inproceedings{bateman2025dtio, entry_type = {conference}, author = {Bateman, Keith Alex and Rajesh, Neeraj and Cernuda, Jaime and Logan, Luke and Nicolae, Bogdan and Cappello, Franck and Sun, Xian-He and Kougkas, Anthony}, booktitle = {Proceedings of the 37th International Conference on Scalable Scientific Data Management}, title = {DTIO: Data Stack for AI-driven Workflows}, year = {2025}, month = {}, publisher = {Association for Computing Machinery}, volume = {}, number = {}, pages = {}, keywords = {Task Systems, Data Stacks, Systems for AI Workflows}, doi = {10.1145/3733723.3733736}, url = {https://doi.org/10.1145/3733723.3733736}, }

2024

sc24 MegaMmap: Blurring the Boundary Between Memory and Storage for Data-Intensive Workloads

Luke Logan, Anthony Kougkas, and Xian-He Sun

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis SC , Nov 2024

ABS BIB Cite

Large-scale data analytics, scientific simulation, and deep learning codes in HPC perform massive computations on data greatly exceeding the bounds of main memory. These out-of-core algorithms suffer from severe data movement penalties, programming complexity, and limited code reuse. To solve this, HPC sites have steadily increased DRAM capacity. However, this is not sustainable due to financial and environmental costs. A more elegant, low-cost, and portable solution is to expand memory to distributed multitiered storage. In this work, we propose MegaMmap: a software distributed shared memory (DSM) that enlarges effective memory capacity through intelligent tiered DRAM and storage management. MegaMmap provides workload-aware data organization, eviction, and prefetching policies to reduce DRAM consumption while ensuring speedy access to critical data. A variety of memory coherence optimizations are provided through an intuitive hinting system. Evaluations show that various workloads can be executed with a fraction of the DRAM while offering competitive performance.
@inproceedings{logan2024megammap, entry_type = {conference}, author = {Logan, Luke and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis SC}, title = {MegaMmap: Blurring the Boundary Between Memory and Storage for Data-Intensive Workloads}, year = {2024}, month = nov, publisher = {IEEE Computer Society}, volume = {}, number = {}, pages = {1725-1742}, keywords = {HPC, Systems Software, Memory Tiering, Storage Tiering}, doi = {10.1109/SC41406.2024.00114}, url = {https://dl.acm.org/doi/abs/10.1109/SC41406.2024.00114}, }
sc24 DFTracer: An Analysis-Friendly Data Flow Tracer for AI-Driven Workflows

Hariharan Devarajan, Loïc Pottier, Kaushik Velusamy, Huihuo Zheng, Izzet Yildirim, Olga Kogiou, Weikuan Yu, Anthony Kougkas, Xian-He Sun, Jae Seung Yeom, and Kathryn Mohror

In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis , Nov 2024

ABS BIB Cite

Modern HPC workflows involve intricate coupling of simulation, data analytics, and artificial intelligence (AI) applications to improve time to scientific insight. These workflows require a cohesive set of performance analysis tools to provide a comprehensive understanding of data exchange patterns in HPC systems. However, current tools are not designed to work with an AI-based I/O software stack that requires tracing at multiple levels of the application. To this end, we developed a data flow tracer called DFTracer to capture data-centric events from workflows and the I/O stack to build a detailed understanding of the data exchange within AI-driven workflows. DFTracer has the following three novel features, including a unified interface to capture trace data from different layers in the software stack, a trace format that is analysis-friendly and optimized to support efficiently loading multi-million events in a few seconds, and the capability to tag events with workflow-specific context to perform domain-centric data flow analysis for workflows. Additionally, we demonstrate that DFTracer has a 1.44x smaller runtime overhead and 1.3-7.1x smaller trace size than state-of-the-art tracing tools such as Score-P, Recorder, and Darshan. Moreover, with AI-driven workflows, Score-P, Recorder, and Darshan cannot find I/O accesses from dynamically spawned processes, and their load performance of 100M events is three orders of magnitude slower than DFTracer. In conclusion, we demonstrate that DFTracer can capture multi-level performance data, including contextual event tagging with a low overhead of 1-5% from AI-driven workflows such as MuMMI and Microsoft’s Megatron Deepspeed running on large-scale HPC systems.
@inproceedings{devarajan2024dftracer, entry_type = {conference}, author = {Devarajan, Hariharan and Pottier, Loïc and Velusamy, Kaushik and Zheng, Huihuo and Yildirim, Izzet and Kogiou, Olga and Yu, Weikuan and Kougkas, Anthony and Sun, Xian-He and Yeom, Jae Seung and Mohror, Kathryn}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis}, title = {DFTracer: An Analysis-Friendly Data Flow Tracer for AI-Driven Workflows}, year = {2024}, month = nov, publisher = {IEEE Press}, volume = {}, number = {}, pages = {17:1-17:24}, keywords = {I/O, Application APIs, Deep Learning, Interception, Multilevel, System Calls, Tracer, Transparent, Workflows}, doi = {10.1109/SC41406.2024.00023}, url = {https://dl.acm.org/doi/abs/10.1109/SC41406.2024.00023}, }
sbacpad24 To Derive or Not to Derive: I/O Libraries Take Charge of Derived Quantities Computation

Ana Gainaru, Norbert Podhorszki, Liz Dulac, Qian Gong, Scott Klasky, Greg Eisenhauer, Antonios Kougkas, Xian-He Sun, and Jay Lofstead

In 2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) , Nov 2024

ABS BIB Cite

Data streaming is gaining traction in high-performance computing (HPC) as a mechanism for continuous data transfer, but remains underutilized as a processing paradigm due to the inadequacy of existing technologies, which are primarily designed for cloud architectures and ill-equipped to tackle HPC-specific challenges. This work introduces a novel approach where I/O libraries take charge of computing derived quantities. By managing the computation of these quantities within the I/O stack, issues such as redundant computations and data movement can be effectively addressed at runtime. The proposed solution demonstrates significant performance improvements and reduced resource utilization in HPC environments.
@inproceedings{gainaru2024derive, entry_type = {conference}, author = {Gainaru, Ana and Podhorszki, Norbert and Dulac, Liz and Gong, Qian and Klasky, Scott and Eisenhauer, Greg and Kougkas, Antonios and Sun, Xian-He and Lofstead, Jay}, booktitle = {2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)}, title = {To Derive or Not to Derive: I/O Libraries Take Charge of Derived Quantities Computation}, year = {2024}, month = nov, publisher = {IEEE Press}, volume = {}, number = {}, pages = {105-115}, keywords = {Large-scale I/O, Derived Variables, HPC Analysis, Queries for Scientific Data, Quantities of Interest}, doi = {10.1109/SBAC-PAD63648.2024.00030}, url = {https://ieeexplore.ieee.org/document/10763877}, }
cluster24 DaYu: Optimizing Distributed Scientific Workflows by Decoding Dataflow Semantics and Dynamics

Meng Tang, Jaime Cernuda, Jie Ye, Luanzheng Guo, Nathan R. Tallent, Anthony Kougkas, and Xian-He Sun

In Proceedings of the IEEE International Conference on Cluster Computing , Sep 2024

ABS BIB Cite

The combination of ever-growing scientific datasets and distributed workflow complexity creates I/O performance bottlenecks due to data volume, velocity, and variety. Although the increasing use of descriptive data formats (e.g., HDF5, netCDF) helps organize these datasets, it also introduces obscure bottlenecks due to the need to translate high-level operations into file addresses and then into low-level I/O operations. To address this challenge, we introduce DaYu, a method and toolset for analyzing (a) semantic relationships between logical datasets and file addresses, (b) how dataset operations translate into I/O, and (c) the combination across entire workflows. DaYu’s analysis and visualization enable the identification of critical bottlenecks and the reasoning about remediation. We describe our methodology and propose optimization guidelines. Evaluation on scientific workflows demonstrates up to a 3.7x performance improvement in I/O time for obscure bottlenecks. The time and storage overhead for DaYu’s time-ordered data are typically under 0.2% of runtime and 0.25% of data volume, respectively.
@inproceedings{tang2024dayu, entry_type = {conference}, author = {Tang, Meng and Cernuda, Jaime and Ye, Jie and Guo, Luanzheng and Tallent, Nathan R. and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the IEEE International Conference on Cluster Computing}, title = {DaYu: Optimizing Distributed Scientific Workflows by Decoding Dataflow Semantics and Dynamics}, year = {2024}, month = sep, publisher = {IEEE}, volume = {}, number = {}, pages = {357-369}, keywords = {Workflow Optimization, Data Layout Optimization, In-Situ Analytics, Data-Intensive Applications}, doi = {10.1109/CLUSTER59578.2024.00038}, url = {https://ieeexplore.ieee.org/document/10740817}, }
sigops24 An Evaluation of DAOS for Simulation and Deep Learning HPCWorkloads

Luke Logan, Jay Lofstead, Xian-He Sun, and Anthony Kougkas

In Proceedings of the ACM SIGOPS Operating Systems Review , Aug 2024

ABS BIB Cite

Traditionally, distributed storage systems have relied upon the interfaces provided by OS kernels to interact with storage hardware. However, much research has shown that OSes impose serious overheads on every I/O operation, especially on high-performance storage and networking hardware (e.g., PMEM and 200GBe). Thus, distributed storage stacks are being re-designed to take advantage of this modern hardware by utilizing new hardware interfaces which bypass the kernel entirely. However, the impact of these optimizations have not been well-studied for real HPC workloads on real hardware. In this work, we provide a comprehensive evaluation of DAOS: a state-of-the-art distributed storage system which re-architects the storage stack from scratch for modern hardware.We compare DAOS against traditional storage stacks and demonstrate that by utilizing optimal interfaces to hardware, performance improvements of up to 6x can be observed in real scientific applications.
@inproceedings{logan2024daos, entry_type = {journal}, author = {Logan, Luke and Lofstead, Jay and Sun, Xian-He and Kougkas, Anthony}, booktitle = {Proceedings of the ACM SIGOPS Operating Systems Review}, title = {An Evaluation of DAOS for Simulation and Deep Learning HPCWorkloads}, year = {2024}, month = aug, publisher = {ACM}, volume = {58}, number = {1}, pages = {37-44}, keywords = {Storage Architectures, I/O Benchmarking, High-Performance Computing, Data Movement Optimization}, doi = {10.1145/3689051.3689058}, url = {https://doi.org/10.1145/3689051.3689058}, }
icpp24 Viper: A High-Performance I/O Framework for Transparently Updating, Storing, and Transferring Deep Neural Network Models

Jie Ye, Jaime Cernuda, Neeraj Rajesh, Keith Bateman, Orcun Yildiz, Tom Peterka, Arnur Nigmetov, Dmitriy Morozov, Xian-He Sun, Anthony Kougkas, and Bogdan Nicolae

In Proceedings of the 53rd International Conference on Parallel Processing , Aug 2024

ABS BIB Cite

Scientific workflows increasingly need to train a DNN model in real-time during an experiment (e.g. using ground truth from a simulation), while using it at the same time for inferences. Instead of sharing the same model instance, the training (producer) and inference server (consumer) often use different model replicas that are kept synchronized. In addition to efficient I/O techniques to keep the model replica of the producer and consumer synchronized, there is another important trade-off: frequent model updates enhance inference quality but may slow down training; infrequent updates may lead to less precise inference results. To address these challenges, we introduce Viper: a new I/O framework designed to determine a near-optimal checkpoint schedule and accelerate the delivery of the latest model updates. Viper builds an inference performance predictor to identify the optimal checkpoint schedule to balance the trade-off between training slowdown and inference quality improvement. It also creates a memory-first model transfer engine to accelerate model delivery through direct memory-to-memory communication. Our experiments show that Viper can reduce the model update latency by ≈ 9x using the GPU-to-GPU data transfer engine and ≈ 3x using the DRAM-to-DRAM host data transfer. The checkpoint schedule obtained from Viper’s predictor also demonstrates improved cumulative inference accuracy compared to the baseline of epoch-based solutions.
@inproceedings{ye2024viper, entry_type = {conference}, author = {Ye, Jie and Cernuda, Jaime and Rajesh, Neeraj and Bateman, Keith and Yildiz, Orcun and Peterka, Tom and Nigmetov, Arnur and Morozov, Dmitriy and Sun, Xian-He and Kougkas, Anthony and Nicolae, Bogdan}, booktitle = {Proceedings of the 53rd International Conference on Parallel Processing}, title = {Viper: A High-Performance I/O Framework for Transparently Updating, Storing, and Transferring Deep Neural Network Models}, year = {2024}, month = aug, publisher = {ACM}, volume = {}, number = {}, pages = {812-821}, kkeywords = {Deep Learning I/O, Data Movement Optimization, Workflow Optimization, Storage Bridging}, doi = {10.1145/3673038.3673070}, url = {https://doi.org/10.1145/3673038.3673070}, }
icpp24 HStream: A hierarchical data streaming engine for high-throughput scientific applications

Jaime Cernuda, Jie Ye, Anthony Kougkas, and Xian-He Sun

In Proceedings of the 53rd International Conference on Parallel Processing , Aug 2024

ABS BIB Cite

Data streaming is gaining traction in high-performance computing (HPC) as a mechanism for continuous data transfer, but remains underutilized as a processing paradigm due to the inadequacy of existing technologies, which are primarily designed for cloud architectures and ill-equipped to tackle HPC-specific challenges. This work introduces HStream, a novel data management design for out-of-core data streaming engines. Central to the HStream design is the separation of data and computing planes at the task level. By managing them independently, issues such as memory thrashing and back-pressure, caused by the high volume, velocity, and burstiness of I/O in HPC environments, can be effectively addressed at runtime. Specifically, HStream utilizes adaptive parallelism and hierarchical memory management, enabled by this design paradigm, to alleviate memory pressure and enhance system performance. These improvements enable HStream to match the performance of state-of-the-art HPC streaming engines and achieve up to a 1.5x reduction in latency under high data loads.
@inproceedings{cernuda2024hstream, entry_type = {conference}, author = {Cernuda, Jaime and Ye, Jie and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the 53rd International Conference on Parallel Processing}, title = {HStream: A hierarchical data streaming engine for high-throughput scientific applications}, year = {2024}, month = aug, publisher = {ACM}, volume = {}, number = {}, pages = {231-240}, keywords = {Data Movement Optimization, Elastic Storage, Data Integration Frameworks, Hierarchical Buffering}, doi = {10.1145/3673038.3673150}, url = {https://dl.acm.org/doi/abs/10.1145/3673038.3673150}, }
ccgrid24 Hades: A Context-Aware Active Storage Framework for Accelerating Large-Scale Data Analysis

Jaime Cernuda, Luke Logan, Ana Gainaru, Scott Klasky, Jay Lofstead, Anthony Kougkas, and Xian-He Sun

In Proceedings of the 24th International Symposium on Cluster, Cloud and Internet Computing , May 2024

ABS BIB Cite

Modern simulation workflows generate and analyze massive amounts of data using I/O libraries like Adios2 and NetCDF. Although extensive work has optimized the I/O processes during the simulation phase, executing analytical queries—which often require iterative traversals of large files for insights—is cumbersome and usually constrained by low I/O performance. Instead of waiting for the analysis phase to process queries, quantities can be derived asynchronously during data production and cached, speeding up future queries. In this work, we introduce a context-aware I/O layer named ’Hades.’ It is designed to efficiently derive insights from selected quantities without compromising overall workflow performance. Hades actively and asynchronously computes and stores these quantities while the data is in transit. Hades leverages a hierarchical buffering system with data access-aware prefetching to ensure quick and timely access to relevant data. It offers a flexible query interface empowering users to easily define derived quantities and provide control over data placement decisions. Hades is implemented using an Adios2 plugin engine and the Hermes buffering platform, enabling transparent use by any Adios-powered application or workflow. Experimental results demonstrate performance improvements by up to 3-4x for tested real-world scientific producer-consumer workflows.
@inproceedings{cernuda2024hades, entry_type = {conference}, author = {Cernuda, Jaime and Logan, Luke and Gainaru, Ana and Klasky, Scott and Lofstead, Jay and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the 24th International Symposium on Cluster, Cloud and Internet Computing}, title = {Hades: A Context-Aware Active Storage Framework for Accelerating Large-Scale Data Analysis}, year = {2024}, month = may, publisher = {IEEE}, volume = {}, number = {}, pages = {577-586}, keywords = {In-Situ Analytics, Hierarchical Buffering, Data Movement Optimization, Workflow Optimization}, doi = {10.1109/CCGrid59990.2024.00070}, url = {https://ieeexplore.ieee.org/document/10701392}, }
ipdps24 TunIO: An AI-powered Framework for Optimizing HPC I/O

Neeraj Rajesh, Keith Bateman, Jean Luca Bez, Suren Byna, Anthony Kougkas, and Xian-He Sun

In Proceedings of the International Parallel and Distributed Processing Symposium , May 2024

ABS BIB Cite

I/O operations are a known performance bottleneck of HPC applications. To achieve good performance, users often employ an iterative multistage tuning process to find an optimal I/O stack configuration. However, an I/O stack contains multiple layers, such as high-level I/O libraries, I/O middleware, and parallel file systems, and each layer has many parameters. These parameters and layers are entangled and influenced by each other. The tuning process is time-consuming and complex. In this work, we present TunIO, an AI-powered I/O tuning framework that implements several techniques to balance the tuning cost and performance gain, including tuning the high-impact parameters first. Furthermore, TunIO analyzes the application source code to extract its I/O kernel while retaining all statements necessary to perform I/O. It utilizes a smart selection of high-impact configuration parameters of the given tuning objective. Finally, it uses a novel Reinforcement Learning (RL)-driven early stopping mechanism to balance the cost and performance gain. Experimental results show that TunIO leads to a reduction of up to ≈73% in tuning time while achieving the same performance gain when compared to H5Tuner. It achieves a significant performance gain/cost of 208.4 MBps/min (I/O bandwidth for each minute spent in tuning) over existing approaches under our testing.
@inproceedings{rajesh2024tunio, entry_type = {conference}, author = {Rajesh, Neeraj and Bateman, Keith and Bez, Jean Luca and Byna, Suren and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the International Parallel and Distributed Processing Symposium}, title = {TunIO: An AI-powered Framework for Optimizing HPC I/O}, year = {2024}, month = may, publisher = {IEEE}, volume = {}, number = {}, pages = {494-505}, keywords = {I/O Performance Optimization, Storage Resource Provisioning, Task-Based I/O, High-Performance Computing}, doi = {10.1109/IPDPS57955.2024.00050}, url = {https://ieeexplore.ieee.org/document/10579249}, }

2023

sc23 Data Flow Lifecycles for Optimizing Workflow Coordination

Hyungro Lee, Luanzheng Guo, Meng Tang, Jesun Firoz, Nathan Tallent, Anthony Kougkas, and Xian-He Sun

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , Nov 2023

ABS BIB Cite

A critical performance challenge in distributed scientific workflows is coordinating tasks and data flows on distributed resources. To guide these decisions, this paper introduces data flow lifecycle analysis. Workflows are commonly represented using directed acyclic graphs (DAGs). Data flow lifecycles (DFL) enrich task DAGs with data objects and properties that describe data flow and how tasks interact with that flow. Lifecycles enable analysis from several important perspectives: task, data, and data flow. We describe representation, measurement, analysis, visualization, and opportunity identification for DFLs. Our measurement is both distributed and scalable, using space that is constant per data file. We use lifecycles and opportunity analysis to reason about improved task placement and reduced data movement for five scientific workflows with different characteristics. Case studies show improvements of 15×, 1.9×, and 10–30×. Our work is implemented in the DataLife tool.
@inproceedings{lee2023data, entry_type = {conference}, author = {Lee, Hyungro and Guo, Luanzheng and Tang, Meng and Firoz, Jesun and Tallent, Nathan and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, title = {Data Flow Lifecycles for Optimizing Workflow Coordination}, year = {2023}, month = nov, publisher = {ACM}, volume = {}, number = {}, pages = {1--15}, keywords = {Workflow Optimization, Data Movement Optimization, In-Situ Analytics, Data-Intensive Applications}, doi = {10.1145/3581784.36071}, url = {https://dl.acm.org/doi/abs/10.1145/3581784.3607104}, }
sc23w IOMax: Maximizing Out-of-Core I/O Analysis Performance on HPC Systems

Izzet Yildirim, Hariharan Devarajan, Anthony Kougkas, Xian-He Sun, and Kathryn Mohror

In Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis , Nov 2023

ABS BIB Cite

I/O analysis is an essential task for improving the performance of scientific applications on high-performance computing (HPC) systems. However, current analysis tools, which often use data drilling techniques (iterative exploration for deeper insights), treat every query independently and do not optimize column data for data-slicing (extracting specific data subsets), resulting in subpar querying performance. In this paper, we designed IOMax, a tool for efficient data drilling analysis on large-scale I/O traces. IOMax utilizes a novel query optimization technique to improve the query performance by 8.6x while reducing the memory footprint required for analysis by 11x. Additionally, it employs data transformation techniques to improve data-slicing performance by up to 11.4x. In conclusion, IOMax optimizes I/O analysis for scientific workflows on the Lassen supercomputer, resulting in up to 7x improvement.
@inproceedings{yildirim2023iomax, entry_type = {workshop}, author = {Yildirim, Izzet and Devarajan, Hariharan and Kougkas, Anthony and Sun, Xian-He and Mohror, Kathryn}, booktitle = {Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis}, title = {IOMax: Maximizing Out-of-Core I/O Analysis Performance on HPC Systems}, year = {2023}, month = nov, publisher = {ACM}, volume = {}, number = {}, pages = {1209--1215}, keywords = {I/O Profiling, Data Management in HPC, I/O Benchmarking, Workflow Optimization}, doi = {10.1145/3624062.3624191}, url = {https://dl.acm.org/doi/abs/10.1145/3624062.3624191}, }
cheops23 An Evaluation of DAOS for Simulation and Deep Learning HPC Workloads

Luke Logan, Jay Lofstead, Xian-He Sun, and Anthony Kougkas

In Proceedings of the 3rd Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems , May 2023

ABS BIB Cite

Traditionally, distributed storage systems have relied upon the interfaces provided by OS kernels to interact with storage hardware. However, much research has shown that OSes impose serious overheads on every I/O operation, especially on high-performance storage and networking hardware (e.g., PMEM and 200GBe). Thus, distributed storage stacks are being re-designed to take advantage of this modern hardware by utilizing new hardware interfaces which bypass the kernel entirely. However, the impact of these optimizations have not been well-studied for real HPC workloads on real hardware. In this work, we provide a comprehensive evaluation of DAOS: a state-of-the-art distributed storage system which re-architects the storage stack from scratch for modern hardware. We compare DAOS against traditional storage stacks and demonstrate that by utilizing optimal interfaces to hardware, performance improvements of up to 6x can be observed in real scientific applications.
@inproceedings{logan2023evaluation, entry_type = {workshop}, author = {Logan, Luke and Lofstead, Jay and Sun, Xian-He and Kougkas, Anthony}, booktitle = {Proceedings of the 3rd Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems}, title = {An Evaluation of DAOS for Simulation and Deep Learning HPC Workloads}, year = {2023}, month = may, publisher = {ACM}, volume = {}, number = {}, pages = {9--16}, keywords = {Storage Architectures, I/O Benchmarking, High-Performance Computing, Data Movement Optimization}, doi = {10.1145/3578353.358954}, url = {https://dl.acm.org/doi/abs/10.1145/3578353.3589542}, }

2022

hipc22 LuxIO: Intelligent Resource Provisioning and Auto-Configuration for Storage Services

Keith Bateman, Neeraj Rajesh, Jaime Cernuda Garcia, Luke Logan, Jie Ye, Stephen Herbein, Anthony Kougkas, and Xian-He Sun

In Proceedings of the 29th International Conference on High Performance Computing, Data, and Analytics , Dec 2022

ABS BIB Cite

Storage in HPC is typically a single Remote and Static Storage (RSS) resource. However, applications demonstrate diverse I/O requirements that can be better served by a multi-storage approach. Current practice employs ephemeral storage systems running on either node-local or shared storage resources. Yet, the burden of provisioning and configuring intermediate storage falls solely on the users, while global job schedulers offer little to no support for custom deployments. This lack of support often leads to over- or under-provisioning of resources and poorly configured storage systems. To mitigate this, we present LuxIO, an intelligent storage resource provisioning and auto-configuration service. LuxIO constructs storage deployments configured to best match I/O requirements. LuxIO-tuned storage services show performance improvements up to 2× across common applications and benchmarks, while introducing minimal overhead of 93.40 ms on top of existing job scheduling pipelines. LuxIO improves resource utilization by up to 25% in select workflows.
@inproceedings{bateman2022luxio, entry_type = {conference}, author = {Bateman, Keith and Rajesh, Neeraj and Garcia, Jaime Cernuda and Logan, Luke and Ye, Jie and Herbein, Stephen and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the 29th International Conference on High Performance Computing, Data, and Analytics}, title = {LuxIO: Intelligent Resource Provisioning and Auto-Configuration for Storage Services}, year = {2022}, month = dec, publisher = {IEEE}, volume = {}, number = {}, pages = {246--255}, keywords = {Storage Resource Provisioning, Data Management in HPC, Elastic Storage, Workflow Optimization}, doi = {10.1109/HiPC56025.2022.00041}, url = {https://ieeexplore.ieee.org/abstract/document/10106285}, }
sc22 LabStor: A modular and extensible platform for developing high-performance, customized I/O stacks in userspace

Luke Logan, Jaime Cernuda Garcia, Jay Lofstead, Xian–He Sun, and Anthony Kougkas

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , Nov 2022

ABS BIB Cite

Traditionally, I/O systems have been developed within the confines of a centralized OS kernel. This led to monolithic and rigid storage systems that are limited by low development speed, expressiveness, and performance. Various assumptions are imposed including reliance on the UNIX-file abstraction, the POSIX standard, and a narrow set of I/O policies. However, this monolithic design philosophy makes it difficult to develop and deploy new I/O approaches to satisfy the rapidly-evolving I/O requirements of modern scientific applications. To this end, we propose LabStor: a modular and extensible platform for developing high-performance, customized I/O stacks. Single-purpose I/O modules (e.g, I/O schedulers) can be developed in the comfort of userspace and released as plug-ins, while end-users can compose these modules to form workload- and hardware-specific I/O stacks. Evaluations show that by switching to a fully modular design, tailored I/O stacks can yield performance improvements of up to 60% in various applications.
@inproceedings{logan2022labstor, entry_type = {conference}, author = {Logan, Luke and Garcia, Jaime Cernuda and Lofstead, Jay and Sun, Xian--He and Kougkas, Anthony}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, title = {LabStor: A modular and extensible platform for developing high-performance, customized I/O stacks in userspace}, year = {2022}, month = nov, publisher = {ACM}, volume = {}, number = {}, pages = {1--15}, keywords = {Storage Bridging, Elastic Storage, I/O Acceleration, Task-Based I/O}, doi = {10.1109/SC41404.2022.00028}, url = {https://ieeexplore.ieee.org/abstract/document/10046077}, }
ccgrid22 Stimulus: Accelerate Data Management for Scientific AI applications in HPC

Hariharan Devarajan, Anthony Kougkas, Huihuo Zheng, Venkatram Vishwanath, and Xian-He Sun

In Proceedings of the 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing , May 2022

ABS BIB Cite

Modern scientific workflows couple simulations with AI-powered analytics by frequently exchanging data to accelerate time-to-science to reduce the complexity of the simulation planes. However, this data exchange is limited in performance and portability due to a lack of support for scientific data formats in AI frameworks. We need a cohesive mechanism to effectively integrate at scale complex scientific data formats such as HDF5, PnetCDF, ADIOS2, GNCF, and Silo into popular AI frameworks such as TensorFlow, PyTorch, and Caffe. To this end, we designed Stimulus, a data management library for ingesting scientific data effectively into the popular AI frameworks. We utilize the StimOps functions along with StimPack abstraction to enable the integration of scientific data formats with any AI framework. The evaluations show that Stimulus outperforms several large-scale applications with different use-cases such as Cosmic Tagger (consuming HDF5 dataset in PyTorch), Distributed FFN (consuming HDF5 dataset in TensorFlow), and CosmoFlow (converting HDF5 into TFRecord and then consuming that in TensorFlow) by 5.3 x, 2.9 x, and 1.9 x respectively with ideal I/O scalability up to 768 GPUs on the Summit supercomputer. Through Stimulus, we can portably extend existing popular AI frameworks to cohesively support any complex scientific data format and efficiently scale the applications on large-scale supercomputers.
@inproceedings{devarajan2022stimulus, entry_type = {conference}, author = {Devarajan, Hariharan and Kougkas, Anthony and Zheng, Huihuo and Vishwanath, Venkatram and Sun, Xian-He}, booktitle = {Proceedings of the 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing}, title = {Stimulus: Accelerate Data Management for Scientific AI applications in HPC}, year = {2022}, month = may, publisher = {IEEE}, volume = {}, number = {}, pages = {109--118}, keywords = {Deep Learning I/O, Data Integration Frameworks, Workflow Optimization, In-Situ Analytics}, doi = {10.1109/CCGrid54584.2022.00020}, url = {https://ieeexplore.ieee.org/abstract/document/9826104}, }

2021

cluster21 pMEMCPY: a simple, lightweight, and portable I/O library for storing data in persistent memory

Luke Logan, Jay Lofstead, Patrick Widener, Xian-He Sun, and Anthony Kougkas

In Proceedings of the International Conference on Cluster Computing , Sep 2021

ABS BIB Cite

Persistent memory (PMEM) devices can achieve comparable performance to DRAM while providing significantly more capacity. This has made the technology compelling as an expansion to main memory. Rethinking PMEM as storage devices can offer a high performance buffering layer for HPC applications to temporarily, but safely store data. However, modern parallel I/O libraries, such as HDF5 and pNetCDF, are complicated and introduce significant software and metadata overheads when persisting data to these storage devices, wasting much of their potential. In this work, we explore the potential of PMEM as storage through pMEMCPY: a simple, lightweight, and portable I/O library for storing data in persistent memory. We demonstrate that our approach is up to 2x faster than other popular parallel I/O libraries under real workloads.
@inproceedings{logan2021pmemcpy, entry_type = {workshop}, author = {Logan, Luke and Lofstead, Jay and Widener, Patrick and Sun, Xian-He and Kougkas, Anthony}, booktitle = {Proceedings of the International Conference on Cluster Computing}, title = {pMEMCPY: a simple, lightweight, and portable I/O library for storing data in persistent memory}, year = {2021}, month = sep, publisher = {IEEE}, volume = {}, number = {}, pages = {664--670}, keywords = {Computational Storage, I/O Acceleration, Storage Architectures, Data Movement Optimization}, doi = {10.1109/Cluster48925.2021.00098}, url = {https://ieeexplore.ieee.org/abstract/document/9555982}, }
cluster21 Hflow: A dynamic and elastic multi-layered i/o forwarder

Jaime Cernuda, Hariharan Devarajan, Luke Logan, Keith Bateman, Neeraj Rajesh, Jie Ye, Anthony Kougkas, and Xian-He Sun

In Proceedings of the International Conference on Cluster Computing , Sep 2021

ABS BIB Cite

Modern applications are highly data-intensive, leading to the well-known I/O bottleneck problem. Scientists have proposed the placement of fast intermediate storage resources which aim to mask the I/O penalties. To manage these resources, three core software abstractions are being used in leadership-class computing facilities: IO Forwarders, Burst Buffers, and Data Stagers. Yet, with the rise of multi-tenant deployment in HPC systems, these software abstractions are: managed and maintained in isolation, leading to inefficient interactions; allocated statically, leading to load imbalance; exclusively bifurcated between the intermediate storage, leading to under-utilization of resources, and, in many cases, do not support in-situ operations. To this end, we present HFlow, a new class of data forwarding system that leverages a real-time data movement paradigm. HFlow introduces a unified data movement abstraction (the ByteFlow) providing data-independent tasks that can be executed anywhere and thus, enabling dynamic resource provisioning. Moreover, the processing elements executing the ByteFlows are designed to be ephemeral and, hence, enable elastic management of intermediate storage resources. Our results show that applications running under HFlow display an increase in performance of 3x when compared with state-of-the-art software solutions.
@inproceedings{cernuda2021hflow, entry_type = {conference}, author = {Cernuda, Jaime and Devarajan, Hariharan and Logan, Luke and Bateman, Keith and Rajesh, Neeraj and Ye, Jie and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the International Conference on Cluster Computing}, title = {Hflow: A dynamic and elastic multi-layered i/o forwarder}, year = {2021}, month = sep, publisher = {IEEE}, volume = {}, number = {}, pages = {114--124}, kkeywords = {Data Movement Optimization, I/O Acceleration, Elastic Storage, Data Integration Frameworks}, doi = {10.1109/Cluster48925.2021.00064}, url = {https://ieeexplore.ieee.org/abstract/document/9556010}, }
hpdc21 Apollo: An ML-assisted real-time storage resource observer

Neeraj Rajesh, Hariharan Devarajan, Jaime Cernuda Garcia, Keith Bateman, Luke Logan, Jie Ye, Anthony Kougkas, and Xian-He Sun

In Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing , Jun 2021

ABS BIB Cite

Applications and middleware services, such as data placement engines, I/O scheduling, and prefetching engines, require low-latency access to telemetry data in order to make optimal decisions. However, typical monitoring services store their telemetry data in a database in order to allow applications to query them, resulting in significant latency penalties. This work presents Apollo: a low-latency monitoring service that aims to provide applications and middleware libraries with direct access to relational telemetry data. Monitoring the system can create interference and overhead, slowing down raw performance of the resources for the job. However, having a current view of the system can aid middleware services in making more optimal decisions which can ultimately improve the overall performance. Apollo has been designed from the ground up to provide low latency, using Publish-Subscriber Pub-Sub semantics, and low overhead, using adaptive intervals in order to change the length of time between polling the resource for telemetry data and machine learning in order to predict changes to the telemetry data between actual resource polling. This work also provides some high level abstractions called I/O curators, which can further aid middleware libraries and applications to make optimal decisions. Evaluations showcase that Apollo can achieve sub-millisecond latency for acquiring complex insights with a memory overhead of 57 MB and CPU overhead being only 7% more than existing state-of-the-art systems.
@inproceedings{rajesh2021apollo, entry_type = {conference}, author = {Rajesh, Neeraj and Devarajan, Hariharan and Garcia, Jaime Cernuda and Bateman, Keith and Logan, Luke and Ye, Jie and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing}, title = {Apollo: An ML-assisted real-time storage resource observer}, year = {2021}, month = jun, publisher = {ACM}, volume = {}, number = {}, pages = {147--159}, keywords = {Storage Resource Provisioning, I/O Profiling, Data Management in HPC, High-Performance Computing}, doi = {10.1145/3431379.3460640}, url = {https://dl.acm.org/doi/abs/10.1145/3431379.3460640}, }
ccgrid21 DLIO: A data-centric benchmark for scientific deep learning applications

Hariharan Devarajan, Huihuo Zheng, Anthony Kougkas, Xian-He Sun, and Venkatram Vishwanath

In Proceedings of the 21st International Symposium on Cluster, Cloud and Internet Computing ║ Best Paper Award ║ , May 2021

ABS BIB Cite

2021 CCGrid Best Paper Award

Deep learning has been shown as a successful method for various tasks, and its popularity results in numerous open-source deep learning software tools. Deep learning has been applied to a broad spectrum of scientific domains such as cosmology, particle physics, computer vision, fusion, and astrophysics. Scientists have performed a great deal of work to optimize the computational performance of deep learning frameworks. However, the same cannot be said for I/O performance. As deep learning algorithms rely on big-data volume and variety to effectively train neural networks accurately, I/O is a significant bottleneck on large-scale distributed deep learning training. This study aims to provide a detailed investigation of the I/O behavior of various scientific deep learning workloads running on the Theta supercomputer at Argonne Leadership Computing Facility. In this paper, we present DLIO, a novel representative benchmark suite built based on the I/O profiling of the selected workloads. DLIO can be utilized to accurately emulate the I/O behavior of modern scientific deep learning applications. Using DLIO, application developers and system software solution architects can identify potential I/O bottlenecks in their applications and guide optimizations to boost the I/O performance leading to lower training times by up to 6.7x.
@inproceedings{devarajan2021dlio, entry_type = {conference}, author = {Devarajan, Hariharan and Zheng, Huihuo and Kougkas, Anthony and Sun, Xian-He and Vishwanath, Venkatram}, booktitle = {Proceedings of the 21st International Symposium on Cluster, Cloud and Internet Computing}, title = {DLIO: A data-centric benchmark for scientific deep learning applications}, year = {2021}, month = may, publisher = {IEEE/ACM}, volume = {}, number = {}, pages = {81--91}, keywords = {Deep Learning I/O, I/O Benchmarking, Data-Intensive Applications, Workflow Optimization}, doi = {10.1109/CCGrid51090.2021.00018}, url = {https://ieeexplore.ieee.org/abstract/document/9499416}, }

2020

bigdata20 Hreplica: a dynamic data replication engine with adaptive compression for multi-tiered storage

Hariharan Devarajan, Anthony Kougkas, and Xian-He Sun

In Proceedings of the International Conference on Big Data , Dec 2020

ABS BIB Cite

As the diversity of big data applications increases, their requirements diverge and often conflict with one other. Managing this diversity in any supercomputer or data center is a major challenge for system designers. Data replication is a popular approach to meet several of these requirements, such as low latency, read availability, durability, etc. This approach can be enhanced using new modern heterogeneous hardware and software techniques such as data compression. However, both these enhancements work in isolation to the detriment of both. In this work, we present HReplica: a dynamic data replication engine which harmoniously leverages data compression and hierarchical storage to increase the effectiveness of data replication. We have developed a novel dynamic selection algorithm that facilitates the optimal matching of replication schemes, compression libraries, and tiered storage. Our evaluation shows that HReplica can improve scientific and cloud application performance by 5.2x when compared to other state-of-the-art replication schemes.
@inproceedings{devarajan2020hreplica, entry_type = {conference}, author = {Devarajan, Hariharan and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the International Conference on Big Data}, title = {Hreplica: a dynamic data replication engine with adaptive compression for multi-tiered storage}, year = {2020}, month = dec, publisher = {IEEE}, volume = {}, number = {}, pages = {256--265}, keywords = {Data Replication, Data Compression Techniques, Multi-Tiered Storage Hierarchy, I/O Acceleration}, doi = {10.1109/BigData50022.2020.9378167}, url = {https://ieeexplore.ieee.org/abstract/document/9378167}, }
tos20 Bridging Storage Semantics Using Data Labels and Asynchronous I/O

Anthony Kougkas, Hariharan Devarajan, and Xian-He Sun

In Proceedings of the ACM Transactions on Storage , Oct 2020

ABS BIB Cite

In the era of data-intensive computing, large-scale applications, in both scientific and the BigData communities, demonstrate unique I/O requirements leading to a proliferation of different storage devices and software stacks, many of which have conflicting requirements. Further, new hardware technologies and system designs create a hierarchical composition that may be ideal for computational storage operations. In this article, we investigate how to support a wide variety of conflicting I/O workloads under a single storage system. We introduce the idea of a Label, a new data representation, and, we present LABIOS: a new, distributed, Label- based I/O system. LABIOS boosts I/O performance by up to 17× via asynchronous I/O, supports heterogeneous storage resources, offers storage elasticity, and promotes in situ analytics and software defined storage support via data provisioning. LABIOS demonstrates the effectiveness of storage bridging to support the convergence of HPC and BigData workloads on a single platform.
@inproceedings{kougkas2020bridging, entry_type = {journal}, author = {Kougkas, Anthony and Devarajan, Hariharan and Sun, Xian-He}, booktitle = {Proceedings of the ACM Transactions on Storage}, title = {Bridging Storage Semantics Using Data Labels and Asynchronous I/O}, year = {2020}, month = oct, publisher = {ACM}, volume = {16}, number = {4}, article = {22}, pages = {1--34}, keywords = {Data Labels, Storage Bridging, Task-Based I/O, Elastic Storage}, doi = {10.1145/3415579}, issn = {1553-3077}, url = {https://dl.acm.org/doi/abs/10.1145/3415579}, }
msst20 ChronoLog: A Distributed Shared Tiered Log Store with Time-based Data Ordering

Anthony Kougkas, Hariharan Devarajan, Keith Bateman, Jaime Cernuda, Neeraj Rajesh, and Xian-He Sun

In Proceedings of the 36th International Conference on Massive Storage Systems and Technology , Oct 2020

ABS BIB Cite

Modern applications produce and process massive amounts of activity (or log) data. Traditional storage systems were not designed with an append-only data model and a new storage abstraction aims to fill this gap: the distributed shared log store. However, existing solutions struggle to provide a scalable, parallel, and high-performance solution that can support a diverse set of conflicting log workload requirements. Finding the tail of a distributed log is a centralized point of contention. In this paper, we show how using physical time can help alleviate the need of centralized synchronization points. We present ChronoLog, a new, distributed, shared, and multi-tiered log store that can handle more than a million tail operations per second. Evaluation results show ChronoLog’s potential, outperforming existing solution by an order of magnitude.
@inproceedings{kougkas2020chronolog, entry_type = {conference}, author = {Kougkas, Anthony and Devarajan, Hariharan and Bateman, Keith and Cernuda, Jaime and Rajesh, Neeraj and Sun, Xian-He}, booktitle = {Proceedings of the 36th International Conference on Massive Storage Systems and Technology}, title = {ChronoLog: A Distributed Shared Tiered Log Store with Time-based Data Ordering}, year = {2020}, month = oct, publisher = {Santa Clara University - School of Engineering}, volume = {}, number = {}, pages = {}, keywords = {Shared Log Storage Systems, Multi-Tiered Storage Hierarchy, Data-Intensive Applications, Storage Architectures}, doi = {}, url = {https://msstconference.org/MSST-history/2020/Papers/06.ChronoLog.pdf}, }
cluster20 Hcl: Distributing parallel data structures in extreme scales

Hariharan Devarajan, Anthony Kougkas, Keith Bateman, and Xian-He Sun

In Proceedings of the International Conference on Cluster Computing , Sep 2020

ABS BIB Cite

Most parallel programs use irregular control flow and data structures, which are perfect for one-sided communication paradigms such as MPI or PGAS programming languages. However, these environments lack the presence of efficient function-based application libraries that can utilize popular communication fabrics such as TCP, Infinity Band (IB), and RDMA over Converged Ethernet (RoCE). Additionally, there is a lack of high-performance data structure interfaces. We present Hermes Container Library (HCL), a high-performance distributed data structures library that offers high-level abstractions including hash-maps, sets, and queues. HCL uses a RPC over RDMA technology that implements a novel procedural programming paradigm. In this paper, we argue a RPC over RDMA technology can serve as a high-performance, flexible, and co-ordination free backend for implementing complex data structures. Evaluation results from testing real workloads shows that HCL programs are 2x to 12x faster compared to BCL, a state-of-the-art distributed data structure library.
@inproceedings{devarajan2020hcl, entry_type = {conference}, author = {Devarajan, Hariharan and Kougkas, Anthony and Bateman, Keith and Sun, Xian-He}, booktitle = {Proceedings of the International Conference on Cluster Computing}, title = {Hcl: Distributing parallel data structures in extreme scales}, year = {2020}, month = sep, publisher = {IEEE}, volume = {}, number = {}, pages = {248--258}, keywords = {Distributed Data Structures, Parallel I/O Optimization, Storage Architectures, High-Performance Computing}, doi = {10.1109/CLUSTER49012.2020.00035}, url = {https://ieeexplore.ieee.org/abstract/document/9229595}, }
ipdps20.1 Hfetch: Hierarchical data prefetching for scientific workflows in multi-tiered storage environments

Hariharan Devarajan, Anthony Kougkas, and Xian-He Sun

In Proceedings of the International Parallel and Distributed Processing Symposium , Jul 2020

ABS BIB Cite

In the era of data-intensive computing, accessing data with a high-throughput and low-latency is more imperative than ever. Data prefetching is a well-known technique for hiding read latency. However, existing solutions do not consider the new deep memory and storage hierarchy and also suffer from under-utilization of prefetching resources and unnecessary evictions. Additionally, existing approaches implement a client-pull model where understanding the application’s I/O behavior drives prefetching decisions. Moving towards exascale, where machines run multiple applications concurrently by accessing files in a workflow, a more data-centric approach can resolve challenges such as cache pollution and redundancy. In this study, we present HFetch, a truly hierarchical data prefetcher that adopts a server-push approach to data prefetching. We demonstrate the benefits of such an approach. Results show 10-35% performance gains over existing prefetchers and over 50% when compared to systems with no prefetching.
@inproceedings{devarajan2020hfetch, entry_type = {conference}, author = {Devarajan, Hariharan and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the International Parallel and Distributed Processing Symposium}, title = {Hfetch: Hierarchical data prefetching for scientific workflows in multi-tiered storage environments}, year = {2020}, month = jul, publisher = {IEEE}, volume = {}, number = {}, pages = {62--72}, kkeywords = {Hierarchical Data Prefetching, Data Prefetching, Multi-Tiered Storage Hierarchy, Data Movement Optimization}, doi = {0.1109/IPDPS47924.2020.00017}, url = {https://ieeexplore.ieee.org/abstract/document/9139852}, }
ipdps20.2 Hcompress: Hierarchical data compression for multi-tiered storage environments

Hariharan Devarajan, Anthony Kougkas, Luke Logan, and Xian-He Sun

In Proceedings of the International Parallel and Distributed Processing Symposium , Jul 2020

ABS BIB Cite

Modern scientific applications read and write massive amounts of data through simulations, observations, and analysis. These applications spend the majority of their runtime in performing I/O. HPC storage solutions include fast node-local and shared storage resources to elevate applications from this bottleneck. Moreover, several middleware libraries (e.g., Hermes) are proposed to move data between these tiers transparently. Data reduction is another technique that reduces the amount of data produced and, hence, improve I/O performance. These two technologies, if used together, can benefit from each other. The effectiveness of data compression can be enhanced by selecting different compression algorithms according to the characteristics of the different tiers, and the multi-tiered hierarchy can benefit from extra capacity. In this paper, we design and implement HCompress, a hierarchical data compression library that can improve the application’s performance by harmoniously leveraging both multi-tiered storage and data compression. We have developed a novel compression selection algorithm that facilitates the optimal matching of compression libraries to the tiered storage. Our evaluation shows that HCompress can improve scientific application’s performance by 7x when compared to other state-of-the-art tiered storage solutions.
@inproceedings{devarajan2020hcompress, entry_type = {conference}, author = {Devarajan, Hariharan and Kougkas, Anthony and Logan, Luke and Sun, Xian-He}, booktitle = {Proceedings of the International Parallel and Distributed Processing Symposium}, title = {Hcompress: Hierarchical data compression for multi-tiered storage environments}, year = {2020}, month = jul, publisher = {IEEE}, volume = {}, number = {}, pages = {557--566}, keywords = {Data Compression Techniques, Multi-Tiered Storage Hierarchy, I/O Acceleration, Data Management in HPC}, doi = {10.1109/IPDPS47924.2020.00064}, url = {https://ieeexplore.ieee.org/abstract/document/9139838}, elected = {} }
jcst20 I/O Acceleration via Multi-Tiered Data Buffering and Prefetching

Anthony Kougkas, Hariharan Devarajan, and Xian-He Sun

In Proceedings of the International Journal of Computer Science and Technology , Jan 2020

ABS BIB Cite

Modern High-Performance Computing (HPC) systems are adding extra layers to the memory and storage hierarchy, named deep memory and storage hierarchy (DMSH), to increase I/O performance. New hardware technologies, such as NVMe and SSD, have been introduced in burst buffer installations to reduce the pressure for external storage and boost the burstiness of modern I/O systems. The DMSH has demonstrated its strength and potential in practice. However, each layer of DMSH is an independent heterogeneous system and data movement among more layers is significantly more complex even without considering heterogeneity. How to efficiently utilize the DMSH is a subject of research facing the HPC community. Further, accessing data with a high-throughput and low-latency is more imperative than ever. Data prefetching is a well-known technique for hiding read latency by requesting data before it is needed to move it from a high-latency medium (e.g., disk) to a low-latency one (e.g., main memory). However, existing solutions do not consider the new deep memory and storage hierarchy and also suffer from under-utilization of prefetching resources and unnecessary evictions. Additionally, existing approaches implement a client-pull model where understanding the application’s I/O behavior drives prefetching decisions. Moving towards exascale, where machines run multiple applications concurrently by accessing files in a workflow, a more data-centric approach resolves challenges such as cache pollution and redundancy. In this paper, we present the design and implementation of Hermes: a new, heterogeneous-aware, multi-tiered, dynamic, and distributed I/O buffering system. Hermes enables, manages, supervises, and, in some sense, extends I/O buffering to fully integrate into the DMSH. We introduce three novel data placement policies to efficiently utilize all layers and we present three novel techniques to perform memory, metadata, and communication management in hierarchical buffering systems. Additionally, we demonstrate the benefits of a truly hierarchical data prefetcher that adopts a server-push approach to data prefetching. Our evaluation shows that, in addition to automatic data movement through the hierarchy, Hermes can significantly accelerate I/O and outperforms by more than 2x state-of-the-art buffering platforms. Lastly, results show 10% to 35% performance gains over existing prefetchers and over 50% when compared to systems with no prefetching.
@inproceedings{kougkas2020acceleration, entry_type = {journal}, author = {Kougkas, Anthony and Devarajan, Hariharan and Sun, Xian-He}, booktitle = {Proceedings of the International Journal of Computer Science and Technology}, title = {I/O Acceleration via Multi-Tiered Data Buffering and Prefetching}, year = {2020}, month = jan, publisher = {Springer}, volume = {35}, number = {1}, pages = {92--120}, keywords = {I/O Acceleration, Hierarchical Data Prefetching, Multi-Tiered Storage Hierarchy, Hierarchical Buffering}, doi = {10.1007/s11390-020-9781-1}, url = {https://link.springer.com/article/10.1007/s11390-020-9781-1}, }

2019

bigdata19 NIOBE: An intelligent i/o bridging engine for complex and distributed workflows

Kun Feng, Hariharan Devarajan, Anthony Kougkas, and Xian-He Sun

In Proceedings of the International Conference on Big Data , Dec 2019

ABS BIB Cite

In the age of data-driven computing, integrating High Performance Computing(HPC) and Big Data(BD) environments may be the key to increasing productivity and to driving scientific discovery forward. Scientific workflows consist of diverse applications (i.e., HPC simulations and BD analysis) each with distinct representations of data that introduce a semantic barrier between the two environments. To solve scientific problems at scale, accessing semantically different data from different storage resources is the biggest unsolved challenge. In this work, we aim to address a critical question: ”How can we exploit the existing resources and efficiently provide transparent access to data from/to both environments”. We propose iNtelligent I/O Bridging Engine(NIOBE), a new data integration framework that enables integrated data access for scientific workflows with asynchronous I/O and data aggregation. NIOBE performs the data integration using available I/O resources, in contrast to existing optimizations that ignore the I/O nodes present on the data path. In NIOBE, data access is optimized to consider both the ongoing production and the consumption of the data in the future. Experimental results show that with NIOBE, an integrated scientific workflow can be accelerated by up to 10x when compared to a no-integration baseline and by up to 133% compared to other state-of-the-art integration solutions.
@inproceedings{feng2019niobe, entry_type = {conference}, author = {Feng, Kun and Devarajan, Hariharan and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the International Conference on Big Data}, title = {NIOBE: An intelligent i/o bridging engine for complex and distributed workflows}, year = {2019}, month = dec, publisher = {IEEE}, volume = {}, number = {}, pages = {493--502}, keywords = {Data Integration Frameworks, I/O Acceleration, Storage Bridging, Data Movement Optimization}, doi = {10.1109/BigData47090.2019.9006363}, url = {https://ieeexplore.ieee.org/abstract/document/9006363}, }
hpdc19 Labios: A distributed label-based i/o system

Anthony Kougkas, Hariharan Devarajan, Jay Lofstead, and Xian-He Sun

In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. ║ Best Paper Award ║ , Jun 2019

ABS BIB Cite

Karsten Schwan Best Paper Award at HPDC’2019. Photo here

In the era of data-intensive computing, large-scale applications, in both scientific and the BigData communities, demonstrate unique I/O requirements leading to a proliferation of different storage devices and software stacks, many of which have conflicting requirements. In this paper, we investigate how to support a wide variety of conflicting I/O workloads under a single storage system. We introduce the idea of a Label, a new data representation, and, we present LABIOS: a new, distributed, Label- based I/O system. LABIOS boosts I/O performance by up to 17x via asynchronous I/O, supports heterogeneous storage resources, offers storage elasticity, and promotes in-situ analytics via data provisioning. LABIOS demonstrates the effectiveness of storage bridging to support the convergence of HPC and BigData workloads on a single platform.
@inproceedings{kougkas2019labios, entry_type = {conference}, author = {Kougkas, Anthony and Devarajan, Hariharan and Lofstead, Jay and Sun, Xian-He}, booktitle = {Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing.}, title = {Labios: A distributed label-based i/o system}, year = {2019}, month = jun, publisher = {ACM}, volume = {}, number = {}, pages = {13--24}, keywords = {Data Labels, Task-Based I/O, Elastic Storage, Storage Bridging}, doi = {10.1145/3307681.3325405}, url = {https://dl.acm.org/doi/abs/10.1145/3307681.3325405}, }
ccgrid19 An intelligent, adaptive, and flexible data compression framework

Hariharan Devarajan, Anthony Kougkas, and Xian-He Sun

In Proceedings of the 19th International Symposium on Cluster, Cloud and Grid Computing , May 2019

ABS BIB Cite

The data explosion phenomenon in modern applications causes tremendous stress on storage systems. Developers use data compression, a size-reduction technique, to address this issue. However, each compression library exhibits different strengths and weaknesses when considering the input data entry_type and format. We present Ares, an intelligent, adaptive, and flexible compression framework which can dynamically choose a compression library for a given input data based on the entry_type of the workload and provides an appropriate infrastructure to users to fine-tune the chosen library. Ares is a modular framework which unifies several compression libraries while allowing the addition of more compression libraries by the user. Ares is a unified compression engine that abstracts the complexity of using different compression libraries for each workload. Evaluation results show that under real-world applications, from both scientific and Cloud domains, Ares performed 2-6x faster than competitive solutions with a low cost of additional data analysis (i.e., overheads around 10%) and up to 10x faster against a baseline of no compression at all.
@inproceedings{devarajan2019intelligent, entry_type = {conference}, author = {Devarajan, Hariharan and Kougkas, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the 19th International Symposium on Cluster, Cloud and Grid Computing}, title = {An intelligent, adaptive, and flexible data compression framework}, year = {2019}, month = may, publisher = {IEEE}, volume = {}, number = {}, pages = {82--91}, keywords = {Data Compression Techniques, Data Management in HPC, I/O Acceleration, Storage Resource Provisioning}, doi = {10.1109/CCGRID.2019.00019}, url = {https://ieeexplore.ieee.org/abstract/document/8752926}, }

2018

hipc18 Vidya: Performing code-block I/O characterization for data access optimization

Hariharan Devarajan, Anthony Kougkas, Prajwal Challa, and Xian-He Sun

In Proceedings of the 25th International Conference on High Performance Computing , Dec 2018

ABS BIB Cite

Understanding, characterizing and tuning scientific applications’ I/O behavior is an increasingly complicated process in HPC systems. Existing tools use either offline profiling or online analysis to get insights into the applications’ I/O patterns. However, there is lack of a clear formula to characterize applications’ I/O. Moreover, these tools are application specific and do not account for multi-tenant systems. This paper presents Vidya, an I/O profiling framework which can predict application’s I/O intensity using a new formula called Code-Block I/O Characterization (CIOC). Using CIOC, developers and system architects can tune an application’s I/O behavior and better match the underlying storage system to maximize performance. Evaluation results show that Vidya can predict an application’s I/O intensity with a variance of 0.05%. Vidya can profile applications with a high accuracy of 98% while reducing profiling time by 9x. We further show how Vidya can optimize an application’s I/O time by 3.7x.
@inproceedings{devarajan2018vidya, entry_type = {conference}, author = {Devarajan, Hariharan and Kougkas, Anthony and Challa, Prajwal and Sun, Xian-He}, booktitle = {Proceedings of the 25th International Conference on High Performance Computing}, title = {Vidya: Performing code-block I/O characterization for data access optimization}, year = {2018}, month = dec, publisher = {IEEE}, volume = {}, number = {}, pages = {255--264}, keywords = {I/O Profiling, Data Management in HPC, I/O Performance Optimization, Task-Based I/O}, doi = {10.1109/HiPC.2018.00036}, url = {https://ieeexplore.ieee.org/abstract/document/8638067}, }
cluster18 Harmonia: An interference-aware dynamic I/O scheduler for shared non-volatile burst buffers

Anthony Kougkas, Hariharan Devarajan, Xian-He Sun, and Jay Lofstead

In Proceedings of the International Conference on Cluster Computing , Sep 2018

ABS BIB Cite

Modern HPC systems employ burst buffer installations to reduce the peak I/O requirements for external storage and deal with the burstiness of I/O in modern scientific applications. These I/O buffering resources are shared between multiple applications that run concurrently. This leads to severe performance degradation due to contention, a phenomenon called cross-application I/O interference. In this paper, we first explore the negative effects of interference at the burst buffer layer and we present two new metrics that can quantitatively describe the slowdown applications experience due to interference. We introduce Harmonia, a new dynamic I/O scheduler that is aware of interference, adapts to the underlying system, implements a new 2-way decision-making process and employs several scheduling policies to maximize the system efficiency and applications’ performance. Our evaluation shows that Harmonia, through better I/O scheduling, can outperform by 3x existing state-of-the-art buffering management solutions and can lead to better resource utilization.
@inproceedings{kougkas2018harmonia, entry_type = {conference}, author = {Kougkas, Anthony and Devarajan, Hariharan and Sun, Xian-He and Lofstead, Jay}, booktitle = {Proceedings of the International Conference on Cluster Computing}, title = {Harmonia: An interference-aware dynamic I/O scheduler for shared non-volatile burst buffers}, year = {2018}, month = sep, publisher = {IEEE}, volume = {}, number = {}, pages = {290--301}, keywords = {Burst Buffers, I/O Scheduling Policies, I/O Interference, Heterogeneous Storage Systems}, doi = {10.1109/CLUSTER.2018.00046}, url = {https://ieeexplore.ieee.org/abstract/document/8514889}, }
hpdc18 Hermes: a heterogeneous-aware multi-tiered distributed I/O buffering system

Anthony Kougkas, Hariharan Devarajan, and Xian-He Sun

In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing , Jun 2018

ABS BIB Cite

Modern High-Performance Computing (HPC) systems are adding extra layers to the memory and storage hierarchy named deep memory and storage hierarchy (DMSH), to increase I/O performance. New hardware technologies, such as NVMe and SSD, have been introduced in burst buffer installations to reduce the pressure for external storage and boost the burstiness of modern I/O systems. The DMSH has demonstrated its strength and potential in practice. However, each layer of DMSH is an independent heterogeneous system and data movement among more layers is significantly more complex even without considering heterogeneity. How to efficiently utilize the DMSH is a subject of research facing the HPC community. In this paper, we present the design and implementation of Hermes: a new, heterogeneous-aware, multi-tiered, dynamic, and distributed I/O buffering system. Hermes enables, manages, supervises, and, in some sense, extends I/O buffering to fully integrate into the DMSH. We introduce three novel data placement policies to efficiently utilize all layers and we present three novel techniques to perform memory, metadata, and communication management in hierarchical buffering systems. Our evaluation shows that, in addition to automatic data movement through the hierarchy, Hermes can significantly accelerate I/O and outperforms by more than 2x state-of-the-art buffering platforms.
@inproceedings{kougkas2018hermes, entry_type = {conference}, author = {Kougkas, Anthony and Devarajan, Hariharan and Sun, Xian-He}, booktitle = {Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing}, title = {Hermes: a heterogeneous-aware multi-tiered distributed I/O buffering system}, year = {2018}, month = jun, publisher = {ACM}, volume = {}, number = {}, pages = {219--230}, keywords = {Hierarchical Buffering, Multi-Tiered Storage Hierarchy, Heterogeneity-Aware Buffering, I/O Acceleration}, doi = {10.1145/3208040.3208059}, url = {https://dl.acm.org/doi/abs/10.1145/3208040.3208059}, }
ics18 Iris: I/o redirection via integrated storage

Anthony Kougkas, Hariharan Devarajan, and Xian-He Sun

In Proceedings of the International Conference on Supercomputing , Jun 2018

ABS BIB Cite

There is an ocean of available storage solutions in modern high-performance and distributed systems. These solutions consist of Parallel File Systems (PFS) for the more traditional high-performance computing (HPC) systems and of Object Stores for emerging cloud environments. More of ten than not, these storage solutions are tied to specific APIs and data models and thus, bind developers, applications, and entire computing facilities to using certain interfaces. Each storage system is designed and optimized for certain applications but does not perform well for others. Furthermore, modern applications have become more and more complex consisting of a collection of phases with different computation and I/O requirements. In this paper, we propose a unified storage access system, called IRIS (i.e., I/O Redirection via Integrated Storage). IRIS enables unified data access and seamlessly bridges the semantic gap between file systems and object stores. With IRIS, emerging High-Performance Data Analytics software has capable and diverse I/O support. IRIS can bring us closer to the convergence of HPC and Cloud environments by combining the best storage subsystems from both worlds. Experimental results show that IRIS can grant more than 7x improvement in performance than existing solutions.
@inproceedings{kougkas2018iris, entry_type = {conference}, author = {Kougkas, Anthony and Devarajan, Hariharan and Sun, Xian-He}, booktitle = {Proceedings of the International Conference on Supercomputing}, title = {Iris: I/o redirection via integrated storage}, year = {2018}, month = jun, publisher = {}, volume = {}, number = {}, pages = {33--42}, keywords = {Data Integration Frameworks, Storage Bridging, Elastic Storage, Data Movement Optimization}, doi = {10.1145/3205289.3205322}, url = {https://dl.acm.org/doi/abs/10.1145/3205289.3205322}, }

2017

ijhpca17 Rethinking key–value store for parallel i/o optimization

Anthony Kougkas, Hassan Eslami, Xian-He Sun, Rajeev Thakur, and William Gropp

In Proceedings of the International Journal of High Performance Computing Applications , 2017

ABS BIB Cite

Key–value stores are being widely used as the storage system for large-scale internet services and cloud storage systems. However, they are rarely used in HPC systems, where parallel file systems are the dominant storage solution. In this study, we examine the architecture differences and performance characteristics of parallel file systems and key–value stores. We propose using key–value stores to optimize overall Input/Output (I/O) performance, especially for workloads that parallel file systems cannot handle well, such as the cases with intense data synchronization or heavy metadata operations. We conducted experiments with several synthetic benchmarks, an I/O benchmark, and a real application. We modeled the performance of these two systems using collected data from our experiments, and we provide a predictive method to identify which system offers better I/O performance given a specific workload. The results show that we can optimize the I/O performance in HPC systems by utilizing key–value stores.
@inproceedings{kougkas2017rethinking, entry_type = {journal}, author = {Kougkas, Anthony and Eslami, Hassan and Sun, Xian-He and Thakur, Rajeev and Gropp, William}, booktitle = {Proceedings of the International Journal of High Performance Computing Applications}, title = {Rethinking key--value store for parallel i/o optimization}, year = {2017}, month = {}, publisher = {SAGE Publications Sage UK: London, England}, volume = {31}, number = {4}, pages = {335--356}, keywords = {Key–Value Stores, Parallel I/O Optimization, Data-Intensive Applications, Storage Architectures}, doi = {10.1177/109434201667}, url = {https://journals.sagepub.com/doi/abs/10.1177/1094342016677084}, }

2016

pdsw16 Towards energy efficient data management in HPC: the open ethernet drive approach

Anthony Kougkas, Anthony Fleck, and Xian-He Sun

In Proceedings of the 1st Joint International Workshop On Parallel Data Storage & Data Intensive Scalable Computing Systems , Nov 2016

ABS BIB Cite

An Open Ethernet Drive (OED) is a new technology that encloses into a hard drive (HDD or SSD) a low-power processor, a fixed-size memory and an Ethernet card. In this study, we thoroughly evaluate the performance of such device and the energy requirements to operate it. The results show that first it is a viable solution to offload data-intensive computations on the OED while maintaining a reasonable performance, and second, the energy consumption savings from utilizing such technology are significant as it only consumes 10% of the power needed by a normal server node. We propose that by using OED devices as storage servers in HPC, we can run a reliable, scalable, cost and energy efficient storage solution.
@inproceedings{kougkas2016towards, entry_type = {workshop}, author = {Kougkas, Anthony and Fleck, Anthony and Sun, Xian-He}, booktitle = {Proceedings of the 1st Joint International Workshop On Parallel Data Storage & Data Intensive Scalable Computing Systems}, title = {Towards energy efficient data management in HPC: the open ethernet drive approach}, year = {2016}, month = nov, publisher = {IEEE}, volume = {}, number = {}, pages = {43-48}, keywords = {Computational Storage, Data Management in HPC, Heterogeneous Storage Systems, I/O Performance Optimization}, doi = {10.1109/PDSW-DISCS.2016.012}, url = {https://ieeexplore.ieee.org/abstract/document/7836567}, }
escience16 Leveraging burst buffer coordination to prevent I/O interference

Anthony Kougkas, Matthieu Dorier, Rob Latham, Rob Ross, and Xian-He Sun

In Proceedings of the 12th International Conference on e-Science , Jun 2016

ABS BIB Cite

Concurrent accesses to the shared storage resources in current HPC machines lead to severe performance degradation caused by I/O contention. In this study, we identify some key challenges to efficiently handling interleaved data accesses, and we propose a system-wide solution to optimize global performance. We implemented and tested several I/O scheduling policies, including prioritizing specific applications by leveraging burst buffers to defer the conflicting accesses from another application and/or directing the requests to different storage servers inside the parallel file system infrastructure. The results show that we mitigate the negative effects of interference and optimize the performance up to 2x depending on the selected I/O policy.
@inproceedings{bbio, entry_type = {conference}, author = {Kougkas, Anthony and Dorier, Matthieu and Latham, Rob and Ross, Rob and Sun, Xian-He}, booktitle = {Proceedings of the 12th International Conference on e-Science}, title = {Leveraging burst buffer coordination to prevent I/O interference}, year = {2016}, month = jun, publisher = {IEEE}, volume = {}, number = {}, pages = {371-380}, keywords = {Burst Buffers, I/O Interference, I/O Scheduling Policies, Data Movement Optimization}, doi = {10.1109/eScience.2016.7870922}, url = {https://ieeexplore.ieee.org/document/7870922}, }

2015

icpp15 A Heterogeneity-Aware Region-Level Data Layout for Hybrid Parallel File Systems

Shuibing He, Xian-He Sun, Yang Wang, Anthony Kougkas, and Adnan Haider

In Proceedings of the 44th International Conference on Parallel Processing , Dec 2015

ABS BIB Cite

Parallel file systems (PFS) are commonly used in high-end computing systems. With the emergence of solid state drives (SSD), hybrid PFSs, which consist of both HDD and SSD servers, provide a practical I/O system solution for data-intensive applications. However, most existing PFS layout schemes are inefficient for hybrid PFSs due to their lack of awareness of the performance differences between heterogeneous servers and the workload changes between different parts of a file. This lack of recognition can result in severe I/O performance degradation. In this study, we propose a heterogeneity-aware region-level (HARL) data layout scheme to improve the data distribution of a hybrid PFS. HARL first divides a file into fine-grained, varying sized regions according to the changes of an application’s I/O workload, then chooses appropriate file stripe sizes on heterogeneous servers based on the server performance for each file region. Experimental results of representative benchmarks show that HARL can greatly improve the I/O system performance.
@inproceedings{he2015heterogeneity, entry_type = {conference}, author = {He, Shuibing and Sun, Xian-He and Wang, Yang and Kougkas, Anthony and Haider, Adnan}, booktitle = {Proceedings of the 44th International Conference on Parallel Processing}, title = {A Heterogeneity-Aware Region-Level Data Layout for Hybrid Parallel File Systems}, year = {2015}, month = dec, publisher = {}, volume = {}, number = {}, pages = {340-349}, keywords = {Hybrid Parallel File Systems, Data Layout Optimization, Region-Level Data Distribution, Heterogeneous Storage Systems}, doi = {10.1109/ICPP.2015.43}, url = {https://ieeexplore.ieee.org/abstract/document/7349589}, }
discs15 Efficient disk-to-disk sorting: a case study in the decoupled execution paradigm

Hassan Eslami, Anthony Kougkas, Maria Kotsifakou, Theodoros Kasampalis, Kun Feng, Yin Lu, William Gropp, Xian-He Sun, Yong Chen, and Rajeev Thakur

In Proceedings of the International Workshop on Data-Intensive Scalable Computing , Nov 2015

ABS BIB Cite

Many applications foreseen for exascale era should process huge amount of data. However, the IO infrastructure of current supercomputing architecture cannot be generalized to deal with this amount of data due to the need for excessive data movement from storage layers to compute nodes leading to limited scalability. There has been extensive studies addressing this challenge. Decoupled Execution Paradigm (DEP) is an attractive solution due to its unique features such as available fast storage devices close to computational units and available programmable units close to file system. In this paper we study the effectiveness of DEP for a well-known data-intensive kernel, disk-to-disk (aka out-of-core) sorting. We propose an optimized algorithm that uses almost all features of DEP pushing the performance of sorting in HPC even further compared to other existing solutions. Advantages in our algorithm are gained by exploiting programming units close to parallel file system to achieve higher IO throughput, compressing data before sending it over network or to disk, storing intermediate results of computation close to compute nodes, and fully overlapping IO with computation. We also provide an analytical model for our proposed algorithm. Our algorithm achieves 30% better performance compared to the theoretically optimal sorting algorithm running on the same testbed but not designed to exploit the DEP architecture.
@inproceedings{eslami2015efficient, entry_type = {workshop}, author = {Eslami, Hassan and Kougkas, Anthony and Kotsifakou, Maria and Kasampalis, Theodoros and Feng, Kun and Lu, Yin and Gropp, William and Sun, Xian-He and Chen, Yong and Thakur, Rajeev}, booktitle = {Proceedings of the International Workshop on Data-Intensive Scalable Computing}, title = {Efficient disk-to-disk sorting: a case study in the decoupled execution paradigm}, year = {2015}, month = nov, publisher = {ACM}, volume = {}, number = {2}, pages = {1--8}, keywords = {Decoupled Execution Paradigm, Out-of-Core Sorting, I/O Performance Optimization, Data Compression Techniques}, doi = {10.1145/2831244.2831249}, url = {https://dl.acm.org/doi/abs/10.1145/2831244.2831249}, }