as the creator and co-prinicipal investigator of this $5m nsf-funded project, i’m working to improve scientific data management. iowarp aims to enhance how we handle data in modern scientific workflows, especially those involving ai. this project builds on my previous work in storage systems and i/o optimization, focusing on practical solutions for current challenges in scientific computing.
🎧
audio overview - 15:54 minutes
research vision & innovation
iowarp emerged from a critical observation that modern scientific workflows - particularly those integrating ai - are severely constrained by traditional data management approaches. leading a team of researchers across illinois institute of technology, the hdf group, and the university of utah, we are developing a comprehensive platform that:
bridges multiple worlds: seamlessly integrates hpc, big data, and ai workflows
enables intelligence: incorporates llm-driven data exploration with warpgpt
optimizes performance: leverages advanced hardware like cxl and gpudirect
ensures adaptability: provides a flexible, plugin-based architecture
technical breakthroughs
building on the success with the hermes i/o buffering system, i designed several key innovations in iowarp:
content assimilation engine: a novel approach for unifying diverse data formats
advanced storage integration: direct support for emerging storage technologies
ml-guided data placement: intelligent data movement across storage tiers
content exploration interface: natural language-driven data analytics
system architecture
the iowarp architecture, which i conceptualized and developed with my team, consists of four major components:
content assimilation engine (cae): transforms diverse data formats into a unified representation
content transfer engine (cte): manages efficient data movement across storage tiers
content exploration interface (cei): provides llm-powered data discovery capabilities
platform plugins interface (ppi): enables seamless integration with external services
impact on scientific computing
under my leadership, iowarp is already demonstrating significant impact across various scientific domains:
materials science: accelerating x-ray tomography analysis workflows by 7x
climate modeling: enabling real-time data analysis for climate simulations
ai/ml research: supporting efficient model training and inference operations
bioinformatics: streamlining large-scale genomic data processing
team development & mentorship
a core aspect of my leadership in this project involves mentoring the next generation of researchers:
2 phd students specializing in storage systems and ai
1 postdoctoral researchers in advanced data management
3 master students as research assistants in system development
📖 documentation: comprehensive guides and api references
📚 educational materials: training modules and tutorials
🛠️ development tools: ci/cd pipeline and testing infrastructure
acknowledgements 🙏
this material is based upon work supported by the national science foundation. i thank my collaborators at the hdf group and the university of utah for their invaluable contributions.
For inquiries about collaboration opportunities or to learn more about the project, please feel free to reach out!
Large-scale data analytics, scientific simulation, and deep learning codes in HPC perform massive computations on data greatly exceeding the bounds of main memory. These out-of-core algorithms suffer from severe data movement penalties, programming complexity, and limited code reuse. To solve this, HPC sites have steadily increased DRAM capacity. However, this is not sustainable due to financial and environmental costs. A more elegant, low-cost, and portable solution is to expand memory to distributed multitiered storage. In this work, we propose MegaMmap: a software distributed shared memory (DSM) that enlarges effective memory capacity through intelligent tiered DRAM and storage management. MegaMmap provides workload-aware data organization, eviction, and prefetching policies to reduce DRAM consumption while ensuring speedy access to critical data. A variety of memory coherence optimizations are provided through an intuitive hinting system. Evaluations show that various workloads can be executed with a fraction of the DRAM while offering competitive performance.
@inproceedings{logan2024megammap,entry_type={conference},author={Logan, Luke and Kougkas, Anthony and Sun, Xian-He},booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis SC},title={MegaMmap: Blurring the Boundary Between Memory and Storage for Data-Intensive Workloads},year={2024},month=nov,publisher={IEEE Computer Society},volume={},number={},pages={1725-1742},keywords={HPC, Systems Software, Memory Tiering, Storage Tiering},doi={10.1109/SC41406.2024.00114},url={https://dl.acm.org/doi/abs/10.1109/SC41406.2024.00114},}
Modern HPC workflows involve intricate coupling of simulation, data analytics, and artificial intelligence (AI) applications to improve time to scientific insight. These workflows require a cohesive set of performance analysis tools to provide a comprehensive understanding of data exchange patterns in HPC systems. However, current tools are not designed to work with an AI-based I/O software stack that requires tracing at multiple levels of the application. To this end, we developed a data flow tracer called DFTracer to capture data-centric events from workflows and the I/O stack to build a detailed understanding of the data exchange within AI-driven workflows. DFTracer has the following three novel features, including a unified interface to capture trace data from different layers in the software stack, a trace format that is analysis-friendly and optimized to support efficiently loading multi-million events in a few seconds, and the capability to tag events with workflow-specific context to perform domain-centric data flow analysis for workflows. Additionally, we demonstrate that DFTracer has a 1.44x smaller runtime overhead and 1.3-7.1x smaller trace size than state-of-the-art tracing tools such as Score-P, Recorder, and Darshan. Moreover, with AI-driven workflows, Score-P, Recorder, and Darshan cannot find I/O accesses from dynamically spawned processes, and their load performance of 100M events is three orders of magnitude slower than DFTracer. In conclusion, we demonstrate that DFTracer can capture multi-level performance data, including contextual event tagging with a low overhead of 1-5% from AI-driven workflows such as MuMMI and Microsoft’s Megatron Deepspeed running on large-scale HPC systems.
@inproceedings{devarajan2024dftracer,entry_type={conference},author={Devarajan, Hariharan and Pottier, Loïc and Velusamy, Kaushik and Zheng, Huihuo and Yildirim, Izzet and Kogiou, Olga and Yu, Weikuan and Kougkas, Anthony and Sun, Xian-He and Yeom, Jae Seung and Mohror, Kathryn},booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis},title={DFTracer: An Analysis-Friendly Data Flow Tracer for AI-Driven Workflows},year={2024},month=nov,publisher={IEEE Press},volume={},number={},pages={17:1-17:24},keywords={I/O, Application APIs, Deep Learning, Interception, Multilevel, System Calls, Tracer, Transparent, Workflows},doi={10.1109/SC41406.2024.00023},url={https://dl.acm.org/doi/abs/10.1109/SC41406.2024.00023},}
Data streaming is gaining traction in high-performance computing (HPC) as a mechanism for continuous data transfer, but remains underutilized as a processing paradigm due to the inadequacy of existing technologies, which are primarily designed for cloud architectures and ill-equipped to tackle HPC-specific challenges. This work introduces a novel approach where I/O libraries take charge of computing derived quantities. By managing the computation of these quantities within the I/O stack, issues such as redundant computations and data movement can be effectively addressed at runtime. The proposed solution demonstrates significant performance improvements and reduced resource utilization in HPC environments.
@inproceedings{gainaru2024derive,entry_type={conference},author={Gainaru, Ana and Podhorszki, Norbert and Dulac, Liz and Gong, Qian and Klasky, Scott and Eisenhauer, Greg and Kougkas, Antonios and Sun, Xian-He and Lofstead, Jay},booktitle={2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)},title={To Derive or Not to Derive: I/O Libraries Take Charge of Derived Quantities Computation},year={2024},month=nov,publisher={IEEE Press},volume={},number={},pages={105-115},keywords={Large-scale I/O, Derived Variables, HPC Analysis, Queries for Scientific Data, Quantities of Interest},doi={10.1109/SBAC-PAD63648.2024.00030},url={https://ieeexplore.ieee.org/document/10763877},}
The combination of ever-growing scientific datasets and distributed workflow complexity creates I/O performance bottlenecks due to data volume, velocity, and variety. Although the increasing use of descriptive data formats (e.g., HDF5, netCDF) helps organize these datasets, it also introduces obscure bottlenecks due to the need to translate high-level operations into file addresses and then into low-level I/O operations. To address this challenge, we introduce DaYu, a method and toolset for analyzing (a) semantic relationships between logical datasets and file addresses, (b) how dataset operations translate into I/O, and (c) the combination across entire workflows. DaYu’s analysis and visualization enable the identification of critical bottlenecks and the reasoning about remediation. We describe our methodology and propose optimization guidelines. Evaluation on scientific workflows demonstrates up to a 3.7x performance improvement in I/O time for obscure bottlenecks. The time and storage overhead for DaYu’s time-ordered data are typically under 0.2% of runtime and 0.25% of data volume, respectively.
@inproceedings{tang2024dayu,entry_type={conference},author={Tang, Meng and Cernuda, Jaime and Ye, Jie and Guo, Luanzheng and Tallent, Nathan R. and Kougkas, Anthony and Sun, Xian-He},booktitle={Proceedings of the IEEE International Conference on Cluster Computing},title={DaYu: Optimizing Distributed Scientific Workflows by Decoding Dataflow Semantics and Dynamics},year={2024},month=sep,publisher={IEEE},volume={},number={},pages={357-369},keywords={Workflow Optimization, Data Layout Optimization, In-Situ Analytics, Data-Intensive Applications},doi={10.1109/CLUSTER59578.2024.00038},url={https://ieeexplore.ieee.org/document/10740817},}
Traditionally, distributed storage systems have relied upon the interfaces provided by OS kernels to interact with storage hardware. However, much research has shown that OSes impose serious overheads on every I/O operation, especially on high-performance storage and networking hardware (e.g., PMEM and 200GBe). Thus, distributed storage stacks are being re-designed to take advantage of this modern hardware by utilizing new hardware interfaces which bypass the kernel entirely. However, the impact of these optimizations have not been well-studied for real HPC workloads on real hardware. In this work, we provide a comprehensive evaluation of DAOS: a state-of-the-art distributed storage system which re-architects the storage stack from scratch for modern hardware.We compare DAOS against traditional storage stacks and demonstrate that by utilizing optimal interfaces to hardware, performance improvements of up to 6x can be observed in real scientific applications.
@inproceedings{logan2024daos,entry_type={journal},author={Logan, Luke and Lofstead, Jay and Sun, Xian-He and Kougkas, Anthony},booktitle={Proceedings of the ACM SIGOPS Operating Systems Review},title={An Evaluation of DAOS for Simulation and Deep Learning HPCWorkloads},year={2024},month=aug,publisher={ACM},volume={58},number={1},pages={37-44},keywords={Storage Architectures, I/O Benchmarking, High-Performance Computing, Data Movement Optimization},doi={10.1145/3689051.3689058},url={https://doi.org/10.1145/3689051.3689058},}
Jie Ye, Jaime Cernuda, Neeraj Rajesh, Keith Bateman, Orcun Yildiz, Tom Peterka, Arnur Nigmetov, Dmitriy Morozov, Xian-He Sun, Anthony Kougkas, and Bogdan Nicolae
In Proceedings of the 53rd International Conference on Parallel Processing , Aug 2024
Scientific workflows increasingly need to train a DNN model in real-time during an experiment (e.g. using ground truth from a simulation), while using it at the same time for inferences. Instead of sharing the same model instance, the training (producer) and inference server (consumer) often use different model replicas that are kept synchronized. In addition to efficient I/O techniques to keep the model replica of the producer and consumer synchronized, there is another important trade-off: frequent model updates enhance inference quality but may slow down training; infrequent updates may lead to less precise inference results. To address these challenges, we introduce Viper: a new I/O framework designed to determine a near-optimal checkpoint schedule and accelerate the delivery of the latest model updates. Viper builds an inference performance predictor to identify the optimal checkpoint schedule to balance the trade-off between training slowdown and inference quality improvement. It also creates a memory-first model transfer engine to accelerate model delivery through direct memory-to-memory communication. Our experiments show that Viper can reduce the model update latency by ≈ 9x using the GPU-to-GPU data transfer engine and ≈ 3x using the DRAM-to-DRAM host data transfer. The checkpoint schedule obtained from Viper’s predictor also demonstrates improved cumulative inference accuracy compared to the baseline of epoch-based solutions.
@inproceedings{ye2024viper,entry_type={conference},author={Ye, Jie and Cernuda, Jaime and Rajesh, Neeraj and Bateman, Keith and Yildiz, Orcun and Peterka, Tom and Nigmetov, Arnur and Morozov, Dmitriy and Sun, Xian-He and Kougkas, Anthony and Nicolae, Bogdan},booktitle={Proceedings of the 53rd International Conference on Parallel Processing},title={Viper: A High-Performance I/O Framework for Transparently Updating, Storing, and Transferring Deep Neural Network Models},year={2024},month=aug,publisher={ACM},volume={},number={},pages={812-821},kkeywords={Deep Learning I/O, Data Movement Optimization, Workflow Optimization, Storage Bridging},doi={10.1145/3673038.3673070},url={https://doi.org/10.1145/3673038.3673070},}
Data streaming is gaining traction in high-performance computing (HPC) as a mechanism for continuous data transfer, but remains underutilized as a processing paradigm due to the inadequacy of existing technologies, which are primarily designed for cloud architectures and ill-equipped to tackle HPC-specific challenges. This work introduces HStream, a novel data management design for out-of-core data streaming engines. Central to the HStream design is the separation of data and computing planes at the task level. By managing them independently, issues such as memory thrashing and back-pressure, caused by the high volume, velocity, and burstiness of I/O in HPC environments, can be effectively addressed at runtime. Specifically, HStream utilizes adaptive parallelism and hierarchical memory management, enabled by this design paradigm, to alleviate memory pressure and enhance system performance. These improvements enable HStream to match the performance of state-of-the-art HPC streaming engines and achieve up to a 1.5x reduction in latency under high data loads.
@inproceedings{cernuda2024hstream,entry_type={conference},author={Cernuda, Jaime and Ye, Jie and Kougkas, Anthony and Sun, Xian-He},booktitle={Proceedings of the 53rd International Conference on Parallel Processing},title={HStream: A hierarchical data streaming engine for high-throughput scientific applications},year={2024},month=aug,publisher={ACM},volume={},number={},pages={231-240},keywords={Data Movement Optimization, Elastic Storage, Data Integration Frameworks, Hierarchical Buffering},doi={10.1145/3673038.3673150},url={https://dl.acm.org/doi/abs/10.1145/3673038.3673150},}
I/O operations are a known performance bottleneck of HPC applications. To achieve good performance, users often employ an iterative multistage tuning process to find an optimal I/O stack configuration. However, an I/O stack contains multiple layers, such as high-level I/O libraries, I/O middleware, and parallel file systems, and each layer has many parameters. These parameters and layers are entangled and influenced by each other. The tuning process is time-consuming and complex. In this work, we present TunIO, an AI-powered I/O tuning framework that implements several techniques to balance the tuning cost and performance gain, including tuning the high-impact parameters first. Furthermore, TunIO analyzes the application source code to extract its I/O kernel while retaining all statements necessary to perform I/O. It utilizes a smart selection of high-impact configuration parameters of the given tuning objective. Finally, it uses a novel Reinforcement Learning (RL)-driven early stopping mechanism to balance the cost and performance gain. Experimental results show that TunIO leads to a reduction of up to ≈73% in tuning time while achieving the same performance gain when compared to H5Tuner. It achieves a significant performance gain/cost of 208.4 MBps/min (I/O bandwidth for each minute spent in tuning) over existing approaches under our testing.
@inproceedings{rajesh2024tunio,entry_type={conference},author={Rajesh, Neeraj and Bateman, Keith and Bez, Jean Luca and Byna, Suren and Kougkas, Anthony and Sun, Xian-He},booktitle={Proceedings of the International Parallel and Distributed Processing Symposium},title={TunIO: An AI-powered Framework for Optimizing HPC I/O},year={2024},month=may,publisher={IEEE},volume={},number={},pages={494-505},keywords={I/O Performance Optimization, Storage Resource Provisioning, Task-Based I/O, High-Performance Computing},doi={10.1109/IPDPS57955.2024.00050},url={https://ieeexplore.ieee.org/document/10579249},}
A critical performance challenge in distributed scientific workflows is coordinating tasks and data flows on distributed resources. To guide these decisions, this paper introduces data flow lifecycle analysis. Workflows are commonly represented using directed acyclic graphs (DAGs). Data flow lifecycles (DFL) enrich task DAGs with data objects and properties that describe data flow and how tasks interact with that flow. Lifecycles enable analysis from several important perspectives: task, data, and data flow. We describe representation, measurement, analysis, visualization, and opportunity identification for DFLs. Our measurement is both distributed and scalable, using space that is constant per data file. We use lifecycles and opportunity analysis to reason about improved task placement and reduced data movement for five scientific workflows with different characteristics. Case studies show improvements of 15×, 1.9×, and 10–30×. Our work is implemented in the DataLife tool.
@inproceedings{lee2023data,entry_type={conference},author={Lee, Hyungro and Guo, Luanzheng and Tang, Meng and Firoz, Jesun and Tallent, Nathan and Kougkas, Anthony and Sun, Xian-He},booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},title={Data Flow Lifecycles for Optimizing Workflow Coordination},year={2023},month=nov,publisher={ACM},volume={},number={},pages={1--15},keywords={Workflow Optimization, Data Movement Optimization, In-Situ Analytics, Data-Intensive Applications},doi={10.1145/3581784.36071},url={https://dl.acm.org/doi/abs/10.1145/3581784.3607104},}
I/O analysis is an essential task for improving the performance of scientific applications on high-performance computing (HPC) systems. However, current analysis tools, which often use data drilling techniques (iterative exploration for deeper insights), treat every query independently and do not optimize column data for data-slicing (extracting specific data subsets), resulting in subpar querying performance. In this paper, we designed IOMax, a tool for efficient data drilling analysis on large-scale I/O traces. IOMax utilizes a novel query optimization technique to improve the query performance by 8.6x while reducing the memory footprint required for analysis by 11x. Additionally, it employs data transformation techniques to improve data-slicing performance by up to 11.4x. In conclusion, IOMax optimizes I/O analysis for scientific workflows on the Lassen supercomputer, resulting in up to 7x improvement.
@inproceedings{yildirim2023iomax,entry_type={workshop},author={Yildirim, Izzet and Devarajan, Hariharan and Kougkas, Anthony and Sun, Xian-He and Mohror, Kathryn},booktitle={Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis},title={IOMax: Maximizing Out-of-Core I/O Analysis Performance on HPC Systems},year={2023},month=nov,publisher={ACM},volume={},number={},pages={1209--1215},keywords={I/O Profiling, Data Management in HPC, I/O Benchmarking, Workflow Optimization},doi={10.1145/3624062.3624191},url={https://dl.acm.org/doi/abs/10.1145/3624062.3624191},}
Traditionally, distributed storage systems have relied upon the interfaces provided by OS kernels to interact with storage hardware. However, much research has shown that OSes impose serious overheads on every I/O operation, especially on high-performance storage and networking hardware (e.g., PMEM and 200GBe). Thus, distributed storage stacks are being re-designed to take advantage of this modern hardware by utilizing new hardware interfaces which bypass the kernel entirely. However, the impact of these optimizations have not been well-studied for real HPC workloads on real hardware. In this work, we provide a comprehensive evaluation of DAOS: a state-of-the-art distributed storage system which re-architects the storage stack from scratch for modern hardware. We compare DAOS against traditional storage stacks and demonstrate that by utilizing optimal interfaces to hardware, performance improvements of up to 6x can be observed in real scientific applications.
@inproceedings{logan2023evaluation,entry_type={workshop},author={Logan, Luke and Lofstead, Jay and Sun, Xian-He and Kougkas, Anthony},booktitle={Proceedings of the 3rd Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems},title={An Evaluation of DAOS for Simulation and Deep Learning HPC Workloads},year={2023},month=may,publisher={ACM},volume={},number={},pages={9--16},keywords={Storage Architectures, I/O Benchmarking, High-Performance Computing, Data Movement Optimization},doi={10.1145/3578353.358954},url={https://dl.acm.org/doi/abs/10.1145/3578353.3589542},}
Traditionally, I/O systems have been developed within the confines of a centralized OS kernel. This led to monolithic and rigid storage systems that are limited by low development speed, expressiveness, and performance. Various assumptions are imposed including reliance on the UNIX-file abstraction, the POSIX standard, and a narrow set of I/O policies. However, this monolithic design philosophy makes it difficult to develop and deploy new I/O approaches to satisfy the rapidly-evolving I/O requirements of modern scientific applications. To this end, we propose LabStor: a modular and extensible platform for developing high-performance, customized I/O stacks. Single-purpose I/O modules (e.g, I/O schedulers) can be developed in the comfort of userspace and released as plug-ins, while end-users can compose these modules to form workload- and hardware-specific I/O stacks. Evaluations show that by switching to a fully modular design, tailored I/O stacks can yield performance improvements of up to 60% in various applications.
@inproceedings{logan2022labstor,entry_type={conference},author={Logan, Luke and Garcia, Jaime Cernuda and Lofstead, Jay and Sun, Xian--He and Kougkas, Anthony},booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},title={LabStor: A modular and extensible platform for developing high-performance, customized I/O stacks in userspace},year={2022},month=nov,publisher={ACM},volume={},number={},pages={1--15},keywords={Storage Bridging, Elastic Storage, I/O Acceleration, Task-Based I/O},doi={10.1109/SC41404.2022.00028},url={https://ieeexplore.ieee.org/abstract/document/10046077},}
Modern scientific workflows couple simulations with AI-powered analytics by frequently exchanging data to accelerate time-to-science to reduce the complexity of the simulation planes. However, this data exchange is limited in performance and portability due to a lack of support for scientific data formats in AI frameworks. We need a cohesive mechanism to effectively integrate at scale complex scientific data formats such as HDF5, PnetCDF, ADIOS2, GNCF, and Silo into popular AI frameworks such as TensorFlow, PyTorch, and Caffe. To this end, we designed Stimulus, a data management library for ingesting scientific data effectively into the popular AI frameworks. We utilize the StimOps functions along with StimPack abstraction to enable the integration of scientific data formats with any AI framework. The evaluations show that Stimulus outperforms several large-scale applications with different use-cases such as Cosmic Tagger (consuming HDF5 dataset in PyTorch), Distributed FFN (consuming HDF5 dataset in TensorFlow), and CosmoFlow (converting HDF5 into TFRecord and then consuming that in TensorFlow) by 5.3 x, 2.9 x, and 1.9 x respectively with ideal I/O scalability up to 768 GPUs on the Summit supercomputer. Through Stimulus, we can portably extend existing popular AI frameworks to cohesively support any complex scientific data format and efficiently scale the applications on large-scale supercomputers.
@inproceedings{devarajan2022stimulus,entry_type={conference},author={Devarajan, Hariharan and Kougkas, Anthony and Zheng, Huihuo and Vishwanath, Venkatram and Sun, Xian-He},booktitle={Proceedings of the 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing},title={Stimulus: Accelerate Data Management for Scientific AI applications in HPC},year={2022},month=may,publisher={IEEE},volume={},number={},pages={109--118},keywords={Deep Learning I/O, Data Integration Frameworks, Workflow Optimization, In-Situ Analytics},doi={10.1109/CCGrid54584.2022.00020},url={https://ieeexplore.ieee.org/abstract/document/9826104},}