revolutionizing data management for ai-driven scientific discovery.
deepio represents an exciting new frontier in my research, where we’re tackling one of the most pressing challenges in modern high-performance computing: optimizing data management for ai-driven scientific workflows. leading this innovative project, we’re reimagining how scientific computing systems handle the complex interplay between ai training and inference operations.
research vision 💡
the convergence of traditional hpc with ai has created unique challenges that existing storage systems weren’t designed to handle. our vision is to develop a comprehensive framework that:
optimizes model exchange: revolutionizing how dnn models move between training and inference tasks
maximizes performance: achieving up to 6.7x reduction in training times through intelligent i/o optimization
enables intelligence: incorporating adaptive scheduling and smart caching strategies
under my leadership, we’ve developed several groundbreaking technologies:
1. dlio benchmark
novel i/o benchmark for scientific deep learning applications
emulates complex data access patterns in ai workflows
enables systematic identification of i/o bottlenecks
demonstrates up to 6.7x improvement in training performance
2. stimulus framework
stimpack: unified representation for scientific data formats
stimops: optimized data ingestion routines
2x-5.3x performance improvement on summit supercomputer
seamless integration with popular ai frameworks
3. viper i/o framework
adaptive checkpoint scheduling for optimal model updates
memory-first model transfer engine
advanced publish-subscribe notification system
significant reduction in model update latency
4. unboxkv analysis tool
fine-grained analysis of kv caching in transformer models
performance optimization for large language model inference
advanced batching strategy optimization
memory access pattern analysis
technical architecture
the deepio ecosystem consists of several integrated components:
i/o profiling layer: advanced tooling for understanding ai workload characteristics
optimization engine: ml-driven decision making for data placement and movement
storage interface: high-performance data access and caching system
monitoring system: real-time performance analysis and adaptation
impact on scientific ai 🌍
our innovations are already showing significant impact:
performance: up to 6.7x reduction in training times
efficiency: 2x-5.3x improvement in data processing speed
scalability: successfully demonstrated on leadership computing facilities
accessibility: enabling more complex ai workflows in scientific computing
research directions 🎯
we’re actively exploring several exciting frontiers:
advanced caching strategies for transformer models
ml-driven i/o optimization techniques
novel data representation formats for ai workloads
distributed model synchronization protocols
project resources 🛠️
framework: coming soon
documentation: in development
benchmarks: dlio suite available upon request
analysis tools: unboxkv toolset in testing phase
team & collaboration 👥
this ambitious project brings together experts in:
high-performance computing
deep learning systems
storage architecture
scientific computing
future roadmap
our ongoing development focuses on:
expanding dlio benchmark capabilities
enhancing stimulus framework features
optimizing viper for new ai architectures
developing advanced kv caching strategies
acknowledgements 🙏
this cutting-edge research is made possible through support from our research partners and the dedication of our talented team of graduate students and postdoctoral researchers.
Interested in collaborating or learning more about our AI-driven storage solutions? Feel free to reach out!
Modern HPC workflows involve intricate coupling of simulation, data analytics, and artificial intelligence (AI) applications to improve time to scientific insight. These workflows require a cohesive set of performance analysis tools to provide a comprehensive understanding of data exchange patterns in HPC systems. However, current tools are not designed to work with an AI-based I/O software stack that requires tracing at multiple levels of the application. To this end, we developed a data flow tracer called DFTracer to capture data-centric events from workflows and the I/O stack to build a detailed understanding of the data exchange within AI-driven workflows. DFTracer has the following three novel features, including a unified interface to capture trace data from different layers in the software stack, a trace format that is analysis-friendly and optimized to support efficiently loading multi-million events in a few seconds, and the capability to tag events with workflow-specific context to perform domain-centric data flow analysis for workflows. Additionally, we demonstrate that DFTracer has a 1.44x smaller runtime overhead and 1.3-7.1x smaller trace size than state-of-the-art tracing tools such as Score-P, Recorder, and Darshan. Moreover, with AI-driven workflows, Score-P, Recorder, and Darshan cannot find I/O accesses from dynamically spawned processes, and their load performance of 100M events is three orders of magnitude slower than DFTracer. In conclusion, we demonstrate that DFTracer can capture multi-level performance data, including contextual event tagging with a low overhead of 1-5% from AI-driven workflows such as MuMMI and Microsoft’s Megatron Deepspeed running on large-scale HPC systems.
Traditionally, distributed storage systems have relied upon the interfaces provided by OS kernels to interact with storage hardware. However, much research has shown that OSes impose serious overheads on every I/O operation, especially on high-performance storage and networking hardware (e.g., PMEM and 200GBe). Thus, distributed storage stacks are being re-designed to take advantage of this modern hardware by utilizing new hardware interfaces which bypass the kernel entirely. However, the impact of these optimizations have not been well-studied for real HPC workloads on real hardware. In this work, we provide a comprehensive evaluation of DAOS: a state-of-the-art distributed storage system which re-architects the storage stack from scratch for modern hardware.We compare DAOS against traditional storage stacks and demonstrate that by utilizing optimal interfaces to hardware, performance improvements of up to 6x can be observed in real scientific applications.
Jie Ye, Jaime Cernuda, Neeraj Rajesh, Keith Bateman, Orcun Yildiz, Tom Peterka, Arnur Nigmetov, Dmitriy Morozov, Xian-He Sun, Anthony Kougkas, and Bogdan Nicolae
In Proceedings of the 53rd International Conference on Parallel Processing , Aug 2024
Scientific workflows increasingly need to train a DNN model in real-time during an experiment (e.g. using ground truth from a simulation), while using it at the same time for inferences. Instead of sharing the same model instance, the training (producer) and inference server (consumer) often use different model replicas that are kept synchronized. In addition to efficient I/O techniques to keep the model replica of the producer and consumer synchronized, there is another important trade-off: frequent model updates enhance inference quality but may slow down training; infrequent updates may lead to less precise inference results. To address these challenges, we introduce Viper: a new I/O framework designed to determine a near-optimal checkpoint schedule and accelerate the delivery of the latest model updates. Viper builds an inference performance predictor to identify the optimal checkpoint schedule to balance the trade-off between training slowdown and inference quality improvement. It also creates a memory-first model transfer engine to accelerate model delivery through direct memory-to-memory communication. Our experiments show that Viper can reduce the model update latency by ≈ 9x using the GPU-to-GPU data transfer engine and ≈ 3x using the DRAM-to-DRAM host data transfer. The checkpoint schedule obtained from Viper’s predictor also demonstrates improved cumulative inference accuracy compared to the baseline of epoch-based solutions.
I/O operations are a known performance bottleneck of HPC applications. To achieve good performance, users often employ an iterative multistage tuning process to find an optimal I/O stack configuration. However, an I/O stack contains multiple layers, such as high-level I/O libraries, I/O middleware, and parallel file systems, and each layer has many parameters. These parameters and layers are entangled and influenced by each other. The tuning process is time-consuming and complex. In this work, we present TunIO, an AI-powered I/O tuning framework that implements several techniques to balance the tuning cost and performance gain, including tuning the high-impact parameters first. Furthermore, TunIO analyzes the application source code to extract its I/O kernel while retaining all statements necessary to perform I/O. It utilizes a smart selection of high-impact configuration parameters of the given tuning objective. Finally, it uses a novel Reinforcement Learning (RL)-driven early stopping mechanism to balance the cost and performance gain. Experimental results show that TunIO leads to a reduction of up to ≈73% in tuning time while achieving the same performance gain when compared to H5Tuner. It achieves a significant performance gain/cost of 208.4 MBps/min (I/O bandwidth for each minute spent in tuning) over existing approaches under our testing.
Traditionally, distributed storage systems have relied upon the interfaces provided by OS kernels to interact with storage hardware. However, much research has shown that OSes impose serious overheads on every I/O operation, especially on high-performance storage and networking hardware (e.g., PMEM and 200GBe). Thus, distributed storage stacks are being re-designed to take advantage of this modern hardware by utilizing new hardware interfaces which bypass the kernel entirely. However, the impact of these optimizations have not been well-studied for real HPC workloads on real hardware. In this work, we provide a comprehensive evaluation of DAOS: a state-of-the-art distributed storage system which re-architects the storage stack from scratch for modern hardware. We compare DAOS against traditional storage stacks and demonstrate that by utilizing optimal interfaces to hardware, performance improvements of up to 6x can be observed in real scientific applications.
Modern scientific workflows couple simulations with AI-powered analytics by frequently exchanging data to accelerate time-to-science to reduce the complexity of the simulation planes. However, this data exchange is limited in performance and portability due to a lack of support for scientific data formats in AI frameworks. We need a cohesive mechanism to effectively integrate at scale complex scientific data formats such as HDF5, PnetCDF, ADIOS2, GNCF, and Silo into popular AI frameworks such as TensorFlow, PyTorch, and Caffe. To this end, we designed Stimulus, a data management library for ingesting scientific data effectively into the popular AI frameworks. We utilize the StimOps functions along with StimPack abstraction to enable the integration of scientific data formats with any AI framework. The evaluations show that Stimulus outperforms several large-scale applications with different use-cases such as Cosmic Tagger (consuming HDF5 dataset in PyTorch), Distributed FFN (consuming HDF5 dataset in TensorFlow), and CosmoFlow (converting HDF5 into TFRecord and then consuming that in TensorFlow) by 5.3 x, 2.9 x, and 1.9 x respectively with ideal I/O scalability up to 768 GPUs on the Summit supercomputer. Through Stimulus, we can portably extend existing popular AI frameworks to cohesively support any complex scientific data format and efficiently scale the applications on large-scale supercomputers.
Deep learning has been shown as a successful method for various tasks, and its popularity results in numerous open-source deep learning software tools. Deep learning has been applied to a broad spectrum of scientific domains such as cosmology, particle physics, computer vision, fusion, and astrophysics. Scientists have performed a great deal of work to optimize the computational performance of deep learning frameworks. However, the same cannot be said for I/O performance. As deep learning algorithms rely on big-data volume and variety to effectively train neural networks accurately, I/O is a significant bottleneck on large-scale distributed deep learning training. This study aims to provide a detailed investigation of the I/O behavior of various scientific deep learning workloads running on the Theta supercomputer at Argonne Leadership Computing Facility. In this paper, we present DLIO, a novel representative benchmark suite built based on the I/O profiling of the selected workloads. DLIO can be utilized to accurately emulate the I/O behavior of modern scientific deep learning applications. Using DLIO, application developers and system software solution architects can identify potential I/O bottlenecks in their applications and guide optimizations to boost the I/O performance leading to lower training times by up to 6.7x.