DeepIO

revolutionizing data management for ai-driven scientific discovery.

deepio represents an exciting new frontier in my research, where we’re tackling one of the most pressing challenges in modern high-performance computing: optimizing data management for ai-driven scientific workflows. leading this innovative project, we’re reimagining how scientific computing systems handle the complex interplay between ai training and inference operations.

research vision 💡

the convergence of traditional hpc with ai has created unique challenges that existing storage systems weren’t designed to handle. our vision is to develop a comprehensive framework that:

  • optimizes model exchange: revolutionizing how dnn models move between training and inference tasks
  • maximizes performance: achieving up to 6.7x reduction in training times through intelligent i/o optimization
  • enables intelligence: incorporating adaptive scheduling and smart caching strategies
  • ensures scalability: supporting distributed multi-producer, multi-consumer patterns efficiently

core innovations 🔧

under my leadership, we’ve developed several groundbreaking technologies:

1. dlio benchmark

  • novel i/o benchmark for scientific deep learning applications
  • emulates complex data access patterns in ai workflows
  • enables systematic identification of i/o bottlenecks
  • demonstrates up to 6.7x improvement in training performance

2. stimulus framework

  • stimpack: unified representation for scientific data formats
  • stimops: optimized data ingestion routines
  • 2x-5.3x performance improvement on summit supercomputer
  • seamless integration with popular ai frameworks

3. viper i/o framework

  • adaptive checkpoint scheduling for optimal model updates
  • memory-first model transfer engine
  • advanced publish-subscribe notification system
  • significant reduction in model update latency

4. unboxkv analysis tool

  • fine-grained analysis of kv caching in transformer models
  • performance optimization for large language model inference
  • advanced batching strategy optimization
  • memory access pattern analysis

technical architecture

the deepio ecosystem consists of several integrated components:

  • i/o profiling layer: advanced tooling for understanding ai workload characteristics
  • optimization engine: ml-driven decision making for data placement and movement
  • storage interface: high-performance data access and caching system
  • monitoring system: real-time performance analysis and adaptation

impact on scientific ai 🌍

our innovations are already showing significant impact:

  • performance: up to 6.7x reduction in training times
  • efficiency: 2x-5.3x improvement in data processing speed
  • scalability: successfully demonstrated on leadership computing facilities
  • accessibility: enabling more complex ai workflows in scientific computing

research directions 🎯

we’re actively exploring several exciting frontiers:

  • advanced caching strategies for transformer models
  • ml-driven i/o optimization techniques
  • novel data representation formats for ai workloads
  • distributed model synchronization protocols

project resources 🛠️

  • framework: coming soon
  • documentation: in development
  • benchmarks: dlio suite available upon request
  • analysis tools: unboxkv toolset in testing phase

team & collaboration 👥

this ambitious project brings together experts in:

  • high-performance computing
  • deep learning systems
  • storage architecture
  • scientific computing

future roadmap

our ongoing development focuses on:

  • expanding dlio benchmark capabilities
  • enhancing stimulus framework features
  • optimizing viper for new ai architectures
  • developing advanced kv caching strategies

acknowledgements 🙏

this cutting-edge research is made possible through support from our research partners and the dedication of our talented team of graduate students and postdoctoral researchers.


Interested in collaborating or learning more about our AI-driven storage solutions? Feel free to reach out!

Related Publications

2024

  1. Hariharan Devarajan, Loïc Pottier, Kaushik Velusamy, Huihuo Zheng, Izzet Yildirim, Olga Kogiou, Weikuan Yu, Anthony Kougkas, Xian-He Sun, Jae Seung Yeom, and Kathryn Mohror
    In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis , Nov 2024
  2. Luke Logan, Jay Lofstead, Xian-He Sun, and Anthony Kougkas
    In Proceedings of the ACM SIGOPS Operating Systems Review , Aug 2024
  3. Jie Ye, Jaime Cernuda, Neeraj Rajesh, Keith Bateman, Orcun Yildiz, Tom Peterka, Arnur Nigmetov, Dmitriy Morozov, Xian-He Sun, Anthony Kougkas, and Bogdan Nicolae
    In Proceedings of the 53rd International Conference on Parallel Processing , Aug 2024
  4. Neeraj Rajesh, Keith Bateman, Jean Luca Bez, Suren Byna, Anthony Kougkas, and Xian-He Sun
    In Proceedings of the International Parallel and Distributed Processing Symposium , May 2024

2023

  1. Luke Logan, Jay Lofstead, Xian-He Sun, and Anthony Kougkas
    In Proceedings of the 3rd Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems , May 2023

2022

  1. Hariharan Devarajan, Anthony Kougkas, Huihuo Zheng, Venkatram Vishwanath, and Xian-He Sun
    In Proceedings of the 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing , May 2022

2021

  1. Hariharan Devarajan, Huihuo Zheng, Anthony Kougkas, Xian-He Sun, and Venkatram Vishwanath
    In Proceedings of the 21st International Symposium on Cluster, Cloud and Internet ComputingBest Paper Award ║ , May 2021