accelerating scientific insights using enriched metadata.
coeus represents an exciting partnership with sandia national laboratories (snl) and oak ridge national laboratory (ornl), where our research team investigates advanced metadata management techniques to accelerate complex queries on scientific data while optimizing data placement across storage hierarchies. as a co-pi on this doe ascr funded project, my role focuses on leading the storage and data placement research thrust, where we develop novel approaches for intelligent data movement and enhanced metadata management.
research vision & leadership
working closely with dr. jay lofstead (snl) and dr. scott klasky (ornl), we identified critical challenges in scientific data management that led to the coeus project. our team at iit focuses on developing innovative solutions for:
storage-driven data movement: novel techniques for intelligent data placement
ml-guided optimization: advanced prediction models for data access patterns
metadata enhancement: new approaches for derived quantity management
hierarchical storage management: efficient use of modern storage tiers
technical innovations
under our team’s research direction, several breakthrough technologies have emerged:
context-aware active storage: a novel framework adapting storage behavior based on application context, developed by phd candidate jaime cernuda
global file heatmaps: an innovative system for tracking cross-process data access patterns, implemented by phd student luke logan
ml-based prediction engine: advanced models achieving over 90% accuracy in predicting data access patterns
adaptive data movement: dynamic policies responding to changing workload characteristics
mentorship & team development
the success of coeus relies heavily on the dedication and innovation of our outstanding research team:
phd students: leading core research thrusts in machine learning and storage optimization
post-doctoral researchers: bridging theoretical foundations with practical implementations
visiting researchers: contributing diverse perspectives from partner institutions
undergraduate researchers: gaining valuable exposure to cutting-edge research
impact on scientific applications
through collaborative efforts with domain scientists at doe facilities, our research has enhanced several critical applications:
fusion research: supporting xgc with efficient data query capabilities
particle physics: optimizing i/o performance for warpx
climate modeling: enabling complex queries on large-scale climate data
scientific visualization: accelerating data access for visualization tools
knowledge dissemination
our team actively shares research findings through:
guest lectures at partner institutions
technical workshops at major conferences
open-source software releases
peer-reviewed publications
project resources
the team maintains and continues to develop the core framework:
this research is supported by the u.s. department of energy, office of science, under award number de-sc0023386. we are grateful to our collaborators at sandia national laboratories and oak ridge national laboratory for this partnership in advancing the state of scientific data management. special thanks to our talented students and post-doctoral researchers whose dedication and innovation drive this project forward.
Interested in research opportunities or potential collaborations? Feel free to reach out!
Data streaming is gaining traction in high-performance computing (HPC) as a mechanism for continuous data transfer, but remains underutilized as a processing paradigm due to the inadequacy of existing technologies, which are primarily designed for cloud architectures and ill-equipped to tackle HPC-specific challenges. This work introduces HStream, a novel data management design for out-of-core data streaming engines. Central to the HStream design is the separation of data and computing planes at the task level. By managing them independently, issues such as memory thrashing and back-pressure, caused by the high volume, velocity, and burstiness of I/O in HPC environments, can be effectively addressed at runtime. Specifically, HStream utilizes adaptive parallelism and hierarchical memory management, enabled by this design paradigm, to alleviate memory pressure and enhance system performance. These improvements enable HStream to match the performance of state-of-the-art HPC streaming engines and achieve up to a 1.5x reduction in latency under high data loads.
Modern simulation workflows generate and analyze massive amounts of data using I/O libraries like Adios2 and NetCDF. Although extensive work has optimized the I/O processes during the simulation phase, executing analytical queries—which often require iterative traversals of large files for insights—is cumbersome and usually constrained by low I/O performance. Instead of waiting for the analysis phase to process queries, quantities can be derived asynchronously during data production and cached, speeding up future queries. In this work, we introduce a context-aware I/O layer named ’Hades.’ It is designed to efficiently derive insights from selected quantities without compromising overall workflow performance. Hades actively and asynchronously computes and stores these quantities while the data is in transit. Hades leverages a hierarchical buffering system with data access-aware prefetching to ensure quick and timely access to relevant data. It offers a flexible query interface empowering users to easily define derived quantities and provide control over data placement decisions. Hades is implemented using an Adios2 plugin engine and the Hermes buffering platform, enabling transparent use by any Adios-powered application or workflow. Experimental results demonstrate performance improvements by up to 3-4x for tested real-world scientific producer-consumer workflows.