a unified data access framework for hpc and big data storage
iris (i/o redirection via integrated storage) is a groundbreaking framework that bridges the gap between high-performance computing (hpc) and big data storage systems. as scientific applications become increasingly data-intensive and high-performance data analytics (hpda) requires more computing power, iris provides a unified solution to seamlessly integrate compute-centric and data-centric storage environments.
what makes iris special? đź’ˇ
iris acts as an intelligent mediator between different storage worlds. just as a translator helps people speaking different languages communicate, iris enables applications to access data across different storage systems seamlessly. it unifies parallel file systems (pfs) and object stores under one cohesive framework, eliminating the traditional barriers between hpc and big data environments.
behind the innovation
iris emerged from the recognition that the tools and cultures of hpc and hpda have diverged, to the detriment of both. our research shows that unification is essential to address a spectrum of major research domains. the project, funded by the national science foundation, aims to create a unified storage interface that bridges two very different compute-centric and data-centric data storage camps.
key innovations
cross-system data access: enables mpi applications to directly access object stores and big data applications to access parallel file systems
virtual files and objects: novel abstractions that overcome semantic gaps between different storage systems
unified storage interface: seamless integration of compute-centric and data-centric storage systems
high performance: achieves up to 12x speedup on real scientific applications
transparency: no modification needed to existing applications
real-world impact 🌍
iris is making significant contributions across various scientific domains:
climate modeling: supporting applications like cm1 with efficient data analysis integration
scientific computing: enabling seamless data sharing between simulation and analysis phases
high-performance data analytics: bridging the gap between computing and data processing
big data applications: providing efficient access to both file-based and object-based storage
technical architecture
iris consists of several key components working together:
mappers: bridge semantic gaps between different storage interfaces
storage modules: handle interactions with underlying storage systems
metadata manager: maintains consistency and metadata operations
performance optimizer: includes prefetching, caching, and request aggregation
unified storage server: provides deep integration at the disk level
looking forward
iris continues to evolve with exciting developments in:
extended support for various high-level i/o libraries
enhanced performance optimization techniques
deeper integration of storage systems at the disk level
expanded application support across different domains
join the iris community 🤝
iris is an open-source project welcoming contributions from both academic and industrial researchers:
documentation: comprehensive guides and technical details
research papers: latest findings and technical innovations
key publications đź“š
iris: i/o redirection via integrated storage (ics 2018)
foundational paper introducing the iris framework and its core concepts
demonstrates significant performance improvements in real-world applications
syndesis: mapping objects to files for a unified data access system (mtags 2017)
explores novel mapping strategies between file and object-based storage
enosis: bridging the semantic gap between file-based and object-based data models (datacloud 2017)
addresses fundamental challenges in unifying different data models
rethinking key-value store for parallel i/o optimization (ijhpca 2017)
investigates optimization strategies for key-value stores in parallel environments
niobe: an intelligent i/o bridging engine for complex and distributed workflows (ieee big data 2019)
extends iris concepts to support complex workflow scenarios
acknowledgements 🙏
the development of iris has been made possible through the support of the national science foundation (nsf). we’re grateful to our collaborators at the illinois institute of technology and various research institutions whose expertise has been instrumental in advancing this project.
Interested in learning more about IRIS or discussing potential collaborations? Feel free to reach out!
In the age of data-driven computing, integrating High Performance Computing(HPC) and Big Data(BD) environments may be the key to increasing productivity and to driving scientific discovery forward. Scientific workflows consist of diverse applications (i.e., HPC simulations and BD analysis) each with distinct representations of data that introduce a semantic barrier between the two environments. To solve scientific problems at scale, accessing semantically different data from different storage resources is the biggest unsolved challenge. In this work, we aim to address a critical question: ”How can we exploit the existing resources and efficiently provide transparent access to data from/to both environments”. We propose iNtelligent I/O Bridging Engine(NIOBE), a new data integration framework that enables integrated data access for scientific workflows with asynchronous I/O and data aggregation. NIOBE performs the data integration using available I/O resources, in contrast to existing optimizations that ignore the I/O nodes present on the data path. In NIOBE, data access is optimized to consider both the ongoing production and the consumption of the data in the future. Experimental results show that with NIOBE, an integrated scientific workflow can be accelerated by up to 10x when compared to a no-integration baseline and by up to 133% compared to other state-of-the-art integration solutions.
There is an ocean of available storage solutions in modern high-performance and distributed systems. These solutions consist of Parallel File Systems (PFS) for the more traditional high-performance computing (HPC) systems and of Object Stores for emerging cloud environments. More of ten than not, these storage solutions are tied to specific APIs and data models and thus, bind developers, applications, and entire computing facilities to using certain interfaces. Each storage system is designed and optimized for certain applications but does not perform well for others. Furthermore, modern applications have become more and more complex consisting of a collection of phases with different computation and I/O requirements. In this paper, we propose a unified storage access system, called IRIS (i.e., I/O Redirection via Integrated Storage). IRIS enables unified data access and seamlessly bridges the semantic gap between file systems and object stores. With IRIS, emerging High-Performance Data Analytics software has capable and diverse I/O support. IRIS can bring us closer to the convergence of HPC and Cloud environments by combining the best storage subsystems from both worlds. Experimental results show that IRIS can grant more than 7x improvement in performance than existing solutions.
Key–value stores are being widely used as the storage system for large-scale internet services and cloud storage systems. However, they are rarely used in HPC systems, where parallel file systems are the dominant storage solution. In this study, we examine the architecture differences and performance characteristics of parallel file systems and key–value stores. We propose using key–value stores to optimize overall Input/Output (I/O) performance, especially for workloads that parallel file systems cannot handle well, such as the cases with intense data synchronization or heavy metadata operations. We conducted experiments with several synthetic benchmarks, an I/O benchmark, and a real application. We modeled the performance of these two systems using collected data from our experiments, and we provide a predictive method to identify which system offers better I/O performance given a specific workload. The results show that we can optimize the I/O performance in HPC systems by utilizing key–value stores.