Metadata-Rich File Systems
This project was a LLNL/UCSC collaboration: the goal was to design a scalable metadata-rich file system (MRFS) with database-like data management services. With such a file system scientist are able to perform time-critical analysis over continually evolving, very large data sets.
File system metadata management has become a bottleneck for many data‐intensive applications that rely on high‐performance file systems. Part of the bottleneck is due to the limitations of an almost 50‐year old interface standard, with metadata abstractions that were designed at a time when high‐end file systems managed less than 100MB. However, today’s high‐performance file systems store 7‐9 orders of magnitude more data resulting in numbers of data items, for which these metadata abstractions are inadequate. Users of file systems have attempted to work around these inadequacies by moving application‐specific metadata management to relational databases to make metadata searchable. Splitting file system metadata management into two separate systems introduces inefficiencies and systems management‐problems. To address this problem, we explore searchable metadata management services that integrate all file system metadata and use a graph data model with attributes on nodes and links. Our research focuses on the following areas: (1) Query language design, (2) Data structures for metadata, (3) Query planning, processing and optimization, (4) Workload selection and query experiment preparation, (5) Trace analysis.
We designed and implemented QUASAR, a path-based query language using the POSIX IO data model extended by relational links. We conducted a couple of data mining case studies where we compared the baseline architecture consisting of a database and a file system with our MRFS prototype. The QUASAR interface via its query language provides much easier access to large data sets than POSIX IO. MRFS’ querying performance is significantly better than the baseline system due to QUASAR’s hierarchical scoping. We worked on a scalable physical data model of QUASAR’s logical data model and designed a RMFS client cache to address small update overheads and metadata coherence.
Participants
Mentors
- Maya Gokhale
Sponsors
- LLNL
- Institute for Scalable Data Management (ISSDM)
- Petascale Data Storage Institute (PDSI)
- IBM Almaden