-
Notifications
You must be signed in to change notification settings - Fork 6.3k
[To Be Deprecated] Persistent Read Cache
For a very long time, disks were the means of persistent for datastores. With the introduction of SSD, we now have a persistent medium that is significantly faster than the traditional disks but with limited write endurance and capacity, enabling us to explore the opportunities of tiered storage architecture. Open source implementations like flash cache used SSD and disk as tiered storage outperforming disk for server applications. RocksDB Persistent read cache is an effort to take advantage of the tiered storage architecture in a device agnostic and operating system independent manner for the RocksDB ecosystem.
RocksDB users can take advantage of the tiered storage architecture either by adopting tiered storage deployment approach or by adopting tiered cache deployment approach. With the tiered storage approach, you can distribute the contents of the LSM on multiple persistent storage tiers. With the tiered cache approach, users can use the faster persistent medium as a read cache serving frequently accessed parts of the LSM and enhance the overall performance of RocksDB.
Tiered cache has a few advantage in terms of data mobility since the cache is an add-on for performance. The store can continue to function without the cache.
The persistent read cache is a generic implementation and is not specifically designed for any kind of device in particular. Instead of designing for specific types of hardware, we have taken the approach of designing the cache to provide the user with a mechanism to describe the best way to access the device, and the IO paths will be configured to work as per the description.
Write code path can be described using the formula
{ Block Size, Queue depth, Access/Caching Technique }
Read code path can be described using the formula
{ Access/Caching Technique }
Block Size describes the size to read/write. In the case of SSDs, this would typically be erasure block size.
Queue depth is the parallelism at which the device exhibits the best performance.
Access/Caching Technique is used to describes the best way to access the device. Using direct IO access for example is suitable for certain devices/applications and buffered access is preferred for others.
Persistent read cache is build using RocksDB abstraction and is supported on all platforms where RocksDB is supported.
Since this is a cache implementation, the cache may or may not be supplied on a restart.
The implementation of Persistent Read Cache has three fundamental components.
This is a scalable in-memory hash index that maps a given LSM block address to a cache record locator. The cache record locator helps locate the block data in the cache. The cache record can be described as { file-id, offset, size }.
The is a scalable in-memory hash index which allows for eviction based on LRU. This index maps a given file identifier to its reference object abstraction. The object abstraction can be used for reading data from the cache. When we run out of space on the persistent cache, we evict the least recently used file from this index.
The cache is stored in the file system as a sequence of files. Each file contains a sequence of records which contain data corresponding to a block on RocksDB LSM.
Please follow the link below for the public API.
https://github.com/facebook/rocksdb/blob/main/include/rocksdb/persistent_cache.h
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator (Experimental)
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc