Proposal: SLRU to Buffer Cache - Mailing list pgsql-hackers
From | Shawn Debnath |
---|---|
Subject | Proposal: SLRU to Buffer Cache |
Date | |
Msg-id | 20180814213500.GA74618@60f81dc409fc.ant.amazon.com Whole thread Raw |
Responses |
Re: Proposal: SLRU to Buffer Cache
Re: Proposal: SLRU to Buffer Cache |
List | pgsql-hackers |
Hello hackers, At the Unconference in Ottawa this year, I pitched the idea of moving components off of SLRU and on to the buffer cache. The motivation behind the idea was three fold: * Improve performance by eliminating fixed sized caches, simplistic scan and eviction algorithms. * Ensuring durability and consistency by tracking LSNs and checksums per block. * Consolidating caching strategies in the engine to simplify the codebase, and would benefit from future buffer cache optimizations. As the changes are quite invasive, I wanted to vet the approach with the community before digging in to implementation. The changes are strictly on the storage side and do not change the runtime behavior or protocols. Here's the current approach I am considering: 1. Implement a generic block storage manager that parameterizes several options like segment sizes, fork and segment naming and path schemes, concepts entrenched in md.c that are strongly tied to relations. To mitigate risk, I am planning on not modifying md.c for the time being. 2. Introduce a new smgr_truncate_extended() API to allow truncation of a range of blocks starting at a specific offset, and option to delete the file instead of simply truncating. 3. I will continue to use the RelFileNode/SMgrRelation constructs through the SMgr API. I will reserve OIDs within the engine that we can use as DB ID in RelFileNode to determine which storage manager to associate for a specific SMgrRelation. To increase the visibility of the OID mappings to the user, I would expose a new catalog where the OIDs can be reserved and mapped to existing components for template db generation. Internally, SMgr wouldn't rely on catalogs, but instead will have them defined in code to not block bootstrap. This scheme should be compatible with the undo log storage work by Thomas Munro, et al. [0]. 4. For each component that will be transitioned over to the generic block storage, I will introduce a page header at the beginning of the block and re-work the associated offset calculations along with transitioning from SLRU to buffer cache framework. 5. Due to the on-disk format changes, simply copying the segments during upgrade wouldn't work anymore. Given the nature of data stored within SLRU segments today, we can extend pg_upgrade to translate the segment files by scanning from relfrozenxid and relminmxid and recording the corresponding values at the new offsets in the target segments. 6. For now, I will implement a fsync queue handler specific to generic block store manager. In the future, once Andres' fsync queue work [1] gets merged in, we can move towards a common handler instead of duplicating the work. 7. Will update impacted extensions such as pageinspect and pg_buffercache. 8. We may need to introduce new shared buffer access strategies to limit the components from thrashing buffer cache. The work would be broken up into several smaller pieces so that we can get patches out for review and course-correct if needed. 1. Generic block storage manager with changes to SMgr APIs and code to initialize the new storage manager based on DB ID in RelFileNode. This patch will also introduce the new catalog to show the OIDs which map to this new storage manager. 2. Adapt commit timestamp: simple and easy component to transition over as a first step, enabling us to test the whole framework. Changes will also include patching pg_upgrade to translate commit timestamp segments to the new format and associated updates to extensions. Will also include functional test coverage, especially, edge cases around data on page boundaries, and benchmark results comparing performance per component on SLRU vs buffer cache to identify regressions. 3. Iterate for each component in SLRU using the work done for commit timestamp as an example: multixact, clog, subtrans, async notifications, and predicate locking. 4. If required, implement shared access strategies, i.e., non-backend private ring buffers to limit buffer cache usage by these components. Would love to hear feedback and comments on the approach above. Thanks, Shawn Debnath Amazon Web Services (AWS) [0] https://github.com/enterprisedb/zheap/tree/undo-log-storage [1] https://www.postgresql.org/message-id/flat/20180424180054.inih6bxfspgowjuc%40alap3.anarazel.de
pgsql-hackers by date: