Proposal: SLRU to Buffer Cache - Mailing list pgsql-hackers

From Shawn Debnath
Subject Proposal: SLRU to Buffer Cache
Date
Msg-id 20180814213500.GA74618@60f81dc409fc.ant.amazon.com
Whole thread Raw
Responses Re: Proposal: SLRU to Buffer Cache
Re: Proposal: SLRU to Buffer Cache
List pgsql-hackers
Hello hackers,

At the Unconference in Ottawa this year, I pitched the idea of moving
components off of SLRU and on to the buffer cache. The motivation
behind the idea was three fold:

  * Improve performance by eliminating fixed sized caches, simplistic
    scan and eviction algorithms.
  * Ensuring durability and consistency by tracking LSNs and checksums
    per block.
  * Consolidating caching strategies in the engine to simplify the
    codebase, and would benefit from future buffer cache optimizations.

As the changes are quite invasive, I wanted to vet the approach with the
community before digging in to implementation. The changes are strictly
on the storage side and do not change the runtime behavior or protocols.
Here's the current approach I am considering:

  1. Implement a generic block storage manager that parameterizes
     several options like segment sizes, fork and segment naming and
     path schemes, concepts entrenched in md.c that are strongly tied to
     relations. To mitigate risk, I am planning on not modifying md.c
     for the time being.

  2. Introduce a new smgr_truncate_extended() API to allow truncation of
     a range of blocks starting at a specific offset, and option to
     delete the file instead of simply truncating.

  3. I will continue to use the RelFileNode/SMgrRelation constructs
     through the SMgr API. I will reserve OIDs within the engine that we
     can use as DB ID in RelFileNode to determine which storage manager
     to associate for a specific SMgrRelation. To increase the
     visibility of the OID mappings to the user, I would expose a new
     catalog where the OIDs can be reserved and mapped to existing
     components for template db generation. Internally, SMgr wouldn't
     rely on catalogs, but instead will have them defined in code to not
     block bootstrap. This scheme should be compatible with the undo log
     storage work by Thomas Munro, et al. [0].

  4. For each component that will be transitioned over to the generic
     block storage, I will introduce a page header at the beginning of
     the block and re-work the associated offset calculations along with
     transitioning from SLRU to buffer cache framework.

  5. Due to the on-disk format changes, simply copying the segments
     during upgrade wouldn't work anymore. Given the nature of data
     stored within SLRU segments today, we can extend pg_upgrade to
     translate the segment files by scanning from relfrozenxid and
     relminmxid and recording the corresponding values at the new
     offsets in the target segments.

  6. For now, I will implement a fsync queue handler specific to generic
     block store manager. In the future, once Andres' fsync queue work
     [1] gets merged in, we can move towards a common handler instead of
     duplicating the work.

  7. Will update impacted extensions such as pageinspect and
     pg_buffercache.

  8. We may need to introduce new shared buffer access strategies to
     limit the components from thrashing buffer cache.

The work would be broken up into several smaller pieces so that we can
get patches out for review and course-correct if needed.

  1. Generic block storage manager with changes to SMgr APIs and code to
     initialize the new storage manager based on DB ID in RelFileNode.
     This patch will also introduce the new catalog to show the OIDs
     which map to this new storage manager.

  2. Adapt commit timestamp: simple and easy component to transition
     over as a first step, enabling us to test the whole framework. 
     Changes will also include patching pg_upgrade to
     translate commit timestamp segments to the new format and
     associated updates to extensions.

     Will also include functional test coverage, especially, edge
     cases around data on page boundaries, and benchmark results
     comparing performance per component on SLRU vs buffer cache
     to identify regressions.

  3. Iterate for each component in SLRU using the work done for commit
     timestamp as an example: multixact, clog, subtrans, async
     notifications, and predicate locking.

  4. If required, implement shared access strategies, i.e., non-backend
     private ring buffers to limit buffer cache usage by these
     components.

Would love to hear feedback and comments on the approach above.


Thanks,

Shawn Debnath
Amazon Web Services (AWS)


[0] https://github.com/enterprisedb/zheap/tree/undo-log-storage
[1] https://www.postgresql.org/message-id/flat/20180424180054.inih6bxfspgowjuc%40alap3.anarazel.de


pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: logical decoding / rewrite map vs. maxAllocatedDescs
Next
From: Nico Williams
Date:
Subject: Re: Facility for detecting insecure object naming