Thread: Proposal: SLRU to Buffer Cache

Proposal: SLRU to Buffer Cache

From
Shawn Debnath
Date:
Hello hackers,

At the Unconference in Ottawa this year, I pitched the idea of moving
components off of SLRU and on to the buffer cache. The motivation
behind the idea was three fold:

  * Improve performance by eliminating fixed sized caches, simplistic
    scan and eviction algorithms.
  * Ensuring durability and consistency by tracking LSNs and checksums
    per block.
  * Consolidating caching strategies in the engine to simplify the
    codebase, and would benefit from future buffer cache optimizations.

As the changes are quite invasive, I wanted to vet the approach with the
community before digging in to implementation. The changes are strictly
on the storage side and do not change the runtime behavior or protocols.
Here's the current approach I am considering:

  1. Implement a generic block storage manager that parameterizes
     several options like segment sizes, fork and segment naming and
     path schemes, concepts entrenched in md.c that are strongly tied to
     relations. To mitigate risk, I am planning on not modifying md.c
     for the time being.

  2. Introduce a new smgr_truncate_extended() API to allow truncation of
     a range of blocks starting at a specific offset, and option to
     delete the file instead of simply truncating.

  3. I will continue to use the RelFileNode/SMgrRelation constructs
     through the SMgr API. I will reserve OIDs within the engine that we
     can use as DB ID in RelFileNode to determine which storage manager
     to associate for a specific SMgrRelation. To increase the
     visibility of the OID mappings to the user, I would expose a new
     catalog where the OIDs can be reserved and mapped to existing
     components for template db generation. Internally, SMgr wouldn't
     rely on catalogs, but instead will have them defined in code to not
     block bootstrap. This scheme should be compatible with the undo log
     storage work by Thomas Munro, et al. [0].

  4. For each component that will be transitioned over to the generic
     block storage, I will introduce a page header at the beginning of
     the block and re-work the associated offset calculations along with
     transitioning from SLRU to buffer cache framework.

  5. Due to the on-disk format changes, simply copying the segments
     during upgrade wouldn't work anymore. Given the nature of data
     stored within SLRU segments today, we can extend pg_upgrade to
     translate the segment files by scanning from relfrozenxid and
     relminmxid and recording the corresponding values at the new
     offsets in the target segments.

  6. For now, I will implement a fsync queue handler specific to generic
     block store manager. In the future, once Andres' fsync queue work
     [1] gets merged in, we can move towards a common handler instead of
     duplicating the work.

  7. Will update impacted extensions such as pageinspect and
     pg_buffercache.

  8. We may need to introduce new shared buffer access strategies to
     limit the components from thrashing buffer cache.

The work would be broken up into several smaller pieces so that we can
get patches out for review and course-correct if needed.

  1. Generic block storage manager with changes to SMgr APIs and code to
     initialize the new storage manager based on DB ID in RelFileNode.
     This patch will also introduce the new catalog to show the OIDs
     which map to this new storage manager.

  2. Adapt commit timestamp: simple and easy component to transition
     over as a first step, enabling us to test the whole framework. 
     Changes will also include patching pg_upgrade to
     translate commit timestamp segments to the new format and
     associated updates to extensions.

     Will also include functional test coverage, especially, edge
     cases around data on page boundaries, and benchmark results
     comparing performance per component on SLRU vs buffer cache
     to identify regressions.

  3. Iterate for each component in SLRU using the work done for commit
     timestamp as an example: multixact, clog, subtrans, async
     notifications, and predicate locking.

  4. If required, implement shared access strategies, i.e., non-backend
     private ring buffers to limit buffer cache usage by these
     components.

Would love to hear feedback and comments on the approach above.


Thanks,

Shawn Debnath
Amazon Web Services (AWS)


[0] https://github.com/enterprisedb/zheap/tree/undo-log-storage
[1] https://www.postgresql.org/message-id/flat/20180424180054.inih6bxfspgowjuc%40alap3.anarazel.de


Re: Proposal: SLRU to Buffer Cache

From
Thomas Munro
Date:
Hi Shawn,

On Wed, Aug 15, 2018 at 9:35 AM, Shawn Debnath <sdn@amazon.com> wrote:
> At the Unconference in Ottawa this year, I pitched the idea of moving
> components off of SLRU and on to the buffer cache. The motivation
> behind the idea was three fold:
>
>   * Improve performance by eliminating fixed sized caches, simplistic
>     scan and eviction algorithms.
>   * Ensuring durability and consistency by tracking LSNs and checksums
>     per block.
>   * Consolidating caching strategies in the engine to simplify the
>     codebase, and would benefit from future buffer cache optimizations.

Thanks for working on this.  These are good goals, and I've wondered
about doing exactly this myself for exactly those reasons.  I'm sure
we're not the only ones, and I heard only positive reactions to your
unconference pitch.  As you know, my undo log storage design interacts
with the buffer manager in the same way, so I'm interested in this
subject and will be keen to review and test what you come up with.
That said, I'm fairly new here myself and there are people on this
list with a decade or two more experience hacking on the buffer
manager and transam machinery.

> As the changes are quite invasive, I wanted to vet the approach with the
> community before digging in to implementation. The changes are strictly
> on the storage side and do not change the runtime behavior or protocols.
> Here's the current approach I am considering:
>
>   1. Implement a generic block storage manager that parameterizes
>      several options like segment sizes, fork and segment naming and
>      path schemes, concepts entrenched in md.c that are strongly tied to
>      relations. To mitigate risk, I am planning on not modifying md.c
>      for the time being.

+1 for doing it separately at first.

I've also vacillated between extending md.c and doing my own
undo_file.c thing.  It seems plausible that between SLRU and undo we
could at least share a common smgr implementation, and eventually
maybe md.c.  There are a few differences though, and the question is
whether we'd want to do yet another abstraction layer with
callbacks/vtable/configuration points to handle that parameterisation,
or just use the existing indirection in smgr and call it good.

I'm keen to see what you come up with.  After we have a patch to
refactor and generalise the fsync stuff from md.c (about which more
below), let's see what is left and whether we can usefully combine
some code.

>   2. Introduce a new smgr_truncate_extended() API to allow truncation of
>      a range of blocks starting at a specific offset, and option to
>      delete the file instead of simply truncating.

Hmm.  In my undo proposal I'm currently implementing only the minimum
smgr interface required to make bufmgr.c happy (basically read and
write blocks), but I'm managing segment files (creating, deleting,
recycling) directly via a separate interface UndoLogAllocate(),
UndoLogDiscard() defined in undolog.c.  That seemed necessary for me
because that's where I had machinery to track the meta-data (mostly
head and tail pointers) for each undo log explicitly, but I suppose I
could use a wider smgr interface as you are proposing to move the
filesystem operations over there.  Perhaps I should reconsider that
split.  I look forward to seeing your code.

>   3. I will continue to use the RelFileNode/SMgrRelation constructs
>      through the SMgr API. I will reserve OIDs within the engine that we
>      can use as DB ID in RelFileNode to determine which storage manager
>      to associate for a specific SMgrRelation. To increase the
>      visibility of the OID mappings to the user, I would expose a new
>      catalog where the OIDs can be reserved and mapped to existing
>      components for template db generation. Internally, SMgr wouldn't
>      rely on catalogs, but instead will have them defined in code to not
>      block bootstrap. This scheme should be compatible with the undo log
>      storage work by Thomas Munro, et al. [0].

+1 for the pseudo-DB OID scheme, for now.  I think we can reconsider
how we want to structure buffer tags in the longer term as part of
future projects that overhaul buffer mapping.  We shouldn't get hung
up on that now.

I was wondering what the point of exposing the OIDs to users in a
catalog would be though.  It's not necessary to do that to reserve
them (and even if it were, pg_database would be the place): the OIDs
we choose for undo, clog, ... just have to be in the system reserved
range to be safe from collisions.  I suppose one benefit would be the
ability to join eg pg_buffer_cache against it to get a human readable
name like "clog", but that'd be slightly odd because the DB OID field
would refer to entries in pg_database or pg_storage_manager depending
on the number range.

>   4. For each component that will be transitioned over to the generic
>      block storage, I will introduce a page header at the beginning of
>      the block and re-work the associated offset calculations along with
>      transitioning from SLRU to buffer cache framework.

+1

As mentioned over in the SLRU checksums thread[1], I think that also
means that dirtied pages need to be registered with xlog so they get
full page writes when appropriate to deal with torn pages.  I think
SLRUs and undo will all be able to use REGBUF_WILL_INIT and
RBM_ZERO_XXX flags almost all the time because they're append-mostly.
You'll presumably generate one or two FPWs in each SLRU after each
checkpoint; one in the currently active page where the running xids
live, and occasionally an older page if you recently switched clog
page or have some very long running transactions that eventually get
stamped as committed.  In other words, there will be very few actual
full page writes generated by this, but it's something we need to get
right for correctness on some kinds of storage.  It might be possible
to skip that if checksums are not enabled (based on the theory that
torn pages can't hurt any current SLRU user due to their
write-without-read access pattern, it's just the checksum failures
that we need to worry about).

>   5. Due to the on-disk format changes, simply copying the segments
>      during upgrade wouldn't work anymore. Given the nature of data
>      stored within SLRU segments today, we can extend pg_upgrade to
>      translate the segment files by scanning from relfrozenxid and
>      relminmxid and recording the corresponding values at the new
>      offsets in the target segments.

+1

(Hmm, if we're going to change all this stuff, I wonder if there would
be any benefit to switching to 64 bit xids for the xid-based SLRUs
while we're here...)

>   6. For now, I will implement a fsync queue handler specific to generic
>      block store manager. In the future, once Andres' fsync queue work
>      [1] gets merged in, we can move towards a common handler instead of
>      duplicating the work.

I'm looking at that now: more soon.

>   7. Will update impacted extensions such as pageinspect and
>      pg_buffercache.

+1

>   8. We may need to introduce new shared buffer access strategies to
>      limit the components from thrashing buffer cache.

That's going to be an interesting area.  It will be good to get some
real experience.  For undo, so far it usually seems to work out OK
because we aggressively try to discard pages (that is, drop buffers
and put them on the freelist) at the same rate we dirty them.  I
speculate that for the SLRUs it might work out OK because, even though
the "discard" horizon moves very infrequently, pages are dirtied at a
relatively slow rate.  Let's see... you can fit just under 32k
transactions into each clog page, so a 50K TPS nonstop workload would
take about a day to trash 1GB of cache with clog.  That said, if it
turns out to be a problem we have a range of different hammers to hit
it with (and a number of hackers interested in that problem space).

This clog.c comment is interesting:

 * This module replaces the old "pg_log" access code, which treated pg_log
 * essentially like a relation, in that it went through the regular buffer
 * manager.  The problem with that was that there wasn't any good way to
 * recycle storage space for transactions so old that they'll never be
 * looked up again.  Now we use specialized access code so that the commit
 * log can be broken into relatively small, independent segments.

So it actually did use the regular buffer pool for a decade or so.  It
doesn't look like the buffer pool was the problem (not that that would
tell us much if it had been, given how much has changed since commit
2589735da08c): it was just the lack of a way to truncate the front of
the growing relation file, wasting precious turn-of-the-century disk
space.

> The work would be broken up into several smaller pieces so that we can
> get patches out for review and course-correct if needed.
>
>   1. Generic block storage manager with changes to SMgr APIs and code to
>      initialize the new storage manager based on DB ID in RelFileNode.
>      This patch will also introduce the new catalog to show the OIDs
>      which map to this new storage manager.

Personally I wouldn't worry too much about that catalog stuff in v0
since it's just window dressing and doesn't actually help us get our
hands on the core feature prototype to test...

>   2. Adapt commit timestamp: simple and easy component to transition
>      over as a first step, enabling us to test the whole framework.
>      Changes will also include patching pg_upgrade to
>      translate commit timestamp segments to the new format and
>      associated updates to extensions.

+1, seems like as good a place as any to start.

>      Will also include functional test coverage, especially, edge
>      cases around data on page boundaries, and benchmark results
>      comparing performance per component on SLRU vs buffer cache
>      to identify regressions.

+1

>   3. Iterate for each component in SLRU using the work done for commit
>      timestamp as an example: multixact, clog, subtrans, async
>      notifications, and predicate locking.

Without looking, I wonder if clog.c is going to be the trickiest,
since its slots are also involved in some group LSN stuff IIRC.

>   4. If required, implement shared access strategies, i.e., non-backend
>      private ring buffers to limit buffer cache usage by these
>      components.

I have a suspicion this won't turn out to be necessary for SLRUs as
mentioned, so I'm not too worried about it.

> Would love to hear feedback and comments on the approach above.

I like it.  I'm looking forward to some prototype code.  Oh, I think I
already said that a couple of times :-)

[1] https://www.postgresql.org/message-id/flat/fec4857dbb1ddaccaafc6f6c7f71f0a7%40postgrespro.ru

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Proposal: SLRU to Buffer Cache

From
Shawn Debnath
Date:
Sorry for the delay!

On Wed, Aug 15, 2018 at 05:56:19PM +1200, Thomas Munro wrote:
> +1 for doing it separately at first.
> 
> I've also vacillated between extending md.c and doing my own
> undo_file.c thing.  It seems plausible that between SLRU and undo we
> could at least share a common smgr implementation, and eventually
> maybe md.c.  There are a few differences though, and the question is
> whether we'd want to do yet another abstraction layer with
> callbacks/vtable/configuration points to handle that parameterisation,
> or just use the existing indirection in smgr and call it good.
> 
> I'm keen to see what you come up with.  After we have a patch to
> refactor and generalise the fsync stuff from md.c (about which more
> below), let's see what is left and whether we can usefully combine
> some code.

There are a few different approaches we can take here. Let me ponder on 
it before implementing. We can iterate on the patch once it’s out.

> >   3. I will continue to use the RelFileNode/SMgrRelation constructs
> >      through the SMgr API. I will reserve OIDs within the engine that we
> >      can use as DB ID in RelFileNode to determine which storage manager
> >      to associate for a specific SMgrRelation. To increase the
> >      visibility of the OID mappings to the user, I would expose a new
> >      catalog where the OIDs can be reserved and mapped to existing
> >      components for template db generation. Internally, SMgr wouldn't
> >      rely on catalogs, but instead will have them defined in code to not
> >      block bootstrap. This scheme should be compatible with the undo log
> >      storage work by Thomas Munro, et al. [0].
> 
> +1 for the pseudo-DB OID scheme, for now.  I think we can reconsider
> how we want to structure buffer tags in the longer term as part of
> future projects that overhaul buffer mapping.  We shouldn't get hung
> up on that now.

+1 We should postpone discussing revamping buffer tags for a later date. 
This set of patches will be quite a handful already.

> I was wondering what the point of exposing the OIDs to users in a
> catalog would be though.  It's not necessary to do that to reserve
> them (and even if it were, pg_database would be the place): the OIDs
> we choose for undo, clog, ... just have to be in the system reserved
> range to be safe from collisions.  I suppose one benefit would be the
> ability to join eg pg_buffer_cache against it to get a human readable
> name like "clog", but that'd be slightly odd because the DB OID field
> would refer to entries in pg_database or pg_storage_manager depending
> on the number range.

Good points. However, there are very few cases where our internal 
representation using DB OIDs will be exposed, one such being 
pg_buffercache. Wondering if updating the documentation here would be 
sufficient as pg_buffercache is an extension used by developers and DBEs 
rather than by consumers. We can circle back to this after the initial 
set of patches are out.

> >   4. For each component that will be transitioned over to the generic
> >      block storage, I will introduce a page header at the beginning of
> >      the block and re-work the associated offset calculations along with
> >      transitioning from SLRU to buffer cache framework.
> 
> +1
> 
> As mentioned over in the SLRU checksums thread[1], I think that also
> means that dirtied pages need to be registered with xlog so they get
> full page writes when appropriate to deal with torn pages.  I think
> SLRUs and undo will all be able to use REGBUF_WILL_INIT and
> RBM_ZERO_XXX flags almost all the time because they're append-mostly.
> You'll presumably generate one or two FPWs in each SLRU after each
> checkpoint; one in the currently active page where the running xids
> live, and occasionally an older page if you recently switched clog
> page or have some very long running transactions that eventually get
> stamped as committed.  In other words, there will be very few actual
> full page writes generated by this, but it's something we need to get
> right for correctness on some kinds of storage.  It might be possible
> to skip that if checksums are not enabled (based on the theory that
> torn pages can't hurt any current SLRU user due to their
> write-without-read access pattern, it's just the checksum failures
> that we need to worry about).

Yep agreed on FPW, and good point on potentially skipping it if they are 
disabled. For most of the components, we are always setting values at 
the advancing offset so I believe we should be okay here.

> >   5. Due to the on-disk format changes, simply copying the segments
> >      during upgrade wouldn't work anymore. Given the nature of data
> >      stored within SLRU segments today, we can extend pg_upgrade to
> >      translate the segment files by scanning from relfrozenxid and
> >      relminmxid and recording the corresponding values at the new
> >      offsets in the target segments.
> 
> +1
> 
> (Hmm, if we're going to change all this stuff, I wonder if there would
> be any benefit to switching to 64 bit xids for the xid-based SLRUs
> while we're here...)

Do you mean switching or reserving space for it on the block? The latter 
I hope :-)

> >   8. We may need to introduce new shared buffer access strategies to
> >      limit the components from thrashing buffer cache.
> 
> That's going to be an interesting area.  It will be good to get some
> real experience.  For undo, so far it usually seems to work out OK
> because we aggressively try to discard pages (that is, drop buffers
> and put them on the freelist) at the same rate we dirty them.  I
> speculate that for the SLRUs it might work out OK because, even though
> the "discard" horizon moves very infrequently, pages are dirtied at a
> relatively slow rate.  Let's see... you can fit just under 32k
> transactions into each clog page, so a 50K TPS nonstop workload would
> take about a day to trash 1GB of cache with clog.  That said, if it
> turns out to be a problem we have a range of different hammers to hit
> it with (and a number of hackers interested in that problem space).

Agreed,  my plan is to test it without special ring buffers and evaluate 
the performance. I just wanted to raise the issue in case we run into 
abnormal behavior.

> >   1. Generic block storage manager with changes to SMgr APIs and code to
> >      initialize the new storage manager based on DB ID in RelFileNode.
> >      This patch will also introduce the new catalog to show the OIDs
> >      which map to this new storage manager.
> 
> Personally I wouldn't worry too much about that catalog stuff in v0
> since it's just window dressing and doesn't actually help us get our
> hands on the core feature prototype to test...

Yep, agreed. Like I said above, we can circle back on this. The OID 
exposure can be settled on once the functionality has gained acceptance.

> > Would love to hear feedback and comments on the approach above.
> 
> I like it.  I'm looking forward to some prototype code.  Oh, I think I
> already said that a couple of times :-)

More than a couple of times :-) It’s in the works!

-- 
Shawn Debnath
Amazon Web Services (AWS)


Re: Proposal: SLRU to Buffer Cache

From
Andres Freund
Date:
Hi,

On 2018-08-21 09:53:21 -0400, Shawn Debnath wrote:
> > I was wondering what the point of exposing the OIDs to users in a
> > catalog would be though.  It's not necessary to do that to reserve
> > them (and even if it were, pg_database would be the place): the OIDs
> > we choose for undo, clog, ... just have to be in the system reserved
> > range to be safe from collisions.

Maybe I'm missing something, but how are conflicts prevented just by
being in the system range?  There's very commonly multiple patches
trying to use the same oid, and that is just discovered by the
'duplicate_oids' script. But if there's no catalog representation, I
don't see how that'd discover them?



> > I suppose one benefit would be the
> > ability to join eg pg_buffer_cache against it to get a human readable
> > name like "clog", but that'd be slightly odd because the DB OID field
> > would refer to entries in pg_database or pg_storage_manager depending
> > on the number range.

> Good points. However, there are very few cases where our internal 
> representation using DB OIDs will be exposed, one such being 
> pg_buffercache. Wondering if updating the documentation here would be 
> sufficient as pg_buffercache is an extension used by developers and DBEs 
> rather than by consumers. We can circle back to this after the initial 
> set of patches are out.

Showing the oids in pg_database or such seems like it'd make it a bit
harder to change later because people rely on things like joining
against it.  I don't think I like that.  I'm kinda inclined to something
somewhat crazy like instead having a reserved & shared pg_class entry or
such.  Don't like that that much either. Hm.


> > >   5. Due to the on-disk format changes, simply copying the segments
> > >      during upgrade wouldn't work anymore. Given the nature of data
> > >      stored within SLRU segments today, we can extend pg_upgrade to
> > >      translate the segment files by scanning from relfrozenxid and
> > >      relminmxid and recording the corresponding values at the new
> > >      offsets in the target segments.
> > 
> > +1
> > 
> > (Hmm, if we're going to change all this stuff, I wonder if there would
> > be any benefit to switching to 64 bit xids for the xid-based SLRUs
> > while we're here...)
> 
> Do you mean switching or reserving space for it on the block? The latter 
> I hope :-)

I'd make the addressing work in a way that never requires wraparounds,
but instead allows trimming at the beginning. That shouldn't result in
any additional space, while allowing to fully switch to 64bit xids.

Greetings,

Andres Freund


Re: Proposal: SLRU to Buffer Cache

From
Shawn Debnath
Date:
On Tue, Aug 21, 2018 at 07:15:28AM -0700, Andres Freund wrote:

> On 2018-08-21 09:53:21 -0400, Shawn Debnath wrote:
> > > I was wondering what the point of exposing the OIDs to users in a
> > > catalog would be though.  It's not necessary to do that to reserve
> > > them (and even if it were, pg_database would be the place): the OIDs
> > > we choose for undo, clog, ... just have to be in the system reserved
> > > range to be safe from collisions.
> 
> Maybe I'm missing something, but how are conflicts prevented just by
> being in the system range?  There's very commonly multiple patches
> trying to use the same oid, and that is just discovered by the
> 'duplicate_oids' script. But if there's no catalog representation, I
> don't see how that'd discover them?

+1. That's the reason why I suggested introducing a new catalog to 
reserve, and at the same, expose the OIDs to the customer. To Thomas' 
point, we can worry about the catalog after the storage manager patch is 
ready for review as the values will need to be hard coded anyways to 
avoid bootstrap issues.  More on the catalog below.

> > > I suppose one benefit would be the
> > > ability to join eg pg_buffer_cache against it to get a human readable
> > > name like "clog", but that'd be slightly odd because the DB OID field
> > > would refer to entries in pg_database or pg_storage_manager depending
> > > on the number range.
> 
> > Good points. However, there are very few cases where our internal 
> > representation using DB OIDs will be exposed, one such being 
> > pg_buffercache. Wondering if updating the documentation here would be 
> > sufficient as pg_buffercache is an extension used by developers and DBEs 
> > rather than by consumers. We can circle back to this after the initial 
> > set of patches are out.
> 
> Showing the oids in pg_database or such seems like it'd make it a bit
> harder to change later because people rely on things like joining
> against it.  I don't think I like that.  I'm kinda inclined to something
> somewhat crazy like instead having a reserved & shared pg_class entry or
> such.  Don't like that that much either. Hm.

For some scenarios, like SLRU, we could follow the scheme that is used 
by pg_database and have it be shared across all databases. Entries could 
be inserted into pg_class and we can introduce a new ‘reserved’ or 
‘system’ relkind for it. The files would reside in the global 
tablespace. 

Unfortunately, it wouldn’t be a good fit for undo logs as it uses the 
relation OID for the dynamic set of undo logs under one specific DB OID.  
Because of this, I was taking the DB OID approach but instead of adding 
it to pg_database which is inflexible with its member types, I suggested 
that we introduce a new catalog to track the OIDs.  Any reason why we 
shouldn’t/can’t introduce a new catalog to track these?

If a new catalog is not possible, I would prefer the pg_class approach 
given we can define a new relkind to track these special relations and 
have the rest of the fields be set to invalid state if needed (which is 
supported today). This would require significant re-work for the undo 
log implementation.

> > > >   5. Due to the on-disk format changes, simply copying the segments
> > > >      during upgrade wouldn't work anymore. Given the nature of data
> > > >      stored within SLRU segments today, we can extend pg_upgrade to
> > > >      translate the segment files by scanning from relfrozenxid and
> > > >      relminmxid and recording the corresponding values at the new
> > > >      offsets in the target segments.
> > > 
> > > +1
> > > 
> > > (Hmm, if we're going to change all this stuff, I wonder if there would
> > > be any benefit to switching to 64 bit xids for the xid-based SLRUs
> > > while we're here...)
> > 
> > Do you mean switching or reserving space for it on the block? The latter 
> > I hope :-)
> 
> I'd make the addressing work in a way that never requires wraparounds,
> but instead allows trimming at the beginning. That shouldn't result in
> any additional space, while allowing to fully switch to 64bit xids.

+1, will keep this in mind during implementation.

-- 
Shawn Debnath
Amazon Web Services (AWS)


Re: Proposal: SLRU to Buffer Cache

From
Andrey Borodin
Date:
Hi!

> 15 авг. 2018 г., в 2:35, Shawn Debnath <sdn@amazon.com> написал(а):
>
> At the Unconference in Ottawa this year, I pitched the idea of moving
> components off of SLRU and on to the buffer cache. The motivation
> behind the idea was three fold:
>
>  * Improve performance by eliminating fixed sized caches, simplistic
>    scan and eviction algorithms.
>  * Ensuring durability and consistency by tracking LSNs and checksums
>    per block.
+1, I like this idea more than current patch on CF with checksums for SLRU pages.

>  1. Implement a generic block storage manager that parameterizes
>     several options like segment sizes, fork and segment naming and
>     path schemes, concepts entrenched in md.c that are strongly tied to
>     relations. To mitigate risk, I am planning on not modifying md.c
>     for the time being.
Probably I'm missing something, but why this should not be in access methods? You can extend AM to control it's segment
sizeand ability to truncate unneeded pages. This may to be useful, for example, in LSM tree implementation or something
similar.

Best regards, Andrey Borodin.

Re: Proposal: SLRU to Buffer Cache

From
Andres Freund
Date:
Hi,

On 2018-08-22 13:35:47 +0500, Andrey Borodin wrote:
> > 15 авг. 2018 г., в 2:35, Shawn Debnath <sdn@amazon.com> написал(а):
> > 
> > At the Unconference in Ottawa this year, I pitched the idea of moving
> > components off of SLRU and on to the buffer cache. The motivation
> > behind the idea was three fold:
> > 
> >  * Improve performance by eliminating fixed sized caches, simplistic
> >    scan and eviction algorithms.
> >  * Ensuring durability and consistency by tracking LSNs and checksums
> >    per block.
> +1, I like this idea more than current patch on CF with checksums for SLRU pages.

Yea, I don't think it really makes sense to reimplement this logic for
SLRUs (and then UNDO) separately.


> >  1. Implement a generic block storage manager that parameterizes
> >     several options like segment sizes, fork and segment naming and
> >     path schemes, concepts entrenched in md.c that are strongly tied to
> >     relations. To mitigate risk, I am planning on not modifying md.c
> >     for the time being.
> Probably I'm missing something, but why this should not be in access
> methods?

I think it's not an absurd idea to put the reserved oid into pg_am
(under a separate amtype). Although the fact that shared entries would
be in database local tables is a bit weird. But I'm fairly certain that
we'd not put any actual data into it, not the least because we need to
be able to acess clo etc from connections that cannot attach to a
database (say the startup process, which will never ever start reading
from a catalog table).  So I don't really see what you mean with:

> You can extend AM to control it's segment size and ability to
> truncate unneeded pages. This may to be useful, for example, in LSM
> tree implementation or something similar.

that doesn't really seem like it could work. Nor am I even clear what
the above points really have to do with the AM layer.

Greetings,

Andres Freund