Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers

From Yura Sokolov
Subject Re: BufferAlloc: don't take two simultaneous locks
Date
Msg-id bcc57fce3a7623fb42f6baaf53b1d3b1df532542.camel@postgrespro.ru
Whole thread Raw
In response to Re: BufferAlloc: don't take two simultaneous locks  (Simon Riggs <simon.riggs@enterprisedb.com>)
Responses Re: BufferAlloc: don't take two simultaneous locks
List pgsql-hackers
Hello, Simon.

В Пт, 25/02/2022 в 04:35 +0000, Simon Riggs пишет:
> On Mon, 21 Feb 2022 at 08:06, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> > Good day, Kyotaro Horiguchi and hackers.
> > 
> > В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:
> > > At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in
> > > > Hello, all.
> > > > 
> > > > I thought about patch simplification, and tested version
> > > > without BufTable and dynahash api change at all.
> > > > 
> > > > It performs suprisingly well. It is just a bit worse
> > > > than v1 since there is more contention around dynahash's
> > > > freelist, but most of improvement remains.
> > > > 
> > > > I'll finish benchmarking and will attach graphs with
> > > > next message. Patch is attached here.
> > > 
> > > Thanks for the new patch.  The patch as a whole looks fine to me. But
> > > some comments needs to be revised.
> > 
> > Thank you for review and remarks.
> 
> v3 gets the buffer partition locking right, well done, great results!
> 
> In v3, the comment at line 1279 still implies we take both locks
> together, which is not now the case.
> 
> Dynahash actions are still possible. You now have the BufTableDelete
> before the BufTableInsert, which opens up the possibility I discussed
> here:
> http://postgr.es/m/CANbhV-F0H-8oB_A+m=55hP0e0QRL=RdDDQuSXMTFt6JPrdX+pQ@mail.gmail.com
> (Apologies for raising a similar topic, I hadn't noticed this thread
> before; thanks to Horiguchi-san for pointing this out).
> 
> v1 had a horrible API (sorry!) where you returned the entry and then
> explicitly re-used it. I think we *should* make changes to dynahash,
> but not with the API you proposed.
> 
> Proposal for new BufTable API
> BufTableReuse() - similar to BufTableDelete() but does NOT put entry
> back on freelist, we remember it in a private single item cache in
> dynahash
> BufTableAssign() - similar to BufTableInsert() but can only be
> executed directly after BufTableReuse(), fails with ERROR otherwise.
> Takes the entry from single item cache and re-assigns it to new tag
> 
> In dynahash we have two new modes that match the above
> HASH_REUSE - used by BufTableReuse(), similar to HASH_REMOVE, but
> places entry on the single item cache, avoiding freelist
> HASH_ASSIGN - used by BufTableAssign(), similar to HASH_ENTER, but
> uses the entry from the single item cache, rather than asking freelist
> This last call can fail if someone else already inserted the tag, in
> which case it adds the single item cache entry back onto freelist
> 
> Notice that single item cache is not in shared memory, so on abort we
> should give it back, so we probably need an extra API call for that
> also to avoid leaking an entry.

Why there is need for this? Which way backend could be forced to abort
between BufTableReuse and BufTableAssign in this code path? I don't
see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
something.

> 
> Doing it this way allows us to
> * avoid touching freelists altogether in the common path - we know we
> are about to reassign the entry, so we do remember it - no contention
> from other backends, no borrowing etc..
> * avoid sharing the private details outside of the dynahash module
> * allows us to use the same technique elsewhere that we have
> partitioned hash tables
> 
> This approach is cleaner than v1, but should also perform better
> because there will be a 1:1 relationship between a buffer and its
> dynahash entry, most of the time.

Thank you for suggestion. Yes, it is much clearer than my initial proposal.

Should I incorporate it to v4 patch? Perhaps, it could be a separate
commit in new version.


> 
> With these changes, I think we will be able to *reduce* the number of
> freelists for partitioned dynahash from 32 to maybe 8, as originally
> speculated by Robert in 2016:
>    https://www.postgresql.org/message-id/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
> since the freelists will be much less contended with the above approach
> 
> It would be useful to see performance with a higher number of connections, >400.
> 
> --
> Simon Riggs                http://www.EnterpriseDB.com/

------

regards,
Yura Sokolov




pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: BufferAlloc: don't take two simultaneous locks
Next
From: "Drouvot, Bertrand"
Date:
Subject: Re: Minimal logical decoding on standbys