Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers

From Yura Sokolov
Subject Re: BufferAlloc: don't take two simultaneous locks
Date
Msg-id 3222ea2f83d59fdb1c765cc8fc22271163e0a3dc.camel@postgrespro.ru
Whole thread Raw
In response to Re: BufferAlloc: don't take two simultaneous locks  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: BufferAlloc: don't take two simultaneous locks  (Yura Sokolov <y.sokolov@postgrespro.ru>)
Re: BufferAlloc: don't take two simultaneous locks  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
В Чт, 21/04/2022 в 16:24 -0400, Robert Haas пишет:
> On Thu, Apr 21, 2022 at 5:04 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> >   $ pid=`ps x | awk '/checkpointer/ && !/awk/ { print $1 }'`
> >   $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
> > 
> >   $1 = 16512
> > 
> >   $ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
> >   ...
> >   $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
> > 
> >   $1 = 20439
> > 
> >   $ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
> >   ...
> >   $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
> > 
> >   $1 = 20541
> > 
> > It stabilizes at 20541
> 
> Hmm. So is the existing comment incorrect?

It is correct and incorrect at the same time. Logically it is correct.
And it is correct in practice if HASH_FIXED_SIZE is set for SharedBufHash
(which is not currently). But setting HASH_FIXED_SIZE hurts performance
with low number of spare items.

> Remember, I was complaining
> about this change:
> 
> --- a/src/backend/storage/buffer/freelist.c
> +++ b/src/backend/storage/buffer/freelist.c
> @@ -481,10 +481,10 @@ StrategyInitialize(bool init)
>   *
>   * Since we can't tolerate running out of lookup table entries, we must be
>   * sure to specify an adequate table size here.  The maximum steady-state
> - * usage is of course NBuffers entries, but BufferAlloc() tries to insert
> - * a new entry before deleting the old.  In principle this could be
> - * happening in each partition concurrently, so we could need as many as
> - * NBuffers + NUM_BUFFER_PARTITIONS entries.
> + * usage is of course NBuffers entries. But due to concurrent
> + * access to numerous free lists in dynahash we can miss free entry that
> + * moved between free lists. So it is better to have some spare free entries
> + * to reduce probability of entry allocations after server start.
>   */
>   InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
> 
> Pre-patch, the comment claims that the maximum number of buffer
> entries that can be simultaneously used is limited to NBuffers +
> NUM_BUFFER_PARTITIONS, and that's why we make the hash table that
> size. The idea is that we normally need more than 1 entry per buffer,
> but sometimes we might have 2 entries for the same buffer if we're in
> the process of changing the buffer tag, because we make the new entry
> before removing the old one. To change the buffer tag, we need the
> buffer mapping lock for the old partition and the new one, but if both
> are the same, we need only one buffer mapping lock. That means that in
> the worst case, you could have a number of processes equal to
> NUM_BUFFER_PARTITIONS each in the process of changing the buffer tag
> between values that both fall into the same partition, and thus each
> using 2 entries. Then you could have every other buffer in use and
> thus using 1 entry, for a total of NBuffers + NUM_BUFFER_PARTITIONS
> entries. Now I think you're saying we go far beyond that number, and
> what I wonder is how that's possible. If the system doesn't work the
> way the comment says it does, maybe we ought to start by talking about
> what to do about that.

At the master state:
- SharedBufHash is not declared as HASH_FIXED_SIZE
- get_hash_entry falls back to element_alloc too fast (just if it doesn't
  found free entry in current freelist partition).
- get_hash_entry has races.
- if there are small number of spare items (and NUM_BUFFER_PARTITIONS is
  small number) and HASH_FIXED_SIZE is set, it becomes contended and
  therefore slow.

HASH_REUSE solves (for shared buffers) most of this issues. Free list
became rare fallback, so HASH_FIXED_SIZE for SharedBufHash doesn't lead
to performance hit. And with fair number of spare items, get_hash_entry
will find free entry despite its races.

> I am a bit confused by your description of having done "p
> SharedBufHash->hctl->allocated.value" because SharedBufHash is of type
> HTAB and HTAB's hctl member is of type HASHHDR, which has no field
> called "allocated".

Previous letter contains links to small patches that I used for
experiments. Link that adds "allocated" is https://pastebin.com/c5z0d5mz

>  I thought maybe my analysis here was somehow
> mistaken, so I tried the debugger, which took the same view of it that
> I did:
> 
> (lldb) p SharedBufHash->hctl->allocated.value
> error: <user expression 0>:1:22: no member named 'allocated' in 'HASHHDR'
> SharedBufHash->hctl->allocated.value
> ~~~~~~~~~~~~~~~~~~~  ^


-----

regards

Yura Sokolov




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Assert failure in CTE inlining with view and correlated subquery
Next
From: Peter Geoghegan
Date:
Subject: Re: More problems with VacuumPageHit style global variables