Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers

From Yura Sokolov
Subject Re: BufferAlloc: don't take two simultaneous locks
Date
Msg-id 42b7deffdb20bdad53417b615c08cbf4b5bf85bd.camel@postgrespro.ru
Whole thread Raw
In response to Re: BufferAlloc: don't take two simultaneous locks  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: BufferAlloc: don't take two simultaneous locks
List pgsql-hackers
Good day, hackers.

There are some sentences.

Sentence one
============

> With the existing system, there is a hard cap on the number of hash
> table entries that we can ever need: one per buffer, plus one per
> partition to cover the "extra" entries that are needed while changing
> buffer tags.

As I understand it: current shared buffers implementation doesn't
allocate entries after initialization.
(I experiment on master 6c0f9f60f1 )

Ok, then it is safe to elog(FATAL) if shared buffers need to allocate?
https://pastebin.com/x8arkEdX

(all tests were done on base initialized with `pgbench -i -s 100`)

  $ pgbench -c 1 -T 10 -P 1 -S -M prepared postgres
  ....
  pgbench: error: client 0 script 0 aborted in command 1 query 0: FATAL:  extend SharedBufHash

oops...

How many entries are allocated after start?
https://pastebin.com/c5z0d5mz
(shared_buffers = 128MB .
 40/80ht cores on EPYC 7702 (VM on 64/128ht cores))

  $ pid=`ps x | awk '/checkpointer/ && !/awk/ { print $1 }'`
  $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'

  $1 = 16512

  $ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
  ...
  $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
  
  $1 = 20439
  
  $ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
  ...
  $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
  
  $1 = 20541
  
It stabilizes at 20541

To be honestly, if we add HASH_FIXED_SIZE to SharedBufHash=ShmemInitHash
then it works, but with noticable performance regression.

More over, I didn't notice "out of shared memory" starting with 23
spare items instead of 128 (NUM_BUFFER_PARTITIONS).


Sentence two:
=============

> With the patch, the number of concurrent buffer tag
> changes is no longer limited by NUM_BUFFER_PARTITIONS, because you
> release the lock on the old buffer partition before acquiring the lock
> on the new partition, and therefore there can be any number of
> backends trying to change buffer tags at the same time.

Lets check.
I take v9 branch:
- no "thundering nerd" prevention yet
- "get_hash_entry" is not modified
- SharedBufHash is HASH_FIXED_SIZE      (!!!)
- no spare items at all, just NBuffers. (!!!)

https://www.postgresql.org/message-id/6e6cfb8eea5ccac8e4bc2249fe0614d9f97055ee.camel%40postgrespro.ru

I noticed some "out of shared memory" under high connection number
(> 350) with this version. But I claimed it is because of race
conditions in "get_hash_entry": concurrent backends may take free
entries from one slot and but to another.
Example:
- backend A checks freeList[30] - it is empty
- backend B takes entry from freeList[31]
- backend C put entry to freeList[30]
- backend A checks freeList[31] - it is empty
- backend A fails with "out of shared memory"

Lets check my claim: set NUM_FREELISTS to 1, therefore there is no
possible race condition in "get_hash_entry".
....
No single "out of shared memory" for 800 clients for 30 seconds.

(well, in fact on this single socket 80 ht-core EPYC I didn't get
"out of shared memory" even with NUM_FREELISTS 32. I noticed them
on 2 socket 56 ht-core Xeon Gold).

At the same time master branch has to have at least 15 spare items
with NUM_FREELISTS 1 to work without "out of shared memory" on
800 clients for 30 seconds.

Therefore suggested approach reduces real need in hash entries
(when there is no races in "get_hash_entry").

If one look into code they see, there is no need in spare item in
suggested code:
- when backend calls BufTableInsert it already has victim buffer.
  Victim buffer either:
  - was uninitialized
  -- therefore wasn't in hash table
  --- therefore there is free entry for it in freeList
  - was just cleaned
  -- then there is stored free entry in DynaHashReuse
  --- then there is no need for free entry in freeList.

And, not-surprisingly, there is no huge regression from setting
NUM_FREELISTS to 1 because we usually 


Sentence three:
===============

(not exact citation)
- It is not atomic now therefore fragile.

Well, going from "theoretical concerns" to practical, there is new part
of control flow:
- we clear buffer (but remain it pinned)
- delete buffer from hash table if it was there, and store it for reuse
- release old partition lock
- acquire new partition lock
- try insert into new partition
- on conflict
-- return hash entry to some freelist
-- Pin found buffer
-- unpin victim buffer
-- return victim to Buffer's free list.
- without conflict
-- reuse saved entry if it was

To get some problem one of this action should fail without fail of
whole cluster. Therefore it should either elog(ERROR) or elog(FATAL).
In any other case whole cluster will stop.

Could BufTableDelete elog(ERROR|FATAL)?
No.
(there is one elog(ERROR), but with comment "shouldn't happen".
It really could be changed to PANIC).

Could LWLockRelease elog(ERROR|FATAL)?
(elog(ERROR, "lock is not held") could not be triggerred since we
certainly hold the lock).

Could LWLockAcquire elog(ERROR|FATAL)?
Well, there is `elog(ERROR, "too many LWLocks taken");`
It is not possible becase we just did LWLockRelease.

Could BufTableInsert elog(ERROR|FATAL)?
There is "out of shared memory" which is avoidable with get_hash_entry
modifications or with HASH_FIXED_SIZE + some spare items.

Could CHECK_FOR_INTERRUPTS raise something?
No: there is single line between LWLockRelease and LWLockAcquire, and
it doesn't contain CHECK_FOR_INTERRUPTS.

Therefore there is single fixable case of "out of shared memory" (by
HASH_FIXED_SIZE or improvements to "get_hash_entry").


May be I'm not quite right at some point. I'd glad to learn.

---------

regards

Yura Sokolov




pgsql-hackers by date:

Previous
From: Richard Guo
Date:
Subject: Re: Assert failure in CTE inlining with view and correlated subquery
Next
From: Niyas Sait
Date:
Subject: Re: [PATCH] Add native windows on arm64 support