Re: Changing shared_buffers without restart - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: Changing shared_buffers without restart |
Date | |
Msg-id | CAExHW5soDB=08jeyqn-W-=+Y86v-a5f_NJRjAdV9_yFrKpJcCw@mail.gmail.com Whole thread Raw |
In response to | Re: Changing shared_buffers without restart (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
Responses |
Re: Changing shared_buffers without restart
|
List | pgsql-hackers |
On Fri, Feb 28, 2025 at 5:31 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > I think we should add a way to monitor the progress of resizing; at > least whether resizing is complete and whether the new GUC value is in > effect. > I further tested this approach by tracing the barrier synchronization using the attached patch with adds a bunch of elogs(). I ran pgbench load and simultaneously executed following commands on a psql connection #alter system set shared_buffers to '200MB'; ALTER SYSTEM #select pg_reload_conf(); pg_reload_conf ---------------- t (1 row) #show shared_buffers; shared_buffers ---------------- 200MB (1 row) #select count(*) from pg_stat_activity; count ------- 6 (1 row) #select pg_backend_pid(); - the backend where all these commands were executed pg_backend_pid ---------------- 878405 (1 row) I see the following in the postgresql error logs. 2025-03-12 11:04:53.812 IST [878167] LOG: received SIGHUP, reloading configuration files 2025-03-12 11:04:53.813 IST [878405] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 -- not all backends have reloaded configuration. 2025-03-12 11:04:53.813 IST [878173] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878173] LOG: attached when barrier was at phase 0 2025-03-12 11:04:53.813 IST [878173] LOG: reached barrier phase 1 2025-03-12 11:04:53.813 IST [878171] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878172] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878171] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878172] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878340] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878340] STATEMENT: UPDATE pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8; 2025-03-12 11:04:53.813 IST [878340] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878340] STATEMENT: UPDATE pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8; 2025-03-12 11:04:53.813 IST [878338] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878338] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662; 2025-03-12 11:04:53.813 IST [878339] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878339] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726; 2025-03-12 11:04:53.813 IST [878338] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878338] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662; 2025-03-12 11:04:53.813 IST [878339] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878339] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726; 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878341] STATEMENT: BEGIN; 2025-03-12 11:04:53.814 IST [878341] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN; 2025-03-12 11:04:53.814 IST [878337] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878337] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878168] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.814 IST [878172] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878171] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878340] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878340] STATEMENT: UPDATE pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8; 2025-03-12 11:04:53.814 IST [878338] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878338] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662; 2025-03-12 11:04:53.814 IST [878341] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN; 2025-03-12 11:04:53.814 IST [878337] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878173] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878339] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878339] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726; 2025-03-12 11:04:53.814 IST [878172] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878340] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878340] STATEMENT: UPDATE pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8; 2025-03-12 11:04:53.814 IST [878341] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN; 2025-03-12 11:04:53.814 IST [878339] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878339] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726; 2025-03-12 11:04:53.814 IST [878171] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878338] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878338] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662; 2025-03-12 11:04:53.814 IST [878337] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878337] LOG: buffer resizing operation finished at phase 4 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878168] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.814 IST [878168] LOG: attached when barrier was at phase 0 2025-03-12 11:04:53.814 IST [878168] LOG: reached barrier phase 1 2025-03-12 11:04:53.814 IST [878168] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878168] LOG: buffer resizing operation finished at phase 3 2025-03-12 11:04:53.815 IST [878169] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.815 IST [878169] LOG: attached when barrier was at phase 0 2025-03-12 11:04:53.815 IST [878169] LOG: reached barrier phase 1 2025-03-12 11:04:53.815 IST [878169] LOG: reached barrier phase 2 2025-03-12 11:04:53.815 IST [878169] LOG: buffer resizing operation finished at phase 3 2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; 2025-03-12 11:04:55.965 IST [878405] LOG: attached when barrier was at phase 0 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; 2025-03-12 11:04:55.965 IST [878405] LOG: reached barrier phase 1 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; 2025-03-12 11:04:55.965 IST [878405] LOG: reached barrier phase 2 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; 2025-03-12 11:04:55.965 IST [878405] LOG: buffer resizing operation finished at phase 3 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; To tell the story in short. pid 173 (for the sake of brevity I am just mentioning the last three digits of PID) attached to the barrier first and immediately reached phase 1. 171, 172, 340, 338, 339, 341, 337 - all attached barrier in phase 1. All of these backends completed the phases in synchronous fashion. But 168, 169 and 405 were yet to attach to the barrier since they hadn't loaded their configurations yet. Each of these backends then finished all phases independent of others. For your reference #select pid, application_name, backend_type from pg_stat_activity where pid in (878169, 878168); pid | application_name | backend_type --------+------------------+------------------- 878168 | | checkpointer 878169 | | background writer (2 rows) This is because the BarrierArriveAndWait() only waits for all the attached backends. It doesn't wait for backends which are yet to attach. I think what we want is *all* the backends should execute all the phases synchronously and wait for others to finish. If we don't do that, there's a possibility that some of them would see inconsistent buffer states or even worse may not have necessary memory mapped and resized - thus causing segfaults. Am I correct? I think what needs to be done is that every backend should wait for other backends to attach themselves to the barrier before moving to the first phase. One way I can think of is we use two signal barriers - one to ensure that all the backends have attached themselves and second for the actual resizing. But then the postmaster needs to wait for all the processes to process the first signal barrier. A postmaster can not wait on anything. Maybe there's a way to poll, but I didn't find it. Does that mean that we have to make some other backend a coordinator? -- Best Wishes, Ashutosh Bapat
pgsql-hackers by date: