Re: Changing shared_buffers without restart - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: Changing shared_buffers without restart |
Date | |
Msg-id | CAExHW5sOu8+9h6t7jsA5jVcQ--N-LCtjkPnCw+rpoN0ovT6PHg@mail.gmail.com Whole thread Raw |
In response to | Re: Changing shared_buffers without restart (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hi, I started studying the interaction of the checkpointer process with buffer pool resizing. Soon I noticed that the checkpointer didn't load the config as frequently as other backends. When it is executing a checkpoint, it does not reload the config for the entire duration of the checkpoint for example. As the synchronization is implemented, in the set of patches so far, the checkpointer will not see the new value of shared_buffers and will not acknowledge the proc signal barrier and thus not enter the synchronized buffer resizing. However, other backends will notice that the checkpointer has received the proc signal barrier and will enter the synchronization process. Once the proc signal barrier is received by all the backends, the backends which have entered the synchronization process will move forward with resizing buffer pool leaving behind those who have received but not acknowledged the proc signal barrier. At the end there will be two sets of backends, one which have entered synchronization and see the buffer pool with new size and the other which haven't entered synchronization and do not see the buffer pool with new size. This leads to SIGBUS, SIG 11 in the other set of backends. I saw this mostly with the checkpointer process but we also saw it with other types of backends. Every aspect of buffer resizing that I started looking at was blocked by this behaviour. Since there were already other suggestions and comments about the current UI as well as synchronization mechanism, I started implementing a different UI and synchronization as described below. The WIP implementation is available in the attached set of patches. Patches 0001 to 0016 are the same as the previous patchset. I haven't touched them in case someone would like to see an incremental change. However, it's getting unwieldy at this point, so I will squash relevant patches together and provide a patchset with fewer patches next. 0017 reverts to 0003 and gets rid of the "pending" GUC flag which is not required by the new UI. They will vanish from the next patchset. 0018 implements the new UI described below. New UI and synchronization ====================== 0018 changes the way "shared_buffers" is handled. a. A new global variable NBuffersPending is used to hold the value of this GUC. When the server starts, shared memory required by the buffer manager is calculated using NBuffersPending instead of NBuffers. Once the shared memory is allocated, NBuffers is set to NBuffersPending. NBuffers, thus shows the number of buffers in the buffer pool instead of the value of the GUC. b. "shared_buffers" is PGC_SIGHUP now so it can be changed using ALTER SYSTEM ... SET shared_buffers = ...; followed by SELECT pg_reload_config(). But this does not resize the buffer pool. It merely sets NBuffersPending to the new value. A new function pg_resize_buffer_pool() (described later) can be used to resize the buffer pool to the pending value. c. show "shared_buffers" shows the value of NBuffers, and NBuffersPending if it differs from NBuffers. I think we need some adjustment here when the resizing is in progress since the value of NBuffers would be changed to the size of the active buffer pool (explained later in the email), but I haven't worked out those details yet. A new GUC max_shared_buffers sets the upper limit on "shared_buffers". It is PGC_POSTMASTER; requires a restart to change the value. This GUC is used a. to reserve the address space for future expansion of the buffer pool and b. allocate memory for a maximally sized buffer lookup table at the server start. We may decide to use the GUC to maximally allocate data structures other than buffer blocks as suggested by Andres. But these patches don't do that. The default for this GUC is 0, which means it will be the same as shared_buffers. This maintains backward compatibility and also allows systems, which do not want to resize shared buffer pool, to allocate minimum memory. When it is set to a value other than 0, it should be set to a value higher than the shared_buffers at the start. We need to support the ALTER SYSTEM ... SET shared_buffers = "" for backward compatibility. The users will still be able to perform ALTER SYSTEM and restart the server with a newer size of buffer pool. Also this allows the new buffer pool size to be written to postgresql.auto.conf and persist it. With this we can simply use pg_reload_conf() to load the new value along with other GUC changes. pg_resize_buffer_pool() merely picks the new value from the backend where it is executed and resizes the buffer pool. It does not need the new value to be loaded in all the backends. We may want to use a new PGC_ for this GUC but PGC_SIGHUP suffices for the time being and it might be acceptable with clear documentation. pg_resize_buffer_pool() implements phase wise buffer pool resizing operation, but it does not block all the backends till the buffer pool resizing is finished. It works as follows: Pasting from the prologue in patch 0018. When resizing the buffer pool is divided into two portions - active buffer pool, which is the part of the buffer pool which remains active even during resizing. Its size is given by activeNBuffers. Newly allocated buffers will have their buffer ids less than activeNBuffers. - in-transit buffer pool, which is the part of the buffer pool which may be accessible to some backends but not others depending upon the time when a given backend processes a shrink/expand barrier. When shrinking the buffer pool this is the part of the buffer pool which will be evicted. When expanding the buffer pool this is the expanded portion. Its size is given by transitNBuffers. The backends may see buffer ids upto transitNBuffers till the resizing finishes. Before starting resizing, activeNBuffers = transitNBuffers = NBuffers where NBuffers is the size of buffer pool before resizing. NewNBuffers is the new size of the shared buffer pool. After resizing finishes activeNBuffers = transitNBuffers = NBuffers = newNBuffers. In order to synchronize with other running backends, the coordinator sends following ProcSignalBarriers in the order given below: 1. When shrinking the shared buffer pool the coordinator sends SHBUF_SHRINK ProcSignalBarrier. Every backend sets activeNBuffers = NewNBuffers to restrict its buffer pool allocations to the new size of the buffer pool and acknowledges the ProcSignalBarrrier. Once every backend has acknowledged, the coordinator evicts the buffers in the area being shrunk. Note that tansitNBuffers is still NBuffers, so the backends may see buffer ids upto NBuffers from earlier allocations till eviction completes. 2. In both cases, when expanding the buffer pool or shrinking the buffer pool, the coordinator sends SHBUF_RESIZE_MAP_AND_MEM ProcSignalBarrier after resizing the shared memory segments and initializing the required data structures if any. Every backend is expected to adjust their shared memory segment address maps (by calling AnonymousShmemResize()) and validate that their pointers to the shared buffers structure are valid and have the right size. When shrinking shared buffer pool transitNBuffers is set to NewNBuffers and the backends should no longer see buffer ids beyond NewNBuffers; the buffer resizing operation is finished at this stage. When expanding they should set transitNBuffers to NewNBuffers to accommodate for the backends which may accept the next barrier earlier than the others. Once every backend acknowledges this barrier, the coordinator sends the next barrier when expanding the buffer pool. 3. When expanding the buffer pool, the coordinator sends SHBUF_EXPAND ProcSignalBarrier. The backends are expected to set activeNBuffers = NewNBuffers and start allocating buffers from the expanded range. The coordinator uses this barrier to know when all the backends have settled using the new size of the buffer pool. For either operation, at most two barriers are sent. All this together in action looks like (See tests in the patch for more examples) SHOW shared_buffers; -- default shared_buffers ---------------- 128MB (1 row) ALTER SYSTEM SET shared_buffers = '64MB'; SELECT pg_reload_conf(); pg_reload_conf ---------------- t (1 row) SHOW shared_buffers; shared_buffers ----------------------- 128MB (pending: 64MB) (1 row) SELECT pg_resize_shared_buffers(); pg_resize_shared_buffers -------------------------- t (1 row) SHOW shared_buffers; shared_buffers ---------------- 64MB (1 row) ALTER SYSTEM SET shared_buffers = '256MB'; SELECT pg_reload_conf(); pg_reload_conf ---------------- t (1 row) SHOW shared_buffers; shared_buffers ----------------------- 64MB (pending: 256MB) (1 row) SELECT pg_resize_shared_buffers(); pg_resize_shared_buffers -------------------------- t (1 row) SHOW shared_buffers; shared_buffers ---------------- 256MB (1 row) On Thu, Sep 18, 2025 at 7:22 PM Andres Freund <andres@anarazel.de> wrote: > > > From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001 > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > Date: Sun, 6 Apr 2025 16:40:32 +0200 > > Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks > > > > Currently an assing hook can perform some preprocessing of a new value, > > but it cannot change the behavior, which dictates that the new value > > will be applied immediately after the hook. Certain GUC options (like > > shared_buffers, coming in subsequent patches) may need coordinating work > > between backends to change, meaning we cannot apply it right away. > > > > Add a new flag "pending" for an assign hook to allow the hook indicate > > exactly that. If the pending flag is set after the hook, the new value > > will not be applied and it's handling becomes the hook's implementation > > responsibility. > > I doubt it makes sense to add this to the GUC system. I think it'd be better > to just use the GUC value as the desired "target" configuration and have a > function or a show-only GUC for reporting the current size. This has been taken care of in the new implementation with slightly different approach to show command as described above. > > I don't think you can't just block application of the GUC until the resize is > complete. E.g. what if the value was too big and the new configuration needs > to fixed to be lower? > With the above approach, the application of the GUC won't be blocked but if the size being applied is taking too long, the operation will be required to be cancelled before the new resize can happen. That's a part that needs some work. Chasing a moving target requires a very complex implementation, which would be good to avoid in the first version at least. However, we should leave room for that future enhancement. The current implementation gives that flexibility, I think. > > > From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001 > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > Date: Fri, 4 Apr 2025 21:46:14 +0200 > > Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration > > > > Currently WaitForProcSignalBarrier allows to make sure the message sent > > via EmitProcSignalBarrier was processed by all ProcSignal mechanism > > participants. > > > > Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration, > > which will be updated when a process has received the message, but not > > processed it yet. This makes it possible to support a new mode of > > waiting, when ProcSignal participants want to synchronize message > > processing. To do that, a participant can wait via > > WaitForProcSignalBarrierReceived when processing a message, effectively > > making sure that all processes are going to start processing > > ProcSignalBarrier simultaneously. > > I doubt "online resizing" that requires synchronously processing the same > event, can really be called "online". There can be significant delays in > processing a barrier, stalling the entire server until that is reached seems > like a complete no-go for production systems? > > > From 78bc0a49f8ebe17927abd66164764745ecc6d563 Mon Sep 17 00:00:00 2001 > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > Date: Tue, 17 Jun 2025 14:16:55 +0200 > > Subject: [PATCH 11/16] Allow to resize shared memory without restart > > > > Add assing hook for shared_buffers to resize shared memory using space, > > introduced in the previous commits without requiring PostgreSQL restart. > > Essentially the implementation is based on two mechanisms: a > > ProcSignalBarrier is used to make sure all processes are starting the > > resize procedure simultaneously, and a global Barrier is used to > > coordinate after that and make sure all finished processes are waiting > > for others that are in progress. > > > > The resize process looks like this: > > > > * The GUC assign hook sets a flag to let the Postmaster know that resize > > was requested. > > > > * Postmaster verifies the flag in the event loop, and starts the resize > > by emitting a ProcSignal barrier. > > > > * All processes, that participate in ProcSignal mechanism, begin to > > process ProcSignal barrier. First a process waits until all processes > > have confirmed they received the message and can start simultaneously. > > As mentioned above, this basically makes the entire feature not really > online. Besides the latency of some processes not getting to the barrier > immediately, there's also the issue that actually reserving large amounts of > memory can take a long time - during which all processes would be unavailable. > > I really don't see that being viable. It'd be one thing if that were a > "temporary" restriction, but the whole design seems to be fairly centered > around that. In the new implementation regular backends are not stalled when the resizing is going on. They continue their work with possible temporary performance degradation (this needs to be measured). > > > From experiment it turns out that shared mappings have to be extended > > separately for each process that uses them. Another rough edge is that a > > backend blocked on ReadCommand will not apply shared_buffers change > > until it receives something. > > That's not a rough edge, that basically makes the feature unusable, no? New synchronization doesn't have this problem since it doesn't require every backend to load the new value. The value being loaded only in the backend where pg_resize_buffer_pool() is being run is enough. > > > From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001 > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > Date: Tue, 17 Jun 2025 11:22:02 +0200 > > Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers > > > > Add more shmem segments to split shared buffers into following chunks: > > * BUFFERS_SHMEM_SEGMENT: contains buffer blocks > > * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors > > * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers > > * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids > > * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status > > Why do all these need to be separate segments? Afaict we'll have to maximally > size everything other than BUFFERS_SHMEM_SEGMENT at start? > I am leaning towards that. I will implement that soon. On Wed, Oct 1, 2025 at 2:40 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > I see you folks are inclined to keep some small segments static and > allocate maximum allowed memory for it. It's an option, at the end of > the day we need to experiment and measure both approaches. I did measure performance with a maximally sized buffer lookup table (shared_buffers = 128MB, max_shared_buffers = 10GB) on my laptop. There was no noticeable difference in the performance. I will post formal numbers with the next patchset. > > > > * Every process recalculates shared memory size based on the new > > NBuffers, adjusts its size using ftruncate and adjust reservation > > permissions with mprotect. One elected process signals the postmaster > > to do the same. > > If we just used a single memory mapping with all unused parts marked > MAP_NORESERVE, we wouldn't need this (and wouldn't need a fair bit of other > work in this patchset).. > On Sat, Sep 27, 2025 at 12:06 AM Andres Freund <andres@anarazel.de> wrote: > > > How do we return memory to the OS in that case? Currently it's done > > explicitly via truncating the anonymous file. > > madvise with MADV_DONTNEED or MADV_REMOVE. The patchset still uses the ftruncate + mprotect. I have questions apart from portability concerns about your proposal. MADV_DONTNEED documentation says After a successful MADV_DONTNEED operation, the semantics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either repopulating the memory contents from the up-to-date contents of the underlying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared mem‐ ory segments) or zero-fill-on-demand pages for anonymous private mappings. Note that, when applied to shared mappings, MADV_DONTNEED might not lead to immediate freeing of the pages in the range. The kernel is free to delay freeing the pages until an appropriate moment. The resident set size (RSS) of the calling process will be immedi‐ ately reduced however. MADV_DONTNEED cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. (Pages marked with the kernel-internal VM_PFN‐ MAP flag are special memory areas that are not managed by the virtual memory subsystem. Such pages are typically created by device drivers that map the pages into user space.) and MADV_REMOVE (since Linux 2.6.16) Free up a given range of pages and its associated backing store. This is equivalent to punching a hole in the corresponding byte range of the backing store (see fallocate(2)). Subsequent accesses in the specified address range will see bytes containing zero. The specified address range must be mapped shared and writable. This flag cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. Combining these two, 1. The access to the freed memory doesn't give any error but returns 0. Won't that lead to silent corruption? 2. Those are not supported with huge tlb pages. So can not be used when huge pages = on? With the current approach, we get SIGBUS and SIG 11 when the process tries to access the freed memory. That protection won't be there with madvise(). The synchronization mechanism in this patch is inspired from Thomas's implementation posted in [1]. I still need to go through Tomas's detailed comments and address those which still apply. And the patches are still WIP, with many TODOs. But I wanted to get some feedback on the proposed UI and synchronization as described above. I will be looking into the cases below one by one 1. New backends join while the synchronization is going on. An existing backend exiting. 2. Failure or crash in the backend which is executing pg_resize_buffer_pool() 3. Fix crashes in the tests. [1] postgr.es/m/CA+hUKGL5hW3i_pk5y_gcbF_C5kP-pWFjCuM8bAyCeHo3xUaH8g@mail.gmail.com -- Best Wishes, Ashutosh Bapat
Attachment
- 0003-Introduce-pending-flag-for-GUC-assign-hooks-20251013.patch
- 0002-Process-config-reload-in-AIO-workers-20251013.patch
- 0001-Add-system-view-for-shared-buffer-lookup-ta-20251013.patch
- 0004-Introduce-pss_barrierReceivedGeneration-20251013.patch
- 0005-Allow-to-use-multiple-shared-memory-mapping-20251013.patch
- 0008-Fix-compilation-failures-from-previous-comm-20251013.patch
- 0006-Address-space-reservation-for-shared-memory-20251013.patch
- 0007-Introduce-multiple-shmem-segments-for-share-20251013.patch
- 0010-WIP-Monitoring-views-20251013.patch
- 0009-Refactor-CalculateShmemSize-20251013.patch
- 0012-Initial-value-of-shared_buffers-or-NBuffers-20251013.patch
- 0013-Update-sizes-and-addresses-of-shared-memory-20251013.patch
- 0011-Allow-to-resize-shared-memory-without-resta-20251013.patch
- 0014-Support-shrinking-shared-buffers-20251013.patch
- 0015-Reinitialize-StrategyControl-after-resizing-20251013.patch
- 0016-Tests-for-dynamic-shared_buffers-resizing-20251013.patch
- 0017-Revert-Introduce-pending-flag-for-GUC-assig-20251013.patch
- 0018-Re-implement-UI-and-synchronization-for-res-20251013.patch
pgsql-hackers by date: