On Mon, Apr 21, 2025 at 9:30 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are
> questionable from portability point of view. This leaves memfd_create, and I'm
> still not completely clear on it's portability -- it seems to be specific to
> Linux, but others provide compatible implementation as well.
Something like this should work, roughly based on DSM code except here
we don't really need the name so we unlink it immediately, at the
slight risk of leaking it if the postmaster is killed between those
lines (maybe someone should go and tell POSIX to support the special
name SHM_ANON or some other way to avoid that; I can't see any
portable workaround). Not tested/compiled, just a sketch:
#ifdef HAVE_MEMFD_CREATE
/* Anonymous shared memory region. */
fd = memfd_create("foo", MFD_CLOEXEC | huge_pages_flags);
#else
/* Standard POSIX insists on a name, which we unlink immediately. */
do
{
char tmp[80];
snprintf(tmp, sizeof(tmp), "PostgreSQL.%u",
pg_prng_uint32(&pg_global_prng_state));
fd.= shm_open(tmp, O_CREAT | O_EXCL);
if (fd >= 0)
shm_unlink(tmp);
} while (fd < 0 && errno == EXIST);
#endif
> Let me experiment with this idea a bit, I would like to make sure there are no
> other limitations we might face.
One thing I'm still wondering about is whether you really need all
this multi-phase barrier stuff, or even need to stop other backends
from running at all while doing the resize. I guess that's related to
your remapping scheme, but supposing you find the simple
ftruncate()-only approach to be good, my next question is: why isn't
it enough to wait for all backends to agree to stop allocating new
buffers in the range to be truncated, and then left them continue to
run as normal? As far as they would be concerned, the in-progress
downsize has already happened, though it could be reverted later if
the eviction phase fails. Then the coordinator could start evicting
buffers and truncating the shared memory object, which are
phases/steps, sure, but it's not clear to me why they need other
backends' help.
AFAIU, we required the phased approach since mremap needed to happen in every backend after buffer eviction but before making modifications to the shared memory. If we don't need to call mremap in every backend and just ftruncate + initializing memory (when expanding buffers) is enough, I think phased approach isn't needed. But I haven't tried it myself.
Here's patchset rebased on 3feff3916ee106c084eca848527dc2d2c3ef4e89.
0001 - 0008 are same as the previous patchset
0009 adds support to shrink shared buffers. It has two changes: a. evict the buffers outside the new buffer size b. remove buffers with buffer id outside the new buffer size from the free list. If a buffer being evicted is pinned, the operation is aborted and a FATAL error is raised. I think we need to change this behaviour to be less severe like rolling back the operation or waiting for the pinned buffer to be unpinned etc. Better even if we could let users control the behaviour. But we need better infrastructure to do such things. That's one TODO left in the patch.
0010 is about reinitializing the Strategy reinitialization. Once we expand the buffers, the new buffers need to be added to the free list. Some StrategyControl area members (not all) need to be adjusted. That's what this patch does. But a deeper adjustment in BgBufferSync() and ClockSweepTick() is required. Further we need to do something about the buffer lookup table. More on that later in the email.
0011-0012 fix compilation issues in these patches but those fixes are not correct. The patches are there so that binaries can be built without any compilation issues and someone can experiment with buffer resizing. Good thing is the compilation fixes are in SQL callable functions pg_get_shmem_pagesize() and pg_get_shmem_numa(). So there's no ill-effect because of these patches as long as those two functions are not called.
Buffer lookup table resizing
------------------------------------
The size of the buffer lookup table depends upon (number of shared buffers + number of partitions in the shared buffer lookup table). If we shrink the buffer pool, the buffer lookup table will become sparse but still useful. If we expand the buffers we need to expand the buffer lookup table too. That's not implemented in the current patchset. There are two solutions here:
1. We map a lot of extra address space (not memory) initially to accomodate for future expansion of shared buffer pool. Let's say that the total address space is sufficient to accomodate Nx buffers. Simple solution is to allocate a buffer lookup table with Nx initial entries so that we don't have to resize the buffer lookup table ever. It will waste memory but we might be ok with that as version 1 solution. According to my offline discussion with David Rowley, buffer lookups in sparse hash tables are inefficient because or more cacheline faults. Whether that translates to any noticeable performance degradation in TPS needs to be measured.
2. Alternate solution is to resize the buffer mapping table as well. This means that we rehash all the entries again which may take a longer time and the partitions will remain locked for that amount of time. Not to mention this will require non-trivial change to dynahash implementation.
Next I will look at BgBufferSync() and ClockSweepTick() adjustments and then buffer lookup table fix with approach 1.