Thread: Re: Changing shared_buffers without restart
On Fri, Oct 18, 2024 at 3:21 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > changing shared memory mapping layout. Any feedback is appreciated. A lot of people would like to have this feature, so I hope this proposal works out. Thanks for working on it. I think the idea of having multiple shared memory segments is interesting and makes sense, but I would prefer to see them called "segments" rather than "slots" just as do we do for DSMs. The name "slot" is somewhat overused, and invites confusion with replication slots, inter alia. I think it's possible that having multiple fixed shared memory segments will spell trouble on Windows, where we already need to use a retry loop to try to get the main shared memory segment mapped at the correct address. If there are multiple segments and we need whatever ASLR stuff happens on Windows to not place anything else overlapping with any of them, that means there's more chances for stuff to fail than if we just need one address range to be free. Granted, the individual ranges are smaller, so maybe it's fine? But I don't know. The big thing that worries me is synchronization, and while I've only looked at the patch set briefly, it doesn't look to me as though there's enough machinery here to make that work correctly. Suppose that shared_buffers=8GB (a million buffers) and I change it to shared_buffers=16GB (2 million buffers). As soon as any one backend has seen that changed and expanded shared_buffers, there's a possibility that some other backend which has not yet seen the change might see a buffer number greater than a million. If it tries to use that buffer number before it absorbs the change, something bad will happen. The most obvious way for it to see such a buffer number - and possibly the only one - is to do a lookup in the buffer mapping table and find a buffer ID there that was inserted by some other backend that has already seen the change. Fixing this seems tricky. My understanding is that BufferGetBlock() is extremely performance-critical, so having to do a bounds check there to make sure that a given buffer number is in range would probably be bad for performance. Also, even if the overhead weren't prohibitive, I don't think we can safely stick code that unmaps and remaps shared memory segments into a function that currently just does math, because we've probably got places where we assume this operation can't fail -- as well as places where we assume that if we call BufferGetBlock(i) and then BufferGetBlock(j), the second call won't change the answer to the first. It seems to me that it's probably only safe to swap out a backend's notion of where shared_buffers is located when the backend holds on buffer pins, and maybe not even all such places, because it would be a problem if a backend looks up the address of a buffer before actually pinning it, on the assumption that the answer can't change. I don't know if that ever happens, but it would be a legal coding pattern today. Doing it between statements seems safe as long as there are no cursors holding pins. Doing it in the middle of a statement is probably possible if we can verify that we're at a "safe" point in the code, but I'm not sure exactly which points are safe. If we have no code anywhere that assumes the address of an unpinned buffer can't change before we pin it, then I guess the check for pins is the only thing we need, but I don't know that to be the case. I guess I would have imagined that a change like this would have to be done in phases. In phase 1, we'd tell all of the backends that shared_buffers had expanded to some new, larger value; but the new buffers wouldn't be usable for anything yet. Then, once we confirmed that everyone had the memo, we'd tell all the backends that those buffers are now available for use. If shared_buffers were contracted, phase 1 would tell all of the backends that shared_buffers had contracted to some new, smaller value. Once a particular backend learns about that, they will refuse to put any new pages into those high-numbered buffers, but the existing contents would still be valid. Once everyone has been told about this, we can go through and evict all of those buffers, and then let everyone know that's done. Then they shrink their mappings. It looks to me like the patch doesn't expand the buffer mapping table, which seems essential. But maybe I missed that. -- Robert Haas EDB: http://www.enterprisedb.com
> On Mon, Nov 25, 2024 at 02:33:48PM GMT, Robert Haas wrote: > > I think the idea of having multiple shared memory segments is > interesting and makes sense, but I would prefer to see them called > "segments" rather than "slots" just as do we do for DSMs. The name > "slot" is somewhat overused, and invites confusion with replication > slots, inter alia. I think it's possible that having multiple fixed > shared memory segments will spell trouble on Windows, where we already > need to use a retry loop to try to get the main shared memory segment > mapped at the correct address. If there are multiple segments and we > need whatever ASLR stuff happens on Windows to not place anything else > overlapping with any of them, that means there's more chances for > stuff to fail than if we just need one address range to be free. > Granted, the individual ranges are smaller, so maybe it's fine? But I > don't know. I haven't had a chance to experiment with that on Windows, but I'm hoping that in the worst case fallback to a single mapping via proposed infrastructure (and the consequent limitations) would be acceptable. > The big thing that worries me is synchronization, and while I've only > looked at the patch set briefly, it doesn't look to me as though > there's enough machinery here to make that work correctly. Suppose > that shared_buffers=8GB (a million buffers) and I change it to > shared_buffers=16GB (2 million buffers). As soon as any one backend > has seen that changed and expanded shared_buffers, there's a > possibility that some other backend which has not yet seen the change > might see a buffer number greater than a million. If it tries to use > that buffer number before it absorbs the change, something bad will > happen. The most obvious way for it to see such a buffer number - and > possibly the only one - is to do a lookup in the buffer mapping table > and find a buffer ID there that was inserted by some other backend > that has already seen the change. Right, I haven't put much efforts into synchronization yet. It's in my bucket list for the next iteration of the patch. > code, but I'm not sure exactly which points are safe. If we have no > code anywhere that assumes the address of an unpinned buffer can't > change before we pin it, then I guess the check for pins is the only > thing we need, but I don't know that to be the case. Probably I'm missing something here. What scenario do you have in mind, when the address of a buffer is changing? > I guess I would have imagined that a change like this would have to be > done in phases. In phase 1, we'd tell all of the backends that > shared_buffers had expanded to some new, larger value; but the new > buffers wouldn't be usable for anything yet. Then, once we confirmed > that everyone had the memo, we'd tell all the backends that those > buffers are now available for use. If shared_buffers were contracted, > phase 1 would tell all of the backends that shared_buffers had > contracted to some new, smaller value. Once a particular backend > learns about that, they will refuse to put any new pages into those > high-numbered buffers, but the existing contents would still be valid. > Once everyone has been told about this, we can go through and evict > all of those buffers, and then let everyone know that's done. Then > they shrink their mappings. Yep, sounds good. I was pondering about more crude approach, but doing this in phases seems to be a way to go. > It looks to me like the patch doesn't expand the buffer mapping table, > which seems essential. But maybe I missed that. Do you mean the "Shared Buffer Lookup Table"? It does expand it, but under somewhat unfitting name STRATEGY_SHMEM_SLOT. But now that I look at the code, I see a few issues around that -- so I would have to improve it anyway, thanks for pointing that out.
On Tue, Nov 26, 2024 at 2:18 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > I haven't had a chance to experiment with that on Windows, but I'm > hoping that in the worst case fallback to a single mapping via proposed > infrastructure (and the consequent limitations) would be acceptable. Yeah, if you can still fall back to a single mapping, I think that's OK. It would be nicer if it could work on every platform in the same way, but half a loaf is better than none. > > code, but I'm not sure exactly which points are safe. If we have no > > code anywhere that assumes the address of an unpinned buffer can't > > change before we pin it, then I guess the check for pins is the only > > thing we need, but I don't know that to be the case. > > Probably I'm missing something here. What scenario do you have in mind, > when the address of a buffer is changing? I was assuming that if you expand the mapping for shared_buffers, you can't count on the new mapping being at the same address as the old mapping. If you can, that makes things simpler, but what if the OS has mapped something else just afterward, in the address space that you're hoping to use when you expand the mapping? > > It looks to me like the patch doesn't expand the buffer mapping table, > > which seems essential. But maybe I missed that. > > Do you mean the "Shared Buffer Lookup Table"? It does expand it, but > under somewhat unfitting name STRATEGY_SHMEM_SLOT. But now that I look > at the code, I see a few issues around that -- so I would have to > improve it anyway, thanks for pointing that out. Yeah, we -- or at least I -- usually call that the buffer mapping table. There are identifiers like BufMappingPartitionLock, for example. I'm slightly surprised that the ShmemInitHash() call uses something else as the identifier, but I guess that's how it is. -- Robert Haas EDB: http://www.enterprisedb.com
> On Wed, Nov 27, 2024 at 10:20:27AM GMT, Robert Haas wrote: > > > > > > code, but I'm not sure exactly which points are safe. If we have no > > > code anywhere that assumes the address of an unpinned buffer can't > > > change before we pin it, then I guess the check for pins is the only > > > thing we need, but I don't know that to be the case. > > > > Probably I'm missing something here. What scenario do you have in mind, > > when the address of a buffer is changing? > > I was assuming that if you expand the mapping for shared_buffers, you > can't count on the new mapping being at the same address as the old > mapping. If you can, that makes things simpler, but what if the OS has > mapped something else just afterward, in the address space that you're > hoping to use when you expand the mapping? Yes, that's the whole point of the exercise with remap -- to keep addresses unchanged, making buffer management simpler and allowing resize mappings quicker. The trade off is that we would need to take care of shared mapping placing. My understanding is that clashing of mappings (either at creation time or when resizing) could happen only withing the process address space, and the assumption is that by the time we prepare the mapping layout all the rest of mappings for the process are already done. But I agree, it's an interesting question -- I'm going to investigate if those assumptions could be wrong under certain conditions. Currently if something else is mapped at the same address where we want to expand the mapping, we will get an error and can decide how to proceed (e.g. if it happens at creation time, proceed with a single mapping, otherwise ignore mapping resize).
On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > My understanding is that clashing of mappings (either at creation time > or when resizing) could happen only withing the process address space, > and the assumption is that by the time we prepare the mapping layout all > the rest of mappings for the process are already done. I don't think that's correct at all. First, the user could type LOAD 'whatever' at any time. But second, even if they don't or you prohibit them from doing so, the process could allocate memory for any of a million different things, and that could require mapping a new region of memory, and the OS could choose to place that just after an existing mapping, or at least close enough that we can't expand the object size as much as desired. If we had an upper bound on the size of shared_buffers and could reserve that amount of address space at startup time but only actually map a portion of it, then we could later remap and expand into the reserved space. Without that, I think there's absolutely no guarantee that the amount of address space that we need is available when we want to extend a mapping. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 27 Nov 2024 at 22:06, Robert Haas <robertmhaas@gmail.com> wrote: > If we had an upper bound on the size of shared_buffers I think a fairly reliable upper bound is the amount of physical memory on the system at time of postmaster start. We could make it a GUC to set the upper bound for the rare cases where people do stuff like adding swap space later or doing online VM growth. We could even have the default be something like 4x the physical memory to accommodate those people by default. > reserve that amount of address space at startup time but only actually > map a portion of it Or is this the difficult part?
Hi, On 2024-11-27 16:05:47 -0500, Robert Haas wrote: > On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > My understanding is that clashing of mappings (either at creation time > > or when resizing) could happen only withing the process address space, > > and the assumption is that by the time we prepare the mapping layout all > > the rest of mappings for the process are already done. > > I don't think that's correct at all. First, the user could type LOAD > 'whatever' at any time. But second, even if they don't or you prohibit > them from doing so, the process could allocate memory for any of a > million different things, and that could require mapping a new region > of memory, and the OS could choose to place that just after an > existing mapping, or at least close enough that we can't expand the > object size as much as desired. > > If we had an upper bound on the size of shared_buffers and could > reserve that amount of address space at startup time but only actually > map a portion of it, then we could later remap and expand into the > reserved space. Without that, I think there's absolutely no guarantee > that the amount of address space that we need is available when we > want to extend a mapping. Strictly speaking we don't actually need to map shared buffers to the same location in each process... We do need that for most other uses of shared memory, including the buffer mapping table, but not for the buffer data itself. Whether it's worth the complexity of dealing with differing locations is another matter. Greetings, Andres Freund
On Wed, Nov 27, 2024 at 4:28 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > On Wed, 27 Nov 2024 at 22:06, Robert Haas <robertmhaas@gmail.com> wrote: > > If we had an upper bound on the size of shared_buffers > > I think a fairly reliable upper bound is the amount of physical memory > on the system at time of postmaster start. We could make it a GUC to > set the upper bound for the rare cases where people do stuff like > adding swap space later or doing online VM growth. We could even have > the default be something like 4x the physical memory to accommodate > those people by default. Yes, Peter mentioned similar ideas on this thread last week. > > reserve that amount of address space at startup time but only actually > > map a portion of it > > Or is this the difficult part? I'm not sure how difficult this is, although I'm pretty sure that it's more difficult than adding a GUC. My point wasn't so much whether this is easy or hard but rather that it's essential if you want to avoid having addresses change when the resizing happens. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Nov 27, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote: > Strictly speaking we don't actually need to map shared buffers to the same > location in each process... We do need that for most other uses of shared > memory, including the buffer mapping table, but not for the buffer data > itself. Well, if it can move, then you have to make sure it doesn't move while someone's holding onto a pointer into it. I'm not exactly sure how hard it is to guarantee that, but we certainly do construct pointers into shared_buffers and use them at least for short periods of time, so it's not a purely academic concern. -- Robert Haas EDB: http://www.enterprisedb.com
> On Wed, Nov 27, 2024 at 04:05:47PM GMT, Robert Haas wrote: > On Wed, Nov 27, 2024 at 3:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > My understanding is that clashing of mappings (either at creation time > > or when resizing) could happen only withing the process address space, > > and the assumption is that by the time we prepare the mapping layout all > > the rest of mappings for the process are already done. > > I don't think that's correct at all. First, the user could type LOAD > 'whatever' at any time. But second, even if they don't or you prohibit > them from doing so, the process could allocate memory for any of a > million different things, and that could require mapping a new region > of memory, and the OS could choose to place that just after an > existing mapping, or at least close enough that we can't expand the > object size as much as desired. > > If we had an upper bound on the size of shared_buffers and could > reserve that amount of address space at startup time but only actually > map a portion of it, then we could later remap and expand into the > reserved space. Without that, I think there's absolutely no guarantee > that the amount of address space that we need is available when we > want to extend a mapping. Just done a couple of experiments, and I think this could be addressed by careful placing of mappings as well, based on two assumptions: for a new mapping the kernel always picks up a lowest address that allows enough space, and the maximum amount of allocable memory for other mappings could be derived from total available memory. With that in mind the shared mapping layout will have to have a large gap at the start, between the lowest address and the shared mappings used for buffers and rest -- the gap where all the other mapping (allocations, libraries, madvise, etc) will land. It's similar to address space reserving you mentioned above, will reduce possibility of clashing significantly, and looks something like this: 01339000-0139e000 [heap] 0139e000-014aa000 [heap] 7f2dd72f6000-7f2dfbc9c000 /memfd:strategy (deleted) 7f2e0209c000-7f2e269b0000 /memfd:checkpoint (deleted) 7f2e2cdb0000-7f2e516b4000 /memfd:iocv (deleted) 7f2e57ab4000-7f2e7c478000 /memfd:descriptors (deleted) 7f2ebc478000-7f2ee8d3c000 /memfd:buffers (deleted) ^ note the distance between two mappings, which is intended for resize 7f3168d3c000-7f318d600000 /memfd:main (deleted) ^ here is where the gap starts 7f4194c00000-7f4194e7d000 ^ this one is an anonymous maping created due to large memory allocation after shared mappings were created 7f4195000000-7f419527d000 7f41952dc000-7f4195416000 7f4195416000-7f4195600000 /dev/shm/PostgreSQL.2529797530 7f4195600000-7f41a311d000 /usr/lib/locale/locale-archive 7f41a317f000-7f41a3200000 7f41a3200000-7f41a3201000 /usr/lib64/libicudata.so.74.2 The assumption about picking up a lowest address is just how it works right now on Linux, this fact is already used in the patch. The idea that we could put upper boundary on the size of other mappings based on total available memory comes from the fact that anonymous mappings, that are much larger than memory, will fail without overcommit. With overcommit it becomes different, but if allocations are hitting that limit I can imagine there are bigger problems than shared buffer resize. This approach follows the same ideas already used in the patch, and have the same trade offs: no address changes, but questions about portability.
On Thu, Nov 28, 2024 at 11:30 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > on Linux, this fact is already used in the patch. The idea that we could put > upper boundary on the size of other mappings based on total available memory > comes from the fact that anonymous mappings, that are much larger than memory, > will fail without overcommit. With overcommit it becomes different, but if > allocations are hitting that limit I can imagine there are bigger problems than > shared buffer resize. > > This approach follows the same ideas already used in the patch, and have the > same trade offs: no address changes, but questions about portability. I definitely welcome the fact that you have some platform-specific knowledge of the Linux behavior, because that's expertise that is obviously quite useful here and which I lack. I'm personally not overly concerned about whether it works on every other platform -- I would prefer an implementation that works everywhere, but I'd rather have one that works on Linux than have nothing. It's unclear to me why operating systems don't offer better primitives for this sort of thing -- in theory there could be a system call that sets aside a pool of address space and then other system calls that let you allocate shared/unshared memory within that space or even at specific addresses, but actually such things don't exist. All that having been said, what does concern me a bit is our ability to predict what Linux will do well enough to keep what we're doing safe; and also whether the Linux behavior might abruptly change in the future. Users would be sad if we released this feature and then a future kernel upgrade causes PostgreSQL to completely stop working. I don't know how the Linux kernel developers actually feel about this sort of thing, but if I imagine myself as a kernel developer, I can totally see myself saying "well, we never promised that this would work in any particular way, so we're free to change it whenever we like." We've certainly used that argument here countless times. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, 28 Nov 2024 at 18:19, Robert Haas <robertmhaas@gmail.com> wrote: > > [...] It's unclear to me why > operating systems don't offer better primitives for this sort of thing > -- in theory there could be a system call that sets aside a pool of > address space and then other system calls that let you allocate > shared/unshared memory within that space or even at specific > addresses, but actually such things don't exist. Isn't that more a stdlib/malloc issue? AFAIK, Linux's mmap(2) syscall allows you to request memory from the OS at arbitrary addresses - it's just that stdlib's malloc doens't expose the 'alloc at this address' part of that API. Windows seems to have an equivalent API in VirtualAlloc*. Both the Windows API and Linux's mmap have an optional address argument, which (when not NULL) is where the allocation will be placed (some conditions apply, based on flags and specific API used), so, assuming we have some control on where to allocate memory, we should be able to reserve enough memory by using these APIs. Kind regards, Matthias van de Meent Neon (https://neon.tech)
> On Thu, Nov 28, 2024 at 12:18:54PM GMT, Robert Haas wrote: > > All that having been said, what does concern me a bit is our ability > to predict what Linux will do well enough to keep what we're doing > safe; and also whether the Linux behavior might abruptly change in the > future. Users would be sad if we released this feature and then a > future kernel upgrade causes PostgreSQL to completely stop working. I > don't know how the Linux kernel developers actually feel about this > sort of thing, but if I imagine myself as a kernel developer, I can > totally see myself saying "well, we never promised that this would > work in any particular way, so we're free to change it whenever we > like." We've certainly used that argument here countless times. Agree, at the moment I can't say for sure how reliable this behavior is in long term. I'll try to see if there are ways to get more confidence about that.
Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > On Thu, 28 Nov 2024 at 18:19, Robert Haas <robertmhaas@gmail.com> wrote: >> [...] It's unclear to me why >> operating systems don't offer better primitives for this sort of thing >> -- in theory there could be a system call that sets aside a pool of >> address space and then other system calls that let you allocate >> shared/unshared memory within that space or even at specific >> addresses, but actually such things don't exist. > Isn't that more a stdlib/malloc issue? AFAIK, Linux's mmap(2) syscall > allows you to request memory from the OS at arbitrary addresses - it's > just that stdlib's malloc doens't expose the 'alloc at this address' > part of that API. I think what Robert is concerned about is that there is exactly 0 guarantee that that will succeed, because you have no control over system-driven allocations of address space (for example, loading of extensions or JIT code). In fact, given things like ASLR, there is pressure on the kernel crew to make that *less* predictable not more so. So even if we devise a method that seems to work reliably today, we could have little faith that it would work with next year's kernels. regards, tom lane
On Thu, 28 Nov 2024 at 19:57, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > > On Thu, 28 Nov 2024 at 18:19, Robert Haas <robertmhaas@gmail.com> wrote: > >> [...] It's unclear to me why > >> operating systems don't offer better primitives for this sort of thing > >> -- in theory there could be a system call that sets aside a pool of > >> address space and then other system calls that let you allocate > >> shared/unshared memory within that space or even at specific > >> addresses, but actually such things don't exist. > > > Isn't that more a stdlib/malloc issue? AFAIK, Linux's mmap(2) syscall > > allows you to request memory from the OS at arbitrary addresses - it's > > just that stdlib's malloc doens't expose the 'alloc at this address' > > part of that API. > > I think what Robert is concerned about is that there is exactly 0 > guarantee that that will succeed, because you have no control over > system-driven allocations of address space (for example, loading > of extensions or JIT code). In fact, given things like ASLR, there > is pressure on the kernel crew to make that *less* predictable not > more so. I see what you mean, but I think that shouldn't be much of an issue. I'm not a kernel hacker, but I've never heard about anyone arguing to remove mmap's mapping-overwriting behavior for user-controlled mappings - it seems too useful as a way to guarantee relative memory addresses (agreed, there is now mseal(2), but that is the user asking for security on their own mapping, this isn't applied to arbitrary mappings). I mean, we can do the following to get a nice contiguous empty address space no other mmap(NULL)s will get put into: /* reserve size bytes of memory */ base = mmap(NULL, size, PROT_NONE, ...flags, ...); /* use the first small_size bytes of that reservation */ allocated_in_reserved = mmap(base, small_size, PROT_READ | PROT_WRITE, MAP_FIXED, ...); With the PROT_NONE protection option the OS doesn't actually allocate any backing memory, but guarantees no other mmap(NULL, ...) will get placed in that area such that it overlaps with that allocation until the area is munmap-ed, thus allowing us to reserve a chunk of address space without actually using (much) memory. Deallocations have to go through mmap(... PROT_NONE, ...) instead of munmap if we'd want to keep the full area reserved, but I think that's not that much of an issue. I also highly doubt Linux will remove or otherwise limit the PROT_NONE option to such a degree that we won't be able to "balloon" the memory address space for (e.g.) dynamic shared buffer resizing. See also: FreeBSD's MAP_GUARD mmap flag, Window's MEM_RESERVE and MEM_RESERVE_PLACEHOLDER flags for VirtualAlloc[2][Ex]. See also [0] where PROT_NONE is explicitly called out as a tool for reserving memory address space. > So even if we devise a method that seems to work reliably > today, we could have little faith that it would work with next year's > kernels. I really don't think that userspace memory address space reservations through e.g. PROT_NONE or MEM_RESERVE[_PLACEHOLDER] will be retired anytime soon, at least not without the relevant kernels also providing effective alternatives. Kind regards, Matthias van de Meent Neon (https://neon.tech) [0] https://www.gnu.org/software/libc/manual/html_node/Memory-Protection.html
Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > I mean, we can do the following to get a nice contiguous empty address > space no other mmap(NULL)s will get put into: > /* reserve size bytes of memory */ > base = mmap(NULL, size, PROT_NONE, ...flags, ...); > /* use the first small_size bytes of that reservation */ > allocated_in_reserved = mmap(base, small_size, PROT_READ | > PROT_WRITE, MAP_FIXED, ...); > With the PROT_NONE protection option the OS doesn't actually allocate > any backing memory, but guarantees no other mmap(NULL, ...) will get > placed in that area such that it overlaps with that allocation until > the area is munmap-ed, thus allowing us to reserve a chunk of address > space without actually using (much) memory. Well, that's all great if it works portably. But I don't see one word in either POSIX or the Linux mmap(2) man page that promises those semantics for PROT_NONE. I also wonder how well a giant chunk of "unbacked" address space will interoperate with the OOM killer, top(1)'s display of used memory, and other things that have caused us headaches with large shared-memory arenas. Maybe those issues are all in the past and this'll work great. I'm not holding my breath though. regards, tom lane
> On Fri, Nov 29, 2024 at 01:56:30AM GMT, Matthias van de Meent wrote: > > I mean, we can do the following to get a nice contiguous empty address > space no other mmap(NULL)s will get put into: > > /* reserve size bytes of memory */ > base = mmap(NULL, size, PROT_NONE, ...flags, ...); > /* use the first small_size bytes of that reservation */ > allocated_in_reserved = mmap(base, small_size, PROT_READ | > PROT_WRITE, MAP_FIXED, ...); > > With the PROT_NONE protection option the OS doesn't actually allocate > any backing memory, but guarantees no other mmap(NULL, ...) will get > placed in that area such that it overlaps with that allocation until > the area is munmap-ed, thus allowing us to reserve a chunk of address > space without actually using (much) memory. From what I understand it's not much different from the scenario when we just map as much as we want in advance. The actual memory will not be allocated in both cases due to CoW, oom_score seems to be the same. I agree it sounds attractive, but after some experimenting it looks like it won't work with huge pages insige a cgroup v2 (=container). The reason is Linux has recently learned to apply memory reservation limits on hugetlb inside a cgroup, which are applied to mmap. Nowadays this feature is often configured out of the box in various container orchestrators, meaning that a scenario "set hugetlb=1GB on a container, reserve 32GB with PROT_NONE" will fail. I've also tried to mix and match, reserve some address space via non-hugetlb mapping, and allocate a hugetlb out of it, but it doesn't work either (the smaller mmap complains about MAP_HUGETLB with EINVAL).
Hi, On 2024-11-28 17:30:32 +0100, Dmitry Dolgov wrote: > The assumption about picking up a lowest address is just how it works right now > on Linux, this fact is already used in the patch. The idea that we could put > upper boundary on the size of other mappings based on total available memory > comes from the fact that anonymous mappings, that are much larger than memory, > will fail without overcommit. The overcommit issue shouldn't be a big hurdle - by mmap()ing with MAP_NORESERVE the space isn't reserved. Then madvise with MADV_POPULATE_WRITE can be used to actually populate the used range of the mapping and MADV_REMOVE can be used to shrink the mapping again. > With overcommit it becomes different, but if allocations are hitting that > limit I can imagine there are bigger problems than shared buffer resize. I'm fairly sure it'll not work to just disregard issues around overcommit. A overly large memory allocation, without MAP_NORESERVE, will actually reduce the amount of memory that can be used for other allocations. That's obviously problematic, because you'll now have a smaller shared buffers, but can't use the memory for work_mem type allocations... Greetings, Andres Freund
> On Fri, Nov 29, 2024 at 05:47:27PM GMT, Dmitry Dolgov wrote: > > On Fri, Nov 29, 2024 at 01:56:30AM GMT, Matthias van de Meent wrote: > > > > I mean, we can do the following to get a nice contiguous empty address > > space no other mmap(NULL)s will get put into: > > > > /* reserve size bytes of memory */ > > base = mmap(NULL, size, PROT_NONE, ...flags, ...); > > /* use the first small_size bytes of that reservation */ > > allocated_in_reserved = mmap(base, small_size, PROT_READ | > > PROT_WRITE, MAP_FIXED, ...); > > > > With the PROT_NONE protection option the OS doesn't actually allocate > > any backing memory, but guarantees no other mmap(NULL, ...) will get > > placed in that area such that it overlaps with that allocation until > > the area is munmap-ed, thus allowing us to reserve a chunk of address > > space without actually using (much) memory. > > From what I understand it's not much different from the scenario when we > just map as much as we want in advance. The actual memory will not be > allocated in both cases due to CoW, oom_score seems to be the same. I > agree it sounds attractive, but after some experimenting it looks like > it won't work with huge pages insige a cgroup v2 (=container). > > The reason is Linux has recently learned to apply memory reservation > limits on hugetlb inside a cgroup, which are applied to mmap. Nowadays > this feature is often configured out of the box in various container > orchestrators, meaning that a scenario "set hugetlb=1GB on a container, > reserve 32GB with PROT_NONE" will fail. I've also tried to mix and > match, reserve some address space via non-hugetlb mapping, and allocate > a hugetlb out of it, but it doesn't work either (the smaller mmap > complains about MAP_HUGETLB with EINVAL). I've asked about that in linux-mm [1]. To my surprise, the recommendations were to stick to creating a large mapping in advance, and slice smaller mappings out of that, which could be resized later. The OOM score should not be affected, and hugetlb could be avoided using MAP_NORESERVE flag for the initial mapping (I've experimented with that, seems to be working just fine, even if the slices are not using MAP_NORESERVE). I guess that would mean I'll try to experiment with this approach as well. But what others think? How much research do we need to do, to gain some confidence about large shared mappings and make it realistically acceptable? [1]: https://lore.kernel.org/linux-mm/pr7zggtdgjqjwyrfqzusih2suofszxvlfxdptbo2smneixkp7i@nrmtbhemy3is/t/
On Mon, Dec 2, 2024 at 2:18 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > I've asked about that in linux-mm [1]. To my surprise, the > recommendations were to stick to creating a large mapping in advance, > and slice smaller mappings out of that, which could be resized later. > The OOM score should not be affected, and hugetlb could be avoided using > MAP_NORESERVE flag for the initial mapping (I've experimented with that, > seems to be working just fine, even if the slices are not using > MAP_NORESERVE). > > I guess that would mean I'll try to experiment with this approach as > well. But what others think? How much research do we need to do, to gain > some confidence about large shared mappings and make it realistically > acceptable? Personally, I like this approach. It seems to me that this opens up the possibility of a system where the virtual addresses of data structures in shared memory never change, which I think will avoid an absolutely massive amount of implementation complexity. It's obviously not ideal that we have to specify in advance an upper limit on the potential size of shared_buffers, but we can live with it. It's better than what we have today; and certainly cloud providers will have no issue with pre-setting that to a reasonable value. I don't know if we can port it to other operating systems, but it seems at least possible that they offer similar primitives, or will in the future; if not, we can disable the feature on those platforms. I still think the synchronization is going to be tricky. For example when you go to shrink a mapping, you need to make sure that it's free of buffers that anyone might touch; and when you grow a mapping, you need to make sure that nobody tries to touch that address space before they grow the mapping, which goes back to my earlier point about someone doing a lookup into the buffer mapping table and finding a buffer number that is beyond the end of what they've already mapped. But I think it may be doable with sufficient cleverness. -- Robert Haas EDB: http://www.enterprisedb.com