Thread: PGC_SIGHUP shared_buffers?
Hi, I remember Magnus making a comment many years ago to the effect that every setting that is PGC_POSTMASTER is a bug, but some of those bugs are very difficult to fix. Perhaps the use of the word bug is arguable, but I think the sentiment is apt, especially with regard to shared_buffers. Changing without a server restart would be really nice, but it's hard to figure out how to do it. I can think of a few basic approaches, and I'd like to know (a) which ones people think are good and which ones people think suck (maybe they all suck) and (b) if anybody's got any other ideas not mentioned here. 1. Complicate the Buffer->pointer mapping. Right now, BufferGetBlock() is basically just BufferBlocks + (buffer - 1) * BLCKSZ, which means that we're expecting to find all of the buffers in a single giant array. Years ago, somebody proposed changing the implementation to essentially WhereIsTheBuffer[buffer], which was heavily criticized on performance grounds, because it requires an extra memory access. A gentler version of this might be something like WhereIsTheChunkOfBuffers[buffer/CHUNK_SIZE]+(buffer%CHUNK_SIZE)*BLCKSZ; i.e. instead of allowing every single buffer to be at some random address, manage chunks of the buffer pool. This makes the lookup array potentially quite a lot smaller, which might mitigate performance concerns. For example, if you had one chunk per GB of shared_buffers, your mapping array would need only a handful of cache lines, or a few handfuls on really big systems. (I am here ignoring the difficulties of how to orchestrate addition of or removal of chunks as a SMOP[1]. Feel free to criticize that hand-waving, but as of this writing, I feel like moderate determination would suffice.) 2. Make a Buffer just a disguised pointer. Imagine something like typedef struct { Page bp; } *buffer. WIth this approach, BufferGetBlock() becomes trivial. The tricky part with this approach is that you still need a cheap way of finding the buffer header. What I imagine might work here is to again have some kind of chunked representation of shared_buffers, where each chunk contains a bunch of buffer headers at, say, the beginning, followed by a bunch of buffers. Theoretically, if the chunks are sufficiently strong-aligned, you can figure out what offset you're at within the chunk without any additional information and the whole process of locating the buffer header is just math, with no memory access. But in practice, getting the chunks to be sufficiently strongly aligned sounds hard, and this also makes a Buffer 64 bits rather than the current 32. A variant on this concept might be to make the Buffer even wider and include two pointers in it i.e. typedef struct { Page bp; BufferDesc *bd; } Buffer. 3. Reserve lots of address space and then only use some of it. I hear rumors that some forks of PG have implemented something like this. The idea is that you convince the OS to give you a whole bunch of address space, but you try to avoid having all of it be backed by physical memory. If you later want to increase shared_buffers, you then get the OS to back more of it by physical memory, and if you later want to decrease shared_buffers, you hopefully have some way of giving the OS the memory back. As compared with the previous two approaches, this seems less likely to be noticeable to most PG code. Problems include (1) you have to somehow figure out how much address space to reserve, and that forms an upper bound on how big shared_buffers can grow at runtime and (2) you have to figure out ways to reserve address space and back more or less of it with physical memory that will work on all of the platforms that we currently support or might want to support in the future. 4. Give up on actually changing the size of shared_buffer per se, but stick some kind of resizable secondary cache in front of it. Data that is going to be manipulated gets brought into a (perhaps small?) "real" shared_buffers that behaves just like today, but you have some larger data structure which is designed to be easier to resize and maybe simpler in some other ways that sits between shared_buffers and the OS cache. This doesn't seem super-appealing because it requires a lot of data copying, but maybe it's worth considering as a last resort. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com [1] https://en.wikipedia.org/wiki/Small_matter_of_programming
Hi, On 2024-02-16 09:58:43 +0530, Robert Haas wrote: > I remember Magnus making a comment many years ago to the effect that > every setting that is PGC_POSTMASTER is a bug, but some of those bugs > are very difficult to fix. Perhaps the use of the word bug is > arguable, but I think the sentiment is apt, especially with regard to > shared_buffers. Changing without a server restart would be really > nice, but it's hard to figure out how to do it. I can think of a few > basic approaches, and I'd like to know (a) which ones people think are > good and which ones people think suck (maybe they all suck) and (b) if > anybody's got any other ideas not mentioned here. IMO the ability to *shrink* shared_buffers dynamically and cheaply is more important than growing it in a way, except that they are related of course. Idling hardware is expensive, thus overcommitting hardware is very attractive (I count "serverless" as part of that). To be able to overcommit effectively, unused long-lived memory has to be released. I.e. shared buffers needs to be shrinkable. Perhaps worth noting that there are two things limiting the size of shared buffers: 1) the available buffer space 2) the available buffer *mapping* space. I think making the buffer mapping resizable is considerably harder than the buffers themselves. Of course pre-reserving memory for a buffer mapping suitable for a huge shared_buffers is more feasible than pre-allocating all that memory for the buffers themselves. But it' still mean youd have a maximum set at server start. > 1. Complicate the Buffer->pointer mapping. Right now, BufferGetBlock() > is basically just BufferBlocks + (buffer - 1) * BLCKSZ, which means > that we're expecting to find all of the buffers in a single giant > array. Years ago, somebody proposed changing the implementation to > essentially WhereIsTheBuffer[buffer], which was heavily criticized on > performance grounds, because it requires an extra memory access. A > gentler version of this might be something like > WhereIsTheChunkOfBuffers[buffer/CHUNK_SIZE]+(buffer%CHUNK_SIZE)*BLCKSZ; > i.e. instead of allowing every single buffer to be at some random > address, manage chunks of the buffer pool. This makes the lookup array > potentially quite a lot smaller, which might mitigate performance > concerns. For example, if you had one chunk per GB of shared_buffers, > your mapping array would need only a handful of cache lines, or a few > handfuls on really big systems. Such a scheme still leaves you with a dependend memory read for a quite frequent operation. It could turn out to nto matter hugely if the mapping array is cache resident, but I don't know if we can realistically bank on that. I'm also somewhat concerned about the coarse granularity being problematic. It seems like it'd lead to a desire to make the granule small, causing slowness. One big advantage of a scheme like this is that it'd be a step towards a NUMA aware buffer mapping and replacement. Practically everything beyond the size of a small consumer device these days has NUMA characteristics, even if not "officially visible". We could make clock sweeps (or a better victim buffer selection algorithm) happen within each "chunk", with some additional infrastructure to choose which of the chunks to search a buffer in. Using a chunk on the current numa node, except when there is a lot of imbalance between buffer usage or replacement rate between chunks. > 2. Make a Buffer just a disguised pointer. Imagine something like > typedef struct { Page bp; } *buffer. WIth this approach, > BufferGetBlock() becomes trivial. You also additionally need something that allows for efficient iteration over all shared buffers. Making buffer replacement and checkpointing more expensive isn't great. > 3. Reserve lots of address space and then only use some of it. I hear > rumors that some forks of PG have implemented something like this. The > idea is that you convince the OS to give you a whole bunch of address > space, but you try to avoid having all of it be backed by physical > memory. If you later want to increase shared_buffers, you then get the > OS to back more of it by physical memory, and if you later want to > decrease shared_buffers, you hopefully have some way of giving the OS > the memory back. As compared with the previous two approaches, this > seems less likely to be noticeable to most PG code. Another advantage is that you can shrink shared buffers fairly granularly and cheaply with that approach, compared to having to move buffes entirely out of a larger mapping to be able to unmap it. > Problems include (1) you have to somehow figure out how much address space > to reserve, and that forms an upper bound on how big shared_buffers can grow > at runtime and Presumably you'd normally not want to reserve more than the physical amount of memory on the system. Sure, memory can be hot added, but IME that's quite rare. > (2) you have to figure out ways to reserve address space and > back more or less of it with physical memory that will work on all of the > platforms that we currently support or might want to support in the future. We also could decide to only implement 2) on platforms with suitable APIs. A third issue is that it can confuse administrators inspecting the system with OS tools. "Postgres uses many terabytes of memory on my system!" due to VIRT being huge etc. > 4. Give up on actually changing the size of shared_buffer per se, but > stick some kind of resizable secondary cache in front of it. Data that > is going to be manipulated gets brought into a (perhaps small?) "real" > shared_buffers that behaves just like today, but you have some larger > data structure which is designed to be easier to resize and maybe > simpler in some other ways that sits between shared_buffers and the OS > cache. This doesn't seem super-appealing because it requires a lot of > data copying, but maybe it's worth considering as a last resort. Yea, that seems quite unappealing. Needing buffer replacement to be able to pin a buffer would be ... unattractive. Greetings, Andres Freund
On 16/02/2024 06:28, Robert Haas wrote: > 3. Reserve lots of address space and then only use some of it. I hear > rumors that some forks of PG have implemented something like this. The > idea is that you convince the OS to give you a whole bunch of address > space, but you try to avoid having all of it be backed by physical > memory. If you later want to increase shared_buffers, you then get the > OS to back more of it by physical memory, and if you later want to > decrease shared_buffers, you hopefully have some way of giving the OS > the memory back. As compared with the previous two approaches, this > seems less likely to be noticeable to most PG code. Problems include > (1) you have to somehow figure out how much address space to reserve, > and that forms an upper bound on how big shared_buffers can grow at > runtime and (2) you have to figure out ways to reserve address space > and back more or less of it with physical memory that will work on all > of the platforms that we currently support or might want to support in > the future. A variant of this approach: 5. Re-map the shared_buffers when needed. Between transactions, a backend should not hold any buffer pins. When there are no pins, you can munmap() the shared_buffers and mmap() it at a different address. -- Heikki Linnakangas Neon (https://neon.tech)
On Fri, Feb 16, 2024 at 5:29 PM Robert Haas <robertmhaas@gmail.com> wrote: > 3. Reserve lots of address space and then only use some of it. I hear > rumors that some forks of PG have implemented something like this. The > idea is that you convince the OS to give you a whole bunch of address > space, but you try to avoid having all of it be backed by physical > memory. If you later want to increase shared_buffers, you then get the > OS to back more of it by physical memory, and if you later want to > decrease shared_buffers, you hopefully have some way of giving the OS > the memory back. As compared with the previous two approaches, this > seems less likely to be noticeable to most PG code. Problems include > (1) you have to somehow figure out how much address space to reserve, > and that forms an upper bound on how big shared_buffers can grow at > runtime and (2) you have to figure out ways to reserve address space > and back more or less of it with physical memory that will work on all > of the platforms that we currently support or might want to support in > the future. FTR I'm aware of a working experimental prototype along these lines, that will be presented in Vancouver: https://www.pgevents.ca/events/pgconfdev2024/sessions/session/31-enhancing-postgresql-plasticity-new-frontiers-in-memory-management/
On Fri, 16 Feb 2024 at 21:24, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 16/02/2024 06:28, Robert Haas wrote: > > 3. Reserve lots of address space and then only use some of it. I hear > > rumors that some forks of PG have implemented something like this. The > > idea is that you convince the OS to give you a whole bunch of address > > space, but you try to avoid having all of it be backed by physical > > memory. If you later want to increase shared_buffers, you then get the > > OS to back more of it by physical memory, and if you later want to > > decrease shared_buffers, you hopefully have some way of giving the OS > > the memory back. As compared with the previous two approaches, this > > seems less likely to be noticeable to most PG code. Problems include > > (1) you have to somehow figure out how much address space to reserve, > > and that forms an upper bound on how big shared_buffers can grow at > > runtime and (2) you have to figure out ways to reserve address space > > and back more or less of it with physical memory that will work on all > > of the platforms that we currently support or might want to support in > > the future. > > A variant of this approach: > > 5. Re-map the shared_buffers when needed. > > Between transactions, a backend should not hold any buffer pins. When > there are no pins, you can munmap() the shared_buffers and mmap() it at > a different address. This can quite realistically fail to find an unused memory region of sufficient size when the heap is sufficiently fragmented, e.g. through ASLR, which would make it difficult to use this dynamic single-allocation shared_buffers in security-hardened environments. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Hi, On 2024-02-17 23:40:51 +0100, Matthias van de Meent wrote: > > 5. Re-map the shared_buffers when needed. > > > > Between transactions, a backend should not hold any buffer pins. When > > there are no pins, you can munmap() the shared_buffers and mmap() it at > > a different address. I hadn't quite realized that we don't seem to rely on shared_buffers having a specific address across processes. That does seem to make it a more viable to remap mappings in backends. However, I don't think this works with mmap(MAP_ANONYMOUS) - as long as we are using the process model. To my knowledge there is no way to get the same mapping in multiple already existing processes. Even mmap()ing /dev/zero after sharing file descriptors across processes doesn't work, if I recall correctly. We would have to use sysv/posix shared memory or such (or mmap() if files in tmpfs) for the shared buffers allocation. > This can quite realistically fail to find an unused memory region of > sufficient size when the heap is sufficiently fragmented, e.g. through > ASLR, which would make it difficult to use this dynamic > single-allocation shared_buffers in security-hardened environments. I haven't seen anywhere close to this bad fragmentation on 64bit machines so far - have you? Most implementations of ASLR randomize mmap locations across multiple runs of the same binary, not within the same binary. There are out-of-tree linux patches that make mmap() randomize every single allocation, but I am not sure that we ought to care about such things. Even if we were to care, on 64bit platforms it doesn't seem likely that we'd run out of space that quickly. AMD64 had 48bits of virtual address space from the start, and on recent CPUs that has grown to 57bits [1], that's a lot of space. And if you do run out of VM space, wouldn't that also affect lots of other things, like mmap() for malloc? Greetings, Andres Freund [1] https://en.wikipedia.org/wiki/Intel_5-level_paging
On Sat, Feb 17, 2024 at 12:38 AM Andres Freund <andres@anarazel.de> wrote: > IMO the ability to *shrink* shared_buffers dynamically and cheaply is more > important than growing it in a way, except that they are related of > course. Idling hardware is expensive, thus overcommitting hardware is very > attractive (I count "serverless" as part of that). To be able to overcommit > effectively, unused long-lived memory has to be released. I.e. shared buffers > needs to be shrinkable. I see your point, but people want to scale up, too. Of course, those people will have to live with what we can practically implement. > Perhaps worth noting that there are two things limiting the size of shared > buffers: 1) the available buffer space 2) the available buffer *mapping* > space. I think making the buffer mapping resizable is considerably harder than > the buffers themselves. Of course pre-reserving memory for a buffer mapping > suitable for a huge shared_buffers is more feasible than pre-allocating all > that memory for the buffers themselves. But it' still mean youd have a maximum > set at server start. We size the fsync queue based on shared_buffers too. That's a lot less important, though, and could be worked around in other ways. > Such a scheme still leaves you with a dependend memory read for a quite > frequent operation. It could turn out to nto matter hugely if the mapping > array is cache resident, but I don't know if we can realistically bank on > that. I don't know, either. I was hoping you did. :-) But we can rig up a test pretty easily, I think. We can just create a fake mapping that gives the same answers as the current calculation and then beat on it. Of course, if testing shows no difference, there is the small problem of knowing whether the test scenario was right; and it's also possible that an initial impact could be mitigated by removing some gratuitously repeated buffer # -> buffer address mappings. Still, I think it could provide us with a useful baseline. I'll throw something together when I have time, unless someone beats me to it. > I'm also somewhat concerned about the coarse granularity being problematic. It > seems like it'd lead to a desire to make the granule small, causing slowness. How many people set shared_buffers to something that's not a whole number of GB these days? I mean I bet it happens, but in practice if you rounded to the nearest GB, or even the nearest 2GB, I bet almost nobody would really care. I think it's fine to be opinionated here and hold the line at a relatively large granule, even though in theory people could want something else. Alternatively, maybe there could be a provision for the last granule to be partial, and if you extend further, you throw away the partial granule and replace it with a whole one. But I'm not even sure that's worth doing. > One big advantage of a scheme like this is that it'd be a step towards a NUMA > aware buffer mapping and replacement. Practically everything beyond the size > of a small consumer device these days has NUMA characteristics, even if not > "officially visible". We could make clock sweeps (or a better victim buffer > selection algorithm) happen within each "chunk", with some additional > infrastructure to choose which of the chunks to search a buffer in. Using a > chunk on the current numa node, except when there is a lot of imbalance > between buffer usage or replacement rate between chunks. I also wondered whether this might be a useful step toward allowing different-sized buffers in the same buffer pool (ducks, runs away quickly). I don't have any particular use for that myself, but it's a thing some people probably want for some reason or other. > > 2. Make a Buffer just a disguised pointer. Imagine something like > > typedef struct { Page bp; } *buffer. WIth this approach, > > BufferGetBlock() becomes trivial. > > You also additionally need something that allows for efficient iteration over > all shared buffers. Making buffer replacement and checkpointing more expensive > isn't great. True, but I don't really see what the problem with this would be in this approach. > > 3. Reserve lots of address space and then only use some of it. I hear > > rumors that some forks of PG have implemented something like this. The > > idea is that you convince the OS to give you a whole bunch of address > > space, but you try to avoid having all of it be backed by physical > > memory. If you later want to increase shared_buffers, you then get the > > OS to back more of it by physical memory, and if you later want to > > decrease shared_buffers, you hopefully have some way of giving the OS > > the memory back. As compared with the previous two approaches, this > > seems less likely to be noticeable to most PG code. > > Another advantage is that you can shrink shared buffers fairly granularly and > cheaply with that approach, compared to having to move buffes entirely out of > a larger mapping to be able to unmap it. Don't you have to still move buffers entirely out of the region you want to unmap? > > Problems include (1) you have to somehow figure out how much address space > > to reserve, and that forms an upper bound on how big shared_buffers can grow > > at runtime and > > Presumably you'd normally not want to reserve more than the physical amount of > memory on the system. Sure, memory can be hot added, but IME that's quite > rare. I would think that might not be so rare in a virtualized environment, which would seem to be one of the most important use cases for this kind of thing. Plus, this would mean we'd need to auto-detect system RAM. I'd rather not go there, and just fix the upper limit via a GUC. > > (2) you have to figure out ways to reserve address space and > > back more or less of it with physical memory that will work on all of the > > platforms that we currently support or might want to support in the future. > > We also could decide to only implement 2) on platforms with suitable APIs. Yep, fair. > A third issue is that it can confuse administrators inspecting the system with > OS tools. "Postgres uses many terabytes of memory on my system!" due to VIRT > being huge etc. Mmph. That's disagreeable but probably not a reason to entirely abandon any particular approach. -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, Feb 17, 2024 at 1:54 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > A variant of this approach: > > 5. Re-map the shared_buffers when needed. > > Between transactions, a backend should not hold any buffer pins. When > there are no pins, you can munmap() the shared_buffers and mmap() it at > a different address. I really like this idea, but I think Andres has latched onto the key issue, which is that it supposes that the underlying shared memory object upon which shared_buffers is based can be made bigger and smaller, and that doesn't work for anonymous mappings AFAIK. Maybe that's not really a problem any more, though. If we don't depend on the address of shared_buffers anywhere, we could move it into a DSM. Now that the stats collector uses DSM, it's surely already a requirement that DSM works on every machine that runs PostgreSQL. We'd still need to do something about the buffer mapping table, though, and I bet dshash is not a reasonable answer on performance grounds. Also, it would be nice if the granularity of resizing could be something less than a whole transaction, because transactions can run for a long time. We don't really need to wait for a transaction boundary, probably -- a time when we hold zero buffer pins will probably happen a lot sooner, and at least some of those should be safe points at which to remap. Then again, somebody can open a cursor, read from it until it holds a pin, and then either idle the connection or make it do arbitrary amounts of unrelated work, forcing the remapping to be postponed for an arbitrarily long time. But some version of this problem will exist in any approach to this problem, and long-running pins are a nuisance for other reasons, too. We probably just have to accept this sort of issue as a limitation of our implementation. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 16, 2024 at 5:29 PM Robert Haas <robertmhaas@gmail.com> wrote:3. Reserve lots of address space and then only use some of it. I hear rumors that some forks of PG have implemented something like this. The idea is that you convince the OS to give you a whole bunch of address space, but you try to avoid having all of it be backed by physical memory. If you later want to increase shared_buffers, you then get the OS to back more of it by physical memory, and if you later want to decrease shared_buffers, you hopefully have some way of giving the OS the memory back. As compared with the previous two approaches, this seems less likely to be noticeable to most PG code. Problems include (1) you have to somehow figure out how much address space to reserve, and that forms an upper bound on how big shared_buffers can grow at runtime and (2) you have to figure out ways to reserve address space and back more or less of it with physical memory that will work on all of the platforms that we currently support or might want to support in the future.FTR I'm aware of a working experimental prototype along these lines, that will be presented in Vancouver: https://www.pgevents.ca/events/pgconfdev2024/sessions/session/31-enhancing-postgresql-plasticity-new-frontiers-in-memory-management/
If you are interested - this is my attempt to implement resizable shared buffers based on ballooning:
https://github.com/knizhnik/postgres/pull/2
Unused memory is returned to OS using `madvise` (so it is not so portable solution).
Unfortunately there are really many data structure in Postgres which size depends on number of buffers.
In my PR I am using `GetAvailableBuffers()` function instead of `NBuffers`. But it doesn't always help because many of this data structures can not be reallocated.
Another important limitation of this approach are:
1. It is necessary to specify maximal number of shared buffers 2. Only `BufferBlocks` space is shrinked but not buffer descriptors and buffer hash. Estimated memory fooyprint for one page is 132 bytes. If we want to scale shared buffers from 100Mb to 100Gb, size of use memory will be 1.6Gb. And it is quite large.
3. Our CLOCK algorithm becomes very inefficient for large number of shared buffers.
Below are first results (pgbench database with scale 100, pgbench -c 32 -j 4 -T 100 -P1 -M prepared -S ) I get:
| shared_buffers | available_buffers | TPS | | ------------------| ---------------------------- | ---- | | 128MB | -1 | 280k | | 1GB | -1 | 324k | | 2GB | -1 | 358k | | 32GB | -1 | 350k | | 2GB | 128Mb | 130k | | 2GB | 1Gb | 311k | | 32GB | 128Mb | 13k | | 32GB | 1Gb | 140k | | 32GB | 2Gb | 348k |
`shared_buffers` specifies maximal shared buffers size and `avaiable_buffer` - current limit.
So when shared_buffers >> available_buffers and dataset doesn't fit in them, we get awful degrade of performance (> 20 times).
Thanks to CLOCK algorithm.
My first thought is to replace clock with LRU based in double-linked list. As far as there is no lockless double-list implementation,
it need some global lock. This lock can become bottleneck. The standard solution is partitioning: use N LRU lists instead of 1.
Just as partitioned has table used by buffer manager to lockup buffers. Actually we can use the same partitions locks to protect LRU list.
But it not clear what to do with ring buffers (strategies).So I decided not to perform such revolution in bufmgr, but optimize clock to more efficiently split reserved buffers.
Just add skip_count
field to buffer descriptor. And it helps! Now the worst case shared_buffer/available_buffers = 32Gb/128Mb
shows the same performance 280k as shared_buffers=128Mb without ballooning.
On Sun, 18 Feb 2024 at 02:03, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2024-02-17 23:40:51 +0100, Matthias van de Meent wrote: > > > 5. Re-map the shared_buffers when needed. > > > > > > Between transactions, a backend should not hold any buffer pins. When > > > there are no pins, you can munmap() the shared_buffers and mmap() it at > > > a different address. > > I hadn't quite realized that we don't seem to rely on shared_buffers having a > specific address across processes. That does seem to make it a more viable to > remap mappings in backends. > > > However, I don't think this works with mmap(MAP_ANONYMOUS) - as long as we are > using the process model. To my knowledge there is no way to get the same > mapping in multiple already existing processes. Even mmap()ing /dev/zero after > sharing file descriptors across processes doesn't work, if I recall correctly. > > We would have to use sysv/posix shared memory or such (or mmap() if files in > tmpfs) for the shared buffers allocation. > > > > > This can quite realistically fail to find an unused memory region of > > sufficient size when the heap is sufficiently fragmented, e.g. through > > ASLR, which would make it difficult to use this dynamic > > single-allocation shared_buffers in security-hardened environments. > > I haven't seen anywhere close to this bad fragmentation on 64bit machines so > far - have you? No. > Most implementations of ASLR randomize mmap locations across multiple runs of > the same binary, not within the same binary. There are out-of-tree linux > patches that make mmap() randomize every single allocation, but I am not sure > that we ought to care about such things. After looking into ASLR a bit more, I realise I was under the mistaken impression that ASLR would implicate randomized mmaps(), too. Apparently, that's wrong; ASLR only does some randomization for the initialization of the process memory layout, and not the process' allocations. > Even if we were to care, on 64bit platforms it doesn't seem likely that we'd > run out of space that quickly. AMD64 had 48bits of virtual address space from > the start, and on recent CPUs that has grown to 57bits [1], that's a lot of > space. Yeah, that's a lot of space, but it seems to me it's also easily consumed; one only needs to allocate one allocation in every 4GB of address space to make allocations of 8GB impossible; a utilization of ~1 byte/MiB. Applying this to 48 bits of virtual address space, a process only needs to use ~256MB of memory across the address space to block out any 8GB allocations; for 57 bits that's still "only" 128GB. But after looking at ASLR a bit more, it is unrealistic that a normal OS and process stack would get to allocating memory in such a pattern. > And if you do run out of VM space, wouldn't that also affect lots of other > things, like mmap() for malloc? Yes. But I would usually expect that the main shared memory allocation would be the single largest uninterrupted allocation, so I'd also expect it to see more such issues than any current user of memory if we were to start moving (reallocating) that allocation. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Hi, On 2024-02-18 17:06:09 +0530, Robert Haas wrote: > On Sat, Feb 17, 2024 at 12:38 AM Andres Freund <andres@anarazel.de> wrote: > > IMO the ability to *shrink* shared_buffers dynamically and cheaply is more > > important than growing it in a way, except that they are related of > > course. Idling hardware is expensive, thus overcommitting hardware is very > > attractive (I count "serverless" as part of that). To be able to overcommit > > effectively, unused long-lived memory has to be released. I.e. shared buffers > > needs to be shrinkable. > > I see your point, but people want to scale up, too. Of course, those > people will have to live with what we can practically implement. Sure, I didn't intend to say that scaling up isn't useful. > > Perhaps worth noting that there are two things limiting the size of shared > > buffers: 1) the available buffer space 2) the available buffer *mapping* > > space. I think making the buffer mapping resizable is considerably harder than > > the buffers themselves. Of course pre-reserving memory for a buffer mapping > > suitable for a huge shared_buffers is more feasible than pre-allocating all > > that memory for the buffers themselves. But it' still mean youd have a maximum > > set at server start. > > We size the fsync queue based on shared_buffers too. That's a lot less > important, though, and could be worked around in other ways. We probably should address that independently of making shared_buffers PGC_SIGHUP. The queue gets absurdly large once s_b hits a few GB. It's not that much memory compared to the buffer blocks themselves, but a sync queue of many millions of entries just doesn't make sense. And a few hundred MB for that isn't nothing either, even if it's just a fraction of the space for the buffers. It makes checkpointer more susceptible to OOM as well, because AbsorbSyncRequests() allocates an array to copy all requests into local memory. > > Such a scheme still leaves you with a dependend memory read for a quite > > frequent operation. It could turn out to nto matter hugely if the mapping > > array is cache resident, but I don't know if we can realistically bank on > > that. > > I don't know, either. I was hoping you did. :-) > > But we can rig up a test pretty easily, I think. We can just create a > fake mapping that gives the same answers as the current calculation > and then beat on it. Of course, if testing shows no difference, there > is the small problem of knowing whether the test scenario was right; > and it's also possible that an initial impact could be mitigated by > removing some gratuitously repeated buffer # -> buffer address > mappings. Still, I think it could provide us with a useful baseline. > I'll throw something together when I have time, unless someone beats > me to it. I think such a test would be useful, although I also don't know how confident we would be if we saw positive results. Probably depends a bit on the generated code and how plausible it is to not see regressions. > > I'm also somewhat concerned about the coarse granularity being problematic. It > > seems like it'd lead to a desire to make the granule small, causing slowness. > > How many people set shared_buffers to something that's not a whole > number of GB these days? I'd say the vast majority of postgres instances in production run with less than 1GB of s_b. Just because numbers wise the majority of instances are running on small VMs and/or many PG instances are running on one larger machine. There are a lot of instances where the total available memory is less than 2GB. > I mean I bet it happens, but in practice if you rounded to the nearest GB, > or even the nearest 2GB, I bet almost nobody would really care. I think it's > fine to be opinionated here and hold the line at a relatively large granule, > even though in theory people could want something else. I don't believe that at all unfortunately. > > One big advantage of a scheme like this is that it'd be a step towards a NUMA > > aware buffer mapping and replacement. Practically everything beyond the size > > of a small consumer device these days has NUMA characteristics, even if not > > "officially visible". We could make clock sweeps (or a better victim buffer > > selection algorithm) happen within each "chunk", with some additional > > infrastructure to choose which of the chunks to search a buffer in. Using a > > chunk on the current numa node, except when there is a lot of imbalance > > between buffer usage or replacement rate between chunks. > > I also wondered whether this might be a useful step toward allowing > different-sized buffers in the same buffer pool (ducks, runs away > quickly). I don't have any particular use for that myself, but it's a > thing some people probably want for some reason or other. I still think that that's something that will just cause a significant cost in complexity, and secondarily also runtime overhead, at a comparatively marginal gain. > > > 2. Make a Buffer just a disguised pointer. Imagine something like > > > typedef struct { Page bp; } *buffer. WIth this approach, > > > BufferGetBlock() becomes trivial. > > > > You also additionally need something that allows for efficient iteration over > > all shared buffers. Making buffer replacement and checkpointing more expensive > > isn't great. > > True, but I don't really see what the problem with this would be in > this approach. It's a bit hard to tell at this level of detail :). At the extreme end, if you end up with a large number of separate allocations for s_b, it surely would. > > > 3. Reserve lots of address space and then only use some of it. I hear > > > rumors that some forks of PG have implemented something like this. The > > > idea is that you convince the OS to give you a whole bunch of address > > > space, but you try to avoid having all of it be backed by physical > > > memory. If you later want to increase shared_buffers, you then get the > > > OS to back more of it by physical memory, and if you later want to > > > decrease shared_buffers, you hopefully have some way of giving the OS > > > the memory back. As compared with the previous two approaches, this > > > seems less likely to be noticeable to most PG code. > > > > Another advantage is that you can shrink shared buffers fairly granularly and > > cheaply with that approach, compared to having to move buffes entirely out of > > a larger mapping to be able to unmap it. > > Don't you have to still move buffers entirely out of the region you > want to unmap? Sure. But you can unmap at the granularity of a hardware page (there is some fragmentation cost on the OS / hardware page table level though). Theoretically you could unmap individual 8kB pages. > > > Problems include (1) you have to somehow figure out how much address space > > > to reserve, and that forms an upper bound on how big shared_buffers can grow > > > at runtime and > > > > Presumably you'd normally not want to reserve more than the physical amount of > > memory on the system. Sure, memory can be hot added, but IME that's quite > > rare. > > I would think that might not be so rare in a virtualized environment, > which would seem to be one of the most important use cases for this > kind of thing. I've not seen it in production in a long time - but that might be because I've been out of the consulting game for too long. To my knowledge none of the common cloud providers support it, which of course restricts where it could be used significantly. I have far more commonly seen use of "balooning" to remove unused/rarely used memory from running instances though. > Plus, this would mean we'd need to auto-detect system RAM. I'd rather > not go there, and just fix the upper limit via a GUC. I'd have assumed we'd want a GUC that auto-determines the amount of RAM if set to -1. I don't think it's that hard to detect the available memory. > > A third issue is that it can confuse administrators inspecting the system with > > OS tools. "Postgres uses many terabytes of memory on my system!" due to VIRT > > being huge etc. > > Mmph. That's disagreeable but probably not a reason to entirely > abandon any particular approach. Agreed. Greetings, Andres Freund
On Mon, Feb 19, 2024 at 2:05 AM Andres Freund <andres@anarazel.de> wrote: > We probably should address that independently of making shared_buffers > PGC_SIGHUP. The queue gets absurdly large once s_b hits a few GB. It's not > that much memory compared to the buffer blocks themselves, but a sync queue of > many millions of entries just doesn't make sense. And a few hundred MB for > that isn't nothing either, even if it's just a fraction of the space for the > buffers. It makes checkpointer more susceptible to OOM as well, because > AbsorbSyncRequests() allocates an array to copy all requests into local > memory. Sure, that could just be capped, if it makes sense. Although given the thrust of this discussion, it might be even better to couple it to something other than the size of shared_buffers. > I'd say the vast majority of postgres instances in production run with less > than 1GB of s_b. Just because numbers wise the majority of instances are > running on small VMs and/or many PG instances are running on one larger > machine. There are a lot of instances where the total available memory is > less than 2GB. Whoa. That is not my experience at all. If I've ever seen such a small system since working at EDB (since 2010!) it was just one where the initdb-time default was never changed. I can't help wondering if we should have some kind of memory_model GUC, measured in T-shirt sizes or something. We've coupled a bunch of things to shared_buffers mostly as a way of distinguishing small systems from large ones. But if we want to make shared_buffers dynamically changeable and we don't want to make all that other stuff dynamically changeable, decoupling those calculations might be an important thing to do. On a really small system, do we even need the ability to dynamically change shared_buffers at all? If we do, then I suspect the granule needs to be small. But does someone want to take a system with <1GB of shared_buffers and then scale it way, way up? I suppose it would be nice to have the option. But you might have to make some choices, like pick either a 16MB granule or a 128MB granule or a 1GB granule at startup time and then stick with it? I don't know, I'm just spitballing here, because I don't know what the real design is going to look like yet. > > Don't you have to still move buffers entirely out of the region you > > want to unmap? > > Sure. But you can unmap at the granularity of a hardware page (there is some > fragmentation cost on the OS / hardware page table level > though). Theoretically you could unmap individual 8kB pages. I thought there were problems, at least on some operating systems, if the address space mappings became too fragmented. At least, I wouldn't expect that you could use huge pages for shared_buffers and still unmap little tiny bits. How would that even work? -- Robert Haas EDB: http://www.enterprisedb.com
On 2/18/24 15:35, Andres Freund wrote: > On 2024-02-18 17:06:09 +0530, Robert Haas wrote: >> How many people set shared_buffers to something that's not a whole >> number of GB these days? > > I'd say the vast majority of postgres instances in production run with less > than 1GB of s_b. Just because numbers wise the majority of instances are > running on small VMs and/or many PG instances are running on one larger > machine. There are a lot of instances where the total available memory is > less than 2GB. > >> I mean I bet it happens, but in practice if you rounded to the nearest GB, >> or even the nearest 2GB, I bet almost nobody would really care. I think it's >> fine to be opinionated here and hold the line at a relatively large granule, >> even though in theory people could want something else. > > I don't believe that at all unfortunately. Couldn't we scale the rounding, e.g. allow small allocations as we do now, but above some number always round? E.g. maybe >= 2GB round to the nearest 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB, etc? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 2024-02-19 09:19:16 -0500, Joe Conway wrote: > On 2/18/24 15:35, Andres Freund wrote: > > On 2024-02-18 17:06:09 +0530, Robert Haas wrote: > > > How many people set shared_buffers to something that's not a whole > > > number of GB these days? > > > > I'd say the vast majority of postgres instances in production run with less > > than 1GB of s_b. Just because numbers wise the majority of instances are > > running on small VMs and/or many PG instances are running on one larger > > machine. There are a lot of instances where the total available memory is > > less than 2GB. > > > > > I mean I bet it happens, but in practice if you rounded to the nearest GB, > > > or even the nearest 2GB, I bet almost nobody would really care. I think it's > > > fine to be opinionated here and hold the line at a relatively large granule, > > > even though in theory people could want something else. > > > > I don't believe that at all unfortunately. > > Couldn't we scale the rounding, e.g. allow small allocations as we do now, > but above some number always round? E.g. maybe >= 2GB round to the nearest > 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB, > etc? That'd make the translation considerably more expensive. Which is important, given how common an operation this is. Greetings, Andres Freund
On 2/19/24 13:13, Andres Freund wrote: > On 2024-02-19 09:19:16 -0500, Joe Conway wrote: >> Couldn't we scale the rounding, e.g. allow small allocations as we do now, >> but above some number always round? E.g. maybe >= 2GB round to the nearest >> 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB, >> etc? > > That'd make the translation considerably more expensive. Which is important, > given how common an operation this is. Perhaps it is not practical, doesn't help, or maybe I misunderstand, but my intent was that the rounding be done/enforced when setting the GUC value which surely cannot be that often. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 2024-02-19 13:54:01 -0500, Joe Conway wrote: > On 2/19/24 13:13, Andres Freund wrote: > > On 2024-02-19 09:19:16 -0500, Joe Conway wrote: > > > Couldn't we scale the rounding, e.g. allow small allocations as we do now, > > > but above some number always round? E.g. maybe >= 2GB round to the nearest > > > 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB, > > > etc? > > > > That'd make the translation considerably more expensive. Which is important, > > given how common an operation this is. > > > Perhaps it is not practical, doesn't help, or maybe I misunderstand, but my > intent was that the rounding be done/enforced when setting the GUC value > which surely cannot be that often. It'd be used for something like WhereIsTheChunkOfBuffers[buffer/CHUNK_SIZE]+(buffer%CHUNK_SIZE)*BLCKSZ; If CHUNK_SIZE isn't a compile time constant this gets a good bit more expensive. A lot more, if implemented naively (i.e as actual modulo/division operations, instead of translating to shifts and masks). Greetings, Andres Freund