Thread: dynamically allocating chunks from shared memory
Hi, for quite some time, I've been under the impression, that there's still one disadvantage left from using processes instead of threads: we can only use statically sized chunks of shared memory. Every component that wants to use shared memory needs to pre-allocate whatever it thinks is sufficient. It cannot enlarge its share, nor can unused memory be allocated to other components. Having written a very primitive kind of a dynamic memory allocator for imessages [1], I've always wanted a better alternative. So I've investigated a bit, refactored step-by-step, and finally came up with the attached, lock based dynamic shared memory allocator. Its interface is as simple as malloc() and free(). A restart of the postmaster should truncate the whole area. Being a component which needs to pre-allocate its area in shared memory in advance, you need to define a maximum size for the pool of dynamically allocatable memory. That's currently defined in shmem.h instead of a GUC. This kind of feature has been requested at the Tokyo Clusting Meeting (by myself) in 2009 and is listed on the Wiki [2]. I'm now using that allocator as the basis for a reworked imessages patch, which I've attached as well. Both are tested as a basis for Postgres-R. While I think other components could use this dynamic memory allocator, too, I didn't write any code for that. Imessages currently is the only user available. (So please apply the dynshmem patch first, then imessages). Comments? Greetings from Oxford, and thanks to Joachim Wieland for providing me the required Internet connectivity ;-) Markus Wanner [1]: Postgres-R: internal messages http://archives.postgresql.org/message-id/4886DB0B.1090508@bluegap.ch [2]: Mentioned Cluster Feature http://wiki.postgresql.org/wiki/ClusterFeatures#Dynamic_shared_memory_allocation For git adicts: here's a git repository with both patches applied: http://git.postgres-r.org/?p=imessages;a=summary
Attachment
Excerpts from Markus Wanner's message of vie jul 02 19:44:46 -0400 2010: > Having written a very primitive kind of a dynamic memory allocator for > imessages [1], I've always wanted a better alternative. So I've > investigated a bit, refactored step-by-step, and finally came up with > the attached, lock based dynamic shared memory allocator. Its interface > is as simple as malloc() and free(). A restart of the postmaster should > truncate the whole area. Interesting, thanks. I gave it a skim and found that it badly needs a lot more code comments. I'm also unconvinced that spinlocks are the best locking primitive here. Why not lwlocks? > Being a component which needs to pre-allocate its area in shared memory > in advance, you need to define a maximum size for the pool of > dynamically allocatable memory. That's currently defined in shmem.h > instead of a GUC. This should be an easy change; I agree that it needs to be configurable. I'm not sure what kind of resistance you'll see to the idea of a dynamically allocatable shmem area. Maybe we could use this in other areas such as allocating space for heavyweight lock objects. Right now the memory usage for them could grow due to a transitory increase in lock traffic, leading to out-of-memory conditions later in other modules. We've seen reports of that problem, so it'd be nice to be able to fix that with this infrastructure. I didn't look at the imessages patch (except to notice that I didn't very much like the handling of out-of-memory, but you already knew that).
On Tue, Jul 20, 2010 at 1:50 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > I'm not sure what kind of resistance you'll see to the idea of a > dynamically allocatable shmem area. Maybe we could use this in other > areas such as allocating space for heavyweight lock objects. Right now > the memory usage for them could grow due to a transitory increase in > lock traffic, leading to out-of-memory conditions later in other > modules. We've seen reports of that problem, so it'd be nice to be able > to fix that with this infrastructure. Well, you can't really fix that problem with this infrastructure, because this infrastructure only allows shared memory to be dynamically allocated from a pool set aside for such allocations in advance. If a surge in demand can exhaust all the heavyweight lock space in the system, it can also exhaust the shared pool from which more heavyweight lock space can be allocated. The failure might manifest itself in a totally different subsystem though, since the allocation that failed wouldn't necessarily be a heavyweight lock allocation, but some other allocation that failed as a result of space used by the heavyweight locks. It would be more interesting if you could expand (or contract) the size of shared memory as a whole while the system is up and running. Then, perhaps, max_locks_per_transaction and other, similar GUCs could be made PGC_SIGHUP, which would give you a way out of such situations that didn't involve taking down the entire cluster. I'm not too sure how to do that, though. With respect to imessages specifically, what is the motivation for using shared memory rather than something like an SLRU? The new LISTEN implementation uses an SLRU and handles variable-size messages, so it seems like it might be well-suited to this task. Incidentally, the link for the imessages patch on the CommitFest page points to http://archives.postgresql.org/message-id/ab0cd52a64e788f4ecb4515d1e6e4691@localhost - which is the dynamic shmem patch. So I'm not sure where to find the latest imessages patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hello Alvaro, thank you for looking through this code. On 07/20/2010 07:50 PM, Alvaro Herrera wrote: > Interesting, thanks. > > I gave it a skim and found that it badly needs a lot more code comments. Hm.. yeah, the dynshmem stuff could probably need more comments. (The bgworker stuff is probably a better example). > I'm also unconvinced that spinlocks are the best locking primitive here. > Why not lwlocks? It's derived from a completely lock-free algorithm, as proposed by Maged M. Michael in: Scalable Lock-Free Dynamic Memory Allocator. I dropped all of the CAS primitives with their retry loop around and did further simplifications. Spinlocks simply looked like the simplest thing to fall-back to. But yeah, splitting into read and write accesses and using lwlocks might be a win. Or it might not. I honestly don't know. And it's probably not the best performing allocator ever. But it's certainly better than nothing. I did recently release the lock-free variant as well as a lock based one, see http://www.bluegap.ch/projects/wamalloc/ for more information. > I'm not sure what kind of resistance you'll see to the idea of a > dynamically allocatable shmem area. So far neither resistance nor applause. I'd love to hear more of an echo. Even if it's resistance. > Maybe we could use this in other > areas ..which is why I've published this separately from Postgres-R. > such as allocating space for heavyweight lock objects. Right now > the memory usage for them could grow due to a transitory increase in > lock traffic, leading to out-of-memory conditions later in other > modules. We've seen reports of that problem, so it'd be nice to be able > to fix that with this infrastructure. Maybe, yes. Sounds like a nice idea. > I didn't look at the imessages patch (except to notice that I didn't > very much like the handling of out-of-memory, but you already knew that). As all of the allocation problem has now been ripped out, the imessages patch got quite a bit smaller. imsg.c now consists of only around 370 lines of code. The handling of out-of-(shared)-memory situation could certainly be improved, yes. Note that I've already separated out a IMessageCreateInternal() method, which simply returns NULL in that case. Is that the API you'd prefer? Getting back to the dynshmem stuff: I don't mind much *which* allocator to use. I also looked at jemalloc, but haven't been able to integrate it into Postgres. So I've extended my experiment with wamalloc and turned it into something usable for Postgres. Regards Markus Wanner
Hi, On 07/20/2010 08:23 PM, Robert Haas wrote: > Well, you can't really fix that problem with this infrastructure, No, but it would allow you to better use the existing amount of shared memory. Possibly avoiding the problem is certain scenarios. > The failure might > manifest itself in a totally different subsystem though, since the > allocation that failed wouldn't necessarily be a heavyweight lock > allocation, but some other allocation that failed as a result of space > used by the heavyweight locks. Yeah, that's a valid concern. Maybe it could be addressed by keeping track of usage of dynshmem per module, and somehow inform the user about the usage pattern in case of OOM. > It would be more interesting Sure, but then you'd definitely need a dynamic allocator, no? > With respect to imessages specifically, what is the motivation for > using shared memory rather than something like an SLRU? The new > LISTEN implementation uses an SLRU and handles variable-size messages, > so it seems like it might be well-suited to this task. Well, imessages predates the new LISTEN implementation by some moons. They are intended to replace (unix-ish) pipes between processes. I fail to see the immediate link between (S)LRU and inter-process message passing. It might be more useful for multiple LISTENers, but I bet it has slightly different semantics than imessages. But to be honest, I don't know too much about the new LISTEN implementation. Do you think a loss-less (single)-process-to-(single)-process message passing system could be built on top of it? > Incidentally, the link for the imessages patch on the CommitFest page > points to http://archives.postgresql.org/message-id/ab0cd52a64e788f4ecb4515d1e6e4691@localhost > - which is the dynamic shmem patch. So I'm not sure where to find the > latest imessages patch. The archive doesn't display attachments very well. But the imessages patch is part of that mail. Maybe you still find it in your local mailbox? In the archive view, it starts at the line that says: *** src/backend/storage/ipc/imsg.c dc149eef487eafb43409a78b8a33c70e7d3c2bfa (and, well, the dynshmem stuff ends just before that line. Those were two .diff files attached, IIRC). Regards Markus Wanner
Excerpts from Markus Wanner's message of mar jul 20 14:36:55 -0400 2010: > > I'm also unconvinced that spinlocks are the best locking primitive here. > > Why not lwlocks? > > It's derived from a completely lock-free algorithm, as proposed by Maged > M. Michael in: Scalable Lock-Free Dynamic Memory Allocator. Hmm, deriving code from a paper published by IBM sounds like bad news -- who knows what patents they hold on the techniques there?
Hi, On 07/20/2010 09:05 PM, Alvaro Herrera wrote: > Hmm, deriving code from a paper published by IBM sounds like bad news -- > who knows what patents they hold on the techniques there? Yeah, that might be an issue. Note, however, that the lock-based variant differs substantially from what's been published. And I sort of doubt their patents covers a lot of stuff that's not lock-free-ish. But again, I'd also very much welcome any other allocator. In my opinion, it's the most annoying drawback of the process-based design compared to a threaded variant (from the perspective of a developer). Regards Markus Wanner
Excerpts from Markus Wanner's message of mar jul 20 14:54:42 -0400 2010: > > With respect to imessages specifically, what is the motivation for > > using shared memory rather than something like an SLRU? The new > > LISTEN implementation uses an SLRU and handles variable-size messages, > > so it seems like it might be well-suited to this task. > > Well, imessages predates the new LISTEN implementation by some moons. > They are intended to replace (unix-ish) pipes between processes. I fail > to see the immediate link between (S)LRU and inter-process message > passing. It might be more useful for multiple LISTENers, but I bet it > has slightly different semantics than imessages. I guess what Robert is saying is that you don't need shmem to pass messages around. The new LISTEN implementation was just an example. imessages aren't supposed to use it directly. Rather, the idea is to store the messages in a new SLRU area. Thus you don't need to mess with dynamically allocating shmem at all. > But to be honest, I don't know too much about the new LISTEN > implementation. Do you think a loss-less > (single)-process-to-(single)-process message passing system could be > built on top of it? I don't think you should build on top of LISTEN but of slru.c. This is probably more similar to multixact (see multixact.c) than to the new LISTEN implementation. I think it should be rather straightforward. There would be a unique append-point; each process desiring to send a new message to another backend would add a new message at that point. There would be one read pointer per backend, and it would be advanced as messages are consumed. Old segments could be trimmed as backends advance their read pointer, similar to how sinval queue is handled.
On Tue, Jul 20, 2010 at 5:46 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Markus Wanner's message of mar jul 20 14:54:42 -0400 2010: > >> > With respect to imessages specifically, what is the motivation for >> > using shared memory rather than something like an SLRU? The new >> > LISTEN implementation uses an SLRU and handles variable-size messages, >> > so it seems like it might be well-suited to this task. >> >> Well, imessages predates the new LISTEN implementation by some moons. >> They are intended to replace (unix-ish) pipes between processes. I fail >> to see the immediate link between (S)LRU and inter-process message >> passing. It might be more useful for multiple LISTENers, but I bet it >> has slightly different semantics than imessages. > > I guess what Robert is saying is that you don't need shmem to pass > messages around. The new LISTEN implementation was just an example. > imessages aren't supposed to use it directly. Rather, the idea is to > store the messages in a new SLRU area. Thus you don't need to mess with > dynamically allocating shmem at all. Right. I might be full of bull, but that's what I'm saying. :-) >> But to be honest, I don't know too much about the new LISTEN >> implementation. Do you think a loss-less >> (single)-process-to-(single)-process message passing system could be >> built on top of it? > > I don't think you should build on top of LISTEN but of slru.c. This is > probably more similar to multixact (see multixact.c) than to the new > LISTEN implementation. > > I think it should be rather straightforward. There would be a unique > append-point; each process desiring to send a new message to another > backend would add a new message at that point. There would be one read > pointer per backend, and it would be advanced as messages are consumed. > Old segments could be trimmed as backends advance their read pointer, > similar to how sinval queue is handled. If the messages are mostly unicast, it might be nice if to contrive a method whereby backends didn't need to explicitly advance over messages destined only for other backends. Like maybe allocate a small, fixed amount of shared memory sufficient for two "pointers" into the SLRU area per backend, and then use the SLRU to store each message with a header indicating where the next message is to be found. For each backend, you store one pointer to the first queued message and one pointer to the last queued message. New messages can be added by making the current last message point to a newly added message and updating the last message pointer for that backend. You'd need to think about the locking and reference counting carefully to make sure you eventually freed up unused pages, but it seems like it might be doable. Of course, if the messages are mostly multi/anycast, or if the rate of messaging is low enough that the aforementioned complexity is not worth bothering with, then, what you said. One big advantage of attacking the problem with an SLRU is that there's no fixed upper limit on the amount of data that can be enqueued at any given time. You can spill to disk or whatever as needed (although hopefully you won't normally do so, for performance reasons). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 07/21/2010 01:52 AM, Robert Haas wrote: > On Tue, Jul 20, 2010 at 5:46 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: >> I guess what Robert is saying is that you don't need shmem to pass >> messages around. The new LISTEN implementation was just an example. >> imessages aren't supposed to use it directly. Rather, the idea is to >> store the messages in a new SLRU area. Thus you don't need to mess with >> dynamically allocating shmem at all. Okay, so I just need to grok the SLRU stuff. Thanks for clarifying. Note that I sort of /want/ to mess with shared memory. It's what I know how to deal with. It's how threaded programs work as well. Ya know, locks, conditional variables, mutexes, all those nice thing that allow you to shoot your foot so terribly nicely... Oh, well... >> I think it should be rather straightforward. There would be a unique >> append-point; Unique append-point? Sounds like what I had before. That'd be a step backwards, compared to the per-backend queue and an allocator that hopefully scales well with the amount of CPU cores. >> each process desiring to send a new message to another >> backend would add a new message at that point. There would be one read >> pointer per backend, and it would be advanced as messages are consumed. >> Old segments could be trimmed as backends advance their read pointer, >> similar to how sinval queue is handled. That leads to pretty nasty fragmentation. A dynamic allocator should do much better in that regard. (Wamalloc certainly does). > If the messages are mostly unicast, it might be nice if to contrive a > method whereby backends didn't need to explicitly advance over > messages destined only for other backends. Like maybe allocate a > small, fixed amount of shared memory sufficient for two "pointers" > into the SLRU area per backend, and then use the SLRU to store each > message with a header indicating where the next message is to be > found. That's pretty much how imessages currently work. A single list of messages queued per backend. > For each backend, you store one pointer to the first queued > message and one pointer to the last queued message. New messages can > be added by making the current last message point to a newly added > message and updating the last message pointer for that backend. You'd > need to think about the locking and reference counting carefully to > make sure you eventually freed up unused pages, but it seems like it > might be doable. I've just read through slru.c, but still don't have a clue how it could replace a dynamic allocator. At the moment, the creator of an imessage allocs memory, copies the payload there and then activates the message by appending it to the recipient's queue. Upon getting signaled, the recipient consumes the message by removing it from the queue and is obliged to release the memory the messages occupies after having processed it. Simple and straight forward, IMO. The queue addition and removal is clear. But how would I do the alloc/free part with SLRU? Its blocks are fixed size (BLCKSZ) and the API with ReadPage and WritePage is rather unlike a pair of alloc() and free(). > One big advantage of attacking the problem with an SLRU is that > there's no fixed upper limit on the amount of data that can be > enqueued at any given time. You can spill to disk or whatever as > needed (although hopefully you won't normally do so, for performance > reasons). Yes, imessages shouldn't ever be spilled to disk. There naturally must be an upper limit for them. (Be it total available memory, as for threaded things or a given and size-constrained pool, as is the case for dynshmem). To me it rather sounds like SLRU is a candidate for using dynamically allocated shared memory underneath, instead of allocating a fixed amount of slots in advance. That would allow more efficient use of shared memory. (Given SLRU's ability to spill to disk, it could even be used to 'balance' out anomalies to some extent). Regards Markus Wanner
On Wed, Jul 21, 2010 at 4:33 AM, Markus Wanner <markus@bluegap.ch> wrote: > Okay, so I just need to grok the SLRU stuff. Thanks for clarifying. > > Note that I sort of /want/ to mess with shared memory. It's what I know how > to deal with. It's how threaded programs work as well. Ya know, locks, > conditional variables, mutexes, all those nice thing that allow you to shoot > your foot so terribly nicely... Oh, well... For what it's worth, I feel your pain. I think the SLRU method is *probably* better, but I feel your pain anyway. >> For each backend, you store one pointer to the first queued >> message and one pointer to the last queued message. New messages can >> be added by making the current last message point to a newly added >> message and updating the last message pointer for that backend. You'd >> need to think about the locking and reference counting carefully to >> make sure you eventually freed up unused pages, but it seems like it >> might be doable. > > I've just read through slru.c, but still don't have a clue how it could > replace a dynamic allocator. > > At the moment, the creator of an imessage allocs memory, copies the payload > there and then activates the message by appending it to the recipient's > queue. Upon getting signaled, the recipient consumes the message by removing > it from the queue and is obliged to release the memory the messages occupies > after having processed it. Simple and straight forward, IMO. > > The queue addition and removal is clear. But how would I do the alloc/free > part with SLRU? Its blocks are fixed size (BLCKSZ) and the API with ReadPage > and WritePage is rather unlike a pair of alloc() and free(). Given what you're trying to do, it does sound like you're going to need some kind of an algorithm for space management; but you'll be managing space within the SLRU rather than within shared_buffers. For example, you might end up putting a header on each SLRU page or segment and using that to track the available freespace within that segment for messages to be read and written. It'll probably be a bit more complex than the one for listen (see asyncQueueAddEntries). >> One big advantage of attacking the problem with an SLRU is that >> there's no fixed upper limit on the amount of data that can be >> enqueued at any given time. You can spill to disk or whatever as >> needed (although hopefully you won't normally do so, for performance >> reasons). > > Yes, imessages shouldn't ever be spilled to disk. There naturally must be an > upper limit for them. (Be it total available memory, as for threaded things > or a given and size-constrained pool, as is the case for dynshmem). I guess experience has taught me to be wary of things that are wired in memory. Under extreme memory pressure, something's got to give, or the whole system will croak. Consider also the contrary situation, where the imessages stuff is not in use (even for a short period of time, like a few minutes). Then we'd really rather not still have memory carved out for it. > To me it rather sounds like SLRU is a candidate for using dynamically > allocated shared memory underneath, instead of allocating a fixed amount of > slots in advance. That would allow more efficient use of shared memory. > (Given SLRU's ability to spill to disk, it could even be used to 'balance' > out anomalies to some extent). I think what would be even better is to merge the SLRU pools with the shared_buffer pool, so that the two can duke it out for who is in most need of the limited amount of memory available. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, first of all, thanks for your feedback, I enjoy the discussion. On 07/21/2010 07:25 PM, Robert Haas wrote: > Given what you're trying to do, it does sound like you're going to > need some kind of an algorithm for space management; but you'll be > managing space within the SLRU rather than within shared_buffers. For > example, you might end up putting a header on each SLRU page or > segment and using that to track the available freespace within that > segment for messages to be read and written. It'll probably be a bit > more complex than the one for listen (see asyncQueueAddEntries). But what would that buy us? Also consider that pretty much all available dynamic allocators use shared memory (either from the OS directly, or via mmap()'d area). >> Yes, imessages shouldn't ever be spilled to disk. There naturally must be an >> upper limit for them. (Be it total available memory, as for threaded things >> or a given and size-constrained pool, as is the case for dynshmem). > > I guess experience has taught me to be wary of things that are wired > in memory. Under extreme memory pressure, something's got to give, or > the whole system will croak. I absolutely agree to that last sentence. However, experience has taught /me/ to be wary of things that needlessly swap to disk for hours before reporting any kind of error (AKA swap hell). I prefer systems that adjust to the OOM condition, instead of just ignoring it and falling back to disk (which isn't doesn't provide infinite space, so that's just pushing the limits). The solution for imessages certainly isn't spilling to disk, which would consume even more resources. Instead the process(es) for which there are pending imessages should be allowed to consume them. That's why upon OOM, IMessageCreate currently simply blocks the process that wants to create an imessages. And yes, that's not quite perfect (that process should still consume messages for itself), and it might not play well with other potential users of dynamically allocated memory. But it certainly works better than spilling to disk (and yes, I tested that behavior within Postgres-R). > Consider also the contrary situation, > where the imessages stuff is not in use (even for a short period of > time, like a few minutes). Then we'd really rather not still have > memory carved out for it. Huh? That's exactly what dynamic allocation could give you: not having memory carved out for stuff you currently don't need, but instead being able to dynamically use memory where most needed. SLRU has memory (not disk space) carved out for pretty much every sub-system separately, if I'm reading that code correctly. > I think what would be even better is to merge the SLRU pools with the > shared_buffer pool, so that the two can duke it out for who is in most > need of the limited amount of memory available. ..well, just add the shared_buffer pool to the list of candidates that could use dynamically allocated shared memory. It would need some thinking about boundaries (i.e. when to spill to disk, for those modules that /want/ to spill to disk) and dealing with OOM situations, but that's about it. Regards Markus
On Wed, Jul 21, 2010 at 2:53 PM, Markus Wanner <markus@bluegap.ch> wrote: >> Consider also the contrary situation, >> where the imessages stuff is not in use (even for a short period of >> time, like a few minutes). Then we'd really rather not still have >> memory carved out for it. > > Huh? That's exactly what dynamic allocation could give you: not having > memory carved out for stuff you currently don't need, but instead being able > to dynamically use memory where most needed. SLRU has memory (not disk > space) carved out for pretty much every sub-system separately, if I'm > reading that code correctly. Yeah, I think you are right. :-( >> I think what would be even better is to merge the SLRU pools with the >> shared_buffer pool, so that the two can duke it out for who is in most >> need of the limited amount of memory available. > > ..well, just add the shared_buffer pool to the list of candidates that could > use dynamically allocated shared memory. It would need some thinking about > boundaries (i.e. when to spill to disk, for those modules that /want/ to > spill to disk) and dealing with OOM situations, but that's about it. I'm not sure why merging the SLRU pools with shared_buffers would benefit from dynamically allocated shared memory. I might be at (or possibly beyond) the limit of my ability to comment intelligently on this without looking more at what you want to use these imessages for, but I'm still pretty skeptical about the idea of storing them directly in shared memory. It's possible, though, that I am all wet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 07/22/2010 12:11 AM, Robert Haas wrote: > I'm not sure why merging the SLRU pools with shared_buffers would > benefit from dynamically allocated shared memory. Well, I'm not sure how you'd merge SLRU pools with shared_buffers. IMO that inherently leads to the problem of allocating memory dynamically. With such an allocator, I'd say you just port one module after another to use that, instead of pre-allocated, fixed portions of shared memory. > I might be at (or possibly beyond) the limit of my ability to comment > intelligently on this without looking more at what you want to use > these imessages for, but I'm still pretty skeptical about the idea of > storing them directly in shared memory. It's possible, though, that I > am all wet. Imessages are meant to be a replacement for unix pipes. (To my knowledge, those don't spill to disk either, but are blocking as soon as Linux considers the pipe to be 'full'. Whenever that is. Or am I wrong here?) The reasons for replacing them were: they consume lots of file descriptors, they can only be established between the parent and its child process (at least for anonymous pipes that's the case) and last but not least, I got told they still aren't fully portable. Another nice thing about imessages compared to unix pipes is, that it's a zero-copy approach. Hope that makes my opinions and decisions clearer. Thank you for sharing your concerns and for explaining SLRU to me. Regards Markus Wanner
On Thu, Jul 22, 2010 at 3:01 AM, Markus Wanner <markus@bluegap.ch> wrote: > On 07/22/2010 12:11 AM, Robert Haas wrote: >> >> I'm not sure why merging the SLRU pools with shared_buffers would >> benefit from dynamically allocated shared memory. > > Well, I'm not sure how you'd merge SLRU pools with shared_buffers. IMO that > inherently leads to the problem of allocating memory dynamically. > > With such an allocator, I'd say you just port one module after another to > use that, instead of pre-allocated, fixed portions of shared memory. Well, shared_buffers has to be allocated as one contiguous slab because we index into it that way. So I don't really see how dynamically allocating memory could help. What you'd need is a different system for assigning buffer tags, so that a particular tag could refer to a buffer with either kind of contents. >> I might be at (or possibly beyond) the limit of my ability to comment >> intelligently on this without looking more at what you want to use >> these imessages for, but I'm still pretty skeptical about the idea of >> storing them directly in shared memory. It's possible, though, that I >> am all wet. > > Imessages are meant to be a replacement for unix pipes. (To my knowledge, > those don't spill to disk either, but are blocking as soon as Linux > considers the pipe to be 'full'. Whenever that is. Or am I wrong here?) I think you're right about that. > The reasons for replacing them were: they consume lots of file descriptors, > they can only be established between the parent and its child process (at > least for anonymous pipes that's the case) and last but not least, I got > told they still aren't fully portable. Another nice thing about imessages > compared to unix pipes is, that it's a zero-copy approach. That's sort of approaching the question from the opposite end from what I was concerned about - I was wondering why you need a unicast message-passing system. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 07/22/2010 01:04 PM, Robert Haas wrote: > Well, shared_buffers has to be allocated as one contiguous slab > because we index into it that way. So I don't really see how > dynamically allocating memory could help. What you'd need is a > different system for assigning buffer tags, so that a particular tag > could refer to a buffer with either kind of contents. Hm.. okay, then it might not be that easy. Thanks for pointing that out. > That's sort of approaching the question from the opposite end from > what I was concerned about - I was wondering why you need a unicast > message-passing system. Well, the initial Postgres-R approach, being based on Postgres 6.4.something used unix pipes. I coded imessages as a replacement. Postgres-R basically uses imessages to pass around change sets and other information required to keep replicas in sync. The thinking in terms of message passing seems to originate from the GCS, which in itself is a message passing system (with some nice extras and varying delivery guarantees). In Postgres-R the coordinator process receives messages from the GCS, does some minor controlling and book-keeping, but basically passes on the data via imessages to a backrgound worker. Of course, as mentioned in the bgworker patch, this could be done differently. Using solely shared memory, or maybe SLRU to store change sets. However, I certainly like the abstraction and guarantees such a message passing system provides. It makes things easier to reason about, IMO. For another example, see the bgworker patches, steps 1 and 2, where I've changed the current autovacuum infrastructure to use imessages (between launcher and worker). [ And I've heard saying that current multi-core CPU designs tend to like message passing systems. Not sure how much that applies to imessages and/or how it's used in bgworkers or Postgres-R, though. ] That much about why using a unicast message-passing system. Regards Markus Wanner
Markus Wanner wrote: > On 07/20/2010 09:05 PM, Alvaro Herrera wrote: >> Hmm, deriving code from a paper published by IBM sounds like bad news -- >> who knows what patents they hold on the techniques there? > > Yeah, that might be an issue. Note, however, that the lock-based > variant differs substantially from what's been published. And I sort > of doubt their patents covers a lot of stuff that's not lock-free-ish. There's a fairly good mapping of what techniques are patented and which were only mentioned in research in the Sun dynamic memory patent at http://www.freepatentsonline.com/7328316.html ; that mentions an earlier paper by the author of the technique Markus is using, but this was from before that one was written. It looks like Sun has a large portion of the patent portfolio in this area, which is particularly troublesome now. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg, On 07/22/2010 03:59 PM, Greg Smith wrote: > There's a fairly good mapping of what techniques are patented and which > were only mentioned in research in the Sun dynamic memory patent at > http://www.freepatentsonline.com/7328316.html ; that mentions an earlier > paper by the author of the technique Markus is using, but this was from > before that one was written. It looks like Sun has a large portion of > the patent portfolio in this area, which is particularly troublesome now. Thanks for the pointer, very helpful. Anybody ever checked jemalloc, or any other OSS allocator out there against these patents? Remembering similar patent-discussions, it might be better to not bother too much and just go with something widely used, based on the assumption that such a thing is going to enjoy broad support in case of an attack from a patent troll. What do you think? What'd be your favorite allocator? Regards Markus Wanner
Excerpts from Markus Wanner's message of jue jul 22 08:49:29 -0400 2010: > Of course, as mentioned in the bgworker patch, this could be done > differently. Using solely shared memory, or maybe SLRU to store change > sets. However, I certainly like the abstraction and guarantees such a > message passing system provides. It makes things easier to reason about, > IMO. FWIW I don't think you should be thinking in "replacing imessages with SLRU". I rather think you should be thinking in how can you implement the imessages API on top of SLRU. So as far as the coordinator and background worker are concerned, there wouldn't be any difference -- they keep using the same API they are using today. Also let me repeat my earlier comment about imessages being more similar to multixact than to notify. The content of each multixact entry is just an arbitrary amount of bytes. If imessages are numbered from a monotonically increasing sequence, it should be possible to use a very similar technique, and perhaps you should be able to reduce locking requirements as well (write messages with only a shared lock, after you've determined and reserved the area you're going to write).
Hi, On 07/22/2010 08:31 PM, Alvaro Herrera wrote: > FWIW I don't think you should be thinking in "replacing imessages with > SLRU". I rather think you should be thinking in how can you implement > the imessages API on top of SLRU. Well, I'm rather comparing SLRU with the dynamic allocator. So far I'm unconvinced that SLRU would be a better base for imessages than a dynamic allocator. (And I'm arguing that SLRU should use a dynamic allocator underneath). > So as far as the coordinator and > background worker are concerned, there wouldn't be any difference -- > they keep using the same API they are using today. Agreed, the imessages API to the upper layer doesn't need to care about the underlying stuff. > Also let me repeat my earlier comment about imessages being more similar > to multixact than to notify. The content of each multixact entry is > just an arbitrary amount of bytes. If imessages are numbered from a > monotonically increasing sequence, Well, there's absolutely no need to serialize imessages. So they don't currently carry any such number. And opposed to multixact entries, they are clearly directed at exactly one single consumer. Every consumer has its own receive queue. Sending messages concurrently to different recipients may happen completely parallelized, without any (b)locking in between. The dynamic allocator is the only part of the chain which might need to do some locking to protect the shared resource (memory) against concurrent access. Note, however, that wamalloc (as any modern dynamic allocator) is parallelized to some extent, i.e. concurrent malloc/free calls don't necessarily need to block each other. > it should be possible to use a very > similar technique, and perhaps you should be able to reduce locking > requirements as well (write messages with only a shared lock, after > you've determined and reserved the area you're going to write). Writing to the message is currently (i.e. imessages-on-dynshmem) done without *any* kind of lock held. So that would rather increase locking requirements and lower parallelism, I fear. Regards Markus
On Thu, Jul 22, 2010 at 3:09 PM, Markus Wanner <markus@bluegap.ch> wrote: >> FWIW I don't think you should be thinking in "replacing imessages with >> SLRU". I rather think you should be thinking in how can you implement >> the imessages API on top of SLRU. > > Well, I'm rather comparing SLRU with the dynamic allocator. So far I'm > unconvinced that SLRU would be a better base for imessages than a dynamic > allocator. (And I'm arguing that SLRU should use a dynamic allocator > underneath). Here's another idea. Instead of making imessages use an SLRU, how about having it steal pages from shared_buffers? This would require segmenting messages into small enough chunks that they'd fit, but the nice part is that it would avoid the need to have a completely separate shared memory arena. Ideally, we'd make the infrastructure general enough that things like SLRU could use it also; and get rid of or reduce in size some of the special-purpose chunks we're now allocating. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Excerpts from Robert Haas's message of lun jul 26 08:52:46 -0400 2010: > Here's another idea. Instead of making imessages use an SLRU, how > about having it steal pages from shared_buffers? This would require > segmenting messages into small enough chunks that they'd fit, but the > nice part is that it would avoid the need to have a completely > separate shared memory arena. Ideally, we'd make the infrastructure > general enough that things like SLRU could use it also; and get rid of > or reduce in size some of the special-purpose chunks we're now > allocating. What's the problem you see with "another shared memory arena"? Right now we allocate a single large arena, and the lot of shared_buffers, SLRU pools, locking objects, etc are all allocated from there. If we want another 2 MB for "dynamic shmem", we'd just allocate 2 MB more in that large arena and give those to this new code.
On Mon, Jul 26, 2010 at 10:31 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Robert Haas's message of lun jul 26 08:52:46 -0400 2010: >> Here's another idea. Instead of making imessages use an SLRU, how >> about having it steal pages from shared_buffers? This would require >> segmenting messages into small enough chunks that they'd fit, but the >> nice part is that it would avoid the need to have a completely >> separate shared memory arena. Ideally, we'd make the infrastructure >> general enough that things like SLRU could use it also; and get rid of >> or reduce in size some of the special-purpose chunks we're now >> allocating. > > What's the problem you see with "another shared memory arena"? Right > now we allocate a single large arena, and the lot of shared_buffers, > SLRU pools, locking objects, etc are all allocated from there. If we > want another 2 MB for "dynamic shmem", we'd just allocate 2 MB more in > that large arena and give those to this new code. But that's not a very flexible design. If you discover that you need 3MB instead of 2MB, you get to restart the entire cluster. If you discover that you need 1MB instead of 2MB, you get to either restart the entire cluster, or waste 1MB of shared memory. And since actual usage will almost certainly fluctuate, you'll almost certainly be wasting some shared memory that could otherwise be used for other purposes some of the time. Now, granted, we have this problem already today, and granted also, 2MB is not an enormous amount of memory on today's machines. If we really think that 2MB will always be adequate for every purpose for which we wish to use unicast messaging, then perhaps it's OK, but I'm not convinced that's true. It would be nice to think, for example, that this could be used as infrastructure for parallel query to stream results back from worker processes to the backend connected to the user. If you're using 16 processors to concurrently scan 16 partitions of an appendrel and stream those results back to the master, will 128kB/backend be enough memory to avoid pipeline stalls? What if there's replication going on at the same time? What if there's other concurrent activity that also uses imessages? Or even better, what if there's other concurrent activity that uses the dynamic allocator but NOT imessages? If the point of having a dynamic allocator is that it's eventually going to be used by lots of different subsystems, then we had better have a fairly high degree of confidence that it actually will, but in fact we've made very little effort to characterize who the other users might be and whether the stated implementation limitations will be adequate for them. Frankly, I doubt it. One of the major reasons why malloc() is so powerful is that you don't have to decide in advance how much memory you're going to need, as you would if you put the structure in the data segment. Dynamically allocating out of a 2MB segment gives up most of that flexibility. What I think will end up happening here is that you'll always have to size the segment used by the dynamic allocator considerably larger than the amount of memory you expect to actually be used, so that performance doesn't go into the toilet when it fills up. As Markus pointed out upthread, you'll always need some hard limit on the amount of space that imessages can use, but you can make that limit much larger if it's not reserved for a single purpose. If you use the "temporarily allocated shared buffers" method, then you could set the default limit to something like "64MB, but not more than 1/8th of shared buffers". Since the memory won't get used unless it's needed, you don't really have to care whether a particular installation is likely to need some, none, or all of that; whereas if you're allocating nailed-down memory, you're going to want a much smaller default - a couple of MB, at most. Furthermore, if you do happen to be running on a 64GB machine with 8GB of shared_buffers and 64MB isn't adequate, you can easily make it possible to bump that value up by changing a GUC and hitting reload. With the "nailed-down shared memory" approach, you're locked into whatever you decide at postmaster start. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 07/26/2010 04:31 PM, Alvaro Herrera wrote: > Excerpts from Robert Haas's message of lun jul 26 08:52:46 -0400 2010: >> Here's another idea. Instead of making imessages use an SLRU, how >> about having it steal pages from shared_buffers? This would require >> segmenting messages into small enough chunks that they'd fit, but the >> nice part is that it would avoid the need to have a completely >> separate shared memory arena. Ideally, we'd make the infrastructure >> general enough that things like SLRU could use it also; and get rid of >> or reduce in size some of the special-purpose chunks we're now >> allocating. To me that sounds like solving the same kind of problem for every module separately and somewhat differently. I tend to like general solutions (often too much, but that's another story), and to me it still seems a completely dynamic memory allocator solves that generically (and way more elegant than 'stealing pages' sounds). > Right > now we allocate a single large arena, and the lot of shared_buffers, > SLRU pools, locking objects, etc are all allocated from there. Uh.. they all allocate from different, statically sized pool, don't they? > If we > want another 2 MB for "dynamic shmem", we'd just allocate 2 MB more in > that large arena and give those to this new code. That's how it could work if we used a dynamic allocator. But currently, if I understand correctly, once the shared_buffers pool is full, it cannot steal memory from the SLRU pools. Or am I mistaken? Regards Markus Wanner
Hi, On 07/26/2010 06:33 PM, Robert Haas wrote: > It would be nice to think, for example, that this could be used as > infrastructure for parallel query to stream results back from worker > processes to the backend connected to the user. If you're using 16 > processors to concurrently scan 16 partitions of an appendrel and > stream those results back to the master Now, *that* sounds like music to my ears ;-) Or put another way: yes, I think imessages and the bgworker infrastructure stuff could enable or at least help that goal. > Dynamically allocating out of a 2MB > segment gives up most of that flexibility. Absolutely, that's why I'd like to see other modules that use the dynamic allocator. The more the better. > What I think will end up happening here is that you'll always have to > size the segment used by the dynamic allocator considerably larger > than the amount of memory you expect to actually be used, so that > performance doesn't go into the toilet when it fills up. As Markus > pointed out upthread, you'll always need some hard limit on the amount > of space that imessages can use, but you can make that limit much > larger if it's not reserved for a single purpose. If you use the > "temporarily allocated shared buffers" method, then you could set the > default limit to something like "64MB, but not more than 1/8th of > shared buffers". I've been thinking about such rules as well. They quickly get more complex if you begin to take OOM situations and their counter-measures into account. In a way, fixing every separate pool to its specific size just is the very simples rule-set I can think of. The dynamic allocator buys you more flexibility, but choosing good limits and rules between the sub-systems is another issue. Regards Markus Wanner
On Mon, Jul 26, 2010 at 12:51 PM, Markus Wanner <markus@bluegap.ch> wrote: >> Dynamically allocating out of a 2MB >> segment gives up most of that flexibility. > > Absolutely, that's why I'd like to see other modules that use the dynamic > allocator. The more the better. Right, I agree. The problem is that I don't think they can. The elephant in the room is shared_buffers, which I believe to be typically BY FAR the largest consumer of shared memory. It would be absolutely fantastic if we had a shared_buffers implementation that could free up unused buffers when they're not needed, or add more when required. But there are several reasons why I don't believe that will ever happen. One, much of the code that uses shared_buffers relies on shared_buffers being located at a fixed memory address on a contiguous chunk, and it's hard to see how we could change that assumption without sacrificing performance. Two, the overall size of the shared memory arena is largely dependent on the size of shared_buffers, so unless you also have the ability to resize the arena on the fly (which is well-nigh to impossible with our current architecture, and maybe with any architecture), resizing shared_buffers doesn't actually add that much flexibility. Three, the need for shared buffers is elastic rather than absolute: stealing a few shared buffers for a defined purpose (like sending imessages) is perfectly reasonable, but it's rarely going to be a good idea for the buffer manager to proactively free up memory just in case some other part of the system might need some. If you have a system that normally has 4GB of shared buffers and some other module borrows 100MB and then returns it, the system will just cache less data while that memory is in use and then start right back up caching more again once it's returned. That's very nice, and it's hard to see how else to achieve that result. Of course, there are other parts of the system (a whole bunch of them) that used shared memory also, and perhaps some of those could be modified to use the dynamic allocator as well. But they're getting by without it now, so maybe they don't really need it. The SLRU stuff, I think, works more or less like shared buffers (so you have the same set of issues) and I think most of the other users are allocating small, fixed-size chunks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 07/26/2010 07:16 PM, Robert Haas wrote: > Of course, there are other parts of the system (a whole bunch of them) > that used shared memory also, and perhaps some of those could be > modified to use the dynamic allocator as well. But they're getting by > without it now, so maybe they don't really need it. The SLRU stuff, I > think, works more or less like shared buffers (so you have the same > set of issues) and I think most of the other users are allocating > small, fixed-size chunks. Yeah, I see your point(s). Note however, that a thread based design doesn't have this problem *at all*. Memory generally is shared (between threads) and you can dynamically allocate more or less (until Linux' OOM killer hits you.. yet another story). The OS reuses memory you don't currently need even for other applications. Users as well as developers know the threaded model (arguably, much better than the process based one). So that's what we get compared to. And what developers (including me) are used to. I think we are getting by with fixed allocations at the moment, because we did a lot to get by with it. By working around these limitations. However, that's just my thinking. Thank you for your inputs. Regards Markus Wanner
On Mon, Jul 26, 2010 at 1:50 PM, Markus Wanner <markus@bluegap.ch> wrote: > Note however, that a thread based design doesn't have this problem *at all*. > Memory generally is shared (between threads) and you can dynamically > allocate more or less (until Linux' OOM killer hits you.. yet another > story). The OS reuses memory you don't currently need even for other > applications. > > Users as well as developers know the threaded model (arguably, much better > than the process based one). So that's what we get compared to. And what > developers (including me) are used to. I'm sort of used to the process model, myself, but I may be in the minority. > I think we are getting by with fixed allocations at the moment, because we > did a lot to get by with it. By working around these limitations. > > However, that's just my thinking. Thank you for your inputs. I completely agree with you that fixed allocations suck. We're just disagreeing (hopefully, in a friendly and collegial fashion) about what to do about it. I actually think that memory management is one of the weakest elements of our current architecture, though I think for somewhat different reasons than what you're thinking about. Besides the fact that we have various smaller pools of dynamically shared memory (e.g. a separate ring of buffers for each SLRU), I'm also unhappy about some of the things we do with backend-private memory, work_mem being the biggest culprit by far, because it's very difficult for the DBA to set the knobs in a way that uses all of the memory he wants to allocate to the database efficiently no overruns and none left over. The case where you can count on the database and all of your temporary files, etc. to fit in RAM is really an exceptional case: in general, you need to assume that there will be more demand for memory than there will be memory available, and as much as possible you want the system (rather than the user) to decide how it should optimally be allocated. The query planner and executor actually do have most of what is needed to execute queries using more or less memory, but they lack the global intelligence needed for intelligent decision-making. Letting the OS buffer cache rather than the PG buffer cache handle most of the system's memory helps, but it's not a complete solution. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> wrote: > I actually think that memory management is one of the weakest > elements of our current architecture I'm actually pretty impressed by the memory contexts in PostgreSQL. Apparently I'm not alone in that, either; a paper by Hellerstein, Stonebraker, and Hamilton[1] has this in section 7.2 (Memory Allocator): "The interested reader may want to browse the open-source PostgreSQL code. This utilizes a fairly sophisticated memory allocator." I think the problem here is that we don't extend that sophistication to shared memory. -Kevin [1] Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007. Architecture of a Database System. Foundations and Trends(R) in Databases Vol. 1, No. 2 (2007) 141*259. http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
On Mon, Jul 26, 2010 at 3:16 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > >> I actually think that memory management is one of the weakest >> elements of our current architecture > > I'm actually pretty impressed by the memory contexts in PostgreSQL. > Apparently I'm not alone in that, either; a paper by Hellerstein, > Stonebraker, and Hamilton[1] has this in section 7.2 (Memory > Allocator): > > "The interested reader may want to browse the open-source PostgreSQL > code. This utilizes a fairly sophisticated memory allocator." > > I think the problem here is that we don't extend that sophistication > to shared memory. That's one aspect of it, and the other is that we don't have much global coordination about how we use it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Markus Wanner wrote: > Hi, > > On 07/26/2010 07:16 PM, Robert Haas wrote: > > Of course, there are other parts of the system (a whole bunch of them) > > that used shared memory also, and perhaps some of those could be > > modified to use the dynamic allocator as well. But they're getting by > > without it now, so maybe they don't really need it. The SLRU stuff, I > > think, works more or less like shared buffers (so you have the same > > set of issues) and I think most of the other users are allocating > > small, fixed-size chunks. > > Yeah, I see your point(s). > > Note however, that a thread based design doesn't have this problem *at > all*. Memory generally is shared (between threads) and you can > dynamically allocate more or less (until Linux' OOM killer hits you.. > yet another story). The OS reuses memory you don't currently need even > for other applications. [ Sorry to be jumping into this thread late.] I am not sure threads would greatly help us. The major problem is that all of our our structures are currently contiguous in memory for quick access. I don't see how threading would help with that. We could use realloc(), but we can do the same in shared memory if we had a chunk infrastructure, though concurrent access to that memory would hurt us in either threads or shared memory. Fundamentally, recreating the libc memory allocation routines is not that hard. (Everyone has to detach from the shared memory segment, but they have to stop using it too, so it doesn't seem that hard.) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, Aug 9, 2010 at 11:02 AM, Bruce Momjian <bruce@momjian.us> wrote: > I am not sure threads would greatly help us. The major problem is that > all of our our structures are currently contiguous in memory for quick > access. I don't see how threading would help with that. We could use > realloc(), but we can do the same in shared memory if we had a chunk > infrastructure, though concurrent access to that memory would hurt us in > either threads or shared memory. > > Fundamentally, recreating the libc memory allocation routines is not > that hard. (Everyone has to detach from the shared memory segment, but > they have to stop using it too, so it doesn't seem that hard.) I actually don't think that's true. The advantage (and disadvantage) of using threads is that everything runs in one address space. So you just allocate more memory and everyone immediately sees it. In a process environment, that's not the case: to expand or shrink the size of the shared memory arena, everyone needs to explicitly change their own mapping. So imagine that thread-or-process A allocates allocates a new chunk of memory and then writes a pointer to the new chunk in a previously allocated section of memory. Thread-or-process B then follows the pointer. In a threaded model, this is guaranteed to be safe. In a process model, it's not: A might have enlarged the shared memory mapping while B has not yet done so. So I think in our model any sort of change to the shared memory segment is going to require extremely careful gymnastics, and be pretty expensive. I don't care to take a position in the religious war over threads vs. processes, but I do think threads simplify the handling of this particular case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 08/09/2010 05:02 PM, Bruce Momjian wrote: > [ Sorry to be jumping into this thread late.] No problem at all. > I am not sure threads would greatly help us. Note that I'm absolutely, certainly not advocating the use of threads for Postgres. > The major problem is that > all of our our structures are currently contiguous in memory for quick > access. I don't see how threading would help with that. We could use > realloc(), but we can do the same in shared memory if we had a chunk > infrastructure, though concurrent access to that memory would hurt us in > either threads or shared memory. I don't quite follow what you are trying to say here. Whether or not structures are contiguous in memory might affect performance, but I don't see the relation to programmer's habits and/or knowledge. With our process-based design, the default is private memory (i.e. not shared). If you need shared memory, you must specify a certain amount in advance. That chunk of shared memory then is reserved and can't ever be used by another subsystem. Even if you barely ever need that much shared memory for the subsystem in question. That's opposed to what lots of people are used to with the threaded approach, where shared memory is the default. And where you can easily and dynamically allocate *shared* memory. Whatever chunk of shared memory one subsystem doesn't need is available to another one (modulo fragmentation of the dynamic allocator, perhaps, but..) > Fundamentally, recreating the libc memory allocation routines is not > that hard. Uh.. well, writing a good, scalable, dynamic allocator certainly poses some very interesting problems. Writing one that doesn't violate any patent or other IP as an additional requirement seems like a pretty tough problem to me. > (Everyone has to detach from the shared memory segment, but > they have to stop using it too, so it doesn't seem that hard.) So far, I only considered dynamically allocating from a pool of shared memory that's initially fixed in size. So as to be able to make better use of shared memory. Resizing the overall pool the easy way, requiring every backend to detach would cost a lot of performance. So that's certainly not something you want to do often. The purpose of such a dynamic allocator as I see it rather is to be able to re-allocate unused memory of one subsystem to another one *on the fly*. Not just for performance, but also for ease of use for the admin and the developer, IMO. Regards Markus Wanner
Robert Haas <robertmhaas@gmail.com> writes: > So imagine that thread-or-process A allocates allocates a new chunk of > memory and then writes a pointer to the new chunk in a previously > allocated section of memory. Thread-or-process B then follows the > pointer. In a threaded model, this is guaranteed to be safe. In a > process model, it's not: A might have enlarged the shared memory > mapping while B has not yet done so. So I think in our model any sort > of change to the shared memory segment is going to require extremely > careful gymnastics, and be pretty expensive. ... and on some platforms, it'll be flat out impossible. We looked at this years ago and concluded that changing the size of the shmem segment after postmaster start was impractical from a portability standpoint. I have not seen anything to change that conclusion. > I don't care to take a position in the religious war over threads vs. > processes, but I do think threads simplify the handling of this > particular case. You meant "I don't think", right? I agree. The only way threads would simplify this is if we went over to a mysql-style model where there was only one process, period, and all backends were threads inside that. No shared memory as such, at all. regards, tom lane
Robert Haas wrote: > On Mon, Aug 9, 2010 at 11:02 AM, Bruce Momjian <bruce@momjian.us> wrote: > > I am not sure threads would greatly help us. ?The major problem is that > > all of our our structures are currently contiguous in memory for quick > > access. ?I don't see how threading would help with that. ?We could use > > realloc(), but we can do the same in shared memory if we had a chunk > > infrastructure, though concurrent access to that memory would hurt us in > > either threads or shared memory. > > > > Fundamentally, recreating the libc memory allocation routines is not > > that hard. ?(Everyone has to detach from the shared memory segment, but > > they have to stop using it too, so it doesn't seem that hard.) > > I actually don't think that's true. The advantage (and disadvantage) > of using threads is that everything runs in one address space. So you > just allocate more memory and everyone immediately sees it. In a > process environment, that's not the case: to expand or shrink the size > of the shared memory arena, everyone needs to explicitly change their > own mapping. You can't expand the size of malloc'ed memory --- you have to call realloc(), and then you effectively get a new pointer. Shared memory has a similar limitation. If you allocate shared memory in chunks so you don't need to change the location, you are effectively doing another malloc(), like you would in a threaded process. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Markus Wanner wrote: > Hi, > > On 08/09/2010 05:02 PM, Bruce Momjian wrote: > > [ Sorry to be jumping into this thread late.] > > No problem at all. > > > I am not sure threads would greatly help us. > > Note that I'm absolutely, certainly not advocating the use of threads > for Postgres. > > > The major problem is that > > all of our our structures are currently contiguous in memory for quick > > access. I don't see how threading would help with that. We could use > > realloc(), but we can do the same in shared memory if we had a chunk > > infrastructure, though concurrent access to that memory would hurt us in > > either threads or shared memory. > > I don't quite follow what you are trying to say here. Whether or not > structures are contiguous in memory might affect performance, but I > don't see the relation to programmer's habits and/or knowledge. > > With our process-based design, the default is private memory (i.e. not > shared). If you need shared memory, you must specify a certain amount in > advance. That chunk of shared memory then is reserved and can't ever be > used by another subsystem. Even if you barely ever need that much shared > memory for the subsystem in question. Once multiple threads are using the same local memory, you have the same issues of being unable to resize it because repalloc can change the pointer location. > That's opposed to what lots of people are used to with the threaded > approach, where shared memory is the default. And where you can easily > and dynamically allocate *shared* memory. Whatever chunk of shared > memory one subsystem doesn't need is available to another one (modulo > fragmentation of the dynamic allocator, perhaps, but..) Well, this could be done with shared memory as well. My point is that you can treat malloc the same as "add shared memory", to some extent, with the same limiations. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian wrote: > > With our process-based design, the default is private memory (i.e. not > > shared). If you need shared memory, you must specify a certain amount in > > advance. That chunk of shared memory then is reserved and can't ever be > > used by another subsystem. Even if you barely ever need that much shared > > memory for the subsystem in question. > > Once multiple threads are using the same local memory, you have the same > issues of being unable to resize it because repalloc can change the > pointer location. Let me be more concrete. Suppose you are using threads, and you want to increase your shared memory from 20MB to 30MB. How do you do that? If you want it contiguous, you have to use realloc, which might move the pointer. If you allocate another 10MB chunk, you then have shared memory fragments, which is the same as adding another shared memory segment. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, 2010-08-09 at 11:41 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > So imagine that thread-or-process A allocates allocates a new chunk of > > memory and then writes a pointer to the new chunk in a previously > > allocated section of memory. Thread-or-process B then follows the > > pointer. In a threaded model, this is guaranteed to be safe. In a > > process model, it's not: A might have enlarged the shared memory > > mapping while B has not yet done so. So I think in our model any sort > > of change to the shared memory segment is going to require extremely > > careful gymnastics, and be pretty expensive. > > ... and on some platforms, it'll be flat out impossible. We looked at > this years ago and concluded that changing the size of the shmem segment > after postmaster start was impractical from a portability standpoint. > I have not seen anything to change that conclusion. As caches get larger, downtime gets longer. Downtime of more than a few minutes per year is enough to blow claims of high availability. At some point, this project will need to face this particular hurdle. We may need to balance utility for the majority against portability for the minority. We should be laying out an architectural roadmap, not just saying no. We can make multi-year plans if we wish to. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Hi, On 08/09/2010 05:41 PM, Tom Lane wrote: > ... and on some platforms, it'll be flat out impossible. We looked at > this years ago and concluded that changing the size of the shmem segment > after postmaster start was impractical from a portability standpoint. > I have not seen anything to change that conclusion. I haven't tried, but I tend to believe that's true. However, I'd like to get back to the original intent of the posted patch. Which is about dynamically allocating memory *within a fixed size pool*. That's something SRLU or shared_buffers do to some extent, but with lots of limitations. And without the ability to move free memory between sub-systems (i.e. between different SLRU buffers). > You meant "I don't think", right? I agree. The only way threads would > simplify this is if we went over to a mysql-style model where there was > only one process, period, and all backends were threads inside that. > No shared memory as such, at all. That's how the threaded model normally is used, yes. And with that model, allocation of shared memory is very easy. It has none of the pre-allocation requirements we are currently facing. Regards Markus Wanner
Hi, On 08/09/2010 06:10 PM, Bruce Momjian wrote: > My point is that you can treat malloc the same as "add shared memory", > to some extent, with the same limiations. Once one of the SLRU buffers is full, it cannot currently allocate from another SLRU buffer's unused memory area. That memory there is plain wasted at that moment. That's my point and the problem the allocator I posted tries to solve. I fail to see how malloc could help here. malloc() only allocates process-local memory. Regards Markus Wanner
On Mon, Aug 9, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> So imagine that thread-or-process A allocates allocates a new chunk of >> memory and then writes a pointer to the new chunk in a previously >> allocated section of memory. Thread-or-process B then follows the >> pointer. In a threaded model, this is guaranteed to be safe. In a >> process model, it's not: A might have enlarged the shared memory >> mapping while B has not yet done so. So I think in our model any sort >> of change to the shared memory segment is going to require extremely >> careful gymnastics, and be pretty expensive. > > ... and on some platforms, it'll be flat out impossible. We looked at > this years ago and concluded that changing the size of the shmem segment > after postmaster start was impractical from a portability standpoint. > I have not seen anything to change that conclusion. I haven't done extensive research into this, but I did take a look at it briefly. It looked to me like the style of shared memory we're using now (I guess it's System V) has no way to resize a shared memory segment at all, and certainly no way that's portable. However it also looked as though POSIX shm (shm_open, etc.) can be resized using ftruncate(). Whether this is portable to all the platforms we run on, or whether the behavior of ftruncate() in combination with shm_open() is in the standard, I'm not sure. I believe I went back and reread the old threads on this topic and it seems like the sticking point as far as POSIX shm goes it that it lacks a readable equivalent of shm_nattch. I think it was proposed to use a small syv shm and then do the main shared memory arena with shm_open, but at that point you start to wonder you're messing around with at all. But I can't help but be intrigued by it, even so. Suppose, for example, that we kept things that were really fixed-size in shared memory but moved, say, shared_buffers to a POSIX shm. Would that allow you to then make shared_buffers PGC_SIGHUP? The obvious answer is "no", because there are a whole bunch of knock-on issues. Changing the size of shared_buffers also means changing the number of LWLocks, changing the number of buffer descriptors, etc. So maybe it can't be done. But I can't stop wondering if there's a way to make it work... >> I don't care to take a position in the religious war over threads vs. >> processes, but I do think threads simplify the handling of this >> particular case. > > You meant "I don't think", right? I agree. The only way threads would > simplify this is if we went over to a mysql-style model where there was > only one process, period, and all backends were threads inside that. > No shared memory as such, at all. I think we're saying the same thing in different ways; I agree with everything in that paragraph that follows the question mark. By "this particular case", I meant "shared memory allocation"; it would amount to just calling malloc() [or palloc()]. But yeah, clearly that only works in a single-process model. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Mon, Aug 9, 2010 at 12:03 PM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> On Mon, Aug 9, 2010 at 11:02 AM, Bruce Momjian <bruce@momjian.us> wrote: >> > I am not sure threads would greatly help us. ?The major problem is that >> > all of our our structures are currently contiguous in memory for quick >> > access. ?I don't see how threading would help with that. ?We could use >> > realloc(), but we can do the same in shared memory if we had a chunk >> > infrastructure, though concurrent access to that memory would hurt us in >> > either threads or shared memory. >> > >> > Fundamentally, recreating the libc memory allocation routines is not >> > that hard. ?(Everyone has to detach from the shared memory segment, but >> > they have to stop using it too, so it doesn't seem that hard.) >> >> I actually don't think that's true. The advantage (and disadvantage) >> of using threads is that everything runs in one address space. So you >> just allocate more memory and everyone immediately sees it. In a >> process environment, that's not the case: to expand or shrink the size >> of the shared memory arena, everyone needs to explicitly change their >> own mapping. > > You can't expand the size of malloc'ed memory --- you have to call > realloc(), and then you effectively get a new pointer. Shared memory > has a similar limitation. If you allocate shared memory in chunks so > you don't need to change the location, you are effectively doing another > malloc(), like you would in a threaded process. The point isn't what happens when you resize individual chunks; it's what happens when you need to expand the arena. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Mon, Aug 9, 2010 at 12:31 PM, Bruce Momjian <bruce@momjian.us> wrote: > Bruce Momjian wrote: >> > With our process-based design, the default is private memory (i.e. not >> > shared). If you need shared memory, you must specify a certain amount in >> > advance. That chunk of shared memory then is reserved and can't ever be >> > used by another subsystem. Even if you barely ever need that much shared >> > memory for the subsystem in question. >> >> Once multiple threads are using the same local memory, you have the same >> issues of being unable to resize it because repalloc can change the >> pointer location. > > Let me be more concrete. Suppose you are using threads, and you want to > increase your shared memory from 20MB to 30MB. How do you do that? If > you want it contiguous, you have to use realloc, which might move the > pointer. If you allocate another 10MB chunk, you then have shared > memory fragments, which is the same as adding another shared memory > segment. You probably wouldn't do either of those things. You'd just allocate small chunks here and there for whatever you need them for. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 08/09/2010 06:31 PM, Bruce Momjian wrote: > Let me be more concrete. Suppose you are using threads, and you want to > increase your shared memory from 20MB to 30MB. How do you do that? There's absolutely no need to pre-allocate 20 MB in advance in a threaded environment. You just allocate memory in small chunks. For a threaded-model, that memory is shared by default, so the total amount of shared memory can grow and shrink very easily. (And even makes usused memory available to other processes, not just other threads). > If > you want it contiguous, you have to use realloc, which might move the > pointer. If you allocate another 10MB chunk, you then have shared > memory fragments, which is the same as adding another shared memory > segment. Okay, I think I now understand the requirement of continuity you mentioned earlier already. I agree that with the current approach, we cannot simply use such a dynamic allocator to solve all of our problems. Every subsystem would need to be converted to something that allocates shared memory in smaller chunks for such a dynamic allocator to be of any use. Robert already pointed out that this may be troublesome for shared_buffers, which is by far the largest consumer of shared memory. I didn't look into this, yet. And I'd like to hear more about the feasibility of that approach for other subsystems. Another issue to be discussed would be the limits of sharing free memory between subsystems. Maybe we even reach the conclusion that we absolutely *want* fixed maximum sizes for every single subsystem so as to be able to guarantee a certain amount of multi-xact or SLRU entries at any point in time (otherwise one memory hungry subsystem could possibly eat it all up with another subsystem getting the OOM error when trying to allocate for its very first entry). Thanks for bringing this discussion to live again. Regards Markus Wanner
Markus Wanner wrote: > Hi, > > On 08/09/2010 06:10 PM, Bruce Momjian wrote: > > My point is that you can treat malloc the same as "add shared memory", > > to some extent, with the same limiations. > > Once one of the SLRU buffers is full, it cannot currently allocate from > another SLRU buffer's unused memory area. That memory there is plain > wasted at that moment. That's my point and the problem the allocator I > posted tries to solve. > > I fail to see how malloc could help here. malloc() only allocates > process-local memory. My point is that we have the same limitations with malloc()/threads, as we have with shared memory. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Robert Haas wrote: > On Mon, Aug 9, 2010 at 12:31 PM, Bruce Momjian <bruce@momjian.us> wrote: > > Bruce Momjian wrote: > >> > With our process-based design, the default is private memory (i.e. not > >> > shared). If you need shared memory, you must specify a certain amount in > >> > advance. That chunk of shared memory then is reserved and can't ever be > >> > used by another subsystem. Even if you barely ever need that much shared > >> > memory for the subsystem in question. > >> > >> Once multiple threads are using the same local memory, you have the same > >> issues of being unable to resize it because repalloc can change the > >> pointer location. > > > > Let me be more concrete. ?Suppose you are using threads, and you want to > > increase your shared memory from 20MB to 30MB. ?How do you do that? ?If > > you want it contiguous, you have to use realloc, which might move the > > pointer. ?If you allocate another 10MB chunk, you then have shared > > memory fragments, which is the same as adding another shared memory > > segment. > > You probably wouldn't do either of those things. You'd just allocate > small chunks here and there for whatever you need them for. Well, then we do that with shared memory then --- my point is that it is the same problem with threads or processes. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Markus Wanner wrote: > Hi, > > On 08/09/2010 06:31 PM, Bruce Momjian wrote: > > Let me be more concrete. Suppose you are using threads, and you want to > > increase your shared memory from 20MB to 30MB. How do you do that? > > There's absolutely no need to pre-allocate 20 MB in advance in a > threaded environment. You just allocate memory in small chunks. For a > threaded-model, that memory is shared by default, so the total amount of > shared memory can grow and shrink very easily. (And even makes usused > memory available to other processes, not just other threads). > > > If > > you want it contiguous, you have to use realloc, which might move the > > pointer. If you allocate another 10MB chunk, you then have shared > > memory fragments, which is the same as adding another shared memory > > segment. > > Okay, I think I now understand the requirement of continuity you > mentioned earlier already. I agree that with the current approach, we > cannot simply use such a dynamic allocator to solve all of our problems. > > Every subsystem would need to be converted to something that allocates > shared memory in smaller chunks for such a dynamic allocator to be of > any use. Robert already pointed out that this may be troublesome for > shared_buffers, which is by far the largest consumer of shared memory. I > didn't look into this, yet. And I'd like to hear more about the > feasibility of that approach for other subsystems. > > Another issue to be discussed would be the limits of sharing free memory > between subsystems. Maybe we even reach the conclusion that we > absolutely *want* fixed maximum sizes for every single subsystem so as > to be able to guarantee a certain amount of multi-xact or SLRU entries > at any point in time (otherwise one memory hungry subsystem could > possibly eat it all up with another subsystem getting the OOM error when > trying to allocate for its very first entry). Yep, you would have to use chunks in threads/malloc, and you have to do the same thing with shared memory. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 08/09/2010 08:33 PM, Bruce Momjian wrote: > Robert Haas wrote: >> You probably wouldn't do either of those things. You'd just allocate >> small chunks here and there for whatever you need them for. > > Well, then we do that with shared memory then --- my point is that it is > the same problem with threads or processes. That's what my patch allows you to do, yes. Currently you are bound to pre-allocate shared memory at startup. Or how would you allocate small chunks from shared memory at the moment? Regards Markus
On Mon, Aug 9, 2010 at 2:28 PM, Markus Wanner <markus@bluegap.ch> wrote: > Another issue to be discussed would be the limits of sharing free memory > between subsystems. Maybe we even reach the conclusion that we absolutely > *want* fixed maximum sizes for every single subsystem so as to be able to > guarantee a certain amount of multi-xact or SLRU entries at any point in > time (otherwise one memory hungry subsystem could possibly eat it all up > with another subsystem getting the OOM error when trying to allocate for its > very first entry). Yeah, I think that's a real concern. I think we need to distinguish memory needs from memory wants. Ideally, we'd like our entire database to be cached in RAM. But that may or may not be feasible, so we page what we can into shared_buffers and page out as necessary to make room for other things. In contrast, the traditional malloc() approach doesn't give you much flexibility: if it returns NULL, you pretty much have to fail whatever operation you were trying to perform. For some things, that's OK. For other things, it's not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Mon, Aug 9, 2010 at 2:33 PM, Bruce Momjian <bruce@momjian.us> wrote: >> > Let me be more concrete. ?Suppose you are using threads, and you want to >> > increase your shared memory from 20MB to 30MB. ?How do you do that? ?If >> > you want it contiguous, you have to use realloc, which might move the >> > pointer. ?If you allocate another 10MB chunk, you then have shared >> > memory fragments, which is the same as adding another shared memory >> > segment. >> >> You probably wouldn't do either of those things. You'd just allocate >> small chunks here and there for whatever you need them for. > > Well, then we do that with shared memory then --- my point is that it is > the same problem with threads or processes. Well, I think your point is wrong, then. :-) It's not the same at all. If you have a bunch of threads in one address space, "shared" memory is really just process-local. You can grow the total amount of allocated space just by calling malloc(). With our architecture, you can't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Markus Wanner wrote: > On 08/09/2010 08:33 PM, Bruce Momjian wrote: > > Robert Haas wrote: > >> You probably wouldn't do either of those things. You'd just allocate > >> small chunks here and there for whatever you need them for. > > > > Well, then we do that with shared memory then --- my point is that it is > > the same problem with threads or processes. > > That's what my patch allows you to do, yes. Currently you are bound to > pre-allocate shared memory at startup. Or how would you allocate small > chunks from shared memory at the moment? We don't --- we allocate it all at startup. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Robert Haas wrote: > On Mon, Aug 9, 2010 at 2:33 PM, Bruce Momjian <bruce@momjian.us> wrote: > >> > Let me be more concrete. ?Suppose you are using threads, and you want to > >> > increase your shared memory from 20MB to 30MB. ?How do you do that? ?If > >> > you want it contiguous, you have to use realloc, which might move the > >> > pointer. ?If you allocate another 10MB chunk, you then have shared > >> > memory fragments, which is the same as adding another shared memory > >> > segment. > >> > >> You probably wouldn't do either of those things. ?You'd just allocate > >> small chunks here and there for whatever you need them for. > > > > Well, then we do that with shared memory then --- my point is that it is > > the same problem with threads or processes. > > Well, I think your point is wrong, then. :-) > > It's not the same at all. If you have a bunch of threads in one > address space, "shared" memory is really just process-local. You can > grow the total amount of allocated space just by calling malloc(). > With our architecture, you can't. You effectively have to add infrastructure to add/remove shared memory segments to match memory requests. It is another step, but it is the same behavior. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 08/09/2010 08:49 PM, Bruce Momjian wrote: > Markus Wanner wrote: >> That's what my patch allows you to do, yes. Currently you are bound to >> pre-allocate shared memory at startup. Or how would you allocate small >> chunks from shared memory at the moment? > > We don't --- we allocate it all at startup. Exactly. And that's the difference to a thread-based approach. The downside of it is that you need to know in advance how much shared memory each of the subsystems is going to need. On the upside is the certainty, that you already have the memory allocated and cannot run out of it. You just have what you have. (Note that you could do that as well with the thread-based approach, if you want. Most other programs I know don't choose that approach, though, but instead try to cope with OOM). Regards Markus
On Mon, Aug 9, 2010 at 2:50 PM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> On Mon, Aug 9, 2010 at 2:33 PM, Bruce Momjian <bruce@momjian.us> wrote: >> >> > Let me be more concrete. ?Suppose you are using threads, and you want to >> >> > increase your shared memory from 20MB to 30MB. ?How do you do that? ?If >> >> > you want it contiguous, you have to use realloc, which might move the >> >> > pointer. ?If you allocate another 10MB chunk, you then have shared >> >> > memory fragments, which is the same as adding another shared memory >> >> > segment. >> >> >> >> You probably wouldn't do either of those things. ?You'd just allocate >> >> small chunks here and there for whatever you need them for. >> > >> > Well, then we do that with shared memory then --- my point is that it is >> > the same problem with threads or processes. >> >> Well, I think your point is wrong, then. :-) >> >> It's not the same at all. If you have a bunch of threads in one >> address space, "shared" memory is really just process-local. You can >> grow the total amount of allocated space just by calling malloc(). >> With our architecture, you can't. > > You effectively have to add infrastructure to add/remove shared memory > segments to match memory requests. It is another step, but it is the > same behavior. That would be one way to tackle the problem, but there are difficulties. If we just created new shared memory segments at need, we might end up with a lot of shared memory segments. I suspect that would get complicated and present many management difficulties - which is why I'm so far of the opinion that we should try to architect the system to avoid the need for this functionality. I don't think it's going to be too easy to provide, short of (as Tom says) moving to the MySQL model of many threads working in a single process. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas wrote: > On Mon, Aug 9, 2010 at 2:50 PM, Bruce Momjian <bruce@momjian.us> wrote: > > Robert Haas wrote: > >> On Mon, Aug 9, 2010 at 2:33 PM, Bruce Momjian <bruce@momjian.us> wrote: > >> >> > Let me be more concrete. ?Suppose you are using threads, and you want to > >> >> > increase your shared memory from 20MB to 30MB. ?How do you do that? ?If > >> >> > you want it contiguous, you have to use realloc, which might move the > >> >> > pointer. ?If you allocate another 10MB chunk, you then have shared > >> >> > memory fragments, which is the same as adding another shared memory > >> >> > segment. > >> >> > >> >> You probably wouldn't do either of those things. ?You'd just allocate > >> >> small chunks here and there for whatever you need them for. > >> > > >> > Well, then we do that with shared memory then --- my point is that it is > >> > the same problem with threads or processes. > >> > >> Well, I think your point is wrong, then. ?:-) > >> > >> It's not the same at all. ?If you have a bunch of threads in one > >> address space, "shared" memory is really just process-local. ?You can > >> grow the total amount of allocated space just by calling malloc(). > >> With our architecture, you can't. > > > > You effectively have to add infrastructure to add/remove shared memory > > segments to match memory requests. ?It is another step, but it is the > > same behavior. > > That would be one way to tackle the problem, but there are > difficulties. If we just created new shared memory segments at need, > we might end up with a lot of shared memory segments. I suspect that > would get complicated and present many management difficulties - which > is why I'm so far of the opinion that we should try to architect the > system to avoid the need for this functionality. I don't think it's > going to be too easy to provide, short of (as Tom says) moving to the > MySQL model of many threads working in a single process. You could allocate shared memory in chunks and then pass that out to requestors, the same way sbrk() does it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 08/09/2010 08:50 PM, Bruce Momjian wrote: > You effectively have to add infrastructure to add/remove shared memory > segments to match memory requests. It is another step, but it is the > same behavior. That's of no use without a dynamic allocator, I think. Or else it is a vague description of a dynamic allocator. I'm approaching the problem from another perspective: trying to implement a dynamic allocator on top of a fixed size memory pool, first. Once we have that, we may start to think about dynamically adding or removing underlying segments. Regards Markus Wanner
On 08/09/2010 09:00 PM, Bruce Momjian wrote: > You could allocate shared memory in chunks and then pass that out to > requestors, the same way sbrk() does it. sbrk() is described [1] as a "low-level memory allocator", which "is typically only used by the high-level malloc memory allocator implemented in the C library". Think of my patch as the high(er)-level variant ;-) It's certainly doable using processes and shared memory. Yes. My patch shows one way of how to go a step into that direction. Regards Markus Wanner [1]: http://www.cs.utah.edu/flux/moss/node39.html
On Mon, Aug 9, 2010 at 3:00 PM, Bruce Momjian <bruce@momjian.us> wrote: >> That would be one way to tackle the problem, but there are >> difficulties. If we just created new shared memory segments at need, >> we might end up with a lot of shared memory segments. I suspect that >> would get complicated and present many management difficulties - which >> is why I'm so far of the opinion that we should try to architect the >> system to avoid the need for this functionality. I don't think it's >> going to be too easy to provide, short of (as Tom says) moving to the >> MySQL model of many threads working in a single process. > > You could allocate shared memory in chunks and then pass that out to > requestors, the same way sbrk() does it. Sure. But I don't think that gets you very far. The management of the chunks is really hard. I go back to my previous example: you can't store a pointer that might point to another chunk, because the chunks won't get mapped into all the address spaces synchronously. Even if you don't care about doing that (and I bet you do), mapping and unmapping chunks is a heavyweight operation that requires every backend to notice that it needs to do something (and, incidentally, if any of them fail, you pretty much have to PANIC). I just can't imagine us building a reliable system this way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 08/09/2010 08:45 PM, Robert Haas wrote: > Yeah, I think that's a real concern. I think we need to distinguish > memory needs from memory wants. Ideally, we'd like our entire > database to be cached in RAM. But that may or may not be feasible, so > we page what we can into shared_buffers and page out as necessary to > make room for other things. In contrast, the traditional malloc() > approach doesn't give you much flexibility: if it returns NULL, you > pretty much have to fail whatever operation you were trying to > perform. For some things, that's OK. For other things, it's not. Agreed, it's going to be a difficult compromise and it possibly is very hard to find a good one automatically. However, I doubt our current approach with hard limits between subsystems is the best compromise. Regards Markus Wanner
Markus Wanner <markus@bluegap.ch> writes: > However, I'd like to get back to the original intent of the posted > patch. Which is about dynamically allocating memory *within a fixed size > pool*. > That's something SRLU or shared_buffers do to some extent, but with lots > of limitations. And without the ability to move free memory between > sub-systems (i.e. between different SLRU buffers). As far as SLRU is concerned, the already-agreed-to plan is to get rid of the separate arenas for SLRU and merge those things into the main shared buffers arena. IIRC, the motivation for designing SLRU the way it is was to ensure that SLRU uses couldn't be starved for memory due to high demand for shared buffers. But that was back when people frequently ran PG with only a few meg for shared buffers; I think that worry is obsolete. So I don't see this patch as offering anything at all that we care about so far as the core server is concerned. Maybe there are extensions that need it badly enough to justify such a feature in core, but SLRU is not a good argument for it. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Aug 9, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... and on some platforms, it'll be flat out impossible. �We looked at >> this years ago and concluded that changing the size of the shmem segment >> after postmaster start was impractical from a portability standpoint. >> I have not seen anything to change that conclusion. > I haven't done extensive research into this, but I did take a look at > it briefly. It looked to me like the style of shared memory we're > using now (I guess it's System V) has no way to resize a shared memory > segment at all, and certainly no way that's portable. However it also > looked as though POSIX shm (shm_open, etc.) can be resized using > ftruncate(). Whether this is portable to all the platforms we run on, > or whether the behavior of ftruncate() in combination with shm_open() > is in the standard, I'm not sure. It's not portable. That's exactly what we were looking into back when. > I believe I went back and reread > the old threads on this topic and it seems like the sticking point as > far as POSIX shm goes it that it lacks a readable equivalent of > shm_nattch. Yeah, that was another little problem. In principle though we only need one SysV-style shmem segment to get the required interlock, and there could be add-on shmem segments using POSIX or other APIs. But that doesn't get you out from under the portability issue or the memory space management issue (it's unlikely you can enlarge a segment without remapping it). regards, tom lane
Hi, On 08/09/2010 09:14 PM, Tom Lane wrote: > As far as SLRU is concerned, the already-agreed-to plan is to get rid of > the separate arenas for SLRU and merge those things into the main shared > buffers arena. I didn't know about that plan. Sounds good. (I'm personally thinking this is trying to solve the same problem in a more specific fashion). > IIRC, the motivation for designing SLRU the way it is > was to ensure that SLRU uses couldn't be starved for memory due to high > demand for shared buffers. But that was back when people frequently ran > PG with only a few meg for shared buffers; I think that worry is > obsolete. Good to know. > So I don't see this patch as offering anything at all that we care about > so far as the core server is concerned. Maybe there are extensions that > need it badly enough to justify such a feature in core, but SLRU is not > a good argument for it. Fair enough. (Patch is already marked as "returned with feedback" on the commitfest app, thanks again for additional feedback) Regards Markus Wanner
On Mon, Aug 9, 2010 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Aug 9, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> ... and on some platforms, it'll be flat out impossible. We looked at >>> this years ago and concluded that changing the size of the shmem segment >>> after postmaster start was impractical from a portability standpoint. >>> I have not seen anything to change that conclusion. > >> I haven't done extensive research into this, but I did take a look at >> it briefly. It looked to me like the style of shared memory we're >> using now (I guess it's System V) has no way to resize a shared memory >> segment at all, and certainly no way that's portable. However it also >> looked as though POSIX shm (shm_open, etc.) can be resized using >> ftruncate(). Whether this is portable to all the platforms we run on, >> or whether the behavior of ftruncate() in combination with shm_open() >> is in the standard, I'm not sure. > > It's not portable. That's exactly what we were looking into back when. Uggh, that sucks. Can you provide any more details? >> I believe I went back and reread >> the old threads on this topic and it seems like the sticking point as >> far as POSIX shm goes it that it lacks a readable equivalent of >> shm_nattch. > > Yeah, that was another little problem. In principle though we only need > one SysV-style shmem segment to get the required interlock, and there > could be add-on shmem segments using POSIX or other APIs. But that > doesn't get you out from under the portability issue or the memory space > management issue (it's unlikely you can enlarge a segment without > remapping it). Unlikely is probably an understatement. Still, enlarging a segment with remapping might be workable for some useful subset of the cases. But, if enlarging it can't be done portably, then we're pretty much dead in the water. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Aug 9, 2010 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's not portable. �That's exactly what we were looking into back when. > Uggh, that sucks. Can you provide any more details? You don't really have to go further than consulting the relevant standards, eg SUS says at http://www.opengroup.org/onlinepubs/007908799/xsh/mmap.html If the size of the mapped file changes after the call to mmap() as aresult of some other operation on the mapped file, theeffect ofreferences to portions of the mapped region that correspond to addedor removed portions of the file is unspecified. Particular implementations might cope with such cases in useful ways, or then again they might not. And even if your platform does, you've set an upper limit for the possible segment size in your mmap() call. Further down the page, SUS also takes pains to point out that you probably can't have an unlimited number of mapped regions, so adding more mmap'd segments isn't a way out either. regards, tom lane
Robert Haas <robertmhaas@gmail.com> wrote: > I don't think it's going to be too easy to provide, short of (as > Tom says) moving to the MySQL model of many threads working in a > single process. Well, it's a bit misleading to refer to it as the MySQL model. It's used by Microsoft SQL Server, MySQL, Informix, and Sybase. IBM DB2 supports four different process models, and OS threads in a single process is the default for them on an OS with good threading support; otherwise they default to one process per connection. Just because MySQL uses a particular technique doesn't *automatically* mean it's a bad one; it's just not in itself a confidence-builder. ;-) -Kevin
On Mon, Aug 9, 2010 at 4:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Aug 9, 2010 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> It's not portable. That's exactly what we were looking into back when. > >> Uggh, that sucks. Can you provide any more details? > > You don't really have to go further than consulting the relevant > standards, eg SUS says at > http://www.opengroup.org/onlinepubs/007908799/xsh/mmap.html > > If the size of the mapped file changes after the call to mmap() as a > result of some other operation on the mapped file, the effect of > references to portions of the mapped region that correspond to added > or removed portions of the file is unspecified. > > Particular implementations might cope with such cases in useful ways, or > then again they might not. That doesn't seem like a big problem to me. I was assuming we'd need to remap when the size changed. Also, I was assuming that we were going to use shms, not files. Take a look at this: http://www.opengroup.org/onlinepubs/007908799/xsh/shm_open.html -and- http://www.opengroup.org/onlinepubs/007908799/xsh/ftruncate.html From the ftruncate page: "If fildes references a shared memory object, ftruncate() sets the size of the shared memory object to length." > And even if your platform does, you've set > an upper limit for the possible segment size in your mmap() call. > > Further down the page, SUS also takes pains to point out that you > probably can't have an unlimited number of mapped regions, so adding > more mmap'd segments isn't a way out either. Yeah. I think any approach that is based on allocating new segments as needed is pretty much DOA. I think the point of this would be to be able to resize things like shared_buffers on the fly - that is, an explicit administrator action might trigger a resize-and-remap cycle, but general system activity would not. The reality is that as PostgreSQL is used in more and more 24x7 contexts and people put more and more critical data into it, forced server restarts become more and more of a problem. IMHO, we really need to do some creative thinking about how to crank PGC_POSTMASTER GUCs down to PGC_SIGHUP. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Mon, Aug 09, 2010 at 02:16:47PM -0400, Robert Haas wrote: > I believe I went back and reread > the old threads on this topic and it seems like the sticking point as > far as POSIX shm goes it that it lacks a readable equivalent of > shm_nattch. I think it was proposed to use a small syv shm and then > do the main shared memory arena with shm_open, but at that point you > start to wonder you're messing around with at all. About using a small sysV segment for nattach and allocating the rest another way: the reason to do it is that "the other way" can be anything other than sysV. Namely, sysV has pathetic default limits whereas you can mmap() a few gig anonymously and the kernel won't bat an eyelid. Even if "the other way" didn't allow you to resize anything (which is what people appear to be talking about here) the benefit of being able to specify useful sizes of shared buffers without having to reconfigure the kernel makes it (ISTM) worthwhile doing irrespective of anything else. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Aug 9, 2010 at 4:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Particular implementations might cope with such cases in useful ways, or >> then again they might not. > That doesn't seem like a big problem to me. I was assuming we'd need > to remap when the size changed. Well, as long as you can do that, sure. I'm concerned about what happens if/when remapping fails (not at all unlikely in 32-bit address spaces in particular). You mentioned that that would probably have to be a PANIC condition, which I think I agree with; and that idea pretty much kills any argument that this would be a good way to improve server uptime. Another issue is that if you're doing dynamic remapping you almost certainly can't assume that the segment will appear at the same addresses in every backend. We could live with that for shared buffers without too much pain, but not so much for most other shared datastructures. > Also, I was assuming that we were > going to use shms, not files. It looked to me like the spec for mmap was the same either way. regards, tom lane
On Mon, Aug 9, 2010 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > That doesn't seem like a big problem to me. I was assuming we'd need > to remap when the size changed. I had thought about this in the past too, just for supporting run-time changes to shared_buffers. I always assumed we would just allocate shared memory in chunks and create separate mappings for each chunk. -- greg
On Mon, Aug 9, 2010 at 7:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Aug 9, 2010 at 4:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Particular implementations might cope with such cases in useful ways, or >>> then again they might not. >> That doesn't seem like a big problem to me. I was assuming we'd need >> to remap when the size changed. > Well, as long as you can do that, sure. I'm concerned about what > happens if/when remapping fails (not at all unlikely in 32-bit address > spaces in particular). You mentioned that that would probably have to > be a PANIC condition, which I think I agree with; and that idea pretty > much kills any argument that this would be a good way to improve server > uptime. In some cases, you might be able to get by with FATAL. Still, it's easier to imagine using this in cases for things like resizing shared_buffers (where the alternative is to restart the server anyway) than it is to use it for routine memory allocation. > Another issue is that if you're doing dynamic remapping you almost > certainly can't assume that the segment will appear at the same > addresses in every backend. We could live with that for shared buffers > without too much pain, but not so much for most other shared > datastructures. Hmm. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company