Thread: bg worker: patch 1 of 6 - permanent process
This patch turns the existing autovacuum launcher into an always running process, partly called the coordinator. If autovacuum is disabled, the coordinator process still gets started and keeps around, but it doesn't dispatch vacuum jobs. The coordinator process now uses imessages to communicate with background (autovacuum) workers and to trigger a vacuum job. So please apply the imessages patches [1] before any of the bg worker ones. It also adds two new controlling GUCs: min_spare_background_workers and max_spare_background_workers. The autovacuum_max_workers still serves as a limit for the total amount of background/autovacuum workers. (It is going to be renamed in step 4). Interaction with the postmaster has changed a bit. If autovacuum is disabled, the coordinator isn't started with PMSIGNAL_START_AUTOVAC_LAUNCHER anymore, instead there is an IMSGT_FORCE_VACUUM that any backend might want to send to the coordinator to prevent data loss due to XID wrap around (see changes in access/transam/varsup.c). The SIGUSR2 from postmaster to the coordinator doesn't need to be multiplexed anymore, but is only sent to inform about fork failures. A note on the dependency on imessages: for just autovacuum, this message passing infrastructure isn't absolutely necessary and could be removed. However, for Postgres-R it turned out to be really helpful and I think chances are good another user of this background worker infrastructure would also want to transfer data of varying size to and from these workers. Just as in the current version of Postgres, the background worker terminates immediately after having performed a vacuum job. Open issue: if the postmaster fails to fork a new background worker, the coordinator still waits a whole second after receiving the SIGUSR2 notification signal from the postmaster. That might have been fine with just autovacuum, but for other jobs, namely changeset application in Postgres-R, that's not feasible. [1] dynshmem and imessages patch http://archives.postgresql.org/message-id/ab0cd52a64e788f4ecb4515d1e6e4691@localhost
Attachment
On Tue, Jul 13, 2010 at 11:31 PM, Markus Wanner <markus@bluegap.ch> wrote: > This patch turns the existing autovacuum launcher into an always running > process, partly called the coordinator. If autovacuum is disabled, the > coordinator process still gets started and keeps around, but it doesn't > dispatch vacuum jobs. I think this part is a reasonable proposal, but... > The coordinator process now uses imessages to communicate with background > (autovacuum) workers and to trigger a vacuum job. > It also adds two new controlling GUCs: min_spare_background_workers and > max_spare_background_workers. Other changes in the patch doesn't seem be always needed for the purpose. In other words, the patch is not minimal. The original purpose could be done without IMessage. Also, min/max_spare_background_workers are not used in the patch at all. (BTW, min/max_spare_background_*helpers* in postgresql.conf.sample is maybe typo.) The most questionable point for me is why you didn't add any hook functions in the coordinator process. With the patch, you can extend the coordinator protocols with IMessage, but it requires patches to core at handle_imessage(). If you want fully-extensible workers, we should provide a method to develop worker codes in an external plugin. Is it possible to develop your own codes in the plugin? If possible, you can use IMessage as a private protocol freely in the plugin. Am I missing something? -- Itagaki Takahiro
On Wed, Aug 25, 2010 at 9:39 PM, Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote: > On Tue, Jul 13, 2010 at 11:31 PM, Markus Wanner <markus@bluegap.ch> wrote: >> This patch turns the existing autovacuum launcher into an always running >> process, partly called the coordinator. If autovacuum is disabled, the >> coordinator process still gets started and keeps around, but it doesn't >> dispatch vacuum jobs. > > I think this part is a reasonable proposal, but... It's not clear to me whether it's better to have a single coordinator process that handles both autovacuum and other things, or whether it's better to have two separate processes. >> The coordinator process now uses imessages to communicate with background >> (autovacuum) workers and to trigger a vacuum job. >> It also adds two new controlling GUCs: min_spare_background_workers and >> max_spare_background_workers. > > Other changes in the patch doesn't seem be always needed for the purpose. > In other words, the patch is not minimal. > The original purpose could be done without IMessage. > Also, min/max_spare_background_workers are not used in the patch at all. > (BTW, min/max_spare_background_*helpers* in postgresql.conf.sample is > maybe typo.) > > The most questionable point for me is why you didn't add any hook functions > in the coordinator process. With the patch, you can extend the coordinator > protocols with IMessage, but it requires patches to core at handle_imessage(). > If you want fully-extensible workers, we should provide a method to develop > worker codes in an external plugin. I agree with this criticism, but the other thing that strikes me as a nonstarter is having the postmaster participate in the imessages framework. Our general rule is that the postmaster must avoid touching shared memory; else a backend that scribbles on shared memory might take out the postmaster, leading to a failure of the crash-and-restart logic. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, Aug 26, 2010 at 11:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Jul 13, 2010 at 11:31 PM, Markus Wanner <markus@bluegap.ch> wrote: >>> This patch turns the existing autovacuum launcher into an always running >>> process, partly called the coordinator. > > It's not clear to me whether it's better to have a single coordinator > process that handles both autovacuum and other things, or whether it's > better to have two separate processes. Ah, we can separate the proposal to two topics: A. Support to run non-vacuum jobs from autovacuum launcher B. Support "userdefined background processes" A was proposed in the original "1 of 6" patch, but B might be more general. If we have a separated coordinator, B will be required. Markus, do you need B? Or A + standard backend processes are enough? If you need B eventually, starting with B might be better. -- Itagaki Takahiro
Hi, thanks for your feedback on this, it sort of got lost below the discussion about the dynamic shared memory stuff, IMO. On 08/26/2010 04:39 AM, Robert Haas wrote: > It's not clear to me whether it's better to have a single coordinator > process that handles both autovacuum and other things, or whether it's > better to have two separate processes. It has been proposed by Alvaro and/or Tom (IIRC) to reduce code duplication. Compared to the former approach, it certainly seems cleaner that way and it has helped reduce duplicate code a lot. I'm envisioning such a coordinator process to handle coordination of other background processes as well, for example for distributed and/or parallel querying. Having just only one process reduces the amount of interaction required between coordinators (i.e. autovacuum shouldn't ever start on databases for which replication didn't start, yet, as the autovacuum worker would be unable to connect to the database at that stage). It also reduces the amount of extra processes required, and thus I think also general complexity. What'd be the benefits of having separate coordinator processes? They'd be doing pretty much the same: coordinate background processes. (And yes, I clearly consider autovacuum to be just one kind of background process). > I agree with this criticism, but the other thing that strikes me as a > nonstarter is having the postmaster participate in the imessages > framework. This is simply not the case (anymore). (And one of the reasons a separate coordinator process is required, instead of letting the postmaster do this kind of coordination). > Our general rule is that the postmaster must avoid > touching shared memory; else a backend that scribbles on shared memory > might take out the postmaster, leading to a failure of the > crash-and-restart logic. That rule is well understood and followed by the bg worker infrastructure patches. If you find code for which that isn't true, please point at it. The crash-and-restart logic should work just as it did with the autovacuum launcher. Regards Markus
Itagaki-san, thanks for reviewing this. On 08/26/2010 03:39 AM, Itagaki Takahiro wrote: > Other changes in the patch doesn't seem be always needed for the purpose. > In other words, the patch is not minimal. Hm.. yeah, maybe the separation between step1 and step2 is a bit arbitrary. I'll look into it. > The original purpose could be done without IMessage. Agreed, that's the one exception. I've mentioned why that is and I don't currently feel like coding an unneeded variant which doesn't use imessages. > Also, min/max_spare_background_workers are not used in the patch at all. You are right, it only starts to get used in step2, so the addition should probably move there, right. > (BTW, min/max_spare_background_*helpers* in postgresql.conf.sample is > maybe typo.) Uh, correct, thank you for pointing this out. (I've originally named these helper processes before. Merging with autovacuum, it made more sense to name them background *workers*) > The most questionable point for me is why you didn't add any hook functions > in the coordinator process. Because I'm a hook-hater. ;-) No, seriously: I don't see what problem hooks could have solved. I'm coding in C and extending the Postgres code. Deciding for hooks and an API to use them requires good knowledge of where exactly you want to hook and what API you want to provide. Then that API needs to remain stable for an extended time. I don't think any part of the bg worker infrastructure currently is anywhere close to that. > With the patch, you can extend the coordinator > protocols with IMessage, but it requires patches to core at handle_imessage(). > If you want fully-extensible workers, we should provide a method to develop > worker codes in an external plugin. It's originally intended as an internal infrastructure. Offering its capabilities to the outside would requires stabilization, security control and working out an API. All of which is certainly not something I intend to do. > Is it possible to develop your own codes in the plugin? If possible, you can > use IMessage as a private protocol freely in the plugin. Am I missing something? Well, what problem(s) are you trying to solve with such a thing? I've no idea what direction you are aiming at, sorry. However, it's certainly different or extending bg worker, so it would need to be a separate patch, IMO. Regards Markus Wanner
On 08/26/2010 05:01 AM, Itagaki Takahiro wrote: > Markus, do you need B? Or A + standard backend processes are enough? > If you need B eventually, starting with B might be better. No, I certainly don't need B. Why not just use an ordinary backend to do "user defined background processing"? It covers all of the API stability and the security issues I've raised. Regards Markus Wanner
On Thu, Aug 26, 2010 at 7:42 PM, Markus Wanner <markus@bluegap.ch> wrote: >> Markus, do you need B? Or A + standard backend processes are enough? > > No, I certainly don't need B. OK, I see why you proposed coordinator hook (yeah, I call it hook :) rather than adding user-defined processes. > Why not just use an ordinary backend to do "user defined background > processing"? It covers all of the API stability and the security issues I've > raised. However, we have autovacuum worker processes in addition to normal backend processes. Does it show a fact that there are some jobs we cannot run in normal backends? For example, normal backends cannot do anything in idle time, so a time-based polling job is difficult in backends. It might be ok to fork processes for each interval when the polling interval is long, but it is not effective for short interval cases. I'd like to use such kind of process as an additional stats collector. -- Itagaki Takahiro
Itagaki-san, On 08/26/2010 01:02 PM, Itagaki Takahiro wrote: > OK, I see why you proposed coordinator hook (yeah, I call it hook :) > rather than adding user-defined processes. I see. If you call that a hook, I'm definitely not a hook-hater ;-) at least not according to your definition. > However, we have autovacuum worker processes in addition to normal backend > processes. Does it show a fact that there are some jobs we cannot run in > normal backends? Hm.. understood. You can use VACUUM from a cron job. And that's the problem autovacuum solves. So in a way, that's just a convenience feature. You want the same for general purpose user defined background processing, right? > For example, normal backends cannot do anything in idle time, so a > time-based polling job is difficult in backends. It might be ok to > fork processes for each interval when the polling interval is long, > but it is not effective for short interval cases. I'd like to use > such kind of process as an additional stats collector. Did you follow the discussion I had with Dimitri, who was trying something similar, IIRC. See the bg worker - overview thread. There might be some interesting bits thinking into that direction. Regards Markus
On Thu, Aug 26, 2010 at 6:07 AM, Markus Wanner <markus@bluegap.ch> wrote: > What'd be the benefits of having separate coordinator processes? They'd be > doing pretty much the same: coordinate background processes. (And yes, I > clearly consider autovacuum to be just one kind of background process). I dunno. It was just a thought. I haven't actually looked at the code to see how much synergy there is. (Sorry, been really busy...) >> I agree with this criticism, but the other thing that strikes me as a >> nonstarter is having the postmaster participate in the imessages >> framework. > > This is simply not the case (anymore). (And one of the reasons a separate > coordinator process is required, instead of letting the postmaster do this > kind of coordination). Oh, OK. I see now that I misinterpreted what you wrote. On the more general topic of imessages, I had one other thought that might be worth considering. Instead of using shared memory, what about using a file that is shared between the sender and receiver? So for example, perhaps each receiver will read messages from a file called pg_messages/%d, where %d is the backend ID. And writers will write into that file. Perhaps both readers and writers mmap() the file, or perhaps there's a way to make it work with just read() and write(). If you actually mmap() the file, you could probably manage it in a fashion pretty similar to what you had in mind for wamalloc, or some other setup that minimizes locking. In particular, ISTM that if we want this to be usable for parallel query, we'll want to be able to have one process streaming data in while another process streams data out, with minimal interference between these two activities. On the other hand, for processes that only send and receive messages occasionally, this might just be overkill (and overhead). You'd be just as well off wrapping the access to the file in an LWLock: the reader takes the lock, reads the data, marks it read, and releases the lock. The writer takes the lock, writes data, and releases the lock. It almost seems to me that there are two different kinds of messages here: control messages and data messages. Control messages are things like "vacuum this database!" or "flush your cache!" or "execute this query and send the results to backend %d!" or "cancel the currently executing query!". They are relatively small (in some cases, fixed-size), relatively low-volume, don't need complex locking, and can generally be processed serially but with high priority. Data messages are streams of tuples, either from a remote database from which we are replicating, or between backends that are executing a parallel query. These messages may be very large and extremely high-volume, are very sensitive to concurrency problems, but are not high-priority. We want to process them as quickly as possible, of course, but the work may get interrupted by control messages. Another point is that it's reasonable, at least in the case of parallel query, for the action of sending a data message to *block*. If one part of the query is too far ahead of the rest of the query, we don't want to queue up results forever, perhaps using CPU or I/O resources that some other backend needs to catch up, exhausting available disk space, etc.Instead, at some point, we just block and wait forthe queue to drain. I suppose there's no avoiding the possibility that sending a control message might also block, but certainly we wouldn't like a control message to block because the relevant queue is full of data messages. So I kind of wonder whether we ought to have two separate systems, one for data and one for control, with someone different characteristics. I notice that one of your bg worker patches is for OOO-messages. I apologize again for not having read through it, but how much does that resemble separating the control and data channels? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert, On 08/26/2010 02:44 PM, Robert Haas wrote: > I dunno. It was just a thought. I haven't actually looked at the > code to see how much synergy there is. (Sorry, been really busy...) No problem, was just wondering if there's any benefit you had in mind. > On the more general topic of imessages, I had one other thought that > might be worth considering. Instead of using shared memory, what > about using a file that is shared between the sender and receiver? What would that buy us? (At the price of more system calls and disk I/O)? Remember that the current approach (IIRC) uses exactly one syscall to send a message: kill() to send the (multiplexed) signal. (Except on strange platforms or setups that don't have a user-space spinlock implementation and need to use system mutexes). > So > for example, perhaps each receiver will read messages from a file > called pg_messages/%d, where %d is the backend ID. And writers will > write into that file. Perhaps both readers and writers mmap() the > file, or perhaps there's a way to make it work with just read() and > write(). If you actually mmap() the file, you could probably manage > it in a fashion pretty similar to what you had in mind for wamalloc, > or some other setup that minimizes locking. That would still require proper locking, then. So I'm not seeing the benefit. > In particular, ISTM that > if we want this to be usable for parallel query, we'll want to be able > to have one process streaming data in while another process streams > data out, with minimal interference between these two activities. That's well possible with the current approach. About the only limitation is that a receiver can only consume the messages in the order they got into the queue. But pretty much any backend can send messages to any other backend concurrently. (Well, except that I think there currently are bugs in wamalloc). > On > the other hand, for processes that only send and receive messages > occasionally, this might just be overkill (and overhead). You'd be > just as well off wrapping the access to the file in an LWLock: the > reader takes the lock, reads the data, marks it read, and releases the > lock. The writer takes the lock, writes data, and releases the lock. The current approach uses plain spinlocks, which are more efficient. Note that both, appending as well as removing from the queue are writing operations, from the point of view of the queue. So I don't think LWLocks buy you anything here, either. > It almost seems to me that there are two different kinds of messages > here: control messages and data messages. Control messages are things > like "vacuum this database!" or "flush your cache!" or "execute this > query and send the results to backend %d!" or "cancel the currently > executing query!". They are relatively small (in some cases, > fixed-size), relatively low-volume, don't need complex locking, and > can generally be processed serially but with high priority. Data > messages are streams of tuples, either from a remote database from > which we are replicating, or between backends that are executing a > parallel query. These messages may be very large and extremely > high-volume, are very sensitive to concurrency problems, but are not > high-priority. We want to process them as quickly as possible, of > course, but the work may get interrupted by control messages. Another > point is that it's reasonable, at least in the case of parallel query, > for the action of sending a data message to *block*. If one part of > the query is too far ahead of the rest of the query, we don't want to > queue up results forever, perhaps using CPU or I/O resources that some > other backend needs to catch up, exhausting available disk space, etc. I agree that such a thing isn't currently covered. And it might be useful. However, adding two separate queues with different priority would be very simple to do. (Note, however, that there already are the standard unix signals for very simple kinds of control signals. I.e. for aborting a parallel query, you could simply send SIGINT to all background workers involved). I understand the need to limit the amount of data in flight, but I don't think that sending any type of message should ever block. Messages are atomic in that regard. Either they are ready to be delivered (in entirety) or not. Thus the sender needs to hold back the message, if the recipient is overloaded. (Also note that currently imessages are bound to a maximum size of around 8 KB). It might be interesting to note that I've just implemented some kind of streaming mechanism *atop* of imessages for Postgres-R. A data stream gets fragmented into single messages. As you pointed out, there should be some kind of congestion control. However, in my case, that needs to cover the inter-node connection as well, not just imessages. So I think the solution to that problem needs to be found on a higher level. I.e. in the Postgres-R case, I want to limit the *overall* amount of recovery data that's pending for a certain node. Not just the amount that's pending on a certain stream of within the imessages system. Think of imessages as the IP between processes, while streaming of data needs something akin to TCP on top of it. (OTOH, this comparison is lacking, because imessages guarantee reliable and ordered delivery of messages). BTW: why do you think the data heavy messages are sensitive to concurrency problems? I found the control messages to be rather more sensitive, as state changes and timing for those control messages are trickier to deal with. > So I kind of wonder whether we ought to have two separate systems, one > for data and one for control, with someone different characteristics. > I notice that one of your bg worker patches is for OOO-messages. I > apologize again for not having read through it, but how much does that > resemble separating the control and data channels? It's something that resides within the coordinator process exclusively and doesn't have much to do with imessages. Postgres-R doesn't require the GCS to deliver (certain kind of) messages in any order, it only requires the GCS to guarantee reliability of message delivery (or notification in the form of excluding the failing node from the group in case delivery failed). Thus, the coordinator needs to be able to re-order the messages, because bg workers need to receive the change sets in the correct order. And imessages guarantees to maintain the ordering. The reason for doing this within the coordinator is to a) lower requirements for the GCS and b) gain more control of the data flow. I.e. congestion control gets much easier, if the coordinator knows the amount of data that's queued. (As opposed to having lots of TCP connections, each of which queues an unknown amount of data). As is evident, all of these decisions are rather Postgres-R centric. However, I still think the simplicity and the level of generalization of imessages, dynamic shared memory and to some extent even the background worker infrastructure makes these components potentionaly re-usable. Regards Markus Wanner
Markus Wanner <markus@bluegap.ch> writes: > On 08/26/2010 02:44 PM, Robert Haas wrote: >> On the more general topic of imessages, I had one other thought that >> might be worth considering. Instead of using shared memory, what >> about using a file that is shared between the sender and receiver? > What would that buy us? Not having to have a hard limit on the space for unconsumed messages? > The current approach uses plain spinlocks, which are more efficient. Please note the coding rule that says that the code should not execute more than a few straight-line instructions while holding a spinlock. If you're copying long messages while holding the lock, I don't think spinlocks are acceptable. regards, tom lane
On 08/26/2010 09:22 PM, Tom Lane wrote: > Not having to have a hard limit on the space for unconsumed messages? Ah, I see. However, spilling to disk is unwanted for the current use cases of imessages. Instead the sender needs to be able to deal with out-of-(that-specific-part-of-shared)-memory conditions. > Please note the coding rule that says that the code should not execute > more than a few straight-line instructions while holding a spinlock. > If you're copying long messages while holding the lock, I don't think > spinlocks are acceptable. Writing the payload data for imessages to shared memory doesn't need any kind of lock. (Because the relevant chunk of shared memory got allocated via wamalloc, which grants the allocator exclusive control over the returned chunk). Only appending and removing (the pointer to the data) to and from the queue requires taking a spinlock. And I think that still qualifies. However, your concern is valid for wamalloc, which is more critical in that regard. Regards Markus
On Thu, Aug 26, 2010 at 3:03 PM, Markus Wanner <markus@bluegap.ch> wrote: >> On the more general topic of imessages, I had one other thought that >> might be worth considering. Instead of using shared memory, what >> about using a file that is shared between the sender and receiver? > > What would that buy us? (At the price of more system calls and disk I/O)? > Remember that the current approach (IIRC) uses exactly one syscall to send a > message: kill() to send the (multiplexed) signal. (Except on strange > platforms or setups that don't have a user-space spinlock implementation and > need to use system mutexes). It wouldn't require you to preallocate a big chunk of shared memory without knowing how much of it you'll actually need. For example, suppose we implement parallel query. If the message queues can be allocated on the fly, then you can just say maximum_message_queue_size_per_backend = 16MB and that'll probably be good enough for most installations. On systems where parallel query is not used (e.g. because they only have 1 or 2 processors) then it costs nothing. On systems where parallel query is used extensively (e.g. because they have 32 processors), you'll allocate enough space for the number of backends that actually need message buffers, and not more than that. Furthermore, if parallel query is used at some times (say, for nightly reporting) but not others (say, for daily OLTP queries), the buffers can be deallocated when the helper backends exit (or paged out if they are idle), and that memory can be reclaimed for other use. In addition, it means that maximum_message_queue_size_per_backend (or whatever it's called) can be changed on-the-fly; that is, it can be PGC_SIGHUP rather than PGC_POSTMASTER. Being able to change GUCs without shutting down the postmaster is a *big deal* for people running in 24x7 operations. Even things like wal_level that aren't apt to be changed more than once in a blue moon are a problem (once you go from "not having a standby" to "having a standby", you're unlikely to want to go backwards), and this would likely need more tweaking. You might find that you need more memory for better throughput, or that you need to reclaim memory for other purposes. Especially if it's a hard allocation for any number of backends, rather than something that backends can allocate only as and when they need it. As to efficiency, the process is not much different once the initial setup is completed. Just because you write to a memory-mapped file rather than a shared memory segment doesn't mean that you're necessarily doing disk I/O. On systems that support it, you could also choose to map a named POSIX shm rather than a disk file. Either way, there might be a little more overhead at startup but that doesn't seem so bad; presumably the amount of work that the worker is doing is large compared to the overhead of a few system calls, or you're probably in trouble anyway, since our process startup overhead is pretty substantial already. The only time it seems like the overhead would be annoying is if a process is going to use this system, but only lightly. Doing the extra setup just to send one or two messages might suck. But maybe that just means this isn't the right mechanism for those cases (e.g. the existing XID-wraparound logic should still use signal multiplexing rather than this system). I see the value of this as being primarily for streaming big chunks of data, not so much for sending individual, very short messages. >> On >> the other hand, for processes that only send and receive messages >> occasionally, this might just be overkill (and overhead). You'd be >> just as well off wrapping the access to the file in an LWLock: the >> reader takes the lock, reads the data, marks it read, and releases the >> lock. The writer takes the lock, writes data, and releases the lock. > > The current approach uses plain spinlocks, which are more efficient. Note > that both, appending as well as removing from the queue are writing > operations, from the point of view of the queue. So I don't think LWLocks > buy you anything here, either. I agree that this might not be useful. We don't really have all the message types defined yet, though, so it's hard to say. > I understand the need to limit the amount of data in flight, but I don't > think that sending any type of message should ever block. Messages are > atomic in that regard. Either they are ready to be delivered (in entirety) > or not. Thus the sender needs to hold back the message, if the recipient is > overloaded. (Also note that currently imessages are bound to a maximum size > of around 8 KB). That's functionally equivalent to blocking, isn't it? I think that's just a question of what API you want to expose. > It might be interesting to note that I've just implemented some kind of > streaming mechanism *atop* of imessages for Postgres-R. A data stream gets > fragmented into single messages. As you pointed out, there should be some > kind of congestion control. However, in my case, that needs to cover the > inter-node connection as well, not just imessages. So I think the solution > to that problem needs to be found on a higher level. I.e. in the Postgres-R > case, I want to limit the *overall* amount of recovery data that's pending > for a certain node. Not just the amount that's pending on a certain stream > of within the imessages system. For replication, that might be the case, but for parallel query, per-queue seems about right. At any rate, no design we've discussed will let individual queues grow without bound. > Think of imessages as the IP between processes, while streaming of data > needs something akin to TCP on top of it. (OTOH, this comparison is lacking, > because imessages guarantee reliable and ordered delivery of messages). You probably need this, but 8KB seems like a pretty small chunk size. I think one of the advantages of a per-backend area is that you don't need to worry so much about fragmentation. If you only need in-order message delivery, you can just use the whole thing as a big ring buffer. There's no padding or sophisticated allocation needed. You just need a pointer to the last byte read (P1), the last byte allowed to be read (P2), and the last byte allocated (P3). Writers take a spinlock, advance P3, release the spinlock, write the message, take the spinlock, advance P2, release the spinlock, and signal the reader.Readers take the spinlock, read P1 and P2, releasethe spinlock, read the data, take the spinlock, advance P1, and release the spinlock. You might still want to fragment chunks of data to avoid problems if, say, two writers are streaming data to a single reader. In that case, if the messages were too large compared to the amount of buffer space available, you might get poor utilization, or even starvation. But I would think you wouldn't need to worry about that until the message size got fairly high. > BTW: why do you think the data heavy messages are sensitive to concurrency > problems? I found the control messages to be rather more sensitive, as state > changes and timing for those control messages are trickier to deal with. Well, what I was thinking about is the fact that data messages are bigger. If I'm writing a 16-byte message once a minute and the reader and I block each other until the message is fully read or written, it's not really that big of a deal. If the same thing happens when we're trying to continuously stream tuple data from one process to another, it halves the throughput; we expect both processes to be reading/writing almost constantly. >> So I kind of wonder whether we ought to have two separate systems, one >> for data and one for control, with someone different characteristics. >> I notice that one of your bg worker patches is for OOO-messages. I >> apologize again for not having read through it, but how much does that >> resemble separating the control and data channels? > > It's something that resides within the coordinator process exclusively and > doesn't have much to do with imessages. Oh, OK. > As is evident, all of these decisions are rather Postgres-R centric. > However, I still think the simplicity and the level of generalization of > imessages, dynamic shared memory and to some extent even the background > worker infrastructure makes these components potentionaly re-usable. I think unicast messaging is really useful and I really want it, but the requirement that it be done through dynamic shared memory allocations feels very uncomfortable to me (as you've no doubt gathered). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Thu, Aug 26, 2010 at 3:40 PM, Markus Wanner <markus@bluegap.ch> wrote: > On 08/26/2010 09:22 PM, Tom Lane wrote: >> >> Not having to have a hard limit on the space for unconsumed messages? > > Ah, I see. However, spilling to disk is unwanted for the current use cases > of imessages. Instead the sender needs to be able to deal with > out-of-(that-specific-part-of-shared)-memory conditions. Shared memory can be paged out, too, if it's not being used enough to keep the OS from deciding to evict it. And I/O to a mmap()'d file or shared memory region can remain in RAM. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 08/26/2010 11:57 PM, Robert Haas wrote: > It wouldn't require you to preallocate a big chunk of shared memory Agreed, you wouldn't have to allocate it in advance. We would still want a configurable upper limit. So this can be seen as another approach for an implementation of a dynamic allocator. (Which should be separate from the exact imessages implementation, just for the sake of modularization already, IMO). > In addition, it means that maximum_message_queue_size_per_backend (or > whatever it's called) can be changed on-the-fly; that is, it can be > PGC_SIGHUP rather than PGC_POSTMASTER. That's certainly a point. However, as you are proposing a solution to just one subsystem (i.e. imessages), I don't find it half as convincing. If you are saying it *should* be possible to resize shared memory in a portable way, why not do it for *all* subsystems right away? I still remember Tom saying it's not something that's doable in a portable way. Why and how should it be possible for a per-backend basis? How portable is mmap() really? Why don't we use in in Postgres as of now? I certainly think that these are orthogonal issues: whether to use fixed boundaries or to dynamically allocate the memory available is one thing, dynamic resizing is another. If the later is possible, I'm certainly not opposed to it. (But would still favor dynamic allocation). > As to efficiency, the process is not much different once the initial > setup is completed. I fully agree to that. I'm more concerned about ease of use for developers. Simply being able to alloc() from shared memory makes things easier than having to invent a separate allocation method for every subsystem, again and again (the argument that people are more used to multi-threaded argument). > Doing the extra setup just to send one or two messages > might suck. But maybe that just means this isn't the right mechanism > for those cases (e.g. the existing XID-wraparound logic should still > use signal multiplexing rather than this system). I see the value of > this as being primarily for streaming big chunks of data, not so much > for sending individual, very short messages. I agree that simple signals don't need a full imessage. But as soon as you want to send some data (like which database to vacuum), or require the delivery guarantee (i.e. no single message gets lost, as opposed to signals), then imessages should be cheap enough. >> The current approach uses plain spinlocks, which are more efficient. Note >> that both, appending as well as removing from the queue are writing >> operations, from the point of view of the queue. So I don't think LWLocks >> buy you anything here, either. > > I agree that this might not be useful. We don't really have all the > message types defined yet, though, so it's hard to say. What does the type of lock used have to do with message types? IMO it doesn't matter what kind of message or what size you want to send. For appending or removing a pointer to or from a message queue, a spinlock seems to be just the right thing to use. >> I understand the need to limit the amount of data in flight, but I don't >> think that sending any type of message should ever block. Messages are >> atomic in that regard. Either they are ready to be delivered (in entirety) >> or not. Thus the sender needs to hold back the message, if the recipient is >> overloaded. (Also note that currently imessages are bound to a maximum size >> of around 8 KB). > > That's functionally equivalent to blocking, isn't it? I think that's > just a question of what API you want to expose. Hm.. well, yeah, depends on what level you are arguing. The imessages API can be used in a completely non-blocking fashion. So a process can theoretically do other work while waiting for messages. For parallel querying, the helper/worker backends would probably need to block, if the origin backend is not ready to accept more data, yes. However, making it accept and process another job in the mean time seems hard to do. But not an imessages problem per se. (While with the above streaming layer I've mentioned, that would not be possible, because that blocks). > For replication, that might be the case, but for parallel query, > per-queue seems about right. At any rate, no design we've discussed > will let individual queues grow without bound. Extend parallel querying to multiple nodes and you are back at the same requirement. However, it's certainly something that can be done atop imessages. I'm unsure if doing it as part of imessages is a good thing or not. Given the above requirement, I don't currently think so. Using multiple queues with different priorities, as you proposed, would probably make it more feasible. > You probably need this, but 8KB seems like a pretty small chunk size. For node-internal messaging, I probably agree. Would need benchmarking, as it's a compromise between latency and overhead, IMO. I've chosen 8KB so these messages (together with some GCS and other transport headers) presumably fit into ethernet jumbo frames. I'd argue that you'd want even smaller chunk sizes for 1500 byte MTUs, because I don't expect the GCS to do a better job at fragmenting, than we can do in the upper layer (i.e. without copying data and w/o additional latency when reassembling the packet). But again, maybe that should be benchmarked, first. > I think one of the advantages of a per-backend area is that you don't > need to worry so much about fragmentation. If you only need in-order > message delivery, you can just use the whole thing as a big ring > buffer. Hm.. interesting idea. It's similar to my initial implementation, except that I had only a single ring-buffer for all backends. > There's no padding or sophisticated allocation needed. You > just need a pointer to the last byte read (P1), the last byte allowed > to be read (P2), and the last byte allocated (P3). Writers take a > spinlock, advance P3, release the spinlock, write the message, take > the spinlock, advance P2, release the spinlock, and signal the reader. That would block parallel writers (i.e. only one process can write to the queue at any time). > Readers take the spinlock, read P1 and P2, release the spinlock, read > the data, take the spinlock, advance P1, and release the spinlock. It would require copying data in case a process only needs to forward the message. That's a quick pointer dequeue and enqueue exercise ATM. > You might still want to fragment chunks of data to avoid problems if, > say, two writers are streaming data to a single reader. In that case, > if the messages were too large compared to the amount of buffer space > available, you might get poor utilization, or even starvation. But I > would think you wouldn't need to worry about that until the message > size got fairly high. Some of the writers in Postgres-R allocate the chunk for the message in shared memory way before they send the message. I.e. during a write operation of a transaction that needs to be replicated, the backend allocates space for a message at the start of the operation, but only fills it with change set data during processing. That can possibly take quite a while. Decoupling memory allocation from message queue management allows to do this without having to copy the data. The same holds true for forwarding a message. > Well, what I was thinking about is the fact that data messages are > bigger. If I'm writing a 16-byte message once a minute and the reader > and I block each other until the message is fully read or written, > it's not really that big of a deal. If the same thing happens when > we're trying to continuously stream tuple data from one process to > another, it halves the throughput; we expect both processes to be > reading/writing almost constantly. Agreed. Unlike the proposed ring-buffer approach, the separate allocator approach doesn't have that problem, because writing itself is fully parallelized, even to the same recipient. > I think unicast messaging is really useful and I really want it, but > the requirement that it be done through dynamic shared memory > allocations feels very uncomfortable to me (as you've no doubt > gathered). Well, I on the other hand am utterly uncomfortable with having a separate solution for memory allocation per sub-system (and it definitely is an inherent problem to lots of our subsystems). Given the ubiquity of dynamic memory allocators, I don't really understand your discomfort. Thanks for discussing, I always enjoy respectful disagreement. Regards Markus Wanner
On Fri, Aug 27, 2010 at 2:17 PM, Markus Wanner <markus@bluegap.ch> wrote: >> In addition, it means that maximum_message_queue_size_per_backend (or >> whatever it's called) can be changed on-the-fly; that is, it can be >> PGC_SIGHUP rather than PGC_POSTMASTER. > > That's certainly a point. However, as you are proposing a solution to just > one subsystem (i.e. imessages), I don't find it half as convincing. What other subsystems are you imagining servicing with a dynamic allocator? If there were a big demand for this functionality, we probably would have been forced to implement it already, but that's not the case. We've already discussed the fact that there are massive problems with using it for something like shared_buffers, which is by far the largest consumer of shared memory. > If you are saying it *should* be possible to resize shared memory in a > portable way, why not do it for *all* subsystems right away? I still > remember Tom saying it's not something that's doable in a portable way. I think it would be great if we could bring some more flexibility to our memory management. There are really two layers of problems here. One is resizing the segment itself, and one is resizing structures within the segment. As far as I can tell, there is no portable API that can be used to resize the shm itself. For so long as that remains the case, I am of the opinion that any meaningful resizing of the objects within the shm is basically unworkable. So we need to solve that problem first. There are a couple of possible solutions, which have been discussed here in the past. One very appealing option is to use POSIX shm rather than sysv shm. AFAICT, it is possible to portably resize a POSIX shm using ftruncate(), though I am not sure to what extent this is supported on Windows. One significant advantage of using POSIX shm is that the default limits for POSIX shm on many operating systems are much higher than the corresponding limits for sysv shm; in fact, some people have expressed the opinion that it might be worth making the switch for that reason alone, since it is no secret that a default value of 32MB or less for shared_buffers is not enough to get reasonable performance on many modern systems. I believe, however, that Tom Lane thinks we need to get a bit more out of it than that to make it worthwhile. One obstacle to making the switch is that POSIX shm does not provide a way to fetch the number of processes attached to the shared memory segment, which is a critical part of our infrastructure to prevent accidentally running multiple postmasters on the same data directory at the same time. Consequently, it seems hard to see how we can make that switch completely. At a minimum, we'll probably need to maintain a small sysv shm for interlock purposes. OK, so let's suppose we use POSIX shm for most of the shared memory segment, and keep only our fixed-size data structures in the sysv shm.Then what? Well, then we can potentially resize it. Because we are using a process-based model, this will require some careful gymnastics. Let's say we're growing the shm. The backend that is initiating the operation will call ftruncate() and then send signal all of the other backends (using a sinval message or a multiplexed signal or some such mechanism) to unmap and remap the shared memory segment. Any failure to remap the shared memory segment is at least a FATAL for that backend, and very likely a PANIC, so this had better not be something we plan to do routinely - for example, we wouldn't want to do this as a way of adapting to changing load conditions. It would probably be acceptable to do it in a situation such as a postgresql.conf reload, to accommodate a change in the server parameter that can't otherwise be changed without a restart, since the worst case scenario is, well, we have to restart anyway. Once all that's done, it's safe to start allocating memory from the newly added portion of the shm. Conversely, if we want to shrink the shm, the considerations are similar, but we have to do everything in the opposite order. First, we must ensure that the portion of the shm we're about to release is unused. Then, we tell all the backends to unmap and remap it. Once we've confirmed that they have done so, we ftruncate() it to the new size. Next, we have to think about how we're going to resize data structures within this expandable shm. Many of these structures are not things that we can easily move without bringing the system to a halt. For example, it's difficult to see how you could change the base address of shared buffers without ceasing all system activity, at which point there's not really much advantage over just forcing a restart. Similarly with LWLocks or the ProcArray. And if you can't move them, then how will you grow them if (as will likely be the case) there's something immediately following them in memory. One possible solution is to divide up these data structures into "slabs". For example, we might imagine allocating shared_buffers in 1GB chunks. To make this work, we'd need to change the memory layout so that each chunk would include all of the miscellaneous stuff that we need to do bookkeeping for that chunk, such as the LWLocks and buffer descriptors. That doesn't seem completely impossible, but there would be some performance penalty, because you could no longer index into shared buffers from a single base offset. Instead, you'd need to determine which chunk contains the buffer you want, look up the base address for that chunk, and then index into the chunk. Maybe that overhead wouldn't be significant (or maybe it would); at any rate, it's not completely free. There's also the problem of handling the partial chunk at the end, especially if that happens to be the only chunk. I think the problems for other arrays are similar, or more severe. I can't see, for example, how you could resize the ProcArray using this approach. If you want to deallocate a chunk of shared buffers, it's not impossible to imagine an algorithm for relocating any dirty buffers in the segment to be deallocated into the remaining available space, and then chucking the ones that are not dirty. It might not be real cheap, but that's not the same thing as not possible. On the other hand, changing the backend ID of a process in flight seems intractable. Maybe it's not. Or maybe there is some other approach to resizing these data structures that can work, but it's not real clear to me what it is. So basically my feeling is that reworking our memory allocation in general, while possibly worthwhile, is a whole lot of work. If we focus on getting imessages done in the most direct fashion possible, it seems like the sort of things that could get done in six months to a year. If we take the approach of reworking our whole approach to memory allocation first, I think it will take several years. Assuming the problems discussed above aren't totally intractable, I'd be in favor of solving them, because I think we can get some collateral benefits out of it that would be nice to have. However, it's definitely a much larger project. > Why > and how should it be possible for a per-backend basis? If we're designing a completely new subsystem, we have a lot more design flexibility, because we needn't worry about interactions with the existing users of shared memory. Resizing an arena that is only used for imessages is a lot more straightforward than resizing the main shared memory arena. If you can't remap the main shared memory chunk, you won't be able to properly clean up your state while exiting, and so a PANIC is forced. But if you can't remap the imessages chunk, and particularly if it only contains messages that were addressed to you, then you should be able to get by with FATAL, which is certainly a good thing from a system robustness point of view. And you might not even need to remap it. The main reason (although perhaps not the only reason) that someone would likely want to vary a global allocation for parallel query or replication is if they changed from "not using that feature" to "using it", or perhaps from "using it" to "using it more heavily". If the allocations are per-backend and can be made on the fly, that problem goes away. As long as we keep the shared memory area used for imessages/dynamic allocation separate from, and independent of, the main shm, we can still gain many of the same advantages - in particular, not PANICing if a remap fails, and being able to resize the thing on the fly. However, I believe that the implementation will be more complex if the area is not per-backend. Resizing is almost certainly a necessity in this case, for the reasons discussed above, and that will have to be done by having all backends unmap and remap the area in a coordinated fashion, so it will be more disruptive than unmapping and remapping a message queue for a single backend, where you only need to worry about the readers and writers for that particular queue. Also, you now have to worry about fragmentation: a simple ring buffer is great if you're processing messages on a FIFO basis, but when you have multiple streams of messages with different destinations, it's probably not a great solution. > How portable is > mmap() really? Why don't we use in in Postgres as of now? I believe that mmap() is very portable, though there are other people on this list who know more about exotic, crufty platforms than I do. I discussed the question of why it's not used for our current shared memory segment above - no nattch interlock. >> As to efficiency, the process is not much different once the initial >> setup is completed. > > I fully agree to that. > > I'm more concerned about ease of use for developers. Simply being able to > alloc() from shared memory makes things easier than having to invent a > separate allocation method for every subsystem, again and again (the > argument that people are more used to multi-threaded argument). This goes back to my points further up: what else do you think this could be used for? I'm much less optimistic about this being reusable than you are, and I'd like to hear some concrete examples of other use cases. >>> The current approach uses plain spinlocks, which are more efficient. Note >>> that both, appending as well as removing from the queue are writing >>> operations, from the point of view of the queue. So I don't think LWLocks >>> buy you anything here, either. >> >> I agree that this might not be useful. We don't really have all the >> message types defined yet, though, so it's hard to say. > > What does the type of lock used have to do with message types? IMO it > doesn't matter what kind of message or what size you want to send. For > appending or removing a pointer to or from a message queue, a spinlock seems > to be just the right thing to use. Well, it's certainly nice, if you can make it work. I haven't really thought about all the cases, though. The main advantages of LWLocks is that you can take them in either shared or exclusive mode, and that you can hold them for more than a handful of instructions. If we're trying to design a really *simple* system for message passing, LWLocks might be just right. Take the lock, read or write the message, release the lock. But it seems like that's not really the case we're trying to optimize for, so this may be a dead-end. >> You probably need this, but 8KB seems like a pretty small chunk size. > > For node-internal messaging, I probably agree. Would need benchmarking, as > it's a compromise between latency and overhead, IMO. > > I've chosen 8KB so these messages (together with some GCS and other > transport headers) presumably fit into ethernet jumbo frames. I'd argue that > you'd want even smaller chunk sizes for 1500 byte MTUs, because I don't > expect the GCS to do a better job at fragmenting, than we can do in the > upper layer (i.e. without copying data and w/o additional latency when > reassembling the packet). But again, maybe that should be benchmarked, > first. Yeah, probably. I think designing something that works efficiently over a network is a somewhat different problem than designing something that works on an individual node, and we probably shouldn't let the designs influence each other too much. >> There's no padding or sophisticated allocation needed. You >> just need a pointer to the last byte read (P1), the last byte allowed >> to be read (P2), and the last byte allocated (P3). Writers take a >> spinlock, advance P3, release the spinlock, write the message, take >> the spinlock, advance P2, release the spinlock, and signal the reader. > > That would block parallel writers (i.e. only one process can write to the > queue at any time). I feel like there's probably some variant of this idea that works around that problem. The problem is that when a worker finishes writing a message, he needs to know whether to advance P2 only over his own message or also over some subsequent message that has been fully written in the meantime. I don't know exactly how to solve that problem off the top of my head, but it seems like it might be possible. >> Readers take the spinlock, read P1 and P2, release the spinlock, read >> the data, take the spinlock, advance P1, and release the spinlock. > > It would require copying data in case a process only needs to forward the > message. That's a quick pointer dequeue and enqueue exercise ATM. If we need to do that, that's a compelling argument for having a single messaging area rather than one per backend. But I'm not sure I see why we would need that sort of capability. Why wouldn't you just arrange for the sender to deliver the message directly to the final recipient? >> You might still want to fragment chunks of data to avoid problems if, >> say, two writers are streaming data to a single reader. In that case, >> if the messages were too large compared to the amount of buffer space >> available, you might get poor utilization, or even starvation. But I >> would think you wouldn't need to worry about that until the message >> size got fairly high. > > Some of the writers in Postgres-R allocate the chunk for the message in > shared memory way before they send the message. I.e. during a write > operation of a transaction that needs to be replicated, the backend > allocates space for a message at the start of the operation, but only fills > it with change set data during processing. That can possibly take quite a > while. So, they know in advance how large the message will be but not what the contents will be? What are they doing? >> I think unicast messaging is really useful and I really want it, but >> the requirement that it be done through dynamic shared memory >> allocations feels very uncomfortable to me (as you've no doubt >> gathered). > > Well, I on the other hand am utterly uncomfortable with having a separate > solution for memory allocation per sub-system (and it definitely is an > inherent problem to lots of our subsystems). Given the ubiquity of dynamic > memory allocators, I don't really understand your discomfort. Well, the fact that something is commonly used doesn't mean it's right for us. Tabula raza, we might design the whole system differently, but changing it now is not to be undertaken lightly. Hopefully the above comments shed some light on my concerns. In short, (1) I don't want to preallocate a big chunk of memory we might not use, (2) I fear reducing the overall robustness of the system, and (3) I'm uncertain what other systems would be able leverage a dynamic allocator of the sort you propose. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 08/27/2010 10:46 PM, Robert Haas wrote: > What other subsystems are you imagining servicing with a dynamic > allocator? If there were a big demand for this functionality, we > probably would have been forced to implement it already, but that's > not the case. We've already discussed the fact that there are massive > problems with using it for something like shared_buffers, which is by > far the largest consumer of shared memory. Understood. I certainly plan to look into that for a better understanding of the problems those pose for dynamically allocated memory. > I think it would be great if we could bring some more flexibility to > our memory management. There are really two layers of problems here. Full ACK. > One is resizing the segment itself, and one is resizing structures > within the segment. As far as I can tell, there is no portable API > that can be used to resize the shm itself. For so long as that > remains the case, I am of the opinion that any meaningful resizing of > the objects within the shm is basically unworkable. So we need to > solve that problem first. Why should resizing of the objects within the shmem be unworkable? Doesn't my patch(es) prove the exact opposite? Being able to resize "objects" within the shm requires some kind of underlying dynamic allocation. And I rather like to be in control of that allocator than having to deal with two dozen different implementations on different OSes and their libraries. > There are a couple of possible solutions, which have been discussed > here in the past. I currently don't have much interest in dynamic resizing. Being able to resize the overall amount of shared memory on the fly would be nice, sure. But the total amount of RAM in a server changes rather infrequently. Being able to use what's available more efficiently is what I'm interested in. That doesn't need any kind of additional or different OS level support. It's just a matter of making better use of what's available - within Postgres itself. > Next, we have to think about how we're going to resize data structures > within this expandable shm. Okay, that's where I'm getting interested. > Many of these structures are not things > that we can easily move without bringing the system to a halt. For > example, it's difficult to see how you could change the base address > of shared buffers without ceasing all system activity, at which point > there's not really much advantage over just forcing a restart. > Similarly with LWLocks or the ProcArray. I guess that's what Bruce wanted to point out by saying our data structures are mostly "continuous". I.e. not dynamic lists or hash tables, but plain simple arrays. Maybe that's a subjective impression, but I seem to hear complaints about their fixed size and inflexibility quite often. Try to imagine the flexibility that dynamic lists could give us. > And if you can't move them, > then how will you grow them if (as will likely be the case) there's > something immediately following them in memory. One possible solution > is to divide up these data structures into "slabs". For example, we > might imagine allocating shared_buffers in 1GB chunks. Why 1GB and do yet another layer of dynamic allocation within that? The buffers are (by default) 8K, so allocate in chunks of 8K. Or a tiny bit more for all of the book-keeping stuff. > To make this > work, we'd need to change the memory layout so that each chunk would > include all of the miscellaneous stuff that we need to do bookkeeping > for that chunk, such as the LWLocks and buffer descriptors. That > doesn't seem completely impossible, but there would be some > performance penalty, because you could no longer index into shared > buffers from a single base offset. AFAICT we currently have three fixed size blocks to manage shared buffers: the buffer blocks themselves, the buffer descriptors, the strategy status (for the freelist) and the buffer lookup table. It's not obvious to me how these data structures should perform better than a dynamically allocated layout. One could rather argue that combining (some of) the bookkeeping stuff with data itself would lead to better locality and thus perform better. > Instead, you'd need to determine > which chunk contains the buffer you want, look up the base address for > that chunk, and then index into the chunk. Maybe that overhead > wouldn't be significant (or maybe it would); at any rate, it's not > completely free. There's also the problem of handling the partial > chunk at the end, especially if that happens to be the only chunk. This sounds way too complicated, yes. Use 8K chunks and most of the problems vanish. > I think the problems for other arrays are similar, or more severe. I > can't see, for example, how you could resize the ProcArray using this > approach. Try not to think in terms of resizing, but dynamic allocation. I.e. being able to resize ProcArray (and thus being able to alter max_connections on the fly) would take a lot more work. Just using the unoccupied space of the ProcArray for other subsystems that need it more urgently could be done much easier. Again, you'd want to allocate a single PGPROC at a time. (And yes, the benefits aren't as significant as for shared_buffers, simply because PGPROC doesn't occupy that much memory). > If you want to deallocate a chunk of shared buffers, it's > not impossible to imagine an algorithm for relocating any dirty > buffers in the segment to be deallocated into the remaining available > space, and then chucking the ones that are not dirty. Please use the dynamic allocator for that. Don't duplicate that again. Those allocators are designed for efficiently allocating small chunks, down to a few bytes. > It might not be > real cheap, but that's not the same thing as not possible. On the > other hand, changing the backend ID of a process in flight seems > intractable. Maybe it's not. Or maybe there is some other approach > to resizing these data structures that can work, but it's not real > clear to me what it is. Changing to a dynamically allocated memory model certainly requires some thought and lots of work. Yes. It's not for free. > So basically my feeling is that reworking our memory allocation in > general, while possibly worthwhile, is a whole lot of work. Exactly. > If we > focus on getting imessages done in the most direct fashion possible, > it seems like the sort of things that could get done in six months to > a year. Well, it works for Postgres-R as it is, so imessages already exists without a single additional month. And I don't intend to change it back to something that couldn't use a dynamic allocator. I already run into too many problems that way, see below. > If we take the approach of reworking our whole approach to > memory allocation first, I think it will take several years. Assuming > the problems discussed above aren't totally intractable, I'd be in > favor of solving them, because I think we can get some collateral > benefits out of it that would be nice to have. However, it's > definitely a much larger project. Agreed. > If the allocations are > per-backend and can be made on the fly, that problem goes away. That might hold true for imessages, which simply loose importance once the (recipient) backend vanishes. But other shared memory stuff, that would rather complicate shared memory access. > As long as we keep the shared memory area used for imessages/dynamic > allocation separate from, and independent of, the main shm, we can > still gain many of the same advantages - in particular, not PANICing > if a remap fails, and being able to resize the thing on the fly. Separate sub-system allocators, separate code, separate bugs, lots more work. Please not. KISS. > However, I believe that the implementation will be more complex if the > area is not per-backend. Resizing is almost certainly a necessity in > this case, for the reasons discussed above I disagree and see the main reason in making better use of the available resources. Resizing will loose lots of importance, once you can dynamically adjust boundaries between subsystem's use of the single, huge, fixed-size shmem chunk allocated at start. > and that will have to be > done by having all backends unmap and remap the area in a coordinated > fashion, That's assuming resizing capability. > so it will be more disruptive than unmapping and remapping a > message queue for a single backend, where you only need to worry about > the readers and writers for that particular queue. And that's assuming a separate allocation method for the imessages sub-system. > Also, you now have > to worry about fragmentation: a simple ring buffer is great if you're > processing messages on a FIFO basis, but when you have multiple > streams of messages with different destinations, it's probably not a > great solution. Exactly, that's where dynamic allocation shows its real advantages. No silly ring buffers required. > This goes back to my points further up: what else do you think this > could be used for? I'm much less optimistic about this being reusable > than you are, and I'd like to hear some concrete examples of other use > cases. Sure. And well understood. I'll try to take a try at converting shared_buffers. > Well, it's certainly nice, if you can make it work. I haven't really > thought about all the cases, though. The main advantages of LWLocks > is that you can take them in either shared or exclusive mode As mentioned, the message queue has write accesses exclusively (enqueue and dequeue), so that's unneeded overhead. > and that > you can hold them for more than a handful of instructions. Neither of the two operations needs more than a handful of instructions, so that's plain overhead as well. > If we're > trying to design a really *simple* system for message passing, LWLocks > might be just right. Take the lock, read or write the message, > release the lock. That's exactly how easy is is *with* the dynamic allocator: take the (even simpler) spin lock, enqueue (or dequeue) the message, release the lock again. No locking required for writing or reading the message. Independent (and well multi-process capable / safe) alloc and free routines for memory management. That get called *before* writing the message and *after* reading it. Mangling memory allocation with queue management is a lot more complicated to design and understand. And less efficient > But it seems like that's not really the case we're > trying to optimize for, so this may be a dead-end. > >>> You probably need this, but 8KB seems like a pretty small chunk size. >> >> For node-internal messaging, I probably agree. Would need benchmarking, as >> it's a compromise between latency and overhead, IMO. >> >> I've chosen 8KB so these messages (together with some GCS and other >> transport headers) presumably fit into ethernet jumbo frames. I'd argue that >> you'd want even smaller chunk sizes for 1500 byte MTUs, because I don't >> expect the GCS to do a better job at fragmenting, than we can do in the >> upper layer (i.e. without copying data and w/o additional latency when >> reassembling the packet). But again, maybe that should be benchmarked, >> first. > > Yeah, probably. I think designing something that works efficiently > over a network is a somewhat different problem than designing > something that works on an individual node, and we probably shouldn't > let the designs influence each other too much. > >>> There's no padding or sophisticated allocation needed. You >>> just need a pointer to the last byte read (P1), the last byte allowed >>> to be read (P2), and the last byte allocated (P3). Writers take a >>> spinlock, advance P3, release the spinlock, write the message, take >>> the spinlock, advance P2, release the spinlock, and signal the reader. >> >> That would block parallel writers (i.e. only one process can write to the >> queue at any time). > > I feel like there's probably some variant of this idea that works > around that problem. The problem is that when a worker finishes > writing a message, he needs to know whether to advance P2 only over > his own message or also over some subsequent message that has been > fully written in the meantime. I don't know exactly how to solve that > problem off the top of my head, but it seems like it might be > possible. > >>> Readers take the spinlock, read P1 and P2, release the spinlock, read >>> the data, take the spinlock, advance P1, and release the spinlock. >> >> It would require copying data in case a process only needs to forward the >> message. That's a quick pointer dequeue and enqueue exercise ATM. > > If we need to do that, that's a compelling argument for having a > single messaging area rather than one per backend. But I'm not sure I > see why we would need that sort of capability. Why wouldn't you just > arrange for the sender to deliver the message directly to the final > recipient? > >>> You might still want to fragment chunks of data to avoid problems if, >>> say, two writers are streaming data to a single reader. In that case, >>> if the messages were too large compared to the amount of buffer space >>> available, you might get poor utilization, or even starvation. But I >>> would think you wouldn't need to worry about that until the message >>> size got fairly high. >> >> Some of the writers in Postgres-R allocate the chunk for the message in >> shared memory way before they send the message. I.e. during a write >> operation of a transaction that needs to be replicated, the backend >> allocates space for a message at the start of the operation, but only fills >> it with change set data during processing. That can possibly take quite a >> while. > > So, they know in advance how large the message will be but not what > the contents will be? What are they doing? > >>> I think unicast messaging is really useful and I really want it, but >>> the requirement that it be done through dynamic shared memory >>> allocations feels very uncomfortable to me (as you've no doubt >>> gathered). >> >> Well, I on the other hand am utterly uncomfortable with having a separate >> solution for memory allocation per sub-system (and it definitely is an >> inherent problem to lots of our subsystems). Given the ubiquity of dynamic >> memory allocators, I don't really understand your discomfort. > > Well, the fact that something is commonly used doesn't mean it's right > for us. Tabula raza, we might design the whole system differently, > but changing it now is not to be undertaken lightly. Hopefully the > above comments shed some light on my concerns. In short, (1) I don't > want to preallocate a big chunk of memory we might not use, (2) I fear > reducing the overall robustness of the system, and (3) I'm uncertain > what other systems would be able leverage a dynamic allocator of the > sort you propose. >
(Sorry, need to disable Ctrl-Return, which quite often sends mails earlier than I really want.. continuing my mail) On 08/27/2010 10:46 PM, Robert Haas wrote: > Yeah, probably. I think designing something that works efficiently > over a network is a somewhat different problem than designing > something that works on an individual node, and we probably shouldn't > let the designs influence each other too much. Agreed. Thus I've left out any kind of congestion avoidance stuff from imessages so far. >>> There's no padding or sophisticated allocation needed. You >>> just need a pointer to the last byte read (P1), the last byte allowed >>> to be read (P2), and the last byte allocated (P3). Writers take a >>> spinlock, advance P3, release the spinlock, write the message, take >>> the spinlock, advance P2, release the spinlock, and signal the reader. >> >> That would block parallel writers (i.e. only one process can write to the >> queue at any time). > > I feel like there's probably some variant of this idea that works > around that problem. The problem is that when a worker finishes > writing a message, he needs to know whether to advance P2 only over > his own message or also over some subsequent message that has been > fully written in the meantime. I don't know exactly how to solve that > problem off the top of my head, but it seems like it might be > possible. I've tried pretty much that before. And failed. Because the allocation-order (i.e. the time the message gets created in preparation for writing to it) isn't necessarily the same as the sending-order (i.e. when the process has finished writing and decides to send the message). To satisfy the FIFO property WRT the sending order, you need to decouple allocation form the ordering (i.e. queuing logic). (And yes, it has taken me a while to figure out what's wrong in Postgres-R, before I've even noticed about that design bug). >>> Readers take the spinlock, read P1 and P2, release the spinlock, read >>> the data, take the spinlock, advance P1, and release the spinlock. >> >> It would require copying data in case a process only needs to forward the >> message. That's a quick pointer dequeue and enqueue exercise ATM. > > If we need to do that, that's a compelling argument for having a > single messaging area rather than one per backend. Absolutely, yes. > But I'm not sure I > see why we would need that sort of capability. Why wouldn't you just > arrange for the sender to deliver the message directly to the final > recipient? A process can read and even change the data of the message before forwarding it. Something the coordinator in Postgres-R does sometimes. (As it is the interface to the GCS and thus to the rest of the nodes in the cluster). For parallel querying (on a single node) that's probably less important a feature. > So, they know in advance how large the message will be but not what > the contents will be? What are they doing? Filling the message until it's (mostly) full and then continue with the next one. At least that's how the streaming approach on top of imessages works. But yes, it's somewhat annoying to have to know the message size in advance. I didn't implement realloc so far. Nor can I think of any other solution. Note that separation of allocation and queue ordering is required anyway for the above reasons. > Well, the fact that something is commonly used doesn't mean it's right > for us. Tabula raza, we might design the whole system differently, > but changing it now is not to be undertaken lightly. Hopefully the > above comments shed some light on my concerns. In short, (1) I don't > want to preallocate a big chunk of memory we might not use, Isn't that's exactly what we do now for lots of sub-systems, and what I'd like to improve (i.e. reduce to a single big chunk). > (2) I fear > reducing the overall robustness of the system, and Well, that applies to pretty much every new feature you add. > (3) I'm uncertain > what other systems would be able leverage a dynamic allocator of the > sort you propose. Okay, that's up to me to show evidences (or at least a PoC). Regards Markus Wanner
Markus Wanner <markus@bluegap.ch> writes: > AFAICT we currently have three fixed size blocks to manage shared > buffers: the buffer blocks themselves, the buffer descriptors, the > strategy status (for the freelist) and the buffer lookup table. > It's not obvious to me how these data structures should perform better > than a dynamically allocated layout. Let me just point out that awhile back we got a *measurable* performance boost by eliminating a single indirect fetch from the buffer addressing code path. We used to have an array of pointers pointing to the actual buffers, and we removed that in favor of assuming the buffers were laid out in a contiguous array, so that the address of buffer N could be computed with a shift-and-add, eliminating the pointer fetch. I forget exactly what the numbers were, but it was significant enough to make us change it. So I don't have any faith in untested assertions that we can convert these data structures to use dynamic allocation with no penalty. It's very difficult to see how you'd do that without introducing a new layer of indirection, and our experience shows that that layer will cost you. regards, tom lane
Hi, On 08/30/2010 04:52 PM, Tom Lane wrote: > Let me just point out that awhile back we got a *measurable* performance > boost by eliminating a single indirect fetch from the buffer addressing > code path. I'll take a look a that, thanks. > So I don't have any faith in untested assertions Neither do I. Thus I'm probably going to try my approach. Regards Markus Wanner
On Mon, Aug 30, 2010 at 11:30 AM, Markus Wanner <markus@bluegap.ch> wrote: > On 08/30/2010 04:52 PM, Tom Lane wrote: >> Let me just point out that awhile back we got a *measurable* performance >> boost by eliminating a single indirect fetch from the buffer addressing >> code path. > > I'll take a look a that, thanks. > >> So I don't have any faith in untested assertions > Neither do I. Thus I'm probably going to try my approach. As a matter of project management, I am inclined to think that until we've hammered out this issue, there's not a whole lot useful that can be done on any of the BG worker patches. So I am wondering if we should set those to Returned with Feedback or bump them to a future CommitFest. The good news is that, after a lot of back and forth, I think we've identified the reason underpinning much of why Markus and I have been disagreeing about dynshmem and imessages - namely, whether or not it's possible to allocate shared_buffers as something other than one giant slab without taking an unacceptable performance hit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 09/14/2010 06:26 PM, Robert Haas wrote: > As a matter of project management, I am inclined to think that until > we've hammered out this issue, there's not a whole lot useful that can > be done on any of the BG worker patches. So I am wondering if we > should set those to Returned with Feedback or bump them to a future > CommitFest. I agree in general. I certainly don't want to hold back the commit fest. What bugs me a bit is that I didn't really get much feedback regarding the *bgworker* portion of code. Especially as that's the part I'm most interested in feedback. However, I currently don't have any time to work on these patches, so I'm fine with dropping them from the current commit fest. > The good news is that, after a lot of back and forth, I think we've > identified the reason underpinning much of why Markus and I have been > disagreeing about dynshmem and imessages - namely, whether or not it's > possible to allocate shared_buffers as something other than one giant > slab without taking an unacceptable performance hit. Agreed. Regards Markus Wanner
Excerpts from Markus Wanner's message of mar sep 14 12:56:59 -0400 2010: > What bugs me a bit is that I didn't really get much feedback regarding > the *bgworker* portion of code. Especially as that's the part I'm most > interested in feedback. I think we've had enough problems with the current design of forking a new autovac process every once in a while, that I'd like to have them as permanent processes instead, waiting for orders from the autovac launcher. From that POV, bgworkers would make sense. I cannot promise a timely review however :-( -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > I think we've had enough problems with the current design of forking a > new autovac process every once in a while, that I'd like to have them as > permanent processes instead, waiting for orders from the autovac > launcher. From that POV, bgworkers would make sense. That seems like a fairly large can of worms to open: we have never tried to make backends switch from one database to another, and I don't think I'd want to start such a project with autovac. regards, tom lane
Excerpts from Tom Lane's message of mar sep 14 13:46:17 -0400 2010: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > I think we've had enough problems with the current design of forking a > > new autovac process every once in a while, that I'd like to have them as > > permanent processes instead, waiting for orders from the autovac > > launcher. From that POV, bgworkers would make sense. > > That seems like a fairly large can of worms to open: we have never tried > to make backends switch from one database to another, and I don't think > I'd want to start such a project with autovac. Yeah, what I was thinking is that each worker would still die after completing the run, but a new one would be started immediately; it would go to sleep until a new assignment arrived. (What got me into this was the whole latch thing, actually.) This is a very raw idea however, so don't mind me much. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Markus Wanner <markus@bluegap.ch> writes: > On 09/14/2010 07:46 PM, Tom Lane wrote: >> That seems like a fairly large can of worms to open: we have never tried >> to make backends switch from one database to another, and I don't think >> I'd want to start such a project with autovac. > They don't. Even with bgworker, every backend stays connected to the > same backend. You configure the min and max amounts of idle backends > *per database*. Plus the overall max of background workers, IIRC. So there is a minimum of one avworker per database? That's a guaranteed nonstarter. There are many people with thousands of databases, but no need for thousands of avworkers. I'm also pretty unclear why you speak of min and max numbers of workers when the proposal (AIUI) is to have the workers there always, rather than have them come and go. regards, tom lane
Hi, On 09/14/2010 07:46 PM, Tom Lane wrote: > Alvaro Herrera<alvherre@commandprompt.com> writes: >> I think we've had enough problems with the current design of forking a >> new autovac process every once in a while, that I'd like to have them as >> permanent processes instead, waiting for orders from the autovac >> launcher. From that POV, bgworkers would make sense. Okay, great. > That seems like a fairly large can of worms to open: we have never tried > to make backends switch from one database to another, and I don't think > I'd want to start such a project with autovac. They don't. Even with bgworker, every backend stays connected to the same backend. You configure the min and max amounts of idle backends *per database*. Plus the overall max of background workers, IIRC. Regards Markus Wanner
Hi, I'm glad discussion on this begins. On 09/14/2010 07:55 PM, Tom Lane wrote: > So there is a minimum of one avworker per database? Nope, you can set that to 0. You don't *need* to keep idle workers around. > That's a guaranteed > nonstarter. There are many people with thousands of databases, but no > need for thousands of avworkers. Well, yeah, bgworkers are primarily designed to be used for Postgres-R, where you easily get more background workers than normal backends. And having idle backends around waiting for a next job is certainly preferable over having to re-connect every time. I've advertised the bgworker infrastructure for use for parallel querying as well. Again, that use case easily leads to having more background workers than normal backends. And you don't want to wait for them all to re-connect for every query they need to help with. > I'm also pretty unclear why you speak of min and max numbers of workers > when the proposal (AIUI) is to have the workers there always, rather > than have them come and go. This step 1 of the bgworker set of patches turns the av*launcher* (coordinator) into a permanent process (even if autovacuum is off). The background workers can still come and go. However, they don't necessarily *need* to terminate after having done their job. The coordinator controls them and requests new workers or commands idle ones to terminate *as required*. I don't think there's that much different to the current implementation. Setting both, the min and max number of idle bgworkers to 0 should in fact give you the exact same behavior as we currently have: namely to terminate each av/bgworker after its job is done, never having idle workers around. Which might or might not be the optimal configuration for users with lots of databases, that's hard to predict. And it depends a lot on the load distribution over the databases and on how clever the coordinator manages the bgworkers. Regards Markus Wanner
On 09/14/2010 08:06 PM, Robert Haas wrote: > One idea I had was to have autovacuum workers stick around for a > period of time after finishing their work. When we need to autovacuum > a database, first check whether there's an existing worker that we can > use, and if so use him. If not, start a new one. If that puts us > over the max number of workers, kill of the one that's been waiting > the longest. But workers will exit anyway if not reused after a > certain period of time. That's pretty close to how bgworkers are implemented, now. Except for the need to terminate after a certain period of time. What is that intended to be good for? Especially considering that the avlauncher/coordinator knows the current amount of work (number of jobs) per database. > The idea here would be to try to avoid all the backend startup costs: > process creation, priming the caches, etc. But I'm not really sure > it's worth the effort. I think we need to look for ways to further > reduce the overhead of vacuuming, but this doesn't necessarily seem > like the thing that would have the most bang for the buck. Well, the pressure has simply been bigger for Postgres-R. It should be possible to do benchmarks using Postgres-R and compare against a max_idle_background_workers = 0 configuration that leads to termination and re-connecting for ever remote transaction to be applied. However, that's not going to say anything about whether or not it's worth it for autovacuum. Regards Markus Wanner
On Tue, Sep 14, 2010 at 1:56 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Tom Lane's message of mar sep 14 13:46:17 -0400 2010: >> Alvaro Herrera <alvherre@commandprompt.com> writes: >> > I think we've had enough problems with the current design of forking a >> > new autovac process every once in a while, that I'd like to have them as >> > permanent processes instead, waiting for orders from the autovac >> > launcher. From that POV, bgworkers would make sense. >> >> That seems like a fairly large can of worms to open: we have never tried >> to make backends switch from one database to another, and I don't think >> I'd want to start such a project with autovac. > > Yeah, what I was thinking is that each worker would still die after > completing the run, but a new one would be started immediately; it would > go to sleep until a new assignment arrived. (What got me into this was > the whole latch thing, actually.) > > This is a very raw idea however, so don't mind me much. What would be the advantage of that? One idea I had was to have autovacuum workers stick around for a period of time after finishing their work. When we need to autovacuum a database, first check whether there's an existing worker that we can use, and if so use him. If not, start a new one. If that puts us over the max number of workers, kill of the one that's been waiting the longest. But workers will exit anyway if not reused after a certain period of time. The idea here would be to try to avoid all the backend startup costs: process creation, priming the caches, etc. But I'm not really sure it's worth the effort. I think we need to look for ways to further reduce the overhead of vacuuming, but this doesn't necessarily seem like the thing that would have the most bang for the buck. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Tue, Sep 14, 2010 at 2:26 PM, Markus Wanner <markus@bluegap.ch> wrote: > On 09/14/2010 08:06 PM, Robert Haas wrote: >> One idea I had was to have autovacuum workers stick around for a >> period of time after finishing their work. When we need to autovacuum >> a database, first check whether there's an existing worker that we can >> use, and if so use him. If not, start a new one. If that puts us >> over the max number of workers, kill of the one that's been waiting >> the longest. But workers will exit anyway if not reused after a >> certain period of time. > > That's pretty close to how bgworkers are implemented, now. Except for the > need to terminate after a certain period of time. What is that intended to > be good for? To avoid consuming system resources forever if they're not being used. > Especially considering that the avlauncher/coordinator knows the current > amount of work (number of jobs) per database. > >> The idea here would be to try to avoid all the backend startup costs: >> process creation, priming the caches, etc. But I'm not really sure >> it's worth the effort. I think we need to look for ways to further >> reduce the overhead of vacuuming, but this doesn't necessarily seem >> like the thing that would have the most bang for the buck. > > Well, the pressure has simply been bigger for Postgres-R. > > It should be possible to do benchmarks using Postgres-R and compare against > a max_idle_background_workers = 0 configuration that leads to termination > and re-connecting for ever remote transaction to be applied. Well, presumably that would be fairly disastrous. I would think, though, that you would not have a min/max number of workers PER DATABASE, but an overall limit on the upper size of the total pool - I can't see any reason to limit the minimum size of the pool, but I might be missing something. > However, that's > not going to say anything about whether or not it's worth it for autovacuum. Personally, my position is that if someone does something that is only a small improvement on its own but which has the potential to help with other things later, that's a perfectly legitimate patch and we should try to accept it. But if it's not a clear (even if small) win then the bar is a lot higher, at least in my book. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 09/14/2010 08:41 PM, Robert Haas wrote: > To avoid consuming system resources forever if they're not being used. Well, what timeout would you choose. And how would you justify it compared to the amounts of system resources consumed by an idle process sitting there and waiting for a job? I'm not against such a timeout, but so far I felt that unlimited would be the best default. > Well, presumably that would be fairly disastrous. I would think, > though, that you would not have a min/max number of workers PER > DATABASE, but an overall limit on the upper size of the total pool That already exists (in addition to the other parameters). > - I > can't see any reason to limit the minimum size of the pool, but I > might be missing something. I tried to mimic what others do, for example apache pre-fork. Maybe it's just another way of trying to keep the overall resource consumption at a reasonable level. The minimum is helpful to eliminate waits for backends starting up. Note here that the coordinator can only request to fork one new bgworker at a time. It then needs to wait until that new bgworker registers with the coordinator, until it can request further bgworkers from the postmaster. (That's due to the limitation in communication between the postmaster and coordinator). > Personally, my position is that if someone does something that is only > a small improvement on its own but which has the potential to help > with other things later, that's a perfectly legitimate patch and we > should try to accept it. But if it's not a clear (even if small) win > then the bar is a lot higher, at least in my book. I don't think it's an improvement over the current autovacuum behavior. Not intended to be one. But it certainly shouldn't hurt, either. It only has the potential to help with other things, namely parallel querying. And of course replication (Postgres-R). Or any other kind of background job you come up with (where background means not requiring a client connection). Regards Markus Wanner
On Tue, Sep 14, 2010 at 2:59 PM, Markus Wanner <markus@bluegap.ch> wrote: > On 09/14/2010 08:41 PM, Robert Haas wrote: >> >> To avoid consuming system resources forever if they're not being used. > > Well, what timeout would you choose. And how would you justify it compared > to the amounts of system resources consumed by an idle process sitting there > and waiting for a job? > > I'm not against such a timeout, but so far I felt that unlimited would be > the best default. I don't have a specific number in mind. 5 minutes? >> Well, presumably that would be fairly disastrous. I would think, >> though, that you would not have a min/max number of workers PER >> DATABASE, but an overall limit on the upper size of the total pool > > That already exists (in addition to the other parameters). Hmm. So what happens if you have 1000 databases with a minimum of 1 worker per database and an overall limit of 10 workers? >> - I >> can't see any reason to limit the minimum size of the pool, but I >> might be missing something. > > I tried to mimic what others do, for example apache pre-fork. Maybe it's > just another way of trying to keep the overall resource consumption at a > reasonable level. > > The minimum is helpful to eliminate waits for backends starting up. Note > here that the coordinator can only request to fork one new bgworker at a > time. It then needs to wait until that new bgworker registers with the > coordinator, until it can request further bgworkers from the postmaster. > (That's due to the limitation in communication between the postmaster and > coordinator). Hmm, I see. That's probably not helpful for autovacuum, but I can see it being useful for replication. I still think maybe we ought to try to crack the nut of allowing backends to rebind to a different database. That would simplify things here a good deal, although then again maybe it's too complex to be worth it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 09/15/2010 03:44 AM, Robert Haas wrote: > Hmm. So what happens if you have 1000 databases with a minimum of 1 > worker per database and an overall limit of 10 workers? The first 10 databases would get an idle worker. As soon as real jobs arrive, the idle workers on databases that don't have any pending jobs get terminated in favor of the databases for which there are pending jobs. Admittedly, that mechanism isn't too clever, yet. I.e. if there always are enough jobs for one database, the others could starve. With 1000 databases and a max of only 10 workers, chances for having a spare worker for the database that gets the next job are pretty low, yes. But that's the case with the proposed 5 minute timeout as well. Lowering that timeout wouldn't increase the chance. And while it might make the start of a new bgworker quicker in the above mentioned case, I think there's not much advantage over just setting max_idle_background_workers = 0. OTOH such a timeout would be easy enough to implement. The admin would be faced with yet another GUC, though. > Hmm, I see. That's probably not helpful for autovacuum, but I can see > it being useful for replication. Glad to hear. > I still think maybe we ought to try > to crack the nut of allowing backends to rebind to a different > database. That would simplify things here a good deal, although then > again maybe it's too complex to be worth it. Also note that it would re-introduce some of the costs we try to avoid with keeping the connected bgworker around. And in case you afford having at least a few spare bgworkers around per database (i.e. less than 10 or 20 databases), potential savings seem to be negligible again. Regards Markus Wanner
On Wed, Sep 15, 2010 at 2:48 AM, Markus Wanner <markus@bluegap.ch> wrote: >> Hmm. So what happens if you have 1000 databases with a minimum of 1 >> worker per database and an overall limit of 10 workers? > > The first 10 databases would get an idle worker. As soon as real jobs > arrive, the idle workers on databases that don't have any pending jobs get > terminated in favor of the databases for which there are pending jobs. > Admittedly, that mechanism isn't too clever, yet. I.e. if there always are > enough jobs for one database, the others could starve. I haven't scrutinized your code but it seems like the minimum-per-database might be complicating things more than necessary.You might find that you can make the logic simplerwithout that. I might be wrong, though. I guess the real issue here is whether it's possible to, and whether you're interested in, extracting a committable subset of this work, and if so what that subset should look like. There's sort of a chicken-and-egg problem with large patches; if you present them as one giant monolithic patch, they're too large to review. But if you break them down into smaller patches, it doesn't really fix the problem unless the pieces have independent value. Even in the two years I've been involved in the project, a number of different contributors have gone through the experience of submitting a patch that only made sense if you assumed that the follow-on patch was also going to get accepted, and as no one was willing to assume that, the first patch didn't get committed either. Where people have been able to break things down into a series of small to medium-sized incremental improvements, things have gone more smoothly. For example, Simon was able to get a batch to start the bgwriter during archive recovery committed to 8.4. That didn't have a lot of independent value, but it had some, and it paved the way for Hot Standby in 9.0. Had someone thought of a way to decompose that project into more than two truly independent pieces, I suspect it might have even gone more smoothly (although of course that's an arguable point and YMMV). >> I still think maybe we ought to try >> to crack the nut of allowing backends to rebind to a different >> database. That would simplify things here a good deal, although then >> again maybe it's too complex to be worth it. > > Also note that it would re-introduce some of the costs we try to avoid with > keeping the connected bgworker around. How? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert, On 09/15/2010 07:23 PM, Robert Haas wrote: > I haven't scrutinized your code but it seems like the > minimum-per-database might be complicating things more than necessary. > You might find that you can make the logic simpler without that. I > might be wrong, though. I still think of that as a feature, not something I'd like to simplify away. Are you arguing it could raise the chance of adaption of bgworkers to Postgres? I'm not seeing your point here. > I guess the real issue here is whether it's possible to, and whether > you're interested in, extracting a committable subset of this work, > and if so what that subset should look like. Well, as it doesn't currently provide any real benefit for autovacuum, it depends on how much hackers like to prepare for something like Postgres-R or parallel querying. > There's sort of a > chicken-and-egg problem with large patches; if you present them as one > giant monolithic patch, they're too large to review. But if you break > them down into smaller patches, it doesn't really fix the problem > unless the pieces have independent value. I don't quite get what you are trying to say here. I splited the bgworker projet from Postgres-R into 6 separate patches. Are you saying that's too few or too much? You are welcome to argue about independent patches, i.e. this patch 1 of 6 (as $SUBJECT implies) might have some value, according to Alvaro. Admittedly, patch 2 of 6 is the biggest and most important chunk of functionality of the whole set. >> Also note that it would re-introduce some of the costs we try to avoid with >> keeping the connected bgworker around. > > How? I'm talking about the cost of connecting to a database (and disconnecting), most probably flushing caches, and very probably some kind of re-registering with the coordinator. Most of what a normal backend does at startup. About the only thing you'd save here is the fork() and very basic process setup. I really doubt that's worth the effort. Having some more idle processes around doesn't hurt that much and solves the problem, I think. Thanks for your feedback. Regards Markus Wanner
On Wed, Sep 15, 2010 at 2:28 PM, Markus Wanner <markus@bluegap.ch> wrote: >> I guess the real issue here is whether it's possible to, and whether >> you're interested in, extracting a committable subset of this work, >> and if so what that subset should look like. > > Well, as it doesn't currently provide any real benefit for autovacuum, it > depends on how much hackers like to prepare for something like Postgres-R or > parallel querying. I think that the bar for committing to another in-core replication solution right now is probably fairly high. I am pretty doubtful that our current architecture is going to get us to the full feature set we'd eventually like to have - multi-master, partial replication, etc.But we're not ever going to have ten replication solutionsin core, so we need to think pretty carefully about what we accept. That conversation probably needs to start from the other end - is the overall architecture correct for us? - before we get down to specific patches. On the other hand, I'm very interested in laying the groundwork for parallel query, and I think there are probably a number of bits of architecture both from this project and Postgres-XC, that could be valuable contributions to PostgreSQL; however, in neither case do I expect them to be accepted without significant modification. >> There's sort of a >> chicken-and-egg problem with large patches; if you present them as one >> giant monolithic patch, they're too large to review. But if you break >> them down into smaller patches, it doesn't really fix the problem >> unless the pieces have independent value. > > I don't quite get what you are trying to say here. I splited the bgworker > projet from Postgres-R into 6 separate patches. Are you saying that's too > few or too much? I'm saying it's hard to think about committing any of them because they aren't really independent of each other or of other parts of Postgres-R. I feel like there is an antagonistic thread to this conversation, and some others that we've had. I hope I'm misreading that, because it's not my intent to piss you off. I'm just offering my honest feedback. Your mileage may vary; others may feel differently; none of it is personal. >>> Also note that it would re-introduce some of the costs we try to avoid >>> with >>> keeping the connected bgworker around. >> >> How? > > I'm talking about the cost of connecting to a database (and disconnecting), > most probably flushing caches, and very probably some kind of re-registering > with the coordinator. Most of what a normal backend does at startup. About > the only thing you'd save here is the fork() and very basic process setup. I > really doubt that's worth the effort. > > Having some more idle processes around doesn't hurt that much and solves the > problem, I think. OK, I think I understand what you're trying to say now. I guess I feel like the ideal architecture for any sort of solution that needs a pool of workers would be to keep around the workers that most recently proved to be useful. Upon needing a new worker, you look for one that's available and already bound to the correct database. If you find one, you assign him to the new task. If not, you find the one that's been idle longest and either (a) kill him off and start a new one that is bound to the correct database or, even better, (b) tell him to flush his caches and rebind to the correct database. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 09/15/2010 08:54 PM, Robert Haas wrote: > I think that the bar for committing to another in-core replication > solution right now is probably fairly high. I'm not trying to convince you to accept the Postgres-R patch.. at least not now. <showing-off> BTW, that'd be what I call a huge patch: bgworkers, excluding dynshmem and imessages: 34 files changed, 2910 insertions(+), 1421 deletions(-) from there to Postgres-R: 98 files changed, 14856 insertions(+), 230 deletions(-) </showing-off> > I am pretty doubtful that > our current architecture is going to get us to the full feature set > we'd eventually like to have - multi-master, partial replication, etc. Would be hard to do, due to the (physical) format of WAL, yes. That's why Postgres-R uses its own (logical) wire format. > But we're not ever going to have ten replication solutions in core, > so we need to think pretty carefully about what we accept. That's very understandable. > That > conversation probably needs to start from the other end - is the > overall architecture correct for us? - before we get down to specific > patches. On the other hand, I'm very interested in laying the > groundwork for parallel query Cool. Maybe we should take another look at bgworkers, as soon as a parallel querying feature gets planned? > and I think there are probably a number > of bits of architecture both from this project and Postgres-XC, that > could be valuable contributions to PostgreSQL; (...note that Postgres-R is license compatible, as opposed to the GPL'ed Postgres-XC project...) > however, in neither > case do I expect them to be accepted without significant modification. Sure, that's understandable as well. I've published this part of the infrastructure to get some feedback as early as possible on that part of Postgres-R. As you can certainly imagine, it's important for me that any modification to such a patch from Postgres-R would still be compatible to what I use it for in Postgres-R and not cripple any functionality there, because that'd probably create more work for me than not getting the patch accepted upstream at all. > I'm saying it's hard to think about committing any of them because > they aren't really independent of each other or of other parts of > Postgres-R. As long as you don't consider imessages and dynshmem a part of Postgres-R, they are independent of the rest of Postgres-R in the technical sense. And for any kind of parallel querying feature, imessages and dynshmem might be of help as well. So I currently don't see where I could de-couple these patches any further. If you have a specific requirement, please don't hesitate to ask. > I feel like there is an antagonistic thread to this conversation, and > some others that we've had. I hope I'm misreading that, because it's > not my intent to piss you off. I'm just offering my honest feedback. > Your mileage may vary; others may feel differently; none of it is > personal. That's absolutely fine. I'm thankful for your feedback. Also note that I initially didn't even want to add the bgworker patches to the commit fest. I've de-coupled and published these separate from Postgres-R with a) the hope to get feedback (more than for the overall Postgres-R patch) and b) to show others that such a facility exists and is ready to be reused. I didn't really expect them to get accepted to Postgres core at the moment. But the Postgres team normally asks for sharing concepts and ideas as early as possible... > OK, I think I understand what you're trying to say now. I guess I > feel like the ideal architecture for any sort of solution that needs a > pool of workers would be to keep around the workers that most recently > proved to be useful. Upon needing a new worker, you look for one > that's available and already bound to the correct database. If you > find one, you assign him to the new task. That's mostly how bgworkers are designed, yes. The min/max idle background worker GUCs allow a loose control over how many spare processes you want to allow hanging around doing nothing. > If not, you find the one > that's been idle longest and either (a) kill him off and start a new > one that is bound to the correct database or, even better, (b) tell > him to flush his caches and rebind to the correct database. Hm.. sorry if I didn't express this more clearly. What I'm trying to say is that (b) isn't worth implementing, because it doesn't offer enough of an improvement over (a). The only saving would be the fork() and some basic process initialization. Being able to re-use a bgworker connected to the correct database already gives you most of the benefit, namely not having to fork() *and* re-connect to the database for every job. Back at the technical issues, let me try to summarize the feedback and what I do with it. In general, there's not much use for bgworkers for just autovacuum as the only background job. I agree. Tom raised the 'lots of databases' issue. I agree that the bgworker infrastructure isn't optimized for such a work load, but argue that it's configurable to not hurt. If bgworkers ever gets accepted upstream, we'd certainly need to discuss about reasonable defaults for the relevant GUCs. Additionally, more cleverness about when to start or stop (spare) workers from the coordinator couldn't hurt. I had a lengthy discussion with Dimitri about whether or not bgworkers could help him with some kind of PgQ daemon. I think we now agree that bgworkers isn't the right tool for that job. You are questioning, whether the min_idle_bgworkers GUC is really necessary. I'm arguing that it is necessary in Postgres-R to cover load spikes, because starting bgworkers is slow. So, overall, I now got quite a bit of feedback. There doesn't seem to be any stumbling block in the general design of bgworkers. So I'll happily continue to use (and refine) bgworkers for Postgres-R. And I'm looking forward to more discussions once parallel querying gets more serious attention. Regards Markus Wanner
On Thu, Sep 16, 2010 at 4:47 AM, Markus Wanner <markus@bluegap.ch> wrote: > <showing-off> > BTW, that'd be what I call a huge patch: > > bgworkers, excluding dynshmem and imessages: > 34 files changed, 2910 insertions(+), 1421 deletions(-) > > from there to Postgres-R: > 98 files changed, 14856 insertions(+), 230 deletions(-) > </showing-off> Yeah, that's huge. :-) >> That >> conversation probably needs to start from the other end - is the >> overall architecture correct for us? - before we get down to specific >> patches. On the other hand, I'm very interested in laying the >> groundwork for parallel query > > Cool. Maybe we should take another look at bgworkers, as soon as a parallel > querying feature gets planned? Well, that will obviously depend somewhat on the wishes of whoever decides to work on parallel query, but it seems reasonable to me. It would be nice to get some pieces of this committed incrementally but as I say I fear there is too much dependency on what might happen later, at least the way things are structured now. >> and I think there are probably a number >> of bits of architecture both from this project and Postgres-XC, that >> could be valuable contributions to PostgreSQL; > > (...note that Postgres-R is license compatible, as opposed to the GPL'ed > Postgres-XC project...) Yeah. +1 for license compatibility. > As you can certainly imagine, it's important for me that any modification to > such a patch from Postgres-R would still be compatible to what I use it for > in Postgres-R and not cripple any functionality there, because that'd > probably create more work for me than not getting the patch accepted > upstream at all. That's an understandable goal, but it may be difficult to achieve. > As long as you don't consider imessages and dynshmem a part of Postgres-R, > they are independent of the rest of Postgres-R in the technical sense. > > And for any kind of parallel querying feature, imessages and dynshmem might > be of help as well. So I currently don't see where I could de-couple these > patches any further. I agree. I've already said my piece on how I think that stuff would need to be reworked to be acceptable, so we might have to agree to disagree on those, especially if your goal is to get something committed that doesn't involve a major rewrite on your end. > I didn't really expect them to get accepted to Postgres core at the moment. > But the Postgres team normally asks for sharing concepts and ideas as early > as possible... Absolutely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Morning, On 09/16/2010 04:26 PM, Robert Haas wrote: > I agree. I've already said my piece on how I think that stuff would > need to be reworked to be acceptable, so we might have to agree to > disagree on those, especially if your goal is to get something > committed that doesn't involve a major rewrite on your end. Just for clarification: you are referring to imessages and dynshmem here, right? I agree that dynshmem needs to be reworked and rethought. And imessages simply depends on dynshmem. If you are referring to the bgworker stuff, I'm not quite clear about what I could do to make bgworker more acceptable. (Except perhaps for removing the dependency on imessages). Regards Markus Wanner
On Thu, Sep 16, 2010 at 1:20 PM, Markus Wanner <markus@bluegap.ch> wrote: > On 09/16/2010 04:26 PM, Robert Haas wrote: >> >> I agree. I've already said my piece on how I think that stuff would >> need to be reworked to be acceptable, so we might have to agree to >> disagree on those, especially if your goal is to get something >> committed that doesn't involve a major rewrite on your end. > > Just for clarification: you are referring to imessages and dynshmem here, > right? I agree that dynshmem needs to be reworked and rethought. And > imessages simply depends on dynshmem. Yes, I was referring to imessages and dynshmem. > If you are referring to the bgworker stuff, I'm not quite clear about what I > could do to make bgworker more acceptable. (Except perhaps for removing the > dependency on imessages). I'm not sure, either. It would be nice if there were a way to create a general facility here that we could then build various applications on, but I'm not sure whether that's the case. We had some back-and-forth about what is best for replication vs. what is best for vacuum vs. what is best for parallel query. If we could somehow conceive of a system that could serve all of those needs without introducing any more configuration complexity than what we have now, that would of course be very interesting. But judging by your comments I'm not very sure such a thing is feasible, so perhaps wait-and-see is the best approach. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Hi, On 09/16/2010 07:47 PM, Robert Haas wrote: > It would be nice if there were a way to create > a general facility here that we could then build various applications > on, but I'm not sure whether that's the case. We had some > back-and-forth about what is best for replication vs. what is best for > vacuum vs. what is best for parallel query. If we could somehow > conceive of a system that could serve all of those needs without > introducing any more configuration complexity than what we have now, > that would of course be very interesting. Lets think about this again from a little distance. We have the existing autovacuum and the Postgres-R project. Then there are the potential features 'parallel querying' and 'autonomous transactions' that could in principle benefit from the bgworker infrastructure. For all of those, one could head for a multi-threaded, a multi-process or an async, event based approach. Multi-threading seems to be out of question for Postgres. We don't have much of an async event framework anywhere, so at least for parallel querying that seems out of question as well. Only the 'autonomous transactions' feature seems simple enough to be doable within a single process. That approach would still miss the isolation that a separate process features (not sure that's required, but 'autonomous' sounds like it could be a good thing to have). So assuming we use the multi-process approach provided by bgworkers for both potential features. What are the requirements? autovacuum: only very few jobs at a time, not very resource intensive, not passing around lots of data Postgres-R: lots of concurrent jobs, easily more than normal backends, depending on the amount of nodes in the cluster and read/write ratio, lots of data to be passed around parallel querying: a couple dozen concurrent jobs (by number of CPUs or spindles available?), more doesn't help, lots of data to be passed around autonomous transactions: max. one per normal backend (correct?), way fewer should suffice in most cases, only control data to be passed around So, for both potential features as well as for autovacuum, a ratio of 1:10 (or even less) for max_bgworkers:max_connections would suffice. Postgres-R clearly seems to be the out-breaker here. It needs special configuration anyway, so I'd have no problem with defaults that target the other use cases. All of the potential users of bgworkers benefit from a pre-connected bgworker. Meaning having at least one spare bgworker around per database could be beneficial, potentially more depending on how often spike loads occur. As long as there are only few databases, it's easily possible to have at least one spare process around per database, but with thousands of databases, that might get prohibitively expensive (not sure where the boundary between win vs loose is, though. Idle backends vs. connection cost). None the less, bgworkers would make the above features easier to implement, as they provide the controlled background worker process infrastructure, including job handling (and even queuing) in the coordinator process. Having spare workers available is not a perquisite to use bgworkers, it's just an optimization. Autovacuum could possibly benefit from bgworkers by enabling a finer grained choice for what database and table to vacuum when. I didn't look too much into that, though. Regarding the additional configuration overhead of the bgworkers patch: max_autovacuum_workers gets turned into max_background_workers, so the only additional GUCs currently are: min_spare_background_workers and max_spare_background_workers (sorry, I thought I named them idle workers, looks like I've gone with spare workers for the GUCs). Those are used to control and limit (in both directions) the amount of spare workers (per database). It's the simplest possible variant I could think of. But I'm open to other mechanisms, especially ones that require less configuration. Simply keeping spare workers around for a given timeout *could* be a replacement and would save us one GUC. However, I feel like this gives less control over how the bgworkers are used. For example, I'd prefer to be able to prevent the system from allocating all bgworkers to a single database at once. And as mentioned above, it also makes sense to pre-fork some bgworkers in advance, if there are still enough available. The timeout approach doesn't take care of that, but assumes that the past is a good indicator of use for the future. Hope that sheds some more light on how bgworkers could be useful. Maybe I just need to describe the job handling features of the coordinator better as well? (Simon also requested better documentation...) Regards Markus Wanner
On Fri, Sep 17, 2010 at 11:29 AM, Markus Wanner <markus@bluegap.ch> wrote: > autonomous transactions: max. one per normal backend (correct?), way fewer > should suffice in most cases, only control data to be passed around Technically, you could start an autonomous transaction from within an autonomous transaction, so I don't think there's a hard maximum of one per normal backend. However, I agree that the expected case is to not have very many. > All of the potential users of bgworkers benefit from a pre-connected > bgworker. Meaning having at least one spare bgworker around per database > could be beneficial, potentially more depending on how often spike loads > occur. As long as there are only few databases, it's easily possible to have > at least one spare process around per database, but with thousands of > databases, that might get prohibitively expensive (not sure where the > boundary between win vs loose is, though. Idle backends vs. connection > cost). I guess it depends on what your goals are. If you're optimizing for ability to respond quickly to a sudden load, keeping idle backends will probably win even when the number of them you're keeping around is fairly high. If you're optimizing for minimal overall resource consumption, though, you'll not be as happy about that. What I'm struggling to understand is this: if there aren't any preforked workers around when the load hits, how much does it slow things down? I would have thought that a few seconds to ramp up to speed after an extended idle period (5 minutes, say) would be acceptable for most of the applications you mention. Is the ramp-up time longer than that, or is even that much delay unacceptable for Postgres-R, or is there some other aspect to the problem I'm failing to grasp? I can tell you have some experience tuning this so I'd like to try to understand where you're coming from. > However, I feel like this gives less control over how the bgworkers are > used. For example, I'd prefer to be able to prevent the system from > allocating all bgworkers to a single database at once. I think this is an interesting example, and worth some further thought. I guess I don't really understand how Postgres-R uses these bgworkers. Are you replicating one transaction at a time, or how does the data get sliced up? I remember you mentioning sync/async/eager/other replication strategies previously - do you have a pointer to some good reading on that topic? > Hope that sheds some more light on how bgworkers could be useful. Maybe I > just need to describe the job handling features of the coordinator better as > well? (Simon also requested better documentation...) That seems like it would be useful, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert, On 09/17/2010 05:52 PM, Robert Haas wrote: > Technically, you could start an autonomous transaction from within an > autonomous transaction, so I don't think there's a hard maximum of one > per normal backend. However, I agree that the expected case is to not > have very many. Thanks for pointing that out. I somehow knew that was wrong.. > I guess it depends on what your goals are. Agreed. > If you're optimizing for > ability to respond quickly to a sudden load, keeping idle backends > will probably win even when the number of them you're keeping around > is fairly high. If you're optimizing for minimal overall resource > consumption, though, you'll not be as happy about that. What resources are we talking about here? Are idle backends really that resource hungry? My feeling so far has been that idle processes are relatively cheap (i.e. some 100 idle processes shouldn't hurt on a modern server). > What I'm > struggling to understand is this: if there aren't any preforked > workers around when the load hits, how much does it slow things down? As the startup code is pretty much the same as for the current avlauncher, the coordinator can only request one bgworker at a time. This means the signal needs to reach the postmaster, which then forks a bgworker process. That new process starts up, connects to the requested database and then sends an imessage to the coordinator to register. Only after having received that registration, the coordinator can request another bgworker (note that this is a one-overall limitation, not per database). I haven't measured the actual time it takes, but given the use case of a connection pool, I so far thought it's obvious that this process takes too long. (It's exactly what apache pre-fork does, no? Is anybody concerned about the idle processes there? Or do they consume much less resources?) > I would have thought that a few seconds to ramp up to speed after an > extended idle period (5 minutes, say) would be acceptable for most of > the applications you mention. A few seconds? That might be sufficient for autovacuum, but most queries are completed in less that one second. So for parallel querying, autonomous transactions and Postgres-R, I certainly don't think that a few seconds are reasonable. Especially considering the cost of idle backends. > Is the ramp-up time longer than that, > or is even that much delay unacceptable for Postgres-R, or is there > some other aspect to the problem I'm failing to grasp? I can tell you > have some experience tuning this so I'd like to try to understand > where you're coming from. I didn't ever compare to a max_spare_background_workers = 0 configuration, so I don't have any hard numbers, sorry. > I think this is an interesting example, and worth some further > thought. I guess I don't really understand how Postgres-R uses these > bgworkers. The given example doesn't only apply to Postgres-R. But with fewer bgworkers in total, you are more likely to want to use them all for one database, yes. > Are you replicating one transaction at a time, or how does > the data get sliced up? Yes, one transaction at a time. One transaction per backend (bgworker). On a cluster with n nodes that has only performs writing transactions, avg. at a rate of m concurrent transactions/node, you ideally end up having m normal backends and (n-1) * m bgworkers that concurrently apply the remote transactions. > I remember you mentioning > sync/async/eager/other replication strategies previously - do you have > a pointer to some good reading on that topic? Postgres-R mainly is eager multi-master replication. www.postgres-r.org has some links, most up-to-date my concept paper: http://www.postgres-r.org/downloads/concept.pdf > That seems like it would be useful, too. Okay, will try to come up with something, soon(ish). Thank you for your feedback and constructive criticism. Regards Markus Wanner
On Fri, Sep 17, 2010 at 4:49 PM, Markus Wanner <markus@bluegap.ch> wrote: >> If you're optimizing for >> ability to respond quickly to a sudden load, keeping idle backends >> will probably win even when the number of them you're keeping around >> is fairly high. If you're optimizing for minimal overall resource >> consumption, though, you'll not be as happy about that. > > What resources are we talking about here? Are idle backends really that > resource hungry? My feeling so far has been that idle processes are > relatively cheap (i.e. some 100 idle processes shouldn't hurt on a modern > server). Wow, 100 processes??! Really? I guess I don't actually know how large modern proctables are, but on my MacOS X machine, for example, there are only 75 processes showing up right now in "ps auxww". My Fedora 12 machine has 97. That's including a PostgreSQL instance in the first case and an Apache instance in the second case. So 100 workers seems like a ton to me. >> What I'm >> struggling to understand is this: if there aren't any preforked >> workers around when the load hits, how much does it slow things down? > > As the startup code is pretty much the same as for the current avlauncher, > the coordinator can only request one bgworker at a time. > > This means the signal needs to reach the postmaster, which then forks a > bgworker process. That new process starts up, connects to the requested > database and then sends an imessage to the coordinator to register. Only > after having received that registration, the coordinator can request another > bgworker (note that this is a one-overall limitation, not per database). > > I haven't measured the actual time it takes, but given the use case of a > connection pool, I so far thought it's obvious that this process takes too > long. Maybe that would be a worthwhile exercise... > (It's exactly what apache pre-fork does, no? Is anybody concerned about the > idle processes there? Or do they consume much less resources?) I think the kicker here is the idea of having a certain number of extra workers per database. On my vanilla Apache server on the above-mentioned Fedora 12 VM, there are a total of 10 processes running. I am sure that could balloon to 100 or more under load, but it's not keeping around 100 processes on an otherwise idle system. So if you knew you only had 1 database, keeping around 2 or 3 or 5 or even 10 workers might seem reasonable, but since you might have 1 database or 1000 databases, it doesn't. Keeping 2 or 3 or 5 or 10 workers TOTAL around could be reasonable, but not per-database. As Tom said upthread, we don't want to assume that we're the only thing running on the box and are therefore entitled to take up all the available memory/disk/process slots/whatever. And even if we DID feel so entitled, there could be hundreds of databases, and it certainly doesn't seem practical to keep 1000 workers around "just in case". I don't know whether an idle Apache worker consumes more or less memory than an idle PostgreSQL worker, but another difference between the Apache case and the PostgreSQL case is that presumably all those backend processes have attached shared memory and have ProcArray slots. We know that code doesn't scale terribly well, especially in terms of taking snapshots, and that's one reason why high-volume PostgreSQL installations pretty much require a connection pooler. I think the sizes of the connection pools I've seen recommended are considerably smaller than 100, more like 2 * CPUs + spindles, or something like that. It seems like if you actually used all 100 workers at the same time performance might be pretty awful. I was taking a look at the Mammoth Replicator code this week (parenthetical note: I couldn't figure out where mcp_server was or how to set it up) and it apparently has a limitation that only one database in the cluster can be replicated. I'm a little fuzzy on how Mammoth works, but apparently this problem of scaling to large numbers of databases is not unique to Postgres-R. >> Is the ramp-up time longer than that, >> or is even that much delay unacceptable for Postgres-R, or is there >> some other aspect to the problem I'm failing to grasp? I can tell you >> have some experience tuning this so I'd like to try to understand >> where you're coming from. > > I didn't ever compare to a max_spare_background_workers = 0 configuration, > so I don't have any hard numbers, sorry. Hmm, OK. >> I think this is an interesting example, and worth some further >> thought. I guess I don't really understand how Postgres-R uses these >> bgworkers. > > The given example doesn't only apply to Postgres-R. But with fewer bgworkers > in total, you are more likely to want to use them all for one database, yes. > >> Are you replicating one transaction at a time, or how does >> the data get sliced up? > > Yes, one transaction at a time. One transaction per backend (bgworker). On a > cluster with n nodes that has only performs writing transactions, avg. at a > rate of m concurrent transactions/node, you ideally end up having m normal > backends and (n-1) * m bgworkers that concurrently apply the remote > transactions. What is the granularity of replication? Per-database? Per-table? How do you accumulate the change sets? Some kind of bespoke hook, WAL scanning, ...? > Thank you for your feedback and constructive criticism. My pleasure. Interesting stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > Wow, 100 processes??! Really? I guess I don't actually know how large > modern proctables are, but on my MacOS X machine, for example, there > are only 75 processes showing up right now in "ps auxww". My Fedora > 12 machine has 97. That's including a PostgreSQL instance in the > first case and an Apache instance in the second case. So 100 workers > seems like a ton to me. The part of that that would worry me is open files. PG backends don't have any compunction about holding open hundreds of files. Apiece. You can dial that down but it'll cost you performance-wise. Last I checked, most Unix kernels still had limited-size FD arrays. And as you say, ProcArray manipulations aren't going to be terribly happy about large numbers of idle backends, either. regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Fri, Sep 17, 2010 at 11:21:13PM -0400, Robert Haas wrote: [...] > Wow, 100 processes??! Really? I guess I don't actually know how large > modern proctables are, but on my MacOS X machine, for example, there > are only 75 processes showing up right now in "ps auxww". My Fedora > 12 machine has 97. That's including a PostgreSQL instance in the > first case and an Apache instance in the second case. So 100 workers > seems like a ton to me. As an equally unscientific data point, on my box, a typical desktop box (actually a netbook, slow CPU, but beefed up to 2GB RAM), I have 5 PostgreSQL processes running, which take away about 1.2 MB (resident) -- not each one, but together!. As a contrast, there is *one* mysql daemon (don't ask!), taking away 17 MB. The worst offenders are, by far, the eye-candy thingies, as one has become accustomed to expect :-( What I wanted to say is that the PostgreSQL processes are unusually light-weight by modern standards. Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFMlFEhBcgs9XrR2kYRAlqHAJ9rz5eQhqnh62H5QljDjU0E68ai6wCffnCW ybV0RIdDy769/JYBBq7xakA= =7Vc/ -----END PGP SIGNATURE-----
On 09/18/2010 05:43 AM, Tom Lane wrote: > The part of that that would worry me is open files. PG backends don't > have any compunction about holding open hundreds of files. Apiece. > You can dial that down but it'll cost you performance-wise. Last > I checked, most Unix kernels still had limited-size FD arrays. Thank you very much, that's a helpful hint. I did some quick testing and managed to fork up to around 2000 backends, at which point my (laptop) system got unresponsive. To be honest, that's really surprising me. (I had to increased the SHM and SEM kernel limits to be able to start Postgres with that many processes at all. Obviously, Linux doesn't seem to like that... on a second test I got a kernel panic) > And as you say, ProcArray manipulations aren't going to be terribly > happy about large numbers of idle backends, either. Very understandable, yes. Regards Markus Wanner
Hi, On 09/18/2010 05:21 AM, Robert Haas wrote: > Wow, 100 processes??! Really? I guess I don't actually know how large > modern proctables are, but on my MacOS X machine, for example, there > are only 75 processes showing up right now in "ps auxww". My Fedora > 12 machine has 97. That's including a PostgreSQL instance in the > first case and an Apache instance in the second case. So 100 workers > seems like a ton to me. Well, Apache pre-forks 5 processes in total (by default, that is, for high volume webservers a higher MinSpareServers setting is certainly not out of question). While bgworkers currently needs to fork min_spare_background_workers processes per database. AIUI, that's the main problem with the current architecture. >> I haven't measured the actual time it takes, but given the use case of a >> connection pool, I so far thought it's obvious that this process takes too >> long. > > Maybe that would be a worthwhile exercise... On my laptop I'm measuring around 18 bgworker starts per second, i.e. roughly 50 ms per bgworker start. That's certainly just a ball-park figure.. One could parallelize the communication channel between the coordinator and postmaster, so as to be able to start multiple bgworkers in parallel, but the initial latency remains. It's certainly quick enough for autovacuum. But equally certainly not acceptable for Postgres-R, where latency is the worst enemy in the first place. For autonomous transactions and parallel querying, I'd also say that I'd rather not like to have such a latency. > I think the kicker here is the idea of having a certain number of > extra workers per database. Agreed, but I don't see any better way. Short of a re-connecting feature. > So > if you knew you only had 1 database, keeping around 2 or 3 or 5 or > even 10 workers might seem reasonable, but since you might have 1 > database or 1000 databases, it doesn't. Keeping 2 or 3 or 5 or 10 > workers TOTAL around could be reasonable, but not per-database. As > Tom said upthread, we don't want to assume that we're the only thing > running on the box and are therefore entitled to take up all the > available memory/disk/process slots/whatever. And even if we DID feel > so entitled, there could be hundreds of databases, and it certainly > doesn't seem practical to keep 1000 workers around "just in case". Agreed. Looks like Postgres-R has a slightly different focus, because if you need multi-master replication, you probably don't have 1000s of databases and/or lots of other services on the same machine. > I don't know whether an idle Apache worker consumes more or less > memory than an idle PostgreSQL worker, but another difference between > the Apache case and the PostgreSQL case is that presumably all those > backend processes have attached shared memory and have ProcArray > slots. We know that code doesn't scale terribly well, especially in > terms of taking snapshots, and that's one reason why high-volume > PostgreSQL installations pretty much require a connection pooler. I > think the sizes of the connection pools I've seen recommended are > considerably smaller than 100, more like 2 * CPUs + spindles, or > something like that. It seems like if you actually used all 100 > workers at the same time performance might be pretty awful. Sounds reasonable, yes. > I was taking a look at the Mammoth Replicator code this week > (parenthetical note: I couldn't figure out where mcp_server was or how > to set it up) and it apparently has a limitation that only one > database in the cluster can be replicated. I'm a little fuzzy on how > Mammoth works, but apparently this problem of scaling to large numbers > of databases is not unique to Postgres-R. Postgres-R is able to replicate multiple databases. Maybe not thousands, but still designed for it. > What is the granularity of replication? Per-database? Per-table? Currently per-cluster (i.e. all your databases at once). > How do you accumulate the change sets? Logical changes get collected at the heapam level. They get serialized and streamed (via imessages and a group communication system) to all nodes. Application of change sets is highly parallelized and should be pretty efficient. Commit ordering is decided by the GCS to guarantee consistency across all nodes, conflicts get resolved by aborting the later transaction. > Some kind of bespoke hook, WAL scanning, ...? No hooks, please! ;-) Regards Markus Wanner
On Mon, Sep 20, 2010 at 11:30 AM, Markus Wanner <markus@bluegap.ch> wrote: > Well, Apache pre-forks 5 processes in total (by default, that is, for > high volume webservers a higher MinSpareServers setting is certainly not > out of question). While bgworkers currently needs to fork > min_spare_background_workers processes per database. > > AIUI, that's the main problem with the current architecture. Assuming that "the main problem" refers more or less to the words "per database", I agree. >>> I haven't measured the actual time it takes, but given the use case of a >>> connection pool, I so far thought it's obvious that this process takes too >>> long. >> >> Maybe that would be a worthwhile exercise... > > On my laptop I'm measuring around 18 bgworker starts per second, i.e. > roughly 50 ms per bgworker start. That's certainly just a ball-park figure.. Gee, that doesn't seem slow enough to worry about to me. If we suppose that you need 2 * CPUs + spindles processes to fully load the system, that means you should be able to ramp up from zero to consuming every available system resource in under a second; except perhaps on a system with a huge RAID array, which might need 2 or 3 seconds. If you parallelize the worker startup, as you suggest, I'd think you could knock quite a bit more off of this, but why all the worry about startup latency? Once the system is chugging along, none of this should matter very much, I would think. If you need to repeatedly kill off some workers bound to one database and start some new ones to bind to a different database, that could be sorta painful, but if you can actually afford to keep around the workers for all the databases you care about, it seems fine. >> How do you accumulate the change sets? > > Logical changes get collected at the heapam level. They get serialized > and streamed (via imessages and a group communication system) to all > nodes. Application of change sets is highly parallelized and should be > pretty efficient. Commit ordering is decided by the GCS to guarantee > consistency across all nodes, conflicts get resolved by aborting the > later transaction. Neat stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert, On 09/20/2010 06:57 PM, Robert Haas wrote: > Gee, that doesn't seem slow enough to worry about to me. If we > suppose that you need 2 * CPUs + spindles processes to fully load the > system, that means you should be able to ramp up from zero to > consuming every available system resource in under a second; except > perhaps on a system with a huge RAID array, which might need 2 or 3 > seconds. If you parallelize the worker startup, as you suggest, I'd > think you could knock quite a bit more off of this, but why all the > worry about startup latency? Once the system is chugging along, none > of this should matter very much, I would think. If you need to > repeatedly kill off some workers bound to one database and start some > new ones to bind to a different database, that could be sorta painful, > but if you can actually afford to keep around the workers for all the > databases you care about, it seems fine. Hm.. I see. So in other words, you are saying min_spare_background_workers isn't flexible enough in case one has thousands of databases but only uses a few of them frequently. I understand that reasoning and the wish to keep the number of GUCs as low as possible. I'll try to drop the min_spare_background_workers from the bgworker patches. The rest of the bgworker infrastructure should behave pretty much like what you have described. Parallelism in starting bgworkers could be a nice improvement, especially if we kill the min_space_background_workers mechanism. > Neat stuff. Thanks. Markus Wanner
On Mon, Sep 20, 2010 at 1:45 PM, Markus Wanner <markus@bluegap.ch> wrote: > Hm.. I see. So in other words, you are saying > min_spare_background_workers isn't flexible enough in case one has > thousands of databases but only uses a few of them frequently. Yes, I think that is true. > I understand that reasoning and the wish to keep the number of GUCs as > low as possible. I'll try to drop the min_spare_background_workers from > the bgworker patches. OK. At least for me, what is important is not only how many GUCs there are but how likely they are to require tuning and how easy it will be to know what the appropriate value is. It seems fairly easy to tune the maximum number of background workers, and it doesn't seem hard to tune an idle timeout, either. Both of those are pretty straightforward trade-offs between, on the one hand, consuming more system resources, and on the other hand, better throughput and/or latency. On the other hand, the minimum number of workers to keep around per-database seems hard to tune. If performance is bad, do I raise it or lower it? And it's certainly not really a hard minimum because it necessarily bumps up against the limit on overall number of workers if the number of databases grows too large; one or the other has to give. I think we need to look for a way to eliminate the maximum number of workers per database, too. Your previous point about not wanting one database to gobble up all the available slots makes sense, but again, it's not obvious how to set this sensibly. If 99% of your activity is in one database, you might want to use all the slots for that database, at least until there's something to do in some other database. I feel like the right thing here is for the number of workers for any given database to fluctuate in some natural way that is based on the workload. If one database has all the activity, it gets all the slots, at least until somebody else needs them. Of course, you need to design the algorithm so as to avoid starvation... > The rest of the bgworker infrastructure should behave pretty much like > what you have described. Parallelism in starting bgworkers could be a > nice improvement, especially if we kill the min_space_background_workers > mechanism. Works for me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 09/21/2010 02:49 AM, Robert Haas wrote: > OK. At least for me, what is important is not only how many GUCs > there are but how likely they are to require tuning and how easy it > will be to know what the appropriate value is. It seems fairly easy > to tune the maximum number of background workers, and it doesn't seem > hard to tune an idle timeout, either. Both of those are pretty > straightforward trade-offs between, on the one hand, consuming more > system resources, and on the other hand, better throughput and/or > latency. Hm.. I thought of it the other way around. It's more obvious and direct for me to determine a min and max of the amount of parallel jobs I want to perform at once. Based on the number of spindles, CPUs and/or nodes in the cluster (in case of Postgres-R). Admittedly, not necessarily per database, but at least overall. I wouldn't known what to set a timeout to. And you didn't make a good argument for any specific value so far. Nor did you offer a reasoning for how to find one. It's certainly very workload and feature specific. > On the other hand, the minimum number of workers to keep > around per-database seems hard to tune. If performance is bad, do I > raise it or lower it? Same applies for the timeout value. > And it's certainly not really a hard minimum > because it necessarily bumps up against the limit on overall number of > workers if the number of databases grows too large; one or the other > has to give. I'd consider the case of min_spare_background_workers * number of databases > max_background_workers to be a configuration error, about which the coordinator should warn. > I think we need to look for a way to eliminate the maximum number of > workers per database, too. Okay, might make sense, yes. Dropping both of these per-database GUCs, we'd simply end up with having max_background_workers around all the time. A timeout would mainly help to limit the max amount of time workers sit around idle. I fail to see how that's more helpful than the proposed min/max. Quite the opposite, it's impossible to get any useful guarantees. It assumes that the workload remains the same over time, but doesn't cope well with sudden spikes and changes in the workload. Unlike the proposed min/max combination, which forks new bgworkers in advance, even if the database already uses lots of them. And after the spike, it quickly reduces the amount of spare bgworkers to a certain max. While not perfect, it's definitely more adaptive to the workload (at least in the usual case of having only few databases). Maybe we need a more sophisticated algorithm in the coordinator. For example measuring the avg. amount of concurrent jobs per database over time and adjust the number of idle backends according to that, the current workload and the max_background_workers, or some such. The min/max GUCs were simply easier to implement, but I'm open to a more sophisticated thing. Regards Markus Wanner
On Tue, Sep 21, 2010 at 4:23 AM, Markus Wanner <markus@bluegap.ch> wrote: > On 09/21/2010 02:49 AM, Robert Haas wrote: >> OK. At least for me, what is important is not only how many GUCs >> there are but how likely they are to require tuning and how easy it >> will be to know what the appropriate value is. It seems fairly easy >> to tune the maximum number of background workers, and it doesn't seem >> hard to tune an idle timeout, either. Both of those are pretty >> straightforward trade-offs between, on the one hand, consuming more >> system resources, and on the other hand, better throughput and/or >> latency. > > Hm.. I thought of it the other way around. It's more obvious and direct > for me to determine a min and max of the amount of parallel jobs I want > to perform at once. Based on the number of spindles, CPUs and/or nodes > in the cluster (in case of Postgres-R). Admittedly, not necessarily per > database, but at least overall. Wait, are we in violent agreement here? An overall limit on the number of parallel jobs is exactly what I think *does* make sense. It's the other knobs I find odd. > I wouldn't known what to set a timeout to. And you didn't make a good > argument for any specific value so far. Nor did you offer a reasoning > for how to find one. It's certainly very workload and feature specific. I think my basic contention is that it doesn't matter very much, so any reasonable value should be fine. I think 5 minutes will be good enough for 99% of cases. But if you find that this leaves too many extra backends around and you start to run out of file descriptors or your ProcArray gets too full, then you might want to drop it down. Conversely, if you want to fine-tune your system for sudden load spikes, you could raise it. > I'd consider the case of min_spare_background_workers * number of > databases > max_background_workers to be a configuration error, about > which the coordinator should warn. The number of databases isn't a configuration parameter. Ideally, users shouldn't have to reconfigure the system because they create more databases. >> I think we need to look for a way to eliminate the maximum number of >> workers per database, too. > > Okay, might make sense, yes. > > Dropping both of these per-database GUCs, we'd simply end up with having > max_background_workers around all the time. > > A timeout would mainly help to limit the max amount of time workers sit > around idle. I fail to see how that's more helpful than the proposed > min/max. Quite the opposite, it's impossible to get any useful guarantees. > > It assumes that the workload remains the same over time, but doesn't > cope well with sudden spikes and changes in the workload. I guess we differ on the meaning of "cope well"... being able to spin up 18 workers in one second seems very fast to me. How many do you expect to ever need?!! > Unlike the > proposed min/max combination, which forks new bgworkers in advance, even > if the database already uses lots of them. And after the spike, it > quickly reduces the amount of spare bgworkers to a certain max. While > not perfect, it's definitely more adaptive to the workload (at least in > the usual case of having only few databases). > > Maybe we need a more sophisticated algorithm in the coordinator. For > example measuring the avg. amount of concurrent jobs per database over > time and adjust the number of idle backends according to that, the > current workload and the max_background_workers, or some such. The > min/max GUCs were simply easier to implement, but I'm open to a more > sophisticated thing. Possibly, but I'm still having a hard time understanding why you need all the complexity you already have. The way I'd imagine doing this is: 1. If a new job arrives, and there is an idle worker available for the correct database, then allocate that worker to that job. Stop. 2. Otherwise, if the number of background workers is less than the maximum number allowable, then start a new worker for the appropriate database and allocate it to the new job. Stop. 3. Otherwise, if there is at least one idle background worker, kill it and start a new one for the correct database. Allocate that new worker to the new job. Stop. 4. Otherwise, you're already at the maximum number of background workers and they're all busy. Wait until some worker finishes a job, and then try again beginning with step 1. When a worker finishes a job, it hangs around for a few minutes to see if it gets assigned a new job (as per #1) and then exits. Although there are other tunables that can be exposed, I would expect, in this design, that the only thing most people would need to adjust would be the maximum pool size. It seems (to me) like your design is being driven by start-up latency, which I just don't understand. Sure, 50 ms to start up a worker isn't fantastic, but the idea is that it won't happen much because there will probably already be a worker in that database from previous activity. The only exception is when there's a sudden surge of activity. But I don't think that's the case to optimize for. If a database hasn't had any activity in a while, I think it's better to reclaim the memory and file descriptors and ProcArray slots that we're spending on it so that the rest of the system can run faster. If that means it takes an extra fraction of a second to respond at some later point, I can live with that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 09/21/2010 03:46 PM, Robert Haas wrote: > Wait, are we in violent agreement here? An overall limit on the > number of parallel jobs is exactly what I think *does* make sense. > It's the other knobs I find odd. Note that the max setting I've been talking about here is the maximum amount of *idle* workers allowed. It does not include busy bgworkers. > I guess we differ on the meaning of "cope well"... being able to spin > up 18 workers in one second seems very fast to me. Well, it's obviously use case dependent. For Postgres-R (and sync replication) in general, people are very sensitive to latency. There's the network latency already, but adding a 50ms latency for no good reason is not going to make these people happy. > How many do you expect to ever need?!! Again, very different. For Postgres-R, easily a couple of dozens. Same applies for parallel querying when having multiple concurrent parallel queries. > Possibly, but I'm still having a hard time understanding why you need > all the complexity you already have. To make sure I we only pay the startup cost in very rare occasions, and not every time the workload changes a bit (or isn't in conformance with an arbitrary timeout). (BTW the min/max is hardly any more complex than a timeout. It doesn't even need a syscall). > It seems (to me) like your design is being driven by start-up latency, > which I just don't understand. Sure, 50 ms to start up a worker isn't > fantastic, but the idea is that it won't happen much because there > will probably already be a worker in that database from previous > activity. The only exception is when there's a sudden surge of > activity. I'm less optimistic about the consistency of the workload. > But I don't think that's the case to optimize for. If a > database hasn't had any activity in a while, I think it's better to > reclaim the memory and file descriptors and ProcArray slots that we're > spending on it so that the rest of the system can run faster. Absolutely. It's what I call a change in workload. The min/max approach is certainly faster at reclaiming unused workers, but (depending on the max setting) doesn't necessarily ever go down to zero. Regards Markus Wanner
On Tue, Sep 21, 2010 at 11:31 AM, Markus Wanner <markus@bluegap.ch> wrote: > On 09/21/2010 03:46 PM, Robert Haas wrote: >> Wait, are we in violent agreement here? An overall limit on the >> number of parallel jobs is exactly what I think *does* make sense. >> It's the other knobs I find odd. > > Note that the max setting I've been talking about here is the maximum > amount of *idle* workers allowed. It does not include busy bgworkers. Oh, wow. Is there another limit on the total number of bgworkers? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 09/21/2010 05:59 PM, Robert Haas wrote: > Oh, wow. Is there another limit on the total number of bgworkers? There currently are three GUCs that control bgworkers: max_background_workers min_spare_background_workers max_spare_background_workers The first replaces the former autovacuum_max_workers GUC. As before, it is an overall limit, much like max_connections. The later two are additional. They are per-database lower and upper limits for the amount of idle workers an any point in time. These later two are what I'm referring to as the min/max approach. And what I'm arguing cannot be replaced by a timeout without loosing functionality. Regards Markus Wanner
On Sat, Sep 18, 2010 at 4:21 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> (It's exactly what apache pre-fork does, no? Is anybody concerned about the >> idle processes there? Or do they consume much less resources?) .... > > I don't know whether an idle Apache worker consumes more or less > memory than an idle PostgreSQL worker, but another difference between > the Apache case and the PostgreSQL case is that presumably all those Apache, like Postgres, handles a lot of different use cases and the ideal configuration depends heavily on how you use it. The default configs for Apache are meant to be flexible and handle a mixed workload where some requests are heavyweight scripts which might have waits on a database and others are lightweight requests for static objects. This is a hard configuration to get right but the default is to ramp up the number of processes dynamically in the hopes of reaching some kind of equilibrium. Generally the recommended approach for a high traffic site is to use a dedicated Apache or thttpd or equivalent install for the static objects -- this one would have hundreds of workers or threads or whatever and each one would be fairly lightweight. In fact nearly all the RAM can be shared and the overhead of forking a new process would be too high compared to serving static content from cache to let the number scale dynamically. If you have 200 processes each of which has only a few kB of private RAM then nearly all the RAM is available for filesystem cache and requests can be served in milliseconds (mostly network latency). Then the heavyweight scripts can run on a dedicated Apache install where the total number of processes is limited to something sane like a small multiple of the number of cores -- basically RAM / ram required to run the interpreter. If you have 20 processes and each uses a 40 MB then your 2GB machine has about half its RAM available for filesystem cache or other uses. Again you want to run with the 20 processes always running -- this time because the interpreter startup is usually quite slow. The dynamic ramp-up is a feature to deal for the default install and for use case where the system has lots of different users with different needs. -- greg
Greg, On 09/25/2010 08:03 PM, Greg Stark wrote: > The dynamic ramp-up is a feature to deal for the default install and > for use case where the system has lots of different users with > different needs. Thanks for sharing this. From that perspective, neither the current min/max nor the timeout configuration approach would be satisfying, as it's not really possible to configure it to always have a certain amount of bgworkers (exactly adjusted to the requirements at hand, with possibly differing requirements per database). I'm unsure about how to continue here. It seems we need the ability to switch between databases not (only) for performance, but for ease of configuration. Regards Markus Wanner