Thread: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!
Kevin didn't send out an official gavel-banging announcement of the end of CommitFest 2009-07 (possibly because I neglected until today to give him privileges to actually change it in the web application), but I'd just like to take a minute to thank him publicly for his efforts. We started this CommitFest with something like 60 patches, which is definitely on the larger side for a CommitFest, and Kevin did a great job staying on top of what was going on with all of them and, I felt, really helped keep us on track. At the same time, I felt he did this with a very light touch that made the whole thing go very smoothly. So -- thanks, Kevin! I also appreciate the efforts of all those who reviewed. Good reviews are really critical to keep the burden from building up on committers, and I appreciate the efforts of everyone who contributed, in many cases probably on their own time. I'm particularly grateful to the people who were vigilant about spelling, grammar, coding style, whitespace, and other nitpicky little issues that are not much fun, but which at least for me are a major time sink if they're still lingering when it comes time to do the actual commit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> wrote: > I'd just like to take a minute to thank him publicly for his > efforts. We started this CommitFest with something like 60 > patches, which is definitely on the larger side for a CommitFest, > and Kevin did a great job staying on top of what was going on with > all of them and, I felt, really helped keep us on track. At the > same time, I felt he did this with a very light touch that made > the whole thing go very smoothly. So -- thanks, Kevin! You're welcome. It was educational for me. I don't think I want to try to handle two in a row, and I think your style is better suited than mine to the final CF for a release, but I might be able to take on the 2010-11 CF if people want that. My hand was not always so light behind the scenes, though -- I sent or received about 100 off-list emails to try to keep things moving. Hopefully nobody was too offended by my nagging. :-) Oh, and thanks for putting together the CF web application. Without that, I couldn't have done half as well as I did. > I also appreciate the efforts of all those who reviewed. Yes, I'll second that! I've always been impressed with the PostgreSQL community, and managing this CF gave me new insights and appreciation for the intelligence, professionalism, and community spirit of its members -- authors, reviewers, and committers. -Kevin
Kevin Grittner wrote: > I don't think I want to try to handle two in a row, and I think your style is better suited > than mine to the final CF for a release, but I might be able to take on the 2010-11 CF if people want that Ha, you just put yourself right back on the hook with that comment, and Robert does seem like the right guy for CF-4 @ 2011-01. Leaving the question of what's going to happen with CF-2 next month. I think the crucial thing with the 2010-09 CF is that we have to get serious progress made sorting out all the sync rep ideas before/during that one. The review Yeb did and subsequent discussion was really helpful, but the scope on that needs to actually get nailed down to *something* concrete if it's going to get built early enough in the 9.1 release to be properly reviewed and tested for more than one round. Parts of the design and scope still feel like they're expanding to me, and I think having someone heavily involved in the next CF who is willing to push on nailing down that particular area is pretty important. Will volunteer myself if I can stay on schedule to make it past the major time commitment sink I've had so far this year by then. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Wed, Aug 18, 2010 at 7:46 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Kevin Grittner wrote: >> >> I don't think I want to try to handle two in a row, and I think your style >> is better suited >> than mine to the final CF for a release, but I might be able to take on >> the 2010-11 CF if people want that > > Ha, you just put yourself right back on the hook with that comment, and > Robert does seem like the right guy for CF-4 @ 2011-01. Leaving the > question of what's going to happen with CF-2 next month. My reputation precedes me, apparently. Although I appreciate everyone so far being willing to avoid mentioning exactly what that reputation might be. :-) > I think the crucial thing with the 2010-09 CF is that we have to get serious > progress made sorting out all the sync rep ideas before/during that one. > The review Yeb did and subsequent discussion was really helpful, but the > scope on that needs to actually get nailed down to *something* concrete if > it's going to get built early enough in the 9.1 release to be properly > reviewed and tested for more than one round. Parts of the design and scope > still feel like they're expanding to me, and I think having someone heavily > involved in the next CF who is willing to push on nailing down that > particular area is pretty important. Will volunteer myself if I can stay on > schedule to make it past the major time commitment sink I've had so far this > year by then. Sitting on Sync Rep is a job and a half by itself, without adding all the other CF work on top of it. Maybe we should try to find two vi^Holunteers: a CommitFest Manager (CFM) and a Major Feature Babysitter (MBS). At any rate, we should definitely NOT wait another month to start thinking about Sync Rep again. I haven't actually looked at any of the Sync Rep code AT ALL but IIRC Heikki expressed the view that the biggest thing standing in the way of a halfway decent Sync Rep implementation was a number of polling loops that needed to be replaced with something that wouldn't introduce up-to-100ms delays. And so far we haven't seen a patch for that. Somebody write one. And then let's get it reviewed and committed RSN.It may seem like we're early in the release cycle yet,but for a feature of this magnitude we are not. We committed way too much big stuff at the very end of the last release cycle; Hot Standby was still being cleaned up in May after commit in November. We'll be lucky to commit sync rep that early. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 19/08/10 04:46, Robert Haas wrote: > At any rate, we should definitely NOT wait another > month to start thinking about Sync Rep again. Agreed. EnterpriseDB is interested in having that feature, so I'm on the hook to spend time on it regardless of commitfests. > I haven't actually > looked at any of the Sync Rep code AT ALL but IIRC Heikki expressed > the view that the biggest thing standing in the way of a halfway > decent Sync Rep implementation was a number of polling loops that > needed to be replaced with something that wouldn't introduce > up-to-100ms delays. Well, that's the only uncontroversial thing about it that doesn't require any fighting over the UI or desired behavior. That's why I've focused on that first, and also because it's useful regardless of synchronous replication. But once that's done, we'll have to nail down how synchronous replication is supposed to behave, and how to configure it. > And so far we haven't seen a patch for that. > Somebody write one. And then let's get it reviewed and committed RSN. Fujii is on vacation, but I've started working on it. The two issues with Fujii's latest patch are that it would not respond promptly on platforms where signals don't interrupt sleep, and it suffers the classic race condition that pselect() was invented for. I'm going to replace pg_usleep() with select(), and use the so called "self-pipe trick" to get over the race condition. I have that written up but I want to do some testing and cleanup before posting the patch. > It may seem like we're early in the release cycle yet, but for a > feature of this magnitude we are not. We committed way too much big > stuff at the very end of the last release cycle; Hot Standby was still > being cleaned up in May after commit in November. We'll be lucky to > commit sync rep that early. Agreed. We need to decide the scope and minimum set of features real soon to get something concrete finished. BTW, on what platforms signals don't interrupt sleep? Although that issue has been discussed many times before, I couldn't find any reference to a real platform in the archives. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > BTW, on what platforms signals don't interrupt sleep? Although that > issue has been discussed many times before, I couldn't find any > reference to a real platform in the archives. I've got one in captivity (my old HPUX box). Happy to test whatever you come up with. Considering that pg_usleep is implemented with select, I'm not following what you mean by "replace pg_usleep() with select()"? regards, tom lane
Excerpts from Heikki Linnakangas's message of jue ago 19 02:02:34 -0400 2010: > On 19/08/10 04:46, Robert Haas wrote: > > And so far we haven't seen a patch for that. > > Somebody write one. And then let's get it reviewed and committed RSN. > > Fujii is on vacation, but I've started working on it. The two issues > with Fujii's latest patch are that it would not respond promptly on > platforms where signals don't interrupt sleep, and it suffers the > classic race condition that pselect() was invented for. I'm going to > replace pg_usleep() with select(), and use the so called "self-pipe > trick" to get over the race condition. I have that written up but I want > to do some testing and cleanup before posting the patch. Hmm, IIRC the self-pipe trick doesn't work on Windows, mainly because select() doesn't handle pipes, only sockets. You may need some extra hack to make it work there. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Thu, Aug 19, 2010 at 17:08, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Heikki Linnakangas's message of jue ago 19 02:02:34 -0400 2010: >> On 19/08/10 04:46, Robert Haas wrote: > >> > And so far we haven't seen a patch for that. >> > Somebody write one. And then let's get it reviewed and committed RSN. >> >> Fujii is on vacation, but I've started working on it. The two issues >> with Fujii's latest patch are that it would not respond promptly on >> platforms where signals don't interrupt sleep, and it suffers the >> classic race condition that pselect() was invented for. I'm going to >> replace pg_usleep() with select(), and use the so called "self-pipe >> trick" to get over the race condition. I have that written up but I want >> to do some testing and cleanup before posting the patch. > > Hmm, IIRC the self-pipe trick doesn't work on Windows, mainly because > select() doesn't handle pipes, only sockets. You may need some extra > hack to make it work there. We have a pipe implementation using sockets that is used on Windows for just this reason, IIRC. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On 19/08/10 18:08, Alvaro Herrera wrote: > Excerpts from Heikki Linnakangas's message of jue ago 19 02:02:34 -0400 2010: >> On 19/08/10 04:46, Robert Haas wrote: > >>> And so far we haven't seen a patch for that. >>> Somebody write one. And then let's get it reviewed and committed RSN. >> >> Fujii is on vacation, but I've started working on it. The two issues >> with Fujii's latest patch are that it would not respond promptly on >> platforms where signals don't interrupt sleep, and it suffers the >> classic race condition that pselect() was invented for. I'm going to >> replace pg_usleep() with select(), and use the so called "self-pipe >> trick" to get over the race condition. I have that written up but I want >> to do some testing and cleanup before posting the patch. > > Hmm, IIRC the self-pipe trick doesn't work on Windows, mainly because > select() doesn't handle pipes, only sockets. You may need some extra > hack to make it work there. That's fine, we can do the naive set-a-flag implementation on Windows because on Windows our "signals" are only delivered at CHECK_FOR_INTERRUPTS(), so we don't have the race condition to begin with. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 19/08/10 16:38, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> BTW, on what platforms signals don't interrupt sleep? Although that >> issue has been discussed many times before, I couldn't find any >> reference to a real platform in the archives. > > I've got one in captivity (my old HPUX box). Happy to test whatever you > come up with. > > Considering that pg_usleep is implemented with select, I'm not following > what you mean by "replace pg_usleep() with select()"? Instead of using pg_usleep(), call select() directly, waiting not only for the timeout, but also for data to arrive on the "self-pipe". The signal handler writes a byte to the self-pipe, waking up the select(). That way the select() is interupted by the signal arriving, even if signals per se don't interrupt it. And it closes the race condition involved with setting a flag in the signal handler and checking that in the main loop. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 19/08/10 16:38, Tom Lane wrote: >> Considering that pg_usleep is implemented with select, I'm not following >> what you mean by "replace pg_usleep() with select()"? > Instead of using pg_usleep(), call select() directly, waiting not only > for the timeout, but also for data to arrive on the "self-pipe". The > signal handler writes a byte to the self-pipe, waking up the select(). > That way the select() is interupted by the signal arriving, even if > signals per se don't interrupt it. And it closes the race condition > involved with setting a flag in the signal handler and checking that in > the main loop. Hmm, but couldn't you still do that inside pg_usleep? Signal handlers that do that couldn't know if they were interrupting a sleep per se, so this would have to be a backend-wide convention. regards, tom lane
On 19/08/10 19:57, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> On 19/08/10 16:38, Tom Lane wrote: >>> Considering that pg_usleep is implemented with select, I'm not following >>> what you mean by "replace pg_usleep() with select()"? > >> Instead of using pg_usleep(), call select() directly, waiting not only >> for the timeout, but also for data to arrive on the "self-pipe". The >> signal handler writes a byte to the self-pipe, waking up the select(). >> That way the select() is interupted by the signal arriving, even if >> signals per se don't interrupt it. And it closes the race condition >> involved with setting a flag in the signal handler and checking that in >> the main loop. > > Hmm, but couldn't you still do that inside pg_usleep? Signal handlers > that do that couldn't know if they were interrupting a sleep per se, > so this would have to be a backend-wide convention. I don't understand, do what inside pg_usleep()? We only need to respond quickly to one signal, the one that tells walsender "there's some new WAL that you should send". We can rely on polling for all the other signals like SIGHUP for config reload or shutdown request, like we do today. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 19/08/10 19:57, Tom Lane wrote: >> Hmm, but couldn't you still do that inside pg_usleep? Signal handlers >> that do that couldn't know if they were interrupting a sleep per se, >> so this would have to be a backend-wide convention. > I don't understand, do what inside pg_usleep()? I was envisioning still using pg_usleep, and having this interaction between signal handlers and the delay be private to pg_usleep, rather than putting such ugly code into forty-nine different places. There are a *lot* of places where we have loops that break down delays into at-most-one-second pg_usleep calls, and if we're going to have a hack like this we should fix them all, not just two or three that SR cares about. But I'm still not seeing how this self-pipe hack avoids a race condition. If the signal handler is sending a byte whenever it executes, then you could have bytes already stacked up in the pipe from previous interrupts that didn't happen to come while inside pg_usleep. If you clear those before sleeping, you have a race condition, and if you don't, then you fail to sleep the intended amount of time even though there was no interrupt this time. regards, tom lane
On 19/08/10 20:18, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> On 19/08/10 19:57, Tom Lane wrote: >>> Hmm, but couldn't you still do that inside pg_usleep? Signal handlers >>> that do that couldn't know if they were interrupting a sleep per se, >>> so this would have to be a backend-wide convention. > >> I don't understand, do what inside pg_usleep()? > > I was envisioning still using pg_usleep, and having this interaction > between signal handlers and the delay be private to pg_usleep, rather > than putting such ugly code into forty-nine different places. There > are a *lot* of places where we have loops that break down delays > into at-most-one-second pg_usleep calls, and if we're going to have a > hack like this we should fix them all, not just two or three that SR > cares about. Hmm, will need to think about a suitable API for that. The nice thing would be that we could implement it using pselect() where available. (And reliable - the Linux select() man page says that glibc's pselect() is emulated using select(), and suffers from the very same race condition pselect() was invented to solve. How awful is that!?) > But I'm still not seeing how this self-pipe hack avoids a race > condition. If the signal handler is sending a byte whenever it > executes, then you could have bytes already stacked up in the pipe > from previous interrupts that didn't happen to come while inside > pg_usleep. If you clear those before sleeping, you have a race > condition, and if you don't, then you fail to sleep the intended > amount of time even though there was no interrupt this time. You clear the pipe after waking up. Before sending all the pending WAL and deciding that you're fully caught up again: for(;;) { 1. select() 2. clear pipe 3. send any pending WAL } There's more information on the self-pipe trick at e.g http://lwn.net/Articles/177897/ -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 19/08/10 20:18, Tom Lane wrote: >> But I'm still not seeing how this self-pipe hack avoids a race >> condition. If the signal handler is sending a byte whenever it >> executes, then you could have bytes already stacked up in the pipe >> from previous interrupts that didn't happen to come while inside >> pg_usleep. If you clear those before sleeping, you have a race >> condition, and if you don't, then you fail to sleep the intended >> amount of time even though there was no interrupt this time. > You clear the pipe after waking up. Hmm ... that doesn't answer my second objection about failing to sleep the expected amount of time, but on reflection I guess that can't be done: we have to be able to cope with interrupts occurring just before the sleep actually begins, and there's no way to define "just before" except by reference to the calling code having done whatever it might need to do before/after sleeping. > Hmm, will need to think about a suitable API for that. Offhand I'd suggest something like SetSleepInterrupt() -- called by signal handlers, writes pipe ClearSleepInterrupt() -- called by sleep-and-do-something loops, clears pipe pg_usleep() itself remains the same, but it is now guaranteed to return immediately if SetSleepInterrupt is called, or has been called since the last ClearSleepInterrupt. > The nice thing > would be that we could implement it using pselect() where available. > (And reliable - the Linux select() man page says that glibc's pselect() > is emulated using select(), and suffers from the very same race > condition pselect() was invented to solve. How awful is that!?) Ick. So how would you tell if it's trustworthy? regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, Aug 19, 2010 at 08:36:09PM +0300, Heikki Linnakangas wrote: [...] > Hmm, will need to think about a suitable API for that. The nice thing would > be that we could implement it using pselect() where available. (And > reliable - the Linux select() man page says that glibc's pselect() is > emulated using select(), and suffers from the very same race condition > pselect() was invented to solve. How awful is that!?) It is indeed. It seems, though, that from Linux kernel 2.6.16 and glibc 2.4 on, things look better [1]. As a reference, Debian stable (not known to adventure too far into the present ;-) is libc 2.7 on kernel 2.6.26. Of course, "enterprise" GNU/Linux distros are said to be even more conservative... [1] <http://lwn.net/Articles/176911/> Regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFMbiauBcgs9XrR2kYRAhiwAJ41f29jSIy409epTH0eJRXW17oByACeIkRo CRg2BCw8tn3PkdnNR1i/MCY= =GVMT -----END PGP SIGNATURE-----
On 19/08/10 20:59, Tom Lane wrote: > Offhand I'd suggest something like > > SetSleepInterrupt() -- called by signal handlers, writes pipe > ClearSleepInterrupt() -- called by sleep-and-do-something loops, clears pipe > > pg_usleep() itself remains the same, but it is now guaranteed to return > immediately if SetSleepInterrupt is called, or has been called since the > last ClearSleepInterrupt. Hmm, we have pg_usleep() calls in some fairly low-level functions, like mdunlink() and s_lock(). If someone has called SetSleepInterrupt(), we don't want those pg_usleep()s to return immediately. And pg_usleep() is used in some client code too. I think we need a separate sleep function for this. Another idea is to not use unix signals at all, but ProcSendSignal() and ProcWaitForSignal(). We would not need the signal handler at all. Walsender would use ProcWaitForSignal() instead of pg_usleep() and backends that want to wake it up would use ProcSendSignal(). The problem is that there is currently no way to specify a timeout, but I presume the underlying semaphore operations have that capability, and we could expose it. Actually ProcSendSignal()/ProcWaitForSignal() won't work as is, because walsender doesn't have a PGPROC entry, but you could easily build a similar mechanism, using PGSemaphoreLock/Unlock like ProcSendSignal()/WaitForSignal() does. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Hmm, we have pg_usleep() calls in some fairly low-level functions, like > mdunlink() and s_lock(). If someone has called SetSleepInterrupt(), we > don't want those pg_usleep()s to return immediately. And pg_usleep() is > used in some client code too. I think we need a separate sleep function > for this. Well, we'd need some careful thought about which sleeps need what, but I don't necessarily have an objection to a separate interruptable sleep function. > Another idea is to not use unix signals at all, but ProcSendSignal() and > ProcWaitForSignal(). We would not need the signal handler at all. > Walsender would use ProcWaitForSignal() instead of pg_usleep() and > backends that want to wake it up would use ProcSendSignal(). You keep on proposing solutions that only work for walsender :-(. Most of the other places where we have pg_usleep actually do want a timed sleep, I believe. It's also unclear that we can always expect ProcSendSignal to be usable --- for example, stuff like SIGHUP would be sent by processes that might not be connected to shared memory. > The problem > is that there is currently no way to specify a timeout, but I presume > the underlying semaphore operations have that capability, and we could > expose it. They don't, or at least the semop-based implementation doesn't. regards, tom lane
On 20/08/10 16:24, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> Hmm, we have pg_usleep() calls in some fairly low-level functions, like >> mdunlink() and s_lock(). If someone has called SetSleepInterrupt(), we >> don't want those pg_usleep()s to return immediately. And pg_usleep() is >> used in some client code too. I think we need a separate sleep function >> for this. > > Well, we'd need some careful thought about which sleeps need what, but I > don't necessarily have an objection to a separate interruptable sleep > function. If we have to, we could also support multiple interrupts with multiple self-pipes, so that you can choose at pg_usleep() which ones to wake up on. >> Another idea is to not use unix signals at all, but ProcSendSignal() and >> ProcWaitForSignal(). We would not need the signal handler at all. >> Walsender would use ProcWaitForSignal() instead of pg_usleep() and >> backends that want to wake it up would use ProcSendSignal(). > > You keep on proposing solutions that only work for walsender :-(. Well yes, the other places where we use pg_usleep() are not really a problem as is. If as a side-effect we can make them respond more quickly to signals, with small changes, that's good, but walsender is the only one that's performance critical. That said, a select() based solution is my current favorite. > Most of the other places where we have pg_usleep actually do want > a timed sleep, I believe. It's also unclear that we can always expect > ProcSendSignal to be usable --- for example, stuff like SIGHUP would > be sent by processes that might not be connected to shared memory. > >> The problem >> is that there is currently no way to specify a timeout, but I presume >> the underlying semaphore operations have that capability, and we could >> expose it. > > They don't, or at least the semop-based implementation doesn't. There's semtimedop(). I don't know how portable it is, but it seems to exist at least on Linux, Solaris, HPUX and AIX. On what platforms do we use sysv semaphores? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
[ It's way past time to change the thread title ] Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 20/08/10 16:24, Tom Lane wrote: >> You keep on proposing solutions that only work for walsender :-(. > Well yes, the other places where we use pg_usleep() are not really a > problem as is. Well, yes they are. They cause unnecessary process wakeups and thereby consume cycles even when the database is idle. See for example a longstanding complaint here: https://bugzilla.redhat.com/show_bug.cgi?id=252129 If we're going to go to the trouble of having a mechanism like this, I'd like it to fix that problem so I can close out that bug. > There's semtimedop(). I don't know how portable it is, but it seems to > exist at least on Linux, Solaris, HPUX and AIX. It's not on my HPUX, and I don't see it in the Single Unix Spec. > On what platforms do we use sysv semaphores? AFAIK, everything except Windows and extremely old versions of OS X. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 20/08/10 17:28, Tom Lane wrote: > [ It's way past time to change the thread title ] > > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> On 20/08/10 16:24, Tom Lane wrote: >>> You keep on proposing solutions that only work for walsender :-(. > >> Well yes, the other places where we use pg_usleep() are not really a >> problem as is. > > Well, yes they are. They cause unnecessary process wakeups and thereby > consume cycles even when the database is idle. See for example a > longstanding complaint here: > https://bugzilla.redhat.com/show_bug.cgi?id=252129 > > If we're going to go to the trouble of having a mechanism like this, > I'd like it to fix that problem so I can close out that bug. Hmm, if you want to put bgwriter and walwriter to deep sleep, then someone will need to wake them up when they have work to do. Currently they poll. Maybe they should just sleep longer, like 10 seconds, if there hasn't been any work to do in the last X wakeups. We've been designing the new sleep facility so that the event that wakes up the sleep is sent from the signal handler in the same process, but it seems that all the potential users would actually want to be woken up from *another* process, so the signal handler seems like an unnecessary middleman. Particularly on Windows where signals are simulated with pipes and threads, while you could just send a Windows event directly from one process to another. A common feature that all the users of this facility want is that once the event is sent, re-sending it is a fast no-op until re-enabled by the receiver. For example, if we need backends to wake up bgwriter after dirtying a buffer, you don't want to waste many cycles determining that bgwriter is already active and doesn't need to be woken up. Let's call these "latches". I'm thinking of something like this: /* Similar to LWLockId */ typedef enum { BgwriterLatch, WalwriterLatch, /* plus one for each walsender */ } LatchId; /* * Wait for given latch to be set. Only one process can wait * for a given latch at a time. */ WaitLatch(LatchId latch, long timeout); /* * Sets latch. Returns quickly if the latch is set already. */ SetLatch(LatchId latch); /* * Clear the latch. Calling WaitLatch after this will sleep, unless * the latch is set again before the WaitLatch call.*/ ResetLatch(LatchId latch); There would be a boolean for each latch in shared memory, to indicate if the latch is "armed", allowing quick return from SetLatch if the latch is already set. Plus a signal to wake up the waiting process (maybe use procsignal.c), and the self-pipe trick within the receiving process to make it race condition free. On Windows, the signal and the self-pipe trick are replaced with Windows events. I'll try out this approach tomorrow.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 20/08/10 17:28, Tom Lane wrote: >> Well, yes they are. They cause unnecessary process wakeups and thereby >> consume cycles even when the database is idle. See for example a >> longstanding complaint here: >> https://bugzilla.redhat.com/show_bug.cgi?id=252129 >> >> If we're going to go to the trouble of having a mechanism like this, >> I'd like it to fix that problem so I can close out that bug. > Hmm, if you want to put bgwriter and walwriter to deep sleep, then > someone will need to wake them up when they have work to do. Currently > they poll. Maybe they should just sleep longer, like 10 seconds, if > there hasn't been any work to do in the last X wakeups. I should probably clarify that I don't expect this patch to solve that problem all by itself. But ISTM that a guaranteed-interruptible version of pg_usleep is a necessary prerequisite before we can even think about dialing down the database's CPU consumption at idle. > We've been designing the new sleep facility so that the event that wakes > up the sleep is sent from the signal handler in the same process, but it > seems that all the potential users would actually want to be woken up > from *another* process, so the signal handler seems like an unnecessary > middleman. No, because there are also lots of cases where the signal is arriving from a non-Postgres process; SIGTERM arriving from init being the killer case that you simply don't get to propose an alternative design for. More generally, I'm uninterested in decoupling cases like SIGHUP for postgresql.conf change from the existing signal mechanism, because it's just too useful to be able to trigger that externally. > Particularly on Windows where signals are simulated with > pipes and threads, while you could just send a Windows event directly > from one process to another. Windows of course has different constraints, and in particular it lacks the constraint that we must be able to respond to system-defined signals. So I wouldn't object to the underlying implementation being different for Windows. > [ "latch" proposal ] This seems reasonably clean as far as signal conditions generated internally to Postgres go, but I remain unclear on how it helps for response to actual signals. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of lun ago 23 19:44:02 -0400 2010: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > [ "latch" proposal ] > > This seems reasonably clean as far as signal conditions generated > internally to Postgres go, but I remain unclear on how it helps for > response to actual signals. This could probably replace the signalling between postmaster and autovac launcher, as well. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 24/08/10 04:08, Alvaro Herrera wrote: > Excerpts from Tom Lane's message of lun ago 23 19:44:02 -0400 2010: >> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: > >>> [ "latch" proposal ] >> >> This seems reasonably clean as far as signal conditions generated >> internally to Postgres go, but I remain unclear on how it helps for >> response to actual signals. > > This could probably replace the signalling between postmaster and > autovac launcher, as well. Hmm, postmaster needs to stay out of shared memory.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 24/08/10 02:44, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> [ "latch" proposal ] > > This seems reasonably clean as far as signal conditions generated > internally to Postgres go, but I remain unclear on how it helps for > response to actual signals. You can can set the latch in the signal handler. Here's a first attempt at implementing that. To demonstrate how it works, I modified walsender to use the new latch facility, also to respond quickly to SIGHUP and SIGTERM. There's two kinds of latches, local and global. Local latches can only be set from the same process - allowing you to replace pg_usleep() with something that is always interruptible by signals (by setting the latch in the signal handler). The global latches work the same, and indeed the implementation is the same, but the latch resides in shared memory, and can be set by any process attached to shared memory. On Unix, when you set a latch waited for by another process, the setter sends SIGUSR1 to the waiting process, and the signal handler sends the byte to the self-pipe to wake up the select(). On Windows, we use WaitEvent to wait on a latch, and SetEvent to wake up. The difference between global and local latches is that for global latches, the Event object needs to be created upfront at postmaster startup so that its inherited to all child processes, and stored in shared memory. A local Event object can be created only in the process that needs it. I put the code in src/backend/storage/ipc/latch.c now, but it probably ought to go in src/backend/portability instead, with a separate win32_latch.c file for Windows. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Fujii Masao
Date:
On Thu, Aug 26, 2010 at 7:40 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Here's a first attempt at implementing that. To demonstrate how it works, I > modified walsender to use the new latch facility, also to respond quickly to > SIGHUP and SIGTERM. Great! > There's two kinds of latches, local and global. Local latches can only be > set from the same process - allowing you to replace pg_usleep() with > something that is always interruptible by signals (by setting the latch in > the signal handler). The global latches work the same, and indeed the > implementation is the same, but the latch resides in shared memory, and can > be set by any process attached to shared memory. On Unix, when you set a > latch waited for by another process, the setter sends SIGUSR1 to the waiting > process, and the signal handler sends the byte to the self-pipe to wake up > the select(). According to this explanation, the latch which walsender uses seems to be local. If it's true, walsender should call InitSharedLatch rather than InitLatch? > /* > * XXX: Should we invent an API to wait for data coming from the > * client connection too? It's not critical, but we could then > * eliminate the timeout altogether and go to sleep for good. > */ Yes, it would be very helpful when walsender waits for the ACK from the standby in upcoming synchronous replication. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 27/08/10 10:39, Fujii Masao wrote: > On Thu, Aug 26, 2010 at 7:40 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> There's two kinds of latches, local and global. Local latches can only be >> set from the same process - allowing you to replace pg_usleep() with >> something that is always interruptible by signals (by setting the latch in >> the signal handler). The global latches work the same, and indeed the >> implementation is the same, but the latch resides in shared memory, and can >> be set by any process attached to shared memory. On Unix, when you set a >> latch waited for by another process, the setter sends SIGUSR1 to the waiting >> process, and the signal handler sends the byte to the self-pipe to wake up >> the select(). > > According to this explanation, the latch which walsender uses seems to be > local. If it's true, walsender should call InitSharedLatch rather than > InitLatch? Yes, it should. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Greg Smith
Date:
Tom Lane wrote: > Well, yes they are. They cause unnecessary process wakeups and thereby > consume cycles even when the database is idle. See for example a > longstanding complaint here: > https://bugzilla.redhat.com/show_bug.cgi?id=252129 > > If we're going to go to the trouble of having a mechanism like this, > I'd like it to fix that problem so I can close out that bug. > The way the background writer wakes up periodically to absorb fsync requests is already way too infrequent on a busy system. That component of the bug report reeks as "not a bug" to me. Don't like the default? Increase bgwriter_delay; move on with your life. We're not going to increase the default, to screw over regular users, just to instead prioritize people who prefer low power consumption. There's a knob for it, you can tune the other way if you want. And based on this pile of data I'm sorting through the last few weeks, my guess is that if the default is going anywhere in 9.1 it's to make the BGW run *more* often, not less. What those asking for the default change don't realize is that if the BGW stops doing useful work, backends will start doing more disk writes with their own fsync calls, and now you've just messed with out how often the disks have to be powered up. I could probably construct a test case that uses more power with the behavior they think they want than the current one does. The only clear case where this is always a win is when the system it totally idle. The complaint that there's no similar way to detune the logger for lower power use, something you can't really tweak on your own, is a much more reasonable demand. I have a patch that adds a new column to pg_stat_bgwriter to count how often backend fsync calls happen directly, because the background writer couldn't be bothered to absorb them. If this latching idea goes somewhere, that should be a reasonable way to measure if the new implementation is getting in the way of that particular processing requirement. It's important to realize that the fsync absorb function of the background writer has become absolutely critical on modern systems that hit high transaction rates, so any refactoring of its basic design needs to stay responsive to those incoming backend requests. I don't see any high-level issues with the latch design approach Heikki has proposed in regards to that. I'll take a look at some of the other test cases I have here to see if they can help quantify its impact on this aspect of BGW behavior. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Robert Haas
Date:
On Fri, Aug 27, 2010 at 5:54 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Tom Lane wrote: >> >> Well, yes they are. They cause unnecessary process wakeups and thereby >> consume cycles even when the database is idle. See for example a >> longstanding complaint here: >> https://bugzilla.redhat.com/show_bug.cgi?id=252129 >> >> If we're going to go to the trouble of having a mechanism like this, >> I'd like it to fix that problem so I can close out that bug. > > The way the background writer wakes up periodically to absorb fsync requests > is already way too infrequent on a busy system. Maybe instead of a fixed-duration sleep we could wake it up when it needs to do something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Aug 27, 2010 at 5:54 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> The way the background writer wakes up periodically to absorb fsync requests >> is already way too infrequent on a busy system. > Maybe instead of a fixed-duration sleep we could wake it up when it > needs to do something. *Any* fixed delay is going to be too long for some people and not long enough for others; and the very same system might fall into both categories at different times of day. I don't think "make bgwriter_delay customizable" is an adequate answer. We've put up with that so far because it wasn't possible to do better given the infrastructure we had for waiting; but if we're going to try to improve the infrastructure, we should have the ambition of getting rid of this type of problem. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Greg Smith <greg@2ndquadrant.com> writes: > Tom Lane wrote: >> Well, yes they are. They cause unnecessary process wakeups and thereby >> consume cycles even when the database is idle. See for example a >> longstanding complaint here: >> https://bugzilla.redhat.com/show_bug.cgi?id=252129 > ... The only clear case where this is > always a win is when the system it totally idle. If you'll climb down off that horse for a moment: yeah, the idle case is *exactly* what they're complaining about. In particular, the complaint is that it's unreasonable to have Postgres running on a machine at all unless it's actively being used, because it forces significant CPU power drain anyway. That gets in the way of our plan for world domination, no? If you can't have a PG sitting unobtrusively in the background, waiting for you to have a need for it, it won't get installed in the first place. People will pick mysql, or something else with a smaller footprint, to put on their laptops, and then we lose some more mindshare. I see this as just another facet of the argument about whether it's okay to have default settings that try to take over the entire machine. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Greg Stark
Date:
On Sat, Aug 28, 2010 at 2:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > That gets in the way of our plan for world domination, > no? If you can't have a PG sitting unobtrusively in the background, > waiting for you to have a need for it, it won't get installed in the > first place. People will pick mysql, or something else with a smaller > footprint, to put on their laptops, and then we lose some more mindshare. People are always complaining about hosting companies not supporting Postgres but put yourself in the shoes of a hosting copmany that wants to have a few hundred hosting clients all on the same box, either as a shared box or more likely these days as VMs. That's perfectly reasonable if they're all idle nearly all the time but if they're all waking up ever few milliseconds that's going to consume a substantial amount of cpu before you even start handling requests. -- greg
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Greg Smith
Date:
Tom Lane wrote: > Greg Smith <greg@2ndquadrant.com> writes: > >> ... The only clear case where this is >> always a win is when the system it totally idle. >> > If you'll climb down off that horse for a moment: yeah, the idle case is > *exactly* what they're complaining about. I wasn't on a horse here--I saw the specific case they must be most annoyed with. If it's possible to turn this around into a "push" model instead of how it works now, without tanking so that performance (and maybe even power use!) suffers for the worst or even average case, that refactoring could end up outperforming the current model. I'd like the background writer to never go to sleep at all really if enough works come in to keep it always busy, and I think that might fall out nicely as a result of aiming for the other end--making it sleeper deeper when it's not needed. Just pointing out the very specific place where I suspect that will go wrong if it's not done carefully is the fsync absorption, because it's already broken enough that we're reworking it here. PostgreSQL already ship a bad enough default configuration to decide yet another spot should be yielded to somebody else's priorities, unless that actually meets performance goals. I think I can quantify what those should be with a test case for this part. > I see this as just another facet of the argument about whether it's okay > to have default settings that try to take over the entire machine. > Mostly agreed here. So long as the result is automatic enough to not introduce yet another GUC set to the wrong thing for busy systems by default, this could turn out great. The list of things you must change to get the database to work right for serious work is already far too long, and I dread the thought of putting yet another on there just for the sake of lower power use. Another part of the plan for world domination is that Oracle DBAs install the database for a test and say "wow, that wasn't nearly as complicated as I'm used to and it performs well, that was nice". That those people matter too is all I'm saying. I could easily file a bug from their perspective saying "Background writer is a lazy SOB in default config" that would be no less valid than the one being discussed here. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
Here's a 2nd version of the "latch" patch. Now with a Windows implementation. Comments welcome. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Fujii Masao
Date:
On Tue, Aug 31, 2010 at 4:06 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Here's a 2nd version of the "latch" patch. Now with a Windows > implementation. Comments welcome. Seems good. Two minor comments: > rc = WaitForSingleObject(latch->event, timeout / 1000); > if (rc == WAIT_FAILED) > { > ereport(ERROR, > (errcode_for_socket_access(), > errmsg("WaitForSingleObject() failed: error code %d", (int) GetLastError()))); > } > if (rc == WAIT_TIMEOUT) > break; /* timeout exceeded */ We should also check "rc == WAIT_OBJECT_0"? > static volatile HANDLE waitingEvent = false; s/false/NULL? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Fujii Masao
Date:
On Fri, Aug 27, 2010 at 4:39 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> /* >> * XXX: Should we invent an API to wait for data coming from the >> * client connection too? It's not critical, but we could then >> * eliminate the timeout altogether and go to sleep for good. >> */ > > Yes, it would be very helpful when walsender waits for the ACK from > the standby in upcoming synchronous replication. I propose to change WaitLatch() so that it accepts the socket descriptor as an argument, to wait for data coming from the client connection. WaitLatch() monitors not only the self-pipe but also the given socket. If -1 is supplied, it checks only the self-pipe. The socket should be monitored by using poll() if the platform has it, since poll() is usually more efficient. So I'd like to change Unix implementation of WaitLatch() as follows. Thought? ------------------- define WaitLatch(latch, timeout) WaitLatchAndSocket(latch, -1, timeout) void WaitLatchAndSocket(Latch *latch, int sock, long timeout); { ... FD_SET(selfpipe_readfd, &input_mask); if (sock != -1) FD_SET(sock, &input_mask); #ifdef HAVE_POLL poll(...) #else select(...)#endif /* HAVE_POLL */ ... } ------------------- Windows implementation of WaitLatchAndSocket() seems not to be so simple. We would need to wait for both the latch event and the packet from the socket by using WaitForMultipleObjectsEx(). Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Bruce Momjian
Date:
Tom Lane wrote: > Greg Smith <greg@2ndquadrant.com> writes: > > Tom Lane wrote: > >> Well, yes they are. They cause unnecessary process wakeups and thereby > >> consume cycles even when the database is idle. See for example a > >> longstanding complaint here: > >> https://bugzilla.redhat.com/show_bug.cgi?id=252129 > > > ... The only clear case where this is > > always a win is when the system it totally idle. > > If you'll climb down off that horse for a moment: yeah, the idle case is > *exactly* what they're complaining about. In particular, the complaint > is that it's unreasonable to have Postgres running on a machine at all > unless it's actively being used, because it forces significant CPU power > drain anyway. That gets in the way of our plan for world domination, > no? If you can't have a PG sitting unobtrusively in the background, > waiting for you to have a need for it, it won't get installed in the > first place. People will pick mysql, or something else with a smaller > footprint, to put on their laptops, and then we lose some more mindshare. > I see this as just another facet of the argument about whether it's okay > to have default settings that try to take over the entire machine. FYI, last week I was running PG 8.4.X with default settings on a T43 laptop running XP with 1 Gig of RAM and it caused a racing game I was playing to run jerkily. When I shut down the Postgres server, the problem was fixed. I was kind of surprised that an idle PG server could cause that. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 31/08/10 15:47, Fujii Masao wrote: > On Fri, Aug 27, 2010 at 4:39 PM, Fujii Masao<masao.fujii@gmail.com> wrote: >>> /* >>> * XXX: Should we invent an API to wait for data coming from the >>> * client connection too? It's not critical, but we could then >>> * eliminate the timeout altogether and go to sleep for good. >>> */ >> >> Yes, it would be very helpful when walsender waits for the ACK from >> the standby in upcoming synchronous replication. > > I propose to change WaitLatch() so that it accepts the socket > descriptor as an argument, to wait for data coming from the > client connection. Yeah, we probably should do that now. > WaitLatch() monitors not only the self-pipe > but also the given socket. If -1 is supplied, it checks only > the self-pipe. The obvious next question is how to wait for multiple sockets and a latch at the same time? Perhaps we should have a select()-like interface where you can pass multiple file descriptors. Then again, looking at the current callers of select() in the backend, apart from postmaster they all wait for only one fd. > The socket should be monitored by using poll() if the platform > has it, since poll() is usually more efficient. Yeah, I guess. > So I'd like to change Unix implementation of WaitLatch() as > follows. Thought? > > ------------------- > define WaitLatch(latch, timeout) WaitLatchAndSocket(latch, -1, timeout) > > void WaitLatchAndSocket(Latch *latch, int sock, long timeout); > { > ... > > FD_SET(selfpipe_readfd,&input_mask); > if (sock != -1) > FD_SET(sock,&input_mask); > > #ifdef HAVE_POLL > poll(...) > #else > select(...) > #endif /* HAVE_POLL */ > > ... > } > ------------------- Yep. > Windows implementation of WaitLatchAndSocket() seems not to be > so simple. We would need to wait for both the latch event and > the packet from the socket by using WaitForMultipleObjectsEx(). Well, we already use WaitForMultipleObjectsEx() to implement select() on Windows, so it should be straightforward to copy that. I'll look into that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Fujii Masao
Date:
On Wed, Sep 1, 2010 at 4:11 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > The obvious next question is how to wait for multiple sockets and a latch at > the same time? Perhaps we should have a select()-like interface where you > can pass multiple file descriptors. Then again, looking at the current > callers of select() in the backend, apart from postmaster they all wait for > only one fd. Currently backends have not waited for multiple sockets, so I don't think that interface is required for now. Similarly, we don't need to wait for the socket to be ready to *write* because there is no use case for now. >> Windows implementation of WaitLatchAndSocket() seems not to be >> so simple. We would need to wait for both the latch event and >> the packet from the socket by using WaitForMultipleObjectsEx(). > > Well, we already use WaitForMultipleObjectsEx() to implement select() on > Windows, so it should be straightforward to copy that. I'll look into that. Agreed. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 02/09/10 06:46, Fujii Masao wrote: > On Wed, Sep 1, 2010 at 4:11 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> The obvious next question is how to wait for multiple sockets and a latch at >> the same time? Perhaps we should have a select()-like interface where you >> can pass multiple file descriptors. Then again, looking at the current >> callers of select() in the backend, apart from postmaster they all wait for >> only one fd. > > Currently backends have not waited for multiple sockets, so I don't think that > interface is required for now. Similarly, we don't need to wait for the socket > to be ready to *write* because there is no use case for now. Ok, here's an updated patch with WaitLatchOrSocket that let's you do that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Ok, here's an updated patch with WaitLatchOrSocket that let's you do that. Minor code review items: s/There is three/There are three/ in unix_latch.c header comment The header comment points out the correct usage of ResetLatch, but I think it should also emphasize that the correct usage of SetLatch is to set whatever flags indicate work-to-do before SetLatch. I don't care for the Asserts in InitLatch and InitSharedLatch. Initialization functions ought not assume that the struct they are told to initialize contains anything but garbage. And *especially* not assume that without documentation. s/inter-proess/inter-process/, at least 2 places Does ReleaseLatch() have any actual use-case, and if so what would it be? I think we could do without it. The WaitLatch timeout API could use a bit of refinement. I'd suggest defining negative timeout as meaning wait forever, so that timeout = 0 can be used for "check but don't wait". Also, it seems like the function shouldn't just return void but should return a bool to show whether it saw the latch set or timed out. (Yeah, I realize the caller could look into the latch to find that out, but callers really ought to treat latches as opaque structs.) I don't think you have the select-failed logic right in WaitLatchOrSocket; on EINTR it will suppose that FD_ISSET is a valid test to make, which I think ain't the case. Just "continue" around the loop. Comment for unix_latch's latch_sigusr1_handler refers to SetEvent, which is a Windows-ism. + * XXX: Is it safe to elog(ERROR) in a signal handler? No, it isn't. It seems like both implementations are #include'ing more than they ought to --- why replication/walsender.h, in particular? I don't think unix_latch needs spin.h either. +typedef struct +{ + volatile sig_atomic_t is_set; + volatile sig_atomic_t owner_pid; +} Latch; I don't believe it is either sane or portable to declare struct fields as volatile. You need to attach the volatile qualifier to the function arguments instead, eg extern WaitLatch(volatile Latch *latch, ...) Also, using sig_atomic_t for owner_pid is entirely not sane. On many platforms sig_atomic_t is only a byte, and besides which you have no need for that field to be settable by a signal handler. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Fujii Masao
Date:
On Fri, Sep 3, 2010 at 5:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Does ReleaseLatch() have any actual use-case, and if so what would it be? > I think we could do without it. In Unix, probably we can live without that. But in Windows, we need to free SharedEventHandles slot for upcoming process using a latch when ending. > The WaitLatch timeout API could use a bit of refinement. I'd suggest > defining negative timeout as meaning wait forever, so that timeout = 0 > can be used for "check but don't wait". Also, it seems like the > function shouldn't just return void but should return a bool to show > whether it saw the latch set or timed out. (Yeah, I realize the caller > could look into the latch to find that out, but callers really ought to > treat latches as opaque structs.) Agreed. > I don't think you have the select-failed logic right in > WaitLatchOrSocket; on EINTR it will suppose that FD_ISSET is a valid > test to make, which I think ain't the case. Just "continue" around > the loop. EINTR already makes us go back to the top of the loop since FD_ISSET(pipe) is not checked. Then we would clear the pipe and break out of the loop because of "latch->is_set == true". > + * XXX: Is it safe to elog(ERROR) in a signal handler? > > No, it isn't. We should use elog(FATAL) or check proc_exit_inprogress, instead? + if (errno != EAGAIN && errno != EWOULDBLOCK) + { + /* + * XXX: Is it safe to elog(ERROR) in a signal handler? + */ + elog(ERROR, "write() on self-pipe failed: %m"); + } + if (errno == EINTR) + goto retry; "errno == EINTR)" seems to be never checked. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Fujii Masao <masao.fujii@gmail.com> writes: >> + * XXX: Is it safe to elog(ERROR) in a signal handler? >> >> No, it isn't. > We should use elog(FATAL) or check proc_exit_inprogress, instead? elog(FATAL) is *certainly* not a better idea. I think there's really nothing that can be done, you just have to silently ignore the error. BTW, if we retry, there had probably better be a limit on how many times to retry ... > + if (errno != EAGAIN && errno != EWOULDBLOCK) > + { > + /* > + * XXX: Is it safe to elog(ERROR) in a signal handler? > + */ > + elog(ERROR, "write() on self-pipe failed: %m"); > + } > + if (errno == EINTR) > + goto retry; > "errno == EINTR)" seems to be never checked. Another issue with coding like that is that it supposes elog() won't change errno. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Fujii Masao
Date:
On Fri, Sep 3, 2010 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >>> + * XXX: Is it safe to elog(ERROR) in a signal handler? >>> >>> No, it isn't. > >> We should use elog(FATAL) or check proc_exit_inprogress, instead? > > elog(FATAL) is *certainly* not a better idea. I think there's really > nothing that can be done, you just have to silently ignore the error. Hmm.. some functions called by a signal handler use elog(FATAL), e.g., RecoveryConflictInterrupt() do that when unknown conflict mode is given as an argument. Are these calls unsafe, too? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 02/09/10 23:13, Tom Lane wrote: > The WaitLatch ...timeout API could use a bit of refinement. I'd suggest > defining negative timeout as meaning wait forever, so that timeout = 0 > can be used for "check but don't wait". Also, it seems like the > function shouldn't just return void but should return a bool to show > whether it saw the latch set or timed out. In case of WaitLatchOrSocket, the caller might want to know if a latch was set, the socket became readable, or it timed out. So we need three different return values. > (Yeah, I realize the caller> could look into the latch to find that out, but callers really ought to> treat latches asopaque structs.) Hmm, maybe we need a TestLatch function to check if a latch is set. > I don't think you have the select-failed logic right in > WaitLatchOrSocket; on EINTR it will suppose that FD_ISSET is a valid > test to make, which I think ain't the case. Just "continue" around > the loop. Yep. I also realized that the timeout handling is a bit surprising with interrupts. After EINTR we call select() again with the same timeout, so a signal effectively restarts the timer. We seem to have similar behavior in a couple of other places, in pgstat.c and auth.c. So maybe that's OK and just needs to be documented, but I thought I'd bring it up. > It seems like both implementations are #include'ing more than they > ought to --- why replication/walsender.h, in particular? Windows implementation needs it for the max_wal_senders variable, to allocate enough shared Event objects in LatchShmemInit. In unix_latch.c it's not needed. > Also, using sig_atomic_t for owner_pid is entirely not sane. > On many platforms sig_atomic_t is only a byte, and besides > which you have no need for that field to be settable by a > signal handler. Hmm, true, it doesn't need to be set from signal handler, but is there an atomicity problem if one process calls ReleaseLatch while another process is in SetLatch? ReleaseLatch sets owner_pid to 0, while SetLatch reads it and calls kill() on it. Can we assume that pid_t is atomic, or do we need a spinlock to protect it? (Windows implementation has a similar issue with HANDLE instead of pid_t) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Fujii Masao <masao.fujii@gmail.com> writes: > On Fri, Sep 3, 2010 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> elog(FATAL) is *certainly* not a better idea. �I think there's really >> nothing that can be done, you just have to silently ignore the error. > Hmm.. some functions called by a signal handler use elog(FATAL), e.g., > RecoveryConflictInterrupt() do that when unknown conflict mode is given > as an argument. Are these calls unsafe, too? [ shrug... ] I stated before that the Hot Standby patch is doing utterly unsafe things in signal handlers. Simon rejected that. I am waiting for irrefutable evidence to emerge from the field (and am very confident that it will be forthcoming...) before I argue with him further. Meanwhile, I'm not going to accept anything unsafe in a core facility like this patch is going to be. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 02/09/10 23:13, Tom Lane wrote: >>> (Yeah, I realize the caller >>> could look into the latch to find that out, but callers really ought to >>> treat latches as opaque structs.) > Hmm, maybe we need a TestLatch function to check if a latch is set. +1. It could be a macro for now. I wish that we could declare Latch as an opaque struct, but we probably need to let callers embed it in a larger struct, so we'll just have to rely on callers to code cleanly. > I also realized that the timeout handling is a bit surprising with > interrupts. After EINTR we call select() again with the same timeout, so > a signal effectively restarts the timer. Actually it's platform-dependent. On some machines (I think BSD-derived) the value of the timeout struct gets reduced by the elapsed time before returning, so that if you just loop around you don't get more than the originally specified delay time in total. > We seem to have similar > behavior in a couple of other places, in pgstat.c and auth.c. So maybe > that's OK and just needs to be documented, but I thought I'd bring it up. Yeah. I am hoping that this facility will greatly reduce the need for callers to care about the exact timeout delay, so I think that what we should do for now is just document that the timeout might be several times as long as specified, and see how it goes from there. We could fix the problem if we had to, by adding some gettimeofday() calls and computing the remaining delay time each time through the loop. But let's avoid doing that until it's clear we need it. >> Also, using sig_atomic_t for owner_pid is entirely not sane. > Hmm, true, it doesn't need to be set from signal handler, but is there > an atomicity problem if one process calls ReleaseLatch while another > process is in SetLatch? If there is *any* possibility of that happening then you have far worse problems than whether the field is atomically readable or not: the behavior will be unpredictable at just slightly larger timescales. This is the reason why I think it'd be better if ReleaseLatch simply didn't exist. That'd discourage people from designing dynamic latch structures, which are fundamentally going to be subject to race conditions. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Robert Haas
Date:
On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> On Fri, Sep 3, 2010 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> elog(FATAL) is *certainly* not a better idea. I think there's really >>> nothing that can be done, you just have to silently ignore the error. > >> Hmm.. some functions called by a signal handler use elog(FATAL), e.g., >> RecoveryConflictInterrupt() do that when unknown conflict mode is given >> as an argument. Are these calls unsafe, too? > > [ shrug... ] I stated before that the Hot Standby patch is doing > utterly unsafe things in signal handlers. Simon rejected that. > I am waiting for irrefutable evidence to emerge from the field > (and am very confident that it will be forthcoming...) before > I argue with him further. Meanwhile, I'm not going to accept anything > unsafe in a core facility like this patch is going to be. Oh. I thought you had ignored his objections and fixed it. Why are we releasing 9.0 with this problem again? Surely this is nuts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> [ shrug... ] �I stated before that the Hot Standby patch is doing >> utterly unsafe things in signal handlers. �Simon rejected that. >> I am waiting for irrefutable evidence to emerge from the field >> (and am very confident that it will be forthcoming...) before >> I argue with him further. �Meanwhile, I'm not going to accept anything >> unsafe in a core facility like this patch is going to be. > Oh. I thought you had ignored his objections and fixed it. Why are > we releasing 9.0 with this problem again? Surely this is nuts. My original review of hot standby found about half a dozen things I thought were broken: http://archives.postgresql.org/pgsql-hackers/2010-05/msg00178.php After a *very* long-drawn-out fight I fixed one of them (max_standby_delay), largely still over Simon's objections. I don't have the energy to repeat that another half-dozen times, so I'm going to wait for the suspected problems to be proven by field experience. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Robert Haas
Date:
On Fri, Sep 3, 2010 at 12:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> [ shrug... ] I stated before that the Hot Standby patch is doing >>> utterly unsafe things in signal handlers. Simon rejected that. >>> I am waiting for irrefutable evidence to emerge from the field >>> (and am very confident that it will be forthcoming...) before >>> I argue with him further. Meanwhile, I'm not going to accept anything >>> unsafe in a core facility like this patch is going to be. > >> Oh. I thought you had ignored his objections and fixed it. Why are >> we releasing 9.0 with this problem again? Surely this is nuts. > > My original review of hot standby found about half a dozen things > I thought were broken: > http://archives.postgresql.org/pgsql-hackers/2010-05/msg00178.php > After a *very* long-drawn-out fight I fixed one of them > (max_standby_delay), largely still over Simon's objections. I don't > have the energy to repeat that another half-dozen times, so I'm going > to wait for the suspected problems to be proven by field experience. Bummer. Allow me to cast a vote for doing something about the fact that handle_standby_sig_alarm() thinks it can safely acquire an LWLock in a signal handler. I think we should be making our decisions on what to change in the code based on what is technically sound, rather than based on how much the author complains about changing it. Of course there may be cases where there is a legitimate difference of opinion concerning the best way forward, but I don't believe this is one of them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 03/09/10 17:51, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> On 02/09/10 23:13, Tom Lane wrote: >>> Also, using sig_atomic_t for owner_pid is entirely not sane. > >> Hmm, true, it doesn't need to be set from signal handler, but is there >> an atomicity problem if one process calls ReleaseLatch while another >> process is in SetLatch? > > If there is *any* possibility of that happening then you have far worse > problems than whether the field is atomically readable or not: the > behavior will be unpredictable at just slightly larger timescales. > This is the reason why I think it'd be better if ReleaseLatch simply > didn't exist. That'd discourage people from designing dynamic latch > structures, which are fundamentally going to be subject to race > conditions. Each Walsender needs a latch, and walsenders come and go. I first experimented with had no ReleaseLatch function; instead any process could call WaitLatch on any shared latch, as long as only one process waits on a given latch at a time. But it had exactly the same problem, WaitLatch had to set the pid on the Latch struct to allow other processes to send the signal. Another process could call SetLatch and read the pid field, while WaitLatch is just setting it. I think we'll have to put a spinlock there, if we can't assume that assignment of pid_t is atomic. It's not the end of the world.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 03/09/10 17:51, Tom Lane wrote: >> If there is *any* possibility of that happening then you have far worse >> problems than whether the field is atomically readable or not: the >> behavior will be unpredictable at just slightly larger timescales. > Each Walsender needs a latch, and walsenders come and go. Well, then we need to think extremely hard about the circumstances in which we need to send a cross-process latch signal to walsenders and what the behavior needs to be in the race conditions. > WaitLatch had to set the pid on the Latch struct to allow other > processes to send the signal. Another process could call SetLatch and > read the pid field, while WaitLatch is just setting it. I think we'll > have to put a spinlock there, if we can't assume that assignment of > pid_t is atomic. It's not the end of the world.. Yes it is. Signal handlers can't take spinlocks (what if they interrupt while the mainline is holding the lock?). It's probably not too unreasonable to assume that pid_t assignment is atomic. But I'm still thinking that we have bigger problems than that if there are really cases where SetLatch can execute at approximately the same time as a latch owner is coming or going. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 03/09/10 21:16, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> WaitLatch had to set the pid on the Latch struct to allow other >> processes to send the signal. Another process could call SetLatch and >> read the pid field, while WaitLatch is just setting it. I think we'll >> have to put a spinlock there, if we can't assume that assignment of >> pid_t is atomic. It's not the end of the world.. > > Yes it is. Signal handlers can't take spinlocks (what if they interrupt > while the mainline is holding the lock?). Ok, I see. > It's probably not too unreasonable to assume that pid_t assignment is > atomic. But I'm still thinking that we have bigger problems than that > if there are really cases where SetLatch can execute at approximately > the same time as a latch owner is coming or going. I don't see how to avoid it. A walsender, or any process really, can exit at any time. It can make the latch inaccessible to others before it exits to minimize the window, but it's always going to be possible that another process is just about to call SetLatch when you exit. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 03/09/10 21:16, Tom Lane wrote: >> It's probably not too unreasonable to assume that pid_t assignment is >> atomic. But I'm still thinking that we have bigger problems than that >> if there are really cases where SetLatch can execute at approximately >> the same time as a latch owner is coming or going. > I don't see how to avoid it. A walsender, or any process really, can > exit at any time. It can make the latch inaccessible to others before it > exits to minimize the window, but it's always going to be possible that > another process is just about to call SetLatch when you exit. Well, in that case what we need to do is presume that the latch object has a continuing existence but the owner/receiver can come and go. I would suggest that InitLatch needs to initialize the object into a valid but unowned state; there is *no* deinitialize operation; and there are AcquireLatch and ReleaseLatch operations to become owner or stop being owner. We also need to define the semantics of SetLatch on an unowned latch --- does this set a signal condition that will be available to the next owner? This amount of complexity might be overkill for local latches, but I think we need it for shared ones. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Ron Mayer
Date:
Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> [ shrug... ] I stated before that the Hot Standby patch is doing >>> utterly unsafe things in signal handlers. Simon rejected that. >>> I am waiting for irrefutable evidence to emerge from the field >>> (and am very confident that it will be forthcoming...) [...] > >> [...]Why are >> we releasing 9.0 with this problem again? Surely this is nuts. Will the docs give enough info so that release note readers will know when they're giving well-informed consent to volunteer to produce such field evidence?
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Robert Haas
Date:
On Fri, Sep 3, 2010 at 3:11 PM, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote: > Tom Lane wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> [ shrug... ] I stated before that the Hot Standby patch is doing >>>> utterly unsafe things in signal handlers. Simon rejected that. >>>> I am waiting for irrefutable evidence to emerge from the field >>>> (and am very confident that it will be forthcoming...) [...] >> >>> [...]Why are >>> we releasing 9.0 with this problem again? Surely this is nuts. > > Will the docs give enough info so that release note readers > will know when they're giving well-informed consent to volunteer > to produce such field evidence? Yeah, exactly. Good news: you can now run queries on the standby. Bad news: we've abandoned our policy of not releasing with known bugs.Have fun and enjoy using PostgreSQL, the world's mostadvanced open source databSegmentation fault -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 03/09/10 19:38, Robert Haas wrote: > On Fri, Sep 3, 2010 at 12:10 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> Robert Haas<robertmhaas@gmail.com> writes: >>> On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >>>> [ shrug... ] I stated before that the Hot Standby patch is doing >>>> utterly unsafe things in signal handlers. Simon rejected that. >>>> I am waiting for irrefutable evidence to emerge from the field >>>> (and am very confident that it will be forthcoming...) before >>>> I argue with him further. Meanwhile, I'm not going to accept anything >>>> unsafe in a core facility like this patch is going to be. >> >>> Oh. I thought you had ignored his objections and fixed it. Why are >>> we releasing 9.0 with this problem again? Surely this is nuts. >> >> My original review of hot standby found about half a dozen things >> I thought were broken: >> http://archives.postgresql.org/pgsql-hackers/2010-05/msg00178.php >> After a *very* long-drawn-out fight I fixed one of them >> (max_standby_delay), largely still over Simon's objections. I don't >> have the energy to repeat that another half-dozen times, so I'm going >> to wait for the suspected problems to be proven by field experience. > > Bummer. Allow me to cast a vote for doing something about the fact > that handle_standby_sig_alarm() thinks it can safely acquire an LWLock > in a signal handler. I think we should be making our decisions on > what to change in the code based on what is technically sound, rather > than based on how much the author complains about changing it. Of > course there may be cases where there is a legitimate difference of > opinion concerning the best way forward, but I don't believe this is > one of them. Hmm, just to make the risk more concrete, here's one scenario that could happen: 1. Startup process tries to acquire cleanup lock on a page. It's pinned, so it has to wait, and calls ResolveRecoveryConflictWithBufferPin(). 2. ResolveRecoveryConflictWithBufferPin enables the standby SIGALRM handler by calling enable_standby_sig_alarm(), and calls ProcWaitForSignal(). 3. ProcWaitForSignal() calls semop() (assuming sysv semaphores here) to wait for the process semaphore 4. Max standby delay is reached and SIGALRM fired. CheckStandbyTimeout() is called in signal handler. CheckStandbyTimeout() calls SendRecoveryConflictWithBufferPin(), which calls CancelDBBackends() 5. CancelDBBackends() tries to acquire ProcArrayLock in exclusive mode. It's being held by another process, so we have to sleep 6. To sleep, LWLockAcquire calls PGSemaphoreLock, which calls semop() to wait on on the process semaphore. So we now have the same process nested twice inside a semop() call. Looking at the Linux signal (7) man page, it is undefined what happens if semop() is re-entered in a signal handler while another semop() call is happening in main line of execution. Assuming it somehow works, which semop() call is going to return when the semaphore is incremented? Maybe that's ok, if I'm reading the deadlock checker code correctly, it also calls semop() to increment the another process' semaphore, and the deadlock checker can be invoked from a signal handler while in semop() to wait on our process' semaphore. BTW, sem_post(), which is the Posix function for incrementing a semaphore, is listed as a safe function to call in a signal handler. But it's certainly fishy. A safer approach would be to just PGSemaphoreUnlock() in the signal handler, and do all the other processing outside it. You'd still call semop() within semop(), but at least it would be closer to the semop() within semop() we already do in the deadlock checker. And there would be less randomness from timing and lock contention involved, making it easier to test the behavior on various platforms. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Robert Haas
Date:
On Fri, Sep 3, 2010 at 4:20 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Maybe that's ok, if I'm reading the deadlock checker code correctly, it also > calls semop() to increment the another process' semaphore, and the deadlock > checker can be invoked from a signal handler while in semop() to wait on our > process' semaphore. BTW, sem_post(), which is the Posix function for > incrementing a semaphore, is listed as a safe function to call in a signal > handler. But it's certainly fishy. Color me confused; I may need to backpedal rapidly here. I had thought that what Tom was complaining about was the fact that the signal handler was taking LWLocks, which I would have thought to be totally unsafe. But it seems the deadlock detector does the same thing, more or less because the signal handlers are set up so that they don't do anything unless we're within a limited section of code where nothing too interesting can happen. I'm not too sure why we think that it's safe to invoke the deadlock detector that way, but it's also not too clear to me why this case is any worse. It furthermore appears that Simon's reply to Tom's complaint about this function was: "This was modelled very closely on handle_sig_alarm() and was reviewed by other hackers. I'm not great on that, as you know, so if you can explain what it is I can't do, and how that differs from handle_sig_alarm running the deadlock detector in the same way, then I'll work on it some more." I guess I need the same explanation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > A safer approach would be to just PGSemaphoreUnlock() in the signal > handler, and do all the other processing outside it. I don't see any particularly good reason to assume that PGSemaphoreUnlock is safe either: you're still talking about nested semop operations. The pre-existing SIGALRM handler uses a self-signal (kill(MyProcPid, SIGINT)) to kick the process off any wait it might be doing. I'd rather do something like that. Or maybe the work you're doing on latches would help ... regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > Color me confused; I may need to backpedal rapidly here. I had > thought that what Tom was complaining about was the fact that the > signal handler was taking LWLocks, which I would have thought to be > totally unsafe. Well, it's unsafe if the signal could interrupt mainline code that is trying to take an LWLock or already holds the same LWLock --- or, if you consider deadlock risks against other processes, already holding other LWLocks might be problematic too. And trying to use facilities like elog or palloc is also pretty unsafe if the mainline is too. The reason the deadlock checker is okay is that there is only a *very* narrow range of mainline code that it could possibly be interrupting, basically nothing but the PGSemaphoreLock() call in ProcSleep. (There's a lot of other code inside the loop there, but notice that it's protected by tests that ensure that it won't run unless the deadlock checker already did.) One salient thing about ProcSleep is that you shouldn't call it while holding any LWLocks, and another is that if you're sleeping while holding regular heavyweight locks, the deadlock checker is exactly what will get you out of trouble if there's a deadlock. Now the HS case likewise appears to be set up so that the signal can only directly interrupt ProcWaitForSignal, so I think the core issue is whether any deadlock situations are possible. Given that this gets called from a low-level place like LockBufferForCleanup, I don't feel too comfortable about that. I certainly haven't seen any analysis or documentation of what locks can safely be held at that point. The deadlock checker only tries to take the LockMgr LWLocks, so extrapolating from whether it is safe to whether touching the ProcArrayLock is safe seems entirely unfounded. It might be worth pointing out here that LockBufferForCleanup is already known to be a risk factor for undetected deadlocks, even without HS in the picture, because of the possibility of deadlocks involving a chain of both heavyweight locks and LWLocks. Whether HS makes it materially worse may be something that we need field experience to determine. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Robert Haas
Date:
On Fri, Sep 3, 2010 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Now the HS case likewise appears to be set up so that the signal can > only directly interrupt ProcWaitForSignal, so I think the core issue is > whether any deadlock situations are possible. Given that this gets > called from a low-level place like LockBufferForCleanup, I don't feel > too comfortable about that. I certainly haven't seen any analysis or > documentation of what locks can safely be held at that point. > The deadlock checker only tries to take the LockMgr LWLocks, so > extrapolating from whether it is safe to whether touching the > ProcArrayLock is safe seems entirely unfounded. > > It might be worth pointing out here that LockBufferForCleanup is already > known to be a risk factor for undetected deadlocks, even without HS in > the picture, because of the possibility of deadlocks involving a chain > of both heavyweight locks and LWLocks. Whether HS makes it materially > worse may be something that we need field experience to determine. OK, well, that does make it more clear what you're worried about, but it seems that moving the work out of the signal handler isn't going to solve this problem. The core issue appears to be whether the caller might be holding a lock which another process might attempt to take while already holding ProcArrayLock, or (in the case of a three or more way deadlock) whether the caller might be holding a lock that some other process could try to acquire after seizing ProcArrayLock. As far as I can tell from looking through the code, the only locks we ever acquire while holding ProcArrayLock are XidGenLock and known_assigned_xids_lck (which is a spin lock). The latter is never held more than VERY briefly. XidGenLock is generally also held only briefly, not acquiring any other locks in the meantime, but GetNewTransactionId() is not obviously (to me) safe. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Greg Stark
Date:
On Fri, Sep 3, 2010 at 9:20 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > So we now have the same process nested twice inside a semop() call. Looking > at the Linux signal (7) man page, it is undefined what happens if semop() is > re-entered in a signal handler while another semop() call is happening in > main line of execution. Assuming it somehow works, which semop() call is > going to return when the semaphore is incremented? > Fwiw I wouldn't be too worried about semop specifically. It's a syscall and will always return with EINTR. What makes functions async-unsafe is when they might do extra processing manipulating data structures in user space such as mallocing memory. POSIX seems to be giving semop license to do that but realistically I can't imagine any implementation doing so. What I might wonder about is what happens if the signal is called just after the semop completes or just before it starts. And someone mentioned calling elog from within the signal handler -- doesn't elog call palloc() and sprintf? That can't be async-safe. -- greg
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Simon Riggs
Date:
On Fri, 2010-09-03 at 18:24 -0400, Tom Lane wrote: > Now the HS case likewise appears to be set up so that the signal can > only directly interrupt ProcWaitForSignal, so I think the core issue > is > whether any deadlock situations are possible. Given that this gets > called from a low-level place like LockBufferForCleanup, I don't feel > too comfortable about that. LockBufferForCleanup is only ever called during recovery by heap_xlog_clean() or btree_xlog_vacuum(). The actions taken to replay a WAL record are independent of all other WAL records from a locking perspective, so replay of every WAL record starts with no LWlocks held by startup process. LockBufferForCleanup is taken early on in replay a heap or btree cleanup record and so we can easily check that no other LWlocks are held while it is called. > I certainly haven't seen any analysis or > documentation of what locks can safely be held at that point. > The deadlock checker only tries to take the LockMgr LWLocks, so > extrapolating from whether it is safe to whether touching the > ProcArrayLock is safe seems entirely unfounded. So the startup process calls one LWlock, ProcArrayLock, and is not holding any other LWlock when it does. The deadlock checker attempts to get and hold all of the other lock partition locks. So deadlock checker already does the thing you're saying might be dangerous and the startup process doesn't. The ProcArrayLock is only taken as a way of signaling other backends. If that is particularly unsafe we could redesign that aspect. > It might be worth pointing out here that LockBufferForCleanup is > already > known to be a risk factor for undetected deadlocks, even without HS in > the picture, because of the possibility of deadlocks involving a chain > of both heavyweight locks and LWLocks. Whether HS makes it materially > worse may be something that we need field experience to determine. You may be right and that it will be a problem. The deadlock risk we're protecting against is a deadlock involving both normal locks and buffer pins. We're safer having it than not having this code, IMHO. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 03/09/10 21:50, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> On 03/09/10 21:16, Tom Lane wrote: >>> It's probably not too unreasonable to assume that pid_t assignment is >>> atomic. But I'm still thinking that we have bigger problems than that >>> if there are really cases where SetLatch can execute at approximately >>> the same time as a latch owner is coming or going. > >> I don't see how to avoid it. A walsender, or any process really, can >> exit at any time. It can make the latch inaccessible to others before it >> exits to minimize the window, but it's always going to be possible that >> another process is just about to call SetLatch when you exit. > > Well, in that case what we need to do is presume that the latch object > has a continuing existence but the owner/receiver can come and go. > I would suggest that InitLatch needs to initialize the object into a > valid but unowned state; there is *no* deinitialize operation; and > there are AcquireLatch and ReleaseLatch operations to become owner > or stop being owner. I think we have just a terminology issue. What you're describing is exactly how it works now, if you just s/InitLatch/AcquireLatch. At the moment there's no need for an initialization function other than the InitLatch/AcquireLatch that associates the latch with the current process. I can add one for the sake of future-proofing, and to have better-defined behavior for setting a latch that has not been owned by anyone yet, but it's not strictly necessary. > We also need to define the semantics of SetLatch > on an unowned latch --- does this set a signal condition that will be > available to the next owner? At the moment, no. Perhaps that would be useful, separating the Init and Acquire operations is needed to make that sane. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 03/09/10 21:50, Tom Lane wrote: >> Well, in that case what we need to do is presume that the latch object >> has a continuing existence but the owner/receiver can come and go. >> I would suggest that InitLatch needs to initialize the object into a >> valid but unowned state; there is *no* deinitialize operation; and >> there are AcquireLatch and ReleaseLatch operations to become owner >> or stop being owner. > I think we have just a terminology issue. What you're describing is > exactly how it works now, if you just s/InitLatch/AcquireLatch. No, it isn't. What I'm suggesting requires breaking InitLatch into two operations. >> We also need to define the semantics of SetLatch >> on an unowned latch --- does this set a signal condition that will be >> available to the next owner? > At the moment, no. Perhaps that would be useful, separating the Init and > Acquire operations is needed to make that sane. Exactly. I'm not totally sure either if it would be useful, but the current design makes it impossible to allow that. BTW, on reflection the AcquireLatch/ReleaseLatch terminology seems a bit ill chosen: ReleaseLatch sounds way too much like something that would just unlock or clear the latch. Perhaps OwnLatch/DisownLatch, or something along that line. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 06/09/10 17:18, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> I think we have just a terminology issue. What you're describing is >> exactly how it works now, if you just s/InitLatch/AcquireLatch. > > No, it isn't. What I'm suggesting requires breaking InitLatch into two > operations. > >>> We also need to define the semantics of SetLatch >>> on an unowned latch --- does this set a signal condition that will be >>> available to the next owner? > >> At the moment, no. Perhaps that would be useful, separating the Init and >> Acquire operations is needed to make that sane. > > Exactly. I'm not totally sure either if it would be useful, but the > current design makes it impossible to allow that. Ok, I've split the Init and Acquire steps into two. > BTW, on reflection the AcquireLatch/ReleaseLatch terminology seems a bit > ill chosen: ReleaseLatch sounds way too much like something that would > just unlock or clear the latch. Perhaps OwnLatch/DisownLatch, or > something along that line. Yeah, I see what you mean. Although, maybe it's just me but Own/Disown looks ugly. Anyone have a better suggestion? Here's an updated patch, with all the issues reported this far fixed, except for that naming issue, and Fujii's suggestion to use poll() instead of select() where available. I've also polished it quite a bit, improving comments etc. Magnus, can you take a look at the Windows implementation to check that it's sane? At least it seems to work. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Markus Wanner
Date:
Hi, On 09/06/2010 06:27 PM, Heikki Linnakangas wrote: > Here's an updated patch, with all the issues reported this far fixed, > except for that naming issue, and Fujii's suggestion to use poll() > instead of select() where available. I've also polished it quite a bit, > improving comments etc. Magnus, can you take a look at the Windows > implementation to check that it's sane? At least it seems to work. Is pselect() really as unportable as stated in the patch? What platforms have problems with pselect()? Using the self-pipe trick, don't we risk running into the open file handles limitation? Or is it just two handles per process? Do I understand correctly that the purpose of this patch is to work around the brokenness of select() on very few platforms? Or is there any additional feature that plain signals don't give us? > + * It's important to reset the latch*before* checking if there's work to > + * do. Otherwise, if someone sets the latch between the check and the > + * ResetLatch call, you will miss it and Wait will block. Why doesn't WaitLatch() clear it? What's the use case for waiting for a latch and *not* wanting to reset it? Regards Markus
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Markus Wanner <markus@bluegap.ch> writes: > Is pselect() really as unportable as stated in the patch? What platforms > have problems with pselect()? Well, it's not defined in the Single Unix Spec, which is our customary reference for portability. Also, it's alleged that some platforms have it but in a form that's not actually any safer than select(). For example, I read in the Darwin man page for it IMPLEMENTATION NOTES The pselect() function is implemented in the C library as a wrapper around select(). and that man page appears to be borrowed verbatim from FreeBSD. > Using the self-pipe trick, don't we risk running into the open file > handles limitation? Or is it just two handles per process? It's just two handles per process. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Markus Wanner
Date:
On 09/06/2010 08:46 PM, Tom Lane wrote: > Well, it's not defined in the Single Unix Spec, which is our customary > reference for portability. FWIW, I bet the self-pipe trick isn't mentioned there, either... any good precedence that it actually works as expected on all of the target platforms? Existing users of the self-pipe trick? (You are certainly aware that pselect() is defined in newer versions of POSIX). > Also, it's alleged that some platforms have > it but in a form that's not actually any safer than select(). For > example, I read in the Darwin man page for it > > IMPLEMENTATION NOTES > The pselect() function is implemented in the C library as a wrapper > around select(). Ouch. Indeed, quick googling reveals the following source code for Darwin: http://www.opensource.apple.com/source/Libc/Libc-391.5.22/gen/FreeBSD/pselect.c Now that you are mentioning it, I seem to recall that even glibc had a user-space "implementation" of pselect. Fortunately, that's quite some years ago. > and that man page appears to be borrowed verbatim from FreeBSD. At least FreeBSD seems to have fixed this about 8 months ago: http://svn.freebsd.org/viewvc/base?view=revision&revision=200725 Maybe Darwin catches up eventually? AFAICT the custom select() implementation we are using for Windows could easily be changed to mimic pselect() instead. Thus most reasonably up-to-date Linux distributions plus Windows certainly provide a workable pselect() syscall. Would it be worth using pselect() for those (and maybe others that support pselect() appropriately)? > It's just two handles per process. Good. How about syscall overhead? One more write operation to the self-pipe per signal from within the signal handler and one read to actually clear the 'ready' state of the pipe from the waiter portion of the code, right? Do we plan to replace all (or most) existing internal signals with these latches to circumvent the interruption problem? Or just the ones we need to wait for using pg_usleep()? For Postgres-R, I'd probably have to extend it to call select() not only on the self-pipe, but on at least one other socket as well (to talk to the GCS). As long as that's possible, it looks like a more portable replacement for the pselect() variant that's currently in place there. Regards Markus
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Markus Wanner <markus@bluegap.ch> writes: > AFAICT the custom select() implementation we are using for Windows could > easily be changed to mimic pselect() instead. Thus most reasonably > up-to-date Linux distributions plus Windows certainly provide a workable > pselect() syscall. Would it be worth using pselect() for those (and > maybe others that support pselect() appropriately)? I don't entirely see the point of opening ourselves up to the risk of using a pselect that's not safe under the hood. In any case, on most modern platforms poll() is preferable to any variant of select(). regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 06/09/10 23:10, Markus Wanner wrote: > Good. How about syscall overhead? One more write operation to the > self-pipe per signal from within the signal handler and one read to > actually clear the 'ready' state of the pipe from the waiter portion of > the code, right? Right. > Do we plan to replace all (or most) existing internal signals with these > latches to circumvent the interruption problem? Or just the ones we need > to wait for using pg_usleep()? At least the poll loops in bgwriter and walwriter need to be replaced if we want to fix the issue Tom mentioned earlier that the server currently wakes up periodically, waking up the CPU which wastes electricity. There's no hurry to replace other code. > For Postgres-R, I'd probably have to extend it to call select() not only > on the self-pipe, but on at least one other socket as well (to talk to > the GCS). As long as that's possible, it looks like a more portable > replacement for the pselect() variant that's currently in place there. Yeah, that would be a straightforward extension. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 06/09/10 20:24, Markus Wanner wrote: > On 09/06/2010 06:27 PM, Heikki Linnakangas wrote: >> + * It's important to reset the latch*before* checking if there's work to >> + * do. Otherwise, if someone sets the latch between the check and the >> + * ResetLatch call, you will miss it and Wait will block. > > Why doesn't WaitLatch() clear it? What's the use case for waiting for a > latch and *not* wanting to reset it? Setting a latch that's already set is very fast, so you want to keep it set until the last moment. See the coding in walsender for example, it goes to some lengths to avoid clearing the latch until it's very sure there's no more work for it to do. That helps to keep the overhead in backends committing transactions low. (no-one has tried to measure that yet, though) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Markus Wanner
Date:
On 09/07/2010 09:06 AM, Heikki Linnakangas wrote: > Setting a latch that's already set is very fast, so you want to keep it > set until the last moment. See the coding in walsender for example, it > goes to some lengths to avoid clearing the latch until it's very sure > there's no more work for it to do. That helps to keep the overhead in > backends committing transactions low. (no-one has tried to measure that > yet, though) Understood, thanks. Markus
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Markus Wanner
Date:
Hi, On 09/06/2010 11:03 PM, Tom Lane wrote: > I don't entirely see the point of opening ourselves up to the risk of > using a pselect that's not safe under the hood. It should be possible to reliably determine the platforms that provide an atomic pselect(). For those, I'm hesitant to use a "trick", where pselect() clearly provides a simpler and more "official" alternative. Especially considering that those platforms form the vast majority for running Postgres on. What I'm most concerned about is the write() syscall within the signal handler. If that fails for another reason than those covered, we miss the signal. As Heikki points out in the comment, it's hard to deal with such a failure. Regarding the exact implementation, the positioning of drainSelfPipe in Heikki's implementation seems strange to me. Most descriptions of the self-pipe trick [1] [2] [4] put the drainSelfPipe() just after the select(), where you can be sure there actually is something to read. (Except [3], which recommends putting it inside the signal handler, which I find even more frightening). Maybe you can read more than one byte at a time in drainSelfPipe(), to save some syscalls? Talking about the trick itself again: I found a lot of descriptions and mentioning of the self-pipe trick, but so far I only found an unknown window manager [5] and the custom inetd that's mentioned in the LWN article [4] which really use that trick. Digging deeper revealed that there's a sigsafe library [6] as well as the bglibs [7] which both seems to use the self-pipe trick as well (of which the later doesn't even care about the write()'s return value in the signal handler). None of these two libraries seems to be used in any project of relevance. Overall I got the impression that people like to describe the trick, because it sounds so nifty and clever. However, I'd feel more comfortable if I knew there were some medium to large projects that actually use that trick. But AFAICT not even Bernstein's qmail does. > In any case, on most > modern platforms poll() is preferable to any variant of select(). Only Linux provides a ppoll() variant. And poll() itself doesn't replace pselect(). Overall, I'm glad this gets addressed. Note that this is a long standing issue for Postgres-R and it's covered with a lengthy comment in its TODO file [8]. Regards Markus Wanner [1] D. J. Bernstein, The self-pipe trick http://cr.yp.to/docs/selfpipe.html [2] Emile van Bergen, Avoiding races with Unix signals and select() http://www.xs4all.nl/~evbergen/unix-signals.html [3] Alex Pennace, Safe UNIX Signal Handling Tips http://osiris.978.org/~alex/safesignalhandling.html [4] LWN Article: The new pselect() system call, mentions the self-pipe trick in a comment http://lwn.net/Articles/176911/ [5] Karmen: a window manager http://freshmeat.net/projects/karmen [6] sigsafe library http://www.slamb.org/projects/sigsafe/ [7] Bruce Guenter, one stop library package http://untroubled.org/bglibs/ [8] Postgres-R TOOD entry http://git.postgres-r.org/?p=Postgres-R;a=blob;f=src/backend/replication/TODO;h=7bfc37ee9629943b9ff052d571b9d933ab38a0a8;hb=HEAD#l12
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 08/09/10 20:36, Markus Wanner wrote: > On 09/06/2010 11:03 PM, Tom Lane wrote: >> I don't entirely see the point of opening ourselves up to the risk of >> using a pselect that's not safe under the hood. > > It should be possible to reliably determine the platforms that provide > an atomic pselect(). For those, I'm hesitant to use a "trick", where > pselect() clearly provides a simpler and more "official" alternative. > Especially considering that those platforms form the vast majority for > running Postgres on. Perhaps, but I'm equally concerned that having different implementations for different platforms means that all implementations get less testing than if we use only one. Because of that I'm actually reluctant to even use poll() where available instead of select(). At least in the first phase, until someone demonstrates that there's a measurable difference in performance. We only call poll/select when we're about to sleep, so it's not really that performance critical anyway. > What I'm most concerned about is the write() syscall within the signal > handler. If that fails for another reason than those covered, we miss > the signal. As Heikki points out in the comment, it's hard to deal with > such a failure. Yeah, there isn't much you can do about it. Perhaps you could set a "mayday flag" (a global boolean variable) if it fails, and check that in the main loop, elogging a warning there instead. But I don't think we need to go to such lengths, realistically the write() will never fail or you have bigger problems. > Maybe you can read more than one byte at a time in drainSelfPipe(), to > save some syscalls? Perhaps, although it should be very rare to have more than one byte in the pipe. SetLatch doesn't write another byte if the latch is already set, so you only get multiple bytes in the pipe if many processes set the latch at the same instant. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 08/09/10 20:36, Markus Wanner wrote: >> It should be possible to reliably determine the platforms that provide >> an atomic pselect(). For those, I'm hesitant to use a "trick", where >> pselect() clearly provides a simpler and more "official" alternative. >> Especially considering that those platforms form the vast majority for >> running Postgres on. > Perhaps, but I'm equally concerned that having different implementations > for different platforms means that all implementations get less testing > than if we use only one. There's that, and there's also that Markus' premise is full of holes. Exactly how will you determine that pselect is safe at compile time? Even if you correctly determine that, how can you be sure that the finished executable will only be run against a version of libc that has a safe implementation? Considering that we know that major platforms such as FreeBSD have changed their implementations *very* recently, it seems foolish to assume that an executable built on a machine with corrected pselect could not be run on one with an older implementation. > Because of that I'm actually reluctant to even > use poll() where available instead of select(). At least in the first > phase, until someone demonstrates that there's a measurable difference > in performance. select() is demonstrably a loser whenever the process has a lot of open files. Also, we have plenty of experience with substituting poll() for select(), so I'm not too worried about copy-and-pasting such code. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Martijn van Oosterhout
Date:
On Mon, Sep 06, 2010 at 07:24:59PM +0200, Markus Wanner wrote: > Do I understand correctly that the purpose of this patch is to work > around the brokenness of select() on very few platforms? Or is there any > additional feature that plain signals don't give us? If the issue is just that select() doesn't get interrupted and we don't care about a couple of syscalls, would it not be better to simply use sigaction to turn on SA_RESTART just prior to the select() and turn it off just after. Or are these systems so broken that select() won't be interrupted, even if the signal handler is explicitly configured to do so? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes: > If the issue is just that select() doesn't get interrupted and we don't > care about a couple of syscalls, would it not be better to simply use > sigaction to turn on SA_RESTART just prior to the select() and turn it > off just after. Or are these systems so broken that select() won't be > interrupted, even if the signal handler is explicitly configured to do > so? I think you mean turn *off* SA_RESTART. We'd have to do that for each signal that we were concerned about allowing to interrupt the select(), so it's more than just two added calls. Another small problem is that the latch code doesn't/shouldn't know what handlers are active, and AFAICS you can't use sigaction() to flip that flag without setting the handler address too. So while maybe we could do it that way, it'd be pretty dang messy. In my mind the main value of the Latch code will be to have a clean platform-independent API for waiting. Why all the angst about whether the implementation underneath is clean or not? It's more important that it *works* and we don't have to worry about whether it will break on platform XYZ. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 08/09/10 23:07, Martijn van Oosterhout wrote: > On Mon, Sep 06, 2010 at 07:24:59PM +0200, Markus Wanner wrote: >> Do I understand correctly that the purpose of this patch is to work >> around the brokenness of select() on very few platforms? Or is there any >> additional feature that plain signals don't give us? > > If the issue is just that select() doesn't get interrupted and we don't > care about a couple of syscalls, would it not be better to simply use > sigaction to turn on SA_RESTART just prior to the select() and turn it > off just after. Or are these systems so broken that select() won't be > interrupted, even if the signal handler is explicitly configured to do > so? I don't know if SA_RESTART is portable. But in any case, that will do nothing about the race condition where the signal arrives just *before* the select() call. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Markus Wanner
Date:
On 09/08/2010 08:18 PM, Tom Lane wrote: > Considering that we know that major platforms > such as FreeBSD have changed their implementations *very* recently, > it seems foolish to assume that an executable built on a machine with > corrected pselect could not be run on one with an older implementation. FWIW testing a recent development (i.e. 9.0-devel) version of FreeBSD still failed to properly support pselect(). > Also, we have plenty of experience with substituting poll() for > select(), so I'm not too worried about copy-and-pasting such code. It's certainly a simpler change than the self-pipe trick vs. pselect(), yes. I'm happy to go with the self-pipe trick. A quick micro-benchmark didn't even show any significant difference compared to pselect(), so form that perspective, it's not a big deal. And we'd then have a major project using the self-pipe trick. (I would still like to know what others exist). Regards Markus Wanner
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Markus Wanner
Date:
On 09/08/2010 08:01 PM, Heikki Linnakangas wrote: > Yeah, there isn't much you can do about it. Perhaps you could set a > "mayday flag" (a global boolean variable) if it fails, and check that in > the main loop, elogging a warning there instead. But I don't think we > need to go to such lengths, realistically the write() will never fail or > you have bigger problems. Hm.. I think I'd like to see such a mayday flag. Just so we at least have a chance of finding out that something has gone wrong - whether or not there's a bigger problem. > Perhaps, although it should be very rare to have more than one byte in > the pipe. SetLatch doesn't write another byte if the latch is already > set, so you only get multiple bytes in the pipe if many processes set > the latch at the same instant. Depending on what you use these latches for, it might not be that rare anymore. Trying to read more than one byte at a time doesn't cost anything. Regards Markus
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 06/09/10 19:27, Heikki Linnakangas wrote: > On 06/09/10 17:18, Tom Lane wrote: >> BTW, on reflection the AcquireLatch/ReleaseLatch terminology seems a bit >> ill chosen: ReleaseLatch sounds way too much like something that would >> just unlock or clear the latch. Perhaps OwnLatch/DisownLatch, or >> something along that line. > > Yeah, I see what you mean. Although, maybe it's just me but Own/Disown > looks ugly. Anyone have a better suggestion? Magnus suggested AssociateLatch, given that the description of the function is that it associates the latch with the current process. I went with Own/Disown after all, it feels more precise, and having made the changes it doesn't look that ugly to me anymore. > Here's an updated patch, with all the issues reported this far fixed, > except for that naming issue, and Fujii's suggestion to use poll() > instead of select() where available. I've also polished it quite a bit, > improving comments etc. Magnus, can you take a look at the Windows > implementation to check that it's sane? At least it seems to work. We discussed the patch over IM, there's one point minor point I'd like to get into the archives: > It seems that NumSharedLatches() is entirely wrongly placed if it's in > the win32 specific code! That needs to be somewhere shared, so people find it, Yeah. There's a notice of that in OwnLatch(), but it would be nice if we could make it even more prominent. One idea is to put in latch.h as: #define NumSharedLatches() (max_wal_senders /* + something else in the future */ ) When it's a #define, we don't need to put #include "walsender.h" in latch.h, it's enough to put it in win32_latch.c. It's a bit weird to have a #define in one header file that doesn't work unless you #include another header file in where you use it, but it would work. Any opinions on whether that's better than having NumSharedLatches() defined in the Win32-specific win32_latch.c file? I'm inclined to leave it as it is, in win32_latch.c, but I'm not sure. Barring any last-minute objections, I'll commit this in the next few days. This patch doesn't affect walreceiver yet; I think the next step is to use the latches to eliminate the polling loop in walreceiver too, so that as soon as a piece of WAL is fsync'd to disk in the standby, it's applied. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 06/09/10 19:27, Heikki Linnakangas wrote: >> It seems that NumSharedLatches() is entirely wrongly placed if it's in >> the win32 specific code! That needs to be somewhere shared, so people find it, > Yeah. There's a notice of that in OwnLatch(), but it would be nice if we > could make it even more prominent. One idea is to put in latch.h as: > #define NumSharedLatches() (max_wal_senders /* + something else in the > future */ ) > When it's a #define, we don't need to put #include "walsender.h" in > latch.h, it's enough to put it in win32_latch.c. It's a bit weird to > have a #define in one header file that doesn't work unless you #include > another header file in where you use it, but it would work. We have other precedents for that. But having said that ... > Any opinions > on whether that's better than having NumSharedLatches() defined in the > Win32-specific win32_latch.c file? I'm inclined to leave it as it is, in > win32_latch.c, but I'm not sure. I'd leave it alone. I see no very good reason to expose NumSharedLatches globally. > Barring any last-minute objections, I'll commit this in the next few > days. This patch doesn't affect walreceiver yet; I think the next step > is to use the latches to eliminate the polling loop in walreceiver too, > so that as soon as a piece of WAL is fsync'd to disk in the standby, > it's applied. I will do some work as well once it's in. Since I was the one complaining about extra wakeups in the other processes, I'm willing to go fix 'em. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 11/09/10 18:02, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> Barring any last-minute objections, I'll commit this in the next few >> days. This patch doesn't affect walreceiver yet; I think the next step >> is to use the latches to eliminate the polling loop in walreceiver too, >> so that as soon as a piece of WAL is fsync'd to disk in the standby, >> it's applied. > > I will do some work as well once it's in. Since I was the one > complaining about extra wakeups in the other processes, I'm willing > to go fix 'em. Committed. I'll take a look at making walreceiver respond quickly when WAL arrives in the standby, using latches, but that shouldn't interfere with what you're doing. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Jeff Davis
Date:
On Sat, 2010-09-11 at 19:15 +0300, Heikki Linnakangas wrote: > Committed. I'll take a look at making walreceiver respond quickly when > WAL arrives in the standby, using latches, but that shouldn't interfere > with what you're doing. I glanced at the code, and I see (in OwnLatch()): + if (latch->owner_pid != 0) + elog(ERROR, "latch already owned"); + latch->owner_pid = MyProcPid; But it looks like there may be a race there. I assume the callers are supposed to have a lock at this point (and it looks like the current caller is safe, but I didn't look in detail). Something still seems strange to me though -- why throw an ERROR there if it can't happen (or indicates an inconsistent state when it does happen)? And it doesn't look safe to _not_ throw an error (due to a race) if it does happen. It seems like OwnLatch is only supposed to be used when you_not_ to throw an error in the event of a race (I don't think it is). already know that you own it, because there's no error code returned or blocking behavior, it just throws an ERROR. But if that's the case, what's the point of OwnLatch()? Perhaps add a few comments to describe to other users of the API how to properly own a shared latch? Regards,Jeff Davis
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes: > I glanced at the code, and I see (in OwnLatch()): > + if (latch->owner_pid != 0) > + elog(ERROR, "latch already owned"); > + latch->owner_pid = MyProcPid; > But it looks like there may be a race there. Yeah, that error check is only intended to catch gross logic errors, not to guard against race conditions. I don't think we really could prevent a race there without adding a spinlock, which seems like overkill. > ... why throw an ERROR there if it can't happen (or > indicates an inconsistent state when it does happen)? Are you suggesting that an Assert would be sufficient? regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Jeff Davis
Date:
On Sun, 2010-09-12 at 12:29 -0400, Tom Lane wrote: > > ... why throw an ERROR there if it can't happen (or > > indicates an inconsistent state when it does happen)? > > Are you suggesting that an Assert would be sufficient? > I'm not too picky about whether it's Assert, ERROR, or PANIC (Asserts aren't available in production systems, which I assume is why elog was used); but we should be consistent and document that:(a) it shouldn't happen(b) that it's just a sanity check and we're ignoringthe race However, that also means that the whole concept of OwnLatch/DisownLatch is entirely redundant, and only there for asserts because it doesn't do anything else. That seems a little strange to me, as well, so (at minimum) it should be documented that the functions really have no effect on execution and are required only to support debugging. Regards,Jeff Davis
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Jeff Davis
Date:
On Sun, 2010-09-12 at 10:13 -0700, Jeff Davis wrote: > I'm not too picky about whether it's Assert, ERROR, or PANIC (Asserts > aren't available in production systems, which I assume is why elog was > used); but we should be consistent and document that: > (a) it shouldn't happen > (b) that it's just a sanity check and we're ignoring the race I should add that Assert seems to imply both of those things (at least to me), so that would solve one of the confusing aspects of the API. The other confusing part that I think still needs comments is that OwnLatch/DisownLatch don't really do anything; they just carry information for sanity checks later. Regards,Jeff Davis
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes: > However, that also means that the whole concept of OwnLatch/DisownLatch > is entirely redundant, and only there for asserts because it doesn't do > anything else. That seems a little strange to me, as well, so (at > minimum) it should be documented that the functions really have no > effect on execution and are required only to support debugging. Uh, this is nonsense. You have to have something like these functions to support transferring ownership of a latch from one process to another, which is required at least for the walreceiver usage. It's correct that the latch code itself isn't trying very hard to avoid a race condition in acquiring ownership, but that doesn't make the whole thing useless, it just means that we're assuming that will be avoided by logic elsewhere. If there is a bug elsewhere that allows two different processes to try to take ownership of the same latch, the current coding will expose that bug soon enough. regards, tom lane
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Jeff Davis
Date:
On Sun, 2010-09-12 at 14:12 -0400, Tom Lane wrote: > Uh, this is nonsense. You have to have something like these functions > to support transferring ownership of a latch from one process to > another, which is required at least for the walreceiver usage. Oh, I see. It's needed to know where to send the signal, of course. Regards,Jeff Davis
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 12/09/10 20:13, Jeff Davis wrote: > On Sun, 2010-09-12 at 12:29 -0400, Tom Lane wrote: >>> ... why throw an ERROR there if it can't happen (or >>> indicates an inconsistent state when it does happen)? >> >> Are you suggesting that an Assert would be sufficient? > > I'm not too picky about whether it's Assert, ERROR, or PANIC (Asserts > aren't available in production systems, which I assume is why elog was > used); Right, OwnLatch is a not performance-critical, so it's better to elog() IMHO. > but we should be consistent and document that: > (a) it shouldn't happen > (b) that it's just a sanity check and we're ignoring the race Would this be sufficient? --- a/src/backend/port/unix_latch.c +++ b/src/backend/port/unix_latch.c @@ -156,6 +156,7 @@ OwnLatch(volatile Latch *latch) if (selfpipe_readfd == -1) initSelfPipe(); + /* sanity check */ if (latch->owner_pid != 0) elog(ERROR, "latch already owned"); latch->owner_pid =MyProcPid; Or you want to suggest something better? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Jeff Davis
Date:
On Mon, 2010-09-13 at 09:10 +0300, Heikki Linnakangas wrote: > > but we should be consistent and document that: > > (a) it shouldn't happen > > (b) that it's just a sanity check and we're ignoring the race > > Would this be sufficient? > > --- a/src/backend/port/unix_latch.c > +++ b/src/backend/port/unix_latch.c > @@ -156,6 +156,7 @@ OwnLatch(volatile Latch *latch) > if (selfpipe_readfd == -1) > initSelfPipe(); > > + /* sanity check */ > if (latch->owner_pid != 0) > elog(ERROR, "latch already owned"); > latch->owner_pid = MyProcPid; > > Or you want to suggest something better? Perfect. I was just slightly confused reading it the first time, and I think that would have cleared it up. Thanks,Jeff Davis
Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)
From
Heikki Linnakangas
Date:
On 13/09/10 20:43, Jeff Davis wrote: > On Mon, 2010-09-13 at 09:10 +0300, Heikki Linnakangas wrote: >>> but we should be consistent and document that: >>> (a) it shouldn't happen >>> (b) that it's just a sanity check and we're ignoring the race >> >> Would this be sufficient? >> >> --- a/src/backend/port/unix_latch.c >> +++ b/src/backend/port/unix_latch.c >> @@ -156,6 +156,7 @@ OwnLatch(volatile Latch *latch) >> if (selfpipe_readfd == -1) >> initSelfPipe(); >> >> + /* sanity check */ >> if (latch->owner_pid != 0) >> elog(ERROR, "latch already owned"); >> latch->owner_pid = MyProcPid; >> >> Or you want to suggest something better? > > Perfect. I was just slightly confused reading it the first time, and I > think that would have cleared it up. Ok, added that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com