Thread: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
From
Andres Freund
Date:
Hi, This mail is a answer to http://archives.postgresql.org/message-id/20150110022542.GF12509%40alap3.anarazel.de but I decided that it's a good go move it into a new thread since the patchseries has outgrown its original target. It's invasive and touches many arcane areas of the code, so that I think more eyes than a long deep thread is likely to have, are a good thing. As a short description of the goal of the patchseries: This started out as steps on the way to be able to safely handle terminate_backend() et al when we're reading/writing from the client. E.g. because the client is dead and we haven't noticed yet and are stuck writing, or because we want to implement a idle_in_transaction timeout. Making these changes allowed me to see that not doing significant work in signal handlers can make code much simpler and more robust. The number of bugs postgres had in the past that involved doing too much in signal handlers is significant. Thus I've since extended the patchseries to get rid of nearly all nontrivial work in signal handlers. All the patches in the series up to 0008 hav ecommit messages providing more detail. A short description of each patch follows: 0001: Replace walsender's latch with the general shared latch. New patch that removes ImmediateInteruptOK behaviour from walsender. I think that's a rather good idea, because walsender currently seems to assume WaitLatchOrSocket is reentrant - which I don't think is really guaranteed. Hasn't been reviewed yet, but I think it's not far from being committable. 0002: Use a nonblocking socket for FE/BE communication and block using latches. Has previously been reviewed by Heikki. I think Noah also had a look, although I'm not sure how close that was. I think this can be committed soon. 0003: Introduce and use infrastructure for interrupt processing during client reads. From here on ImmediateInterruptOK isn't set during client communication. Normal interrupts and sinval/async interrupts are processed outside of signal handlers. Especially the sinval/async greatly simplify the respective code. Has been reviewed by Heikki in an earlier version - I think I addressed all the review comments. I'd like somebody else to look at the sinval/async.c changes before committing. I really like the simplification this change allowed. I think this patch might not be safe without 0004 because I can't convince myself that it's safe to interrupt latch waits with work that actually might also use the same latch (sinval interrupts). But it's easier to review this way as 0004 is quite big. 0004: Process 'die' interrupts while reading/writing from the client socket. This is the reason Horiguchi-san started this thread. I think the important debate here is whether we think it's acceptable that there are situations (a full socket buffer, but a alive connection) where the client previously got an error, but not anymore afterwards. I think that's much better than having unkillable connections, but I can see others being of a different opinion. 0005: Move deadlock and other interrupt handling in proc.c out of signal handlers. I'm surprised how comparatively simple this turned out to be. For robustness and understanding I think this is a great improvement. New and not reviewed at all. Needs significant review. But works in my simple testcases. 0006: Don't allow immediate interupts during authentication anymore. So far we've set ImmediateInterruptOK to true during large parts of the client authentication - that's not all that pretty, interrupts might arrive while we're in some system routines. Due to patches 0003/0004 we now are able to safely serve interrupts during client communication which is the major are where we want to adhere to authentication_timeout. I additionally placed some CHECK_FOR_INTERRUPTS()s in some somewhat randomly chosen places in auth.c. Those don't completely get back the previous 'appruptness' (?) of reacting to interrupts, but that's partially for the better, because we don't interrupt foreign code anymore. 0007: Remove the option to service interrupts during PGSemaphoreLock(). Due to patch 0005 there's no user of PGSemaphoreLock(lock, interruptOK = true) anymore, so remove the code to handle that. I find it rather odd that the code did a CHECK_FOR_INTERRUPTS before the semwait even when interruptOK was set to to false - that only happened to work because lwlock.c did a HOLD_INTERRUPTS() around the semaphore code. It's now removed entirely. This is a API break because the signature of PGSemaphoreLock() changed. But I have a hard time seing that as a problem. For one I don't think there's many users of PGSemaphoreLock, for another they should change their interrupt handling. New and not reviewed. 0008: Remove remnants of ImmediateInterruptOK handling. Now that ImmediateInterruptOK is never set to true anymore, we can remove related code and comments. New and not reviewed. 0009: WIP: lwlock: use latches Optional patch that removes the use of semaphores from the lwlock code. There's no benefit for lwlock.c itself due to this. But it does get rid of the last user of semaphores in a normal postgres build. I primarily wrote this to test the performance of latches. I guess we want to do this anyway. 0010: WIP: unix_latch: WIP efficiency hackery Nothing ready, although I think using eventfds on linux is worth the cost. Questions I have about the series right now: 1) Do others agree that getting rid of ImmediateInterruptOK is worthwile. I personally think that we really should pursue that aggressively as a goal. 2) Does anybody see a significant problem with the reduction of cases where we immediately react to authentication_timeout? Since we still react immediately when we're waiting for the client (except maybe in sspi/kerberos?) I think that's ok. The waiting cases are slow ident/radius/kerberos/ldap/... servers. But we really shouldn't jump out of the relevant code anyway. 3) If we do everything (in corrected), upto 0009, there's no remaining user of semaphores in postgres, except the spinlock emulation. Does anybody see a need for PGPROC->sem to remain? Removing it would have the significant benefit that semaphores are the last thing users frequently need to tune to get postgres to start up when using a higher max_connection or multiple clusters. If we remove PGPROC->sem does anybody see a way to get rid of the semaphore code alltogether? I personally still think we should just gank --disable-spinlocks. Now that it's only a couple lines (in an preexisting patch) to rely on compiler intrinsics for spinlocks that doesn't sound like a big deal. Even if not, we could decide to get rid of win32_sem.c... 4) There remains one user of something like ImmediateInterruptOK - walreceiver. It has WalRcvImmediateInterruptOK. We definitely should get rid of that as well, but that's a independent patch that can be written in the future. 5) In a future patch I think we could also handle catchup interrupts during other client reads, not just ReadCommand(). Does anybody see that as a worthwile goal? I can't remember many problems with not enough catchup happening, but I think someone mentioned that as a problem in the past. 6) Review ;) Comments? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Replace-walsender-s-latch-with-the-general-shared-la.patch
- 0002-Use-a-nonblocking-socket-for-FE-BE-communication-and.patch
- 0003-Introduce-and-use-infrastructure-for-interrupt-proce.patch
- 0004-Process-die-interrupts-while-reading-writing-from-th.patch
- 0005-Move-deadlock-and-other-interrupt-handling-in-proc.c.patch
- 0006-Don-t-allow-immediate-interupts-during-authenticatio.patch
- 0007-Remove-the-option-to-service-interrupts-during-PGSem.patch
- 0008-Remove-remnants-of-ImmediateInterruptOK-handling.patch
- 0009-WIP-lwlock.c-use-latches.patch
- 0010-WIP-unix_latch.c-efficiency-hackery.patch
Hello, I'd synced up this at last. I think I should finilize my commitfest item for this issue, with .. "Rejected"? > This mail is a answer to > http://archives.postgresql.org/message-id/20150110022542.GF12509%40alap3.anarazel.de > but I decided that it's a good go move it into a new thread since the > patchseries has outgrown its original target. It's invasive and touches > many arcane areas of the code, so that I think more eyes than a long > deep thread is likely to have, are a good thing. > > As a short description of the goal of the patchseries: > This started out as steps on the way to be able to safely handle > terminate_backend() et al when we're reading/writing from the > client. E.g. because the client is dead and we haven't noticed yet and > are stuck writing, or because we want to implement a idle_in_transaction > timeout. > > Making these changes allowed me to see that not doing significant work > in signal handlers can make code much simpler and more robust. The > number of bugs postgres had in the past that involved doing too much in > signal handlers is significant. Thus I've since extended the > patchseries to get rid of nearly all nontrivial work in signal > handlers. It sounds pretty nice. > All the patches in the series up to 0008 hav ecommit messages providing > more detail. A short description of each patch follows: > > 0001: Replace walsender's latch with the general shared latch. > > New patch that removes ImmediateInteruptOK behaviour from walsender. I > think that's a rather good idea, because walsender currently seems to > assume WaitLatchOrSocket is reentrant - which I don't think is really > guaranteed. > Hasn't been reviewed yet, but I think it's not far from being > committable. Deesn't this patchset containing per-socket basis non-blocking control for win32? It should make the code (above the win32 socket layer itself) more simpler. > 0002: Use a nonblocking socket for FE/BE communication and block using > latches. > > Has previously been reviewed by Heikki. I think Noah also had a > look, although I'm not sure how close that was. > > I think this can be committed soon. > > 0003: Introduce and use infrastructure for interrupt processing during client reads. > > From here on ImmediateInterruptOK isn't set during client > communication. Normal interrupts and sinval/async interrupts are > processed outside of signal handlers. Especially the sinval/async > greatly simplify the respective code. > > Has been reviewed by Heikki in an earlier version - I think I > addressed all the review comments. > > I'd like somebody else to look at the sinval/async.c changes > before committing. I really like the simplification this change > allowed. > > I think this patch might not be safe without 0004 because I can't > convince myself that it's safe to interrupt latch waits with work > that actually might also use the same latch (sinval > interrupts). But it's easier to review this way as 0004 is quite > big. > > 0004: Process 'die' interrupts while reading/writing from the client socket. > > This is the reason Horiguchi-san started this thread. > > I think the important debate here is whether we think it's > acceptable that there are situations (a full socket buffer, but a > alive connection) where the client previously got an error, but > not anymore afterwards. I think that's much better than having > unkillable connections, but I can see others being of a different > opinion. This patch yields a code a bit confusion like following. | secure_raw_write(Port *port, const void *ptr, size_t len) | { .. | w = WaitLatchOrSocket(MyLatch, | if (w & WL_LATCH_SET) ... | errno = EINTR; | else if (w & WL_SOCKET_WRITEABLE) | goto wloop; | | errno = save_errno; The errno set when WL_LATCH_SET always vanishes. Specifically, the EINTR set by SIGTERM(WL_LATCH_SET) is overwritten by EAGAIN. As the result, pg_terminte_backend() cannot kill the backend till the write blocking is released. errno = save_errno should be the alternative of the line "errno = EINTR" and I confirmed that the change leads to the desirable (as of me) behavior. > 0005: Move deadlock and other interrupt handling in proc.c out of signal handlers. > > I'm surprised how comparatively simple this turned out to be. For > robustness and understanding I think this is a great improvement. > > New and not reviewed at all. Needs significant review. But works > in my simple testcases. > > 0006: Don't allow immediate interupts during authentication anymore. > > So far we've set ImmediateInterruptOK to true during large parts > of the client authentication - that's not all that pretty, > interrupts might arrive while we're in some system routines. > > Due to patches 0003/0004 we now are able to safely serve > interrupts during client communication which is the major are > where we want to adhere to authentication_timeout. > > I additionally placed some CHECK_FOR_INTERRUPTS()s in some > somewhat randomly chosen places in auth.c. Those don't completely > get back the previous 'appruptness' (?) of reacting to > interrupts, but that's partially for the better, because we don't > interrupt foreign code anymore. Simplly as a comment on style, this patch introduces checks of ClientAuthInProgress twice successively into ProcessInterrupts(). Isn't it better to make it like following? | /* As in quickdie, ... | if (ClientAuthInProgress) | { | if (whereToSendOutput == DestRemote) whereToSendOutput = DestNone; | ereport(FATAL, > 0007: Remove the option to service interrupts during PGSemaphoreLock(). > > Due to patch 0005 there's no user of PGSemaphoreLock(lock, interruptOK = true) > > anymore, so remove the code to handle that. I find it rather odd > that the code did a CHECK_FOR_INTERRUPTS before the semwait even > when interruptOK was set to to false - that only happened to work > because lwlock.c did a HOLD_INTERRUPTS() around the semaphore > code. It's now removed entirely. > > This is a API break because the signature of PGSemaphoreLock() > changed. But I have a hard time seing that as a problem. For one I > don't think there's many users of PGSemaphoreLock, for another > they should change their interrupt handling. > > New and not reviewed. > > 0008: Remove remnants of ImmediateInterruptOK handling. > > Now that ImmediateInterruptOK is never set to true anymore, we can > remove related code and comments. > > New and not reviewed. walreceiver.c still has WalRcvImmediateInterruptOK as mentioned below, apart from whether it should be changed or not, the following comment remains. | * This is very much like what regular backends do with ImmediateInterruptOK, | * ProcessInterrupts() etc. > 0009: WIP: lwlock: use latches > > Optional patch that removes the use of semaphores from the lwlock > code. There's no benefit for lwlock.c itself due to this. But it > does get rid of the last user of semaphores in a normal postgres > build. I primarily wrote this to test the performance of latches. > > I guess we want to do this anyway. > > 0010: WIP: unix_latch: WIP efficiency hackery > > Nothing ready, although I think using eventfds on linux is worth > the cost. > > > Questions I have about the series right now: > > 1) Do others agree that getting rid of ImmediateInterruptOK is > worthwile. I personally think that we really should pursue that > aggressively as a goal. Just as my two cents, I perfectly agree with you. The code after it dissapears looks quite cleaner and the less number of states makes it more understandable. > 2) Does anybody see a significant problem with the reduction of cases > where we immediately react to authentication_timeout? Since we still > react immediately when we're waiting for the client (except maybe in > sspi/kerberos?) I think that's ok. The waiting cases are slow > ident/radius/kerberos/ldap/... servers. But we really shouldn't jump > out of the relevant code anyway. I don't have definite answer for it, but from the point that stucking or congesting during authentication can raise the number of immature backend process unboundedly, it might be more preferable if we can administratively or automatically kill them for the case. > 3) If we do everything (in corrected), upto 0009, there's no remaining > user of semaphores in postgres, except the spinlock emulation. > > Does anybody see a need for PGPROC->sem to remain? Removing it would > have the significant benefit that semaphores are the last thing users > frequently need to tune to get postgres to start up when using a > higher max_connection or multiple clusters. The less tuning point, the more usability, I suppose, as long as no loss of function. > If we remove PGPROC->sem does anybody see a way to get rid of the > semaphore code alltogether? I personally still think we should just > gank --disable-spinlocks. Now that it's only a couple lines (in an > preexisting patch) to rely on compiler intrinsics for spinlocks that > doesn't sound like a big deal. Even if not, we could decide to get > rid of win32_sem.c... > > 4) There remains one user of something like ImmediateInterruptOK - > walreceiver. It has WalRcvImmediateInterruptOK. We definitely should > get rid of that as well, but that's a independent patch that can be > written in the future. > > 5) In a future patch I think we could also handle catchup interrupts > during other client reads, not just ReadCommand(). Does anybody see > that as a worthwile goal? I can't remember many problems with not > enough catchup happening, but I think someone mentioned that as a > problem in the past. > > 6) Review ;) > > Comments? > > Greetings, > > Andres Freund regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2015-01-15 15:05:08 +0900, Kyotaro HORIGUCHI wrote: > Hello, I'd synced up this at last. > > I think I should finilize my commitfest item for this issue, with > .. "Rejected"? Fine with me. > > All the patches in the series up to 0008 hav ecommit messages providing > > more detail. A short description of each patch follows: > > > > 0001: Replace walsender's latch with the general shared latch. > > > > New patch that removes ImmediateInteruptOK behaviour from walsender. I > > think that's a rather good idea, because walsender currently seems to > > assume WaitLatchOrSocket is reentrant - which I don't think is really > > guaranteed. > > Hasn't been reviewed yet, but I think it's not far from being > > committable. > > Deesn't this patchset containing per-socket basis non-blocking > control for win32? It should make the code (above the win32 > socket layer itself) more simpler. I don't think so - we still rely on it unfortunately. > > 0004: Process 'die' interrupts while reading/writing from the client socket. > > > > This is the reason Horiguchi-san started this thread. > > > > I think the important debate here is whether we think it's > > acceptable that there are situations (a full socket buffer, but a > > alive connection) where the client previously got an error, but > > not anymore afterwards. I think that's much better than having > > unkillable connections, but I can see others being of a different > > opinion. > > > This patch yields a code a bit confusion like following. > > | secure_raw_write(Port *port, const void *ptr, size_t len) > | { > .. > | w = WaitLatchOrSocket(MyLatch, > | if (w & WL_LATCH_SET) > ... > | errno = EINTR; > | else if (w & WL_SOCKET_WRITEABLE) > | goto wloop; > | > | errno = save_errno; > > The errno set when WL_LATCH_SET always vanishes. Specifically, > the EINTR set by SIGTERM(WL_LATCH_SET) is overwritten by > EAGAIN. As the result, pg_terminte_backend() cannot kill the > backend till the write blocking is released. errno = save_errno > should be the alternative of the line "errno = EINTR" and I > confirmed that the change leads to the desirable (as of me) > behavior. Ugh, that's the result stupid last minute "cleanup". You're right. > > 0006: Don't allow immediate interupts during authentication anymore. > > > > So far we've set ImmediateInterruptOK to true during large parts > > of the client authentication - that's not all that pretty, > > interrupts might arrive while we're in some system routines. > > > > Due to patches 0003/0004 we now are able to safely serve > > interrupts during client communication which is the major are > > where we want to adhere to authentication_timeout. > > > > I additionally placed some CHECK_FOR_INTERRUPTS()s in some > > somewhat randomly chosen places in auth.c. Those don't completely > > get back the previous 'appruptness' (?) of reacting to > > interrupts, but that's partially for the better, because we don't > > interrupt foreign code anymore. > > Simplly as a comment on style, this patch introduces checks of > ClientAuthInProgress twice successively into > ProcessInterrupts(). Isn't it better to make it like following? > > | /* As in quickdie, ... > | if (ClientAuthInProgress) > | { > | if (whereToSendOutput == DestRemote) whereToSendOutput = DestNone; > | ereport(FATAL, Hm, yes. > > 0008: Remove remnants of ImmediateInterruptOK handling. > > > > Now that ImmediateInterruptOK is never set to true anymore, we can > > remove related code and comments. > > > > New and not reviewed. > > walreceiver.c still has WalRcvImmediateInterruptOK as mentioned > below, apart from whether it should be changed or not, the > following comment remains. > | * This is very much like what regular backends do with ImmediateInterruptOK, > | * ProcessInterrupts() etc. Yep. As mentioned below, it doesn't use the same infrastructure, so I'd rather treat this separately. This set is more than big enough. Thanks for looking! Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hello, > > I think I should finilize my commitfest item for this issue, with > > .. "Rejected"? > > Fine with me. done. > > > 0001: Replace walsender's latch with the general shared latch. > > > > > > New patch that removes ImmediateInteruptOK behaviour from walsender. I > > > think that's a rather good idea, because walsender currently seems to > > > assume WaitLatchOrSocket is reentrant - which I don't think is really > > > guaranteed. > > > Hasn't been reviewed yet, but I think it's not far from being > > > committable. > > > > Deesn't this patchset containing per-socket basis non-blocking > > control for win32? It should make the code (above the win32 > > socket layer itself) more simpler. > > I don't think so - we still rely on it unfortunately. Does "it" mean win32_noblock? Or the nonblocking bare win32 socket? The win32-per-sock-blkng-cntl patch in the below message should cover both of them. http://www.postgresql.org/message-id/54060AE5.5020502@vmware.com If you are saying it should be a patch separate from this, I'll do so. regareds, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
From
Robert Haas
Date:
On Wed, Jan 14, 2015 at 9:03 PM, Andres Freund <andres@2ndquadrant.com> wrote: > 0002: Use a nonblocking socket for FE/BE communication and block using > latches. > > Has previously been reviewed by Heikki. I think Noah also had a > look, although I'm not sure how close that was. > > I think this can be committed soon. Doesn't this significantly increase the number of system calls? I worry there could be a performance issue here. > 0003: Introduce and use infrastructure for interrupt processing during client reads. > > From here on ImmediateInterruptOK isn't set during client > communication. Normal interrupts and sinval/async interrupts are > processed outside of signal handlers. Especially the sinval/async > greatly simplify the respective code. ProcessNotifyInterrupt() seems like it could lead to a failure to respond to other interrupts if there is a sufficiently vigorous stream of notify interrupts. I think there ought to be one loop that goes through and tries to handle each kind of interrupt in turn and then loops until no interrupts remain. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
From
Andres Freund
Date:
On 2015-01-30 09:29:59 -0500, Robert Haas wrote: > On Wed, Jan 14, 2015 at 9:03 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > 0002: Use a nonblocking socket for FE/BE communication and block using > > latches. > > > > Has previously been reviewed by Heikki. I think Noah also had a > > look, although I'm not sure how close that was. > > > > I think this can be committed soon. > > Doesn't this significantly increase the number of system calls? I > worry there could be a performance issue here. I've posted benchmarks upthread and I only could start to measure any overhead in pretty absurd cases (i.e. several hundred connections on a few core machine, all doing SELECT 1;statements). As we try the read before the poll/select it's not that bad - there's no superflous work done if we're actually busy. > > 0003: Introduce and use infrastructure for interrupt processing during client reads. > > > > From here on ImmediateInterruptOK isn't set during client > > communication. Normal interrupts and sinval/async interrupts are > > processed outside of signal handlers. Especially the sinval/async > > greatly simplify the respective code. > > ProcessNotifyInterrupt() seems like it could lead to a failure to > respond to other interrupts if there is a sufficiently vigorous stream > of notify interrupts. That's nothing new though. It just used to be executed inside interrupts directly, with looping. And we looped when enabling the notify interrupts. Since I can't recall a report of this being problematic I'm not that inclined to change even more than the patch already does. Given that queuing notifies requires a lock I have a hard time seing this ever fast enough to cause that problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
From
Heikki Linnakangas
Date:
On 01/15/2015 03:03 AM, Andres Freund wrote: > 0004: Process 'die' interrupts while reading/writing from the client socket. > > This is the reason Horiguchi-san started this thread. > + ProcessClientWriteInterrupt(!port->noblock); ... > +/* > + * ProcessClientWriteInterrupt() - Process interrupts specific to client writes > + * > + * This is called just after low-level writes. That might be after the read > + * finished successfully, or it was interrupted via interrupt. 'blocked' tells > + * us whether the > + * > + * Must preserve errno! > + */ > +void > +ProcessClientWriteInterrupt(bool blocked) You're passing port->noblock as argument, but I thought the argument is supposed to mean whether the write would've blocked, i.e. if the write buffer was full. port->noblock doesn't mean that. But perhaps I misunderstood this - the comment on the 'blocked' argument above is a bit incomplete ;-). - Heikki
Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
From
Heikki Linnakangas
Date:
On 01/15/2015 03:03 AM, Andres Freund wrote: > 0002: Use a nonblocking socket for FE/BE communication and block using > latches. s/suceeds/succeeds/ > 0004: Process 'die' interrupts while reading/writing from the client socket. > + * Check for interrupts here, in addition to secure_write(), > + * because a interrupted write in secure_raw_write() can possibly > + * only return to here, not to secure_write(). > */ > + ProcessClientWriteInterrupt(true); Took me a while to parse that sentence. How about: Check for interrupts here, in addition to secure_write(), because an interrupted write in secure_raw_write() will return here, and we cannot return to secure_write() until we've written something. > 0005: Move deadlock and other interrupt handling in proc.c out of signal handlers. > > I'm surprised how comparatively simple this turned out to be. For > robustness and understanding I think this is a great improvement. > > New and not reviewed at all. Needs significant review. But works > in my simple testcases. > @@ -799,6 +791,7 @@ ProcKill(int code, Datum arg) > proc = MyProc; > MyProc = NULL; > DisownLatch(&proc->procLatch); > + SetLatch(MyLatch); > > SpinLockAcquire(ProcStructLock); > > @@ -868,6 +861,7 @@ AuxiliaryProcKill(int code, Datum arg) > proc = MyProc; > MyProc = NULL; > DisownLatch(&proc->procLatch); > + SetLatch(MyLatch); > > SpinLockAcquire(ProcStructLock); These SetLatch calls are unnecessary. SwitchBackToLocalLatch() already sets the latch. (According to the comment in SwitchToSharedLatch() even that should not be necessary.) > 0006: Don't allow immediate interupts during authentication anymore. In commit message: s/interupts/interrupts/. > @@ -80,9 +73,6 @@ md5_crypt_verify(const Port *port, const char *role, char *client_pass, > if (*shadow_pass == '\0') > return STATUS_ERROR; /* empty password */ > > - /* Re-enable immediate response to SIGTERM/SIGINT/timeout interrupts */ > - ImmediateInterruptOK = true; > - /* And don't forget to detect one that already arrived */ > CHECK_FOR_INTERRUPTS(); > > /* That lone CHECK_FOR_INTERRUPTS() that remains looks pretty randomly placed now that the "ImmediateInterruptOK = true" is gone. I would just remove it. Add one to the caller if it's needed, but it probably isn't. > 0007: Remove the option to service interrupts during PGSemaphoreLock(). s/don't/doesn't/ in commit message. > Questions I have about the series right now: > > 1) Do others agree that getting rid of ImmediateInterruptOK is > worthwile. I personally think that we really should pursue that > aggressively as a goal. Yes. > 2) Does anybody see a significant problem with the reduction of cases > where we immediately react to authentication_timeout? Since we still > react immediately when we're waiting for the client (except maybe in > sspi/kerberos?) I think that's ok. The waiting cases are slow > ident/radius/kerberos/ldap/... servers. But we really shouldn't jump > out of the relevant code anyway. Yeah, I think that's OK. Doing those things with ImmediateInterruptOK=true was quite scary, and we need to stop doing that. It would be nice to have a wholesale solution for those lookups but I don't see one. So all the ident/radius/kerberos/ldap lookups will need to be done in a non-blocking way to have them react to the timeout. That requires some work, but we can leave that to a followup patch, if someone wants to write it. Overall, 1-8 look good to me, except for that one incomplete comment in ProcessClientWriteInterrupt() I mentioned in a separate email. I haven't reviewed patches 9 and 10 yet. - Heikki
Re: Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
From
Andres Freund
Date:
On 2015-01-30 18:59:28 +0100, Heikki Linnakangas wrote: > On 01/15/2015 03:03 AM, Andres Freund wrote: > >0004: Process 'die' interrupts while reading/writing from the client socket. > > > > This is the reason Horiguchi-san started this thread. > > >+ ProcessClientWriteInterrupt(!port->noblock); > ... > >+/* > >+ * ProcessClientWriteInterrupt() - Process interrupts specific to client writes > >+ * > >+ * This is called just after low-level writes. That might be after the read > >+ * finished successfully, or it was interrupted via interrupt. 'blocked' tells > >+ * us whether the > >+ * > >+ * Must preserve errno! > >+ */ > >+void > >+ProcessClientWriteInterrupt(bool blocked) > > You're passing port->noblock as argument, but I thought the argument is > supposed to mean whether the write would've blocked, i.e. if the write > buffer was full. Right. > port->noblock doesn't mean that. But perhaps I misunderstood this - > the comment on the 'blocked' argument above is a bit incomplete ;-). The point here is that we tried the send() and then were interrupted while waiting for the buffer to become writable. That pretty much implies that the write buffer is full. The !port->noblock is because we only are blocked on write if we're doinga blocking send, otherwise we'll just return control to the caller... I agree that this could use some more comments ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
From
Andres Freund
Date:
On 2015-01-31 00:52:03 +0100, Heikki Linnakangas wrote: > On 01/15/2015 03:03 AM, Andres Freund wrote: > >0004: Process 'die' interrupts while reading/writing from the client socket. > > >+ * Check for interrupts here, in addition to secure_write(), > >+ * because a interrupted write in secure_raw_write() can possibly > >+ * only return to here, not to secure_write(). > > */ > >+ ProcessClientWriteInterrupt(true); > > Took me a while to parse that sentence. How about: > > Check for interrupts here, in addition to secure_write(), because an > interrupted write in secure_raw_write() will return here, and we cannot > return to secure_write() until we've written something. Yep, that's better. > >0005: Move deadlock and other interrupt handling in proc.c out of signal handlers. > > > > I'm surprised how comparatively simple this turned out to be. For > > robustness and understanding I think this is a great improvement. > > > > New and not reviewed at all. Needs significant review. But works > > in my simple testcases. > > >@@ -799,6 +791,7 @@ ProcKill(int code, Datum arg) > > proc = MyProc; > > MyProc = NULL; > > DisownLatch(&proc->procLatch); > >+ SetLatch(MyLatch); > > > > SpinLockAcquire(ProcStructLock); > > > >@@ -868,6 +861,7 @@ AuxiliaryProcKill(int code, Datum arg) > > proc = MyProc; > > MyProc = NULL; > > DisownLatch(&proc->procLatch); > >+ SetLatch(MyLatch); > > > > SpinLockAcquire(ProcStructLock); > > These SetLatch calls are unnecessary. SwitchBackToLocalLatch() already sets > the latch. (According to the comment in SwitchToSharedLatch() even that > should not be necessary.) Ick, that's some debugging leftovers where I was trying to find a bug that was fundamentally caused by the missing barriers in the latch code... > >0006: Don't allow immediate interupts during authentication anymore. > > In commit message: s/interupts/interrupts/. > > >@@ -80,9 +73,6 @@ md5_crypt_verify(const Port *port, const char *role, char *client_pass, > > if (*shadow_pass == '\0') > > return STATUS_ERROR; /* empty password */ > > > >- /* Re-enable immediate response to SIGTERM/SIGINT/timeout interrupts */ > >- ImmediateInterruptOK = true; > >- /* And don't forget to detect one that already arrived */ > > CHECK_FOR_INTERRUPTS(); > > > > /* > > That lone CHECK_FOR_INTERRUPTS() that remains looks pretty randomly placed > now that the "ImmediateInterruptOK = true" is gone. I would just remove it. > Add one to the caller if it's needed, but it probably isn't. There's some method to the madness here actually - we just did some database stuff that could in theory take a while... Whether it's worthwhile to leave it here I'm not sure. > >2) Does anybody see a significant problem with the reduction of cases > > where we immediately react to authentication_timeout? Since we still > > react immediately when we're waiting for the client (except maybe in > > sspi/kerberos?) I think that's ok. The waiting cases are slow > > ident/radius/kerberos/ldap/... servers. But we really shouldn't jump > > out of the relevant code anyway. > > Yeah, I think that's OK. Doing those things with ImmediateInterruptOK=true > was quite scary, and we need to stop doing that. It would be nice to have a > wholesale solution for those lookups but I don't see one. So all the > ident/radius/kerberos/ldap lookups will need to be done in a non-blocking > way to have them react to the timeout. That requires some work, but we can > leave that to a followup patch, if someone wants to write it. Ok, cool. > Overall, 1-8 look good to me, except for that one incomplete comment in > ProcessClientWriteInterrupt() I mentioned in a separate email. Thanks for the review! I plan to push the fixed up versions sometime next week. > I haven't reviewed patches 9 and 10 yet. Yea, those aren't as close to being ready (and thus marked WIP). I'd like to do the lwlock stuff for 9.5 because then any sane setup (i.e. unless spinlocks are emulated) doesn't need any semaphores anymore, which makes setup easier. Right now users need to adjust system settings when changing max_connctions upwards or run multiple clusters. But it's not really related to getting rid of ImmediateInterruptOK. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
From
Noah Misch
Date:
On Thu, Jan 15, 2015 at 03:03:35AM +0100, Andres Freund wrote: > 0002: Use a nonblocking socket for FE/BE communication and block using > latches. > > Has previously been reviewed by Heikki. I think Noah also had a > look, although I'm not sure how close that was. Negligible. I did little more than scroll past it while reviewing the patch that became commit 4bad60e.
Re: Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
From
Andres Freund
Date:
Hi, Here's the last, rebased (took a while...), version of this patchset. I've fixed the things that Heikki mentioned (except the one "stray" CFI, which imo maskes sense). Besides a fair number of cosmetic changes there are two somewhat important ones: * I previously had removed the win32 waitforsinglesocket calls in the openssl code - they're now just always replaced with latch waits. The windows case makes actually much more sense, as we previously just busylooped in the !win32 case. * Previously the patchset didn't handle SIGTERM while InteractiveBackend() was reading from the client. It did handle ctrl-c/d, but since getc() isn't interruptible and can't be replaced with latches... The fix for that isn't super pretty: die(): if (DoingCommandRead && whereToSendOutput != DestRemote) ProcessInterrupts(); but imo acceptable for single user mode. Unless someone announces the intent do review them some more, I plan to push the attached patches fairly soon. I'm not claiming at all they're bug free, but I think at this stage it's better to get them in. I plan to pursue the remaining patches (latch optimizations, lwlock using latches, possibly removing PGPROC.sem) afterwards. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Use-a-nonblocking-socket-for-FE-BE-communication-and.patch
- 0002-Introduce-and-use-infrastructure-for-interrupt-proce.patch
- 0003-Process-die-interrupts-while-reading-writing-from-th.patch
- 0004-Don-t-allow-immediate-interrupts-during-authenticati.patch
- 0005-Move-deadlock-and-other-interrupt-handling-in-proc.c.patch
- 0006-Remove-the-option-to-service-interrupts-during-PGSem.patch
- 0007-Remove-remnants-of-ImmediateInterruptOK-handling.patch
Re: Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
From
Heikki Linnakangas
Date:
On 02/03/2015 08:54 PM, Andres Freund wrote: > * Previously the patchset didn't handle SIGTERM while > InteractiveBackend() was reading from the client. It did handle > ctrl-c/d, but since getc() isn't interruptible and can't be replaced > with latches... The fix for that isn't super pretty: > die(): > if (DoingCommandRead && whereToSendOutput != DestRemote) > ProcessInterrupts(); > but imo acceptable for single user mode. That smells an awful lot like ImmediateInterruptOK. Could you use WaitLatchOrSocket to sleep on stdin? Probably needs to be renamed or copied to WaitLatchOrStdin(), or a new flag to be added to wait on stdin instead of a socket, at least on Windows, but in theory. Might not be worth it if it gets complicated - I can live with the above - but it's a thought. - Heikki
Re: Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
From
Andres Freund
Date:
On 2015-02-03 22:17:22 +0200, Heikki Linnakangas wrote: > On 02/03/2015 08:54 PM, Andres Freund wrote: > >* Previously the patchset didn't handle SIGTERM while > > InteractiveBackend() was reading from the client. It did handle > > ctrl-c/d, but since getc() isn't interruptible and can't be replaced > > with latches... The fix for that isn't super pretty: > > die(): > > if (DoingCommandRead && whereToSendOutput != DestRemote) > > ProcessInterrupts(); > > but imo acceptable for single user mode. > > That smells an awful lot like ImmediateInterruptOK. It awfully does, yes. But we're only exiting, and only in single user mode. And only during command read. We don't need all the other complicated stuff around sinval, async, deadlock checking, etc. > Could you use WaitLatchOrSocket to sleep on stdin? Unfortunately I don't think so. poll() etc only works properly on network handles, pipes etc - but stdin can be a file :(. And I think what exactly happens if it's a file fd isn't super well defined. On linux the file is always marked as ready, which would probably actually work... I think this is better solved in the long run to get rid of single user mode, akin to the patch Tom had a year or two back. Adding more complications for it doesn't seem worth it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services