Thread: Back-branch update releases coming in a couple weeks
Since we've fixed a couple of relatively nasty bugs recently, the core committee has determined that it'd be a good idea to push out PG update releases soon. The current plan is to wrap on Monday Feb 4 for public announcement Thursday Feb 7. If you're aware of any bug fixes you think ought to get included, now's the time to get them done ... regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Since we've fixed a couple of relatively nasty bugs recently, the core > committee has determined that it'd be a good idea to push out PG update > releases soon. The current plan is to wrap on Monday Feb 4 for public > announcement Thursday Feb 7. If you're aware of any bug fixes you think > ought to get included, now's the time to get them done ... Thanks for that. Should we consider including Simon's fix for an issue which Noah noted in this thread?: http://www.postgresql.org/message-id/CA+U5nMKBrqFxyohr=JSDpgxZ6y0nfAdmt=K3hK4Zzfqo1MHSJg@mail.gmail.com Admittedly, it's been this way for 3+ years, but there's concurrence on that thread that it's a bug, so I figured I'd mention it. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Since we've fixed a couple of relatively nasty bugs recently, the core >> committee has determined that it'd be a good idea to push out PG update >> releases soon. The current plan is to wrap on Monday Feb 4 for public >> announcement Thursday Feb 7. If you're aware of any bug fixes you think >> ought to get included, now's the time to get them done ... > Should we consider including Simon's fix for an issue which Noah noted > in this thread?: > http://www.postgresql.org/message-id/CA+U5nMKBrqFxyohr=JSDpgxZ6y0nfAdmt=K3hK4Zzfqo1MHSJg@mail.gmail.com It sounds like there's something to fix there, but AFAICS the thread is still arguing about the best fix. There's time to do it non-hastily. regards, tom lane
From: "Tom Lane" <tgl@sss.pgh.pa.us> > Since we've fixed a couple of relatively nasty bugs recently, the core > committee has determined that it'd be a good idea to push out PG update > releases soon. The current plan is to wrap on Monday Feb 4 for public > announcement Thursday Feb 7. If you're aware of any bug fixes you think > ought to get included, now's the time to get them done ... I've just encountered a serious problem, and I really wish it would be fixed in the upcoming minor release. Could you tell me whether this is already fixed and will be included? I'm using synchronous streaming replication with PostgreSQL 9.1.6 on Linux. There are two nodes: one is primary and the other is a standby. When I performed failover tests by doing "pg_ctl stop -mi" against the primary while some applications are reading/writing the database, the standby crashed while it was performing recovery after receiving promote request: ... LOG: archive recovery complete WARNING: page 506747 of relation base/482272/482304 was uninitialized PANIC: WAL contains references to invalid pages LOG: startup process (PID 8938) was terminated by signal 6: Aborted LOG: terminating any other active server processes (the log ends here) The mentioned relation is an index. The contents of the referred page was all zeros when I looked at it with "od -t x $PGDATA/base/482272/482304.3" after the crash. The subsequent pages of the same file had valid-seeming contents. I searched through PostgreSQL mailing lists with "WAL contains references to invalid pages", and i found 19 messages. Some people encountered similar problem. There were some discussions regarding those problems (Tom and Simon Riggs commented), but those discussions did not reach a solution. I also found a discussion which might relate to this problem. Does this fix the problem? [BUG] lag of minRecoveryPont in archive recovery http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp Regards MauMau
On Thu, Jan 24, 2013 at 7:42 AM, MauMau <maumau307@gmail.com> wrote: > From: "Tom Lane" <tgl@sss.pgh.pa.us> > >> Since we've fixed a couple of relatively nasty bugs recently, the core >> committee has determined that it'd be a good idea to push out PG update >> releases soon. The current plan is to wrap on Monday Feb 4 for public >> announcement Thursday Feb 7. If you're aware of any bug fixes you think >> ought to get included, now's the time to get them done ... > > > I've just encountered a serious problem, and I really wish it would be fixed > in the upcoming minor release. Could you tell me whether this is already > fixed and will be included? > > I'm using synchronous streaming replication with PostgreSQL 9.1.6 on Linux. > There are two nodes: one is primary and the other is a standby. When I > performed failover tests by doing "pg_ctl stop -mi" against the primary > while some applications are reading/writing the database, the standby > crashed while it was performing recovery after receiving promote request: > > ... > LOG: archive recovery complete > WARNING: page 506747 of relation base/482272/482304 was uninitialized > PANIC: WAL contains references to invalid pages > LOG: startup process (PID 8938) was terminated by signal 6: Aborted > LOG: terminating any other active server processes > (the log ends here) > > The mentioned relation is an index. The contents of the referred page was > all zeros when I looked at it with "od -t x $PGDATA/base/482272/482304.3" > after the crash. The subsequent pages of the same file had valid-seeming > contents. > > I searched through PostgreSQL mailing lists with "WAL contains references to > invalid pages", and i found 19 messages. Some people encountered similar > problem. There were some discussions regarding those problems (Tom and > Simon Riggs commented), but those discussions did not reach a solution. > > I also found a discussion which might relate to this problem. Does this fix > the problem? > > [BUG] lag of minRecoveryPont in archive recovery > http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp Yes. Could you check whether you can reproduce the problem on the latest REL9_2_STABLE? Regards, -- Fujii Masao
From: "Fujii Masao" <masao.fujii@gmail.com> > On Thu, Jan 24, 2013 at 7:42 AM, MauMau <maumau307@gmail.com> wrote: >> I searched through PostgreSQL mailing lists with "WAL contains references >> to >> invalid pages", and i found 19 messages. Some people encountered similar >> problem. There were some discussions regarding those problems (Tom and >> Simon Riggs commented), but those discussions did not reach a solution. >> >> I also found a discussion which might relate to this problem. Does this >> fix >> the problem? >> >> [BUG] lag of minRecoveryPont in archive recovery >> http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp > > Yes. Could you check whether you can reproduce the problem on the > latest REL9_2_STABLE? I tried to produce the problem by doing "pg_ctl stop -mi" against the primary more than ten times on REL9_2_STABLE, but the problem did not appear. However, I encountered the crash only once out of dozens of failovers, possibly more than a hundred times, on PostgreSQL 9.1.6. So, I'm not sure the problem is fixed in REL9_2_STABLE. I'm wondering if the fix discussed in the above thread solves my problem. I found the following differences between Horiguchi-san's case and my case: (1) Horiguchi-san says the bug outputs the message: WARNING: page 0 of relation base/16384/16385 does not exist On the other hand, I got the message: WARNING: page 506747 of relation base/482272/482304 was uninitialized (2) Horiguchi-san produced the problem when he shut the standby immediately and restarted it. However, I saw the problem during failover. (3) Horiguchi-san did not use any index, but in my case the WARNING message refers to an index. But there's a similar point. Horiguchi-san says the problem occurs after DELETE+VACUUM. In my case, I shut the primary down while the application was doing INSERT/UPDATE. As the below messages show, some vacuuming was running just before the immediate shutdown: ... LOG: automatic vacuum of table "GOLD.scm1.tbl1": index scans: 0pages: 0 removed, 36743 remaintuples: 0 removed, 73764 remainsystemusage: CPU 0.09s/0.11u sec elapsed 0.66 sec LOG: automatic analyze of table "GOLD.scm1.tbl1" system usage: CPU 0.00s/0.14u sec elapsed 0.32 sec LOG: automatic vacuum of table "GOLD.scm1.tbl2": index scans: 0pages: 0 removed, 12101 remaintuples: 40657 removed, 44142remain system usage: CPU 0.06s/0.06u sec elapsed 0.30 sec LOG: automatic analyze of table "GOLD.scm1.tbl2" system usage: CPU 0.00s/0.06u sec elapsed 0.14 sec LOG: received immediate shutdown request ... Could you tell me the details of the problem discussed and fixed in the upcoming minor release? I would to like to know the phenomenon and its conditions, and whether it applies to my case. Regards MauMau
On Thu, Jan 24, 2013 at 11:53 PM, MauMau <maumau307@gmail.com> wrote: > From: "Fujii Masao" <masao.fujii@gmail.com> >> >> On Thu, Jan 24, 2013 at 7:42 AM, MauMau <maumau307@gmail.com> wrote: >>> >>> I searched through PostgreSQL mailing lists with "WAL contains references >>> to >>> invalid pages", and i found 19 messages. Some people encountered similar >>> problem. There were some discussions regarding those problems (Tom and >>> Simon Riggs commented), but those discussions did not reach a solution. >>> >>> I also found a discussion which might relate to this problem. Does this >>> fix >>> the problem? >>> >>> [BUG] lag of minRecoveryPont in archive recovery >>> >>> http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp >> >> >> Yes. Could you check whether you can reproduce the problem on the >> latest REL9_2_STABLE? > > > I tried to produce the problem by doing "pg_ctl stop -mi" against the > primary more than ten times on REL9_2_STABLE, but the problem did not > appear. However, I encountered the crash only once out of dozens of > failovers, possibly more than a hundred times, on PostgreSQL 9.1.6. So, I'm > not sure the problem is fixed in REL9_2_STABLE. You can reproduce the problem in REL9_1_STABLE? Sorry, I pointed wrong version, i.e., REL9_2_STABLE. Since you encountered the problem in 9.1.6, you need to retry the test in REL9_1_STABLE. > > I'm wondering if the fix discussed in the above thread solves my problem. I > found the following differences between Horiguchi-san's case and my case: > > (1) > Horiguchi-san says the bug outputs the message: > > WARNING: page 0 of relation base/16384/16385 does not exist > > On the other hand, I got the message: > > > WARNING: page 506747 of relation base/482272/482304 was uninitialized > > > (2) > Horiguchi-san produced the problem when he shut the standby immediately and > restarted it. However, I saw the problem during failover. > > > (3) > Horiguchi-san did not use any index, but in my case the WARNING message > refers to an index. > > > But there's a similar point. Horiguchi-san says the problem occurs after > DELETE+VACUUM. In my case, I shut the primary down while the application > was doing INSERT/UPDATE. As the below messages show, some vacuuming was > running just before the immediate shutdown: > > ... > LOG: automatic vacuum of table "GOLD.scm1.tbl1": index scans: 0 > pages: 0 removed, 36743 remain > tuples: 0 removed, 73764 remain > system usage: CPU 0.09s/0.11u sec elapsed 0.66 sec > LOG: automatic analyze of table "GOLD.scm1.tbl1" system usage: CPU > 0.00s/0.14u sec elapsed 0.32 sec > LOG: automatic vacuum of table "GOLD.scm1.tbl2": index scans: 0 > pages: 0 removed, 12101 remain > tuples: 40657 removed, 44142 remain system usage: CPU 0.06s/0.06u sec > elapsed 0.30 sec > LOG: automatic analyze of table "GOLD.scm1.tbl2" system usage: CPU > 0.00s/0.06u sec elapsed 0.14 sec > LOG: received immediate shutdown request > ... > > > Could you tell me the details of the problem discussed and fixed in the > upcoming minor release? I would to like to know the phenomenon and its > conditions, and whether it applies to my case. http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp Could you read the discussion in the above thread? Regards, -- Fujii Masao
From: "Fujii Masao" <masao.fujii@gmail.com> > On Thu, Jan 24, 2013 at 11:53 PM, MauMau <maumau307@gmail.com> wrote: >> I'm wondering if the fix discussed in the above thread solves my problem. >> I >> found the following differences between Horiguchi-san's case and my case: >> >> (1) >> Horiguchi-san says the bug outputs the message: >> >> WARNING: page 0 of relation base/16384/16385 does not exist >> >> On the other hand, I got the message: >> >> >> WARNING: page 506747 of relation base/482272/482304 was uninitialized >> >> >> (2) >> Horiguchi-san produced the problem when he shut the standby immediately >> and >> restarted it. However, I saw the problem during failover. >> >> >> (3) >> Horiguchi-san did not use any index, but in my case the WARNING message >> refers to an index. >> >> >> But there's a similar point. Horiguchi-san says the problem occurs after >> DELETE+VACUUM. In my case, I shut the primary down while the application >> was doing INSERT/UPDATE. As the below messages show, some vacuuming was >> running just before the immediate shutdown: >> >> ... >> LOG: automatic vacuum of table "GOLD.scm1.tbl1": index scans: 0 >> pages: 0 removed, 36743 remain >> tuples: 0 removed, 73764 remain >> system usage: CPU 0.09s/0.11u sec elapsed 0.66 sec >> LOG: automatic analyze of table "GOLD.scm1.tbl1" system usage: CPU >> 0.00s/0.14u sec elapsed 0.32 sec >> LOG: automatic vacuum of table "GOLD.scm1.tbl2": index scans: 0 >> pages: 0 removed, 12101 remain >> tuples: 40657 removed, 44142 remain system usage: CPU 0.06s/0.06u sec >> elapsed 0.30 sec >> LOG: automatic analyze of table "GOLD.scm1.tbl2" system usage: CPU >> 0.00s/0.06u sec elapsed 0.14 sec >> LOG: received immediate shutdown request >> ... >> >> >> Could you tell me the details of the problem discussed and fixed in the >> upcoming minor release? I would to like to know the phenomenon and its >> conditions, and whether it applies to my case. > > http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp > > Could you read the discussion in the above thread? Yes, I've just read the discussion (it was difficult for me...) Although you said the fix will solve my problem, I don't feel it will. The discussion is about the crash when the standby "re"starts after the primary vacuums and truncates a table. On the other hand, in my case, the standby crashed during failover (not at restart), emitting a message that some WAL record refers to an "uninitialized" page (not a non-existent page) of an "index" (not a table). In addition, fujii_test.sh did not reproduce the mentioned crash on PostgreSQL 9.1.6. I'm sorry to cause you trouble, but could you elaborate on how the fix relates to my case? Regards MauMau
On Sun, Jan 27, 2013 at 12:17 AM, MauMau <maumau307@gmail.com> wrote: > From: "Fujii Masao" <masao.fujii@gmail.com> >> >> On Thu, Jan 24, 2013 at 11:53 PM, MauMau <maumau307@gmail.com> wrote: >>> >>> I'm wondering if the fix discussed in the above thread solves my problem. >>> I >>> found the following differences between Horiguchi-san's case and my case: >>> >>> (1) >>> Horiguchi-san says the bug outputs the message: >>> >>> WARNING: page 0 of relation base/16384/16385 does not exist >>> >>> On the other hand, I got the message: >>> >>> >>> WARNING: page 506747 of relation base/482272/482304 was uninitialized >>> >>> >>> (2) >>> Horiguchi-san produced the problem when he shut the standby immediately >>> and >>> restarted it. However, I saw the problem during failover. >>> >>> >>> (3) >>> Horiguchi-san did not use any index, but in my case the WARNING message >>> refers to an index. >>> >>> >>> But there's a similar point. Horiguchi-san says the problem occurs after >>> DELETE+VACUUM. In my case, I shut the primary down while the application >>> was doing INSERT/UPDATE. As the below messages show, some vacuuming was >>> running just before the immediate shutdown: >>> >>> ... >>> LOG: automatic vacuum of table "GOLD.scm1.tbl1": index scans: 0 >>> pages: 0 removed, 36743 remain >>> tuples: 0 removed, 73764 remain >>> system usage: CPU 0.09s/0.11u sec elapsed 0.66 sec >>> LOG: automatic analyze of table "GOLD.scm1.tbl1" system usage: CPU >>> 0.00s/0.14u sec elapsed 0.32 sec >>> LOG: automatic vacuum of table "GOLD.scm1.tbl2": index scans: 0 >>> pages: 0 removed, 12101 remain >>> tuples: 40657 removed, 44142 remain system usage: CPU 0.06s/0.06u sec >>> elapsed 0.30 sec >>> LOG: automatic analyze of table "GOLD.scm1.tbl2" system usage: CPU >>> 0.00s/0.06u sec elapsed 0.14 sec >>> LOG: received immediate shutdown request >>> ... >>> >>> >>> Could you tell me the details of the problem discussed and fixed in the >>> upcoming minor release? I would to like to know the phenomenon and its >>> conditions, and whether it applies to my case. >> >> >> >> http://www.postgresql.org/message-id/20121206.130458.170549097.horiguchi.kyotaro@lab.ntt.co.jp >> >> Could you read the discussion in the above thread? > > > Yes, I've just read the discussion (it was difficult for me...) > > Although you said the fix will solve my problem, I don't feel it will. The > discussion is about the crash when the standby "re"starts after the primary > vacuums and truncates a table. On the other hand, in my case, the standby > crashed during failover (not at restart), emitting a message that some WAL > record refers to an "uninitialized" page (not a non-existent page) of an > "index" (not a table). > > In addition, fujii_test.sh did not reproduce the mentioned crash on > PostgreSQL 9.1.6. > > I'm sorry to cause you trouble, but could you elaborate on how the fix > relates to my case? Maybe I had not been understanding your problem correctly. Could you show the self-contained test case which reproduces the problem? Is the problem still reproducible in REL9_1_STABLE? Regards, -- Fujii Masao
From: "Fujii Masao" <masao.fujii@gmail.com> > On Sun, Jan 27, 2013 at 12:17 AM, MauMau <maumau307@gmail.com> wrote: >> Although you said the fix will solve my problem, I don't feel it will. >> The >> discussion is about the crash when the standby "re"starts after the >> primary >> vacuums and truncates a table. On the other hand, in my case, the >> standby >> crashed during failover (not at restart), emitting a message that some >> WAL >> record refers to an "uninitialized" page (not a non-existent page) of an >> "index" (not a table). >> >> In addition, fujii_test.sh did not reproduce the mentioned crash on >> PostgreSQL 9.1.6. >> >> I'm sorry to cause you trouble, but could you elaborate on how the fix >> relates to my case? > > Maybe I had not been understanding your problem correctly. > Could you show the self-contained test case which reproduces the problem? > Is the problem still reproducible in REL9_1_STABLE? As I said before, it's very hard to reproduce the problem. All what I did is to repeat the following sequence: 1. run "pg_ctl stop -mi" against the primary while the applications were performing INSERT/UPDATE/SELECT. 2. run "pg_ctl promote" against the standby of synchronous streaming standby. 3. run pg_basebackup on the stopped (original) primary to create a new standby, and start the new standby. I did this failover test dozens of times, probably more than a hundred. And I encountered the crash only once. Although I saw the problem only once, the result is catastrophic. So, I really wish Heiki's patch (in cooperation with Horiguchi-san and you) could fix the issue. Do you think of anything? Regards MauMau
On Sun, Jan 27, 2013 at 11:38 PM, MauMau <maumau307@gmail.com> wrote: > From: "Fujii Masao" <masao.fujii@gmail.com> >> >> On Sun, Jan 27, 2013 at 12:17 AM, MauMau <maumau307@gmail.com> wrote: >>> >>> Although you said the fix will solve my problem, I don't feel it will. >>> The >>> discussion is about the crash when the standby "re"starts after the >>> primary >>> vacuums and truncates a table. On the other hand, in my case, the >>> standby >>> crashed during failover (not at restart), emitting a message that some >>> WAL >>> record refers to an "uninitialized" page (not a non-existent page) of an >>> "index" (not a table). >>> >>> In addition, fujii_test.sh did not reproduce the mentioned crash on >>> PostgreSQL 9.1.6. >>> >>> I'm sorry to cause you trouble, but could you elaborate on how the fix >>> relates to my case? >> >> >> Maybe I had not been understanding your problem correctly. >> Could you show the self-contained test case which reproduces the problem? >> Is the problem still reproducible in REL9_1_STABLE? > > > As I said before, it's very hard to reproduce the problem. All what I did > is to repeat the following sequence: > > 1. run "pg_ctl stop -mi" against the primary while the applications were > performing INSERT/UPDATE/SELECT. > 2. run "pg_ctl promote" against the standby of synchronous streaming > standby. > 3. run pg_basebackup on the stopped (original) primary to create a new > standby, and start the new standby. > > I did this failover test dozens of times, probably more than a hundred. And > I encountered the crash only once. > > Although I saw the problem only once, the result is catastrophic. So, I > really wish Heiki's patch (in cooperation with Horiguchi-san and you) could > fix the issue. > > Do you think of anything? Umm... it's hard to tell whether your problem has been fixed in the latest 9.1, from that information. The bug fix which you mentioned consists of two patches. http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7bffc9b7bf9e09ddeddc65117e49829f758e500d http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=970fb12de121941939e777764d6e0446c974bba3 The former seems not to be related to your problem because the problem that patch fixed could basically happen only when restarting the standby. The latter might be related to your problem.... Regards, -- Fujii Masao
backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us> > Since we've fixed a couple of relatively nasty bugs recently, the core > committee has determined that it'd be a good idea to push out PG update > releases soon. The current plan is to wrap on Monday Feb 4 for public > announcement Thursday Feb 7. If you're aware of any bug fixes you think > ought to get included, now's the time to get them done ... I've just encountered another serious bug, which I wish to be fixed in the upcoming minor release. I'm using streaming replication with PostgreSQL 9.1.6 on Linux (RHEL6.2, kernel 2.6.32). But this problem should happen regardless of the use of streaming replication. When I ran "pg_ctl stop -mi" against the primary, some applications connected to the primary did not stop. The cause was that the backends was deadlocked in quickdie() with some call stack like the following. I'm sorry to have left the stack trace file on the testing machine, so I'll show you the precise stack trace tomorrow. some lock function malloc() gettext() errhint() quickdie() <signal handler called because of SIGQUIT> free() ... PostgresMain() ... The root cause is that gettext() is called in the signal handler quickdie() via errhint(). As you know, malloc() cannot be called in a signal handler: http://www.gnu.org/software/libc/manual/html_node/Nonreentrancy.html#Nonreentrancy [Excerpt] On most systems, malloc and free are not reentrant, because they use a static data structure which records what memory blocks are free. As a result, no library functions that allocate or free memory are reentrant. This includes functions that allocate space to store a result. And gettext() calls malloc(), as reported below: http://lists.gnu.org/archive/html/bug-coreutils/2005-04/msg00056.html I think the solution is the typical one. That is, to just remember the receipt of SIGQUIT by setting a global variable and call siglongjmp() in quickdie(), and perform tasks currently done in quickdie() when sigsetjmp() returns in PostgresMain(). What do think about the solution? Could you include the fix? If it's okay and you want, I'll submit the patch. Regards MauMau
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Tom Lane
Date:
"MauMau" <maumau307@gmail.com> writes: > When I ran "pg_ctl stop -mi" against the primary, some applications > connected to the primary did not stop. ... > The root cause is that gettext() is called in the signal handler quickdie() > via errhint(). Yeah, it's a known hazard that quickdie() operates like that. > I think the solution is the typical one. That is, to just remember the > receipt of SIGQUIT by setting a global variable and call siglongjmp() in > quickdie(), and perform tasks currently done in quickdie() when sigsetjmp() > returns in PostgresMain(). I think this cure is considerably worse than the disease. As stated, it's not a fix at all: longjmp'ing out of a signal handler is no better defined than what happens now, in fact it's probably even less safe. We could just set a flag and wait for the mainline code to notice, but that would make SIGQUIT hardly any stronger than SIGTERM --- in particular it couldn't get you out of any loop that wasn't checking for interrupts. The long and the short of it is that SIGQUIT is the emergency-stop panic button. You don't use it for routine shutdowns --- you use it when there is a damn good reason to and you're prepared to do some manual cleanup if necessary. http://en.wikipedia.org/wiki/Big_Red_Switch > What do think about the solution? Could you include the fix? Even if we had an arguably-better solution, I'd be disinclined to risk cramming it into stable branches on such short notice. What might make sense on short notice is to strengthen the documentation's cautions against using SIGQUIT unnecessarily. regards, tom lane
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Andres Freund
Date:
On 2013-01-30 10:23:09 -0500, Tom Lane wrote: > "MauMau" <maumau307@gmail.com> writes: > > When I ran "pg_ctl stop -mi" against the primary, some applications > > connected to the primary did not stop. ... > > The root cause is that gettext() is called in the signal handler quickdie() > > via errhint(). > > Yeah, it's a known hazard that quickdie() operates like that. What about not translating those? The messages are static and all memory needed by postgres should be pre-allocated. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-01-30 10:23:09 -0500, Tom Lane wrote: >> Yeah, it's a known hazard that quickdie() operates like that. > What about not translating those? The messages are static and all memory > needed by postgres should be pre-allocated. That would reduce our exposure slightly, but hardly to zero. For instance, if SIGQUIT happened in the midst of handling a regular error, ErrorContext might be pretty full already, necessitating further malloc requests. I thought myself about suggesting that quickdie do something to disable gettext(), but it doesn't seem like that would make it enough safer to justify the loss of user-friendliness for non English speakers. I think the conflict between "we don't want SIGQUIT to interrupt this" and "we do want SIGQUIT to interrupt that" is pretty fundamental, and there's probably not any bulletproof solution (or at least none that would have reasonable development/maintenance cost). If we had more confidence that there were no major loops lacking CHECK_FOR_INTERRUPTS calls, maybe the set-a-flag approach would be acceptable ... but I sure don't have such confidence. regards, tom lane
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us> > "MauMau" <maumau307@gmail.com> writes: >> I think the solution is the typical one. That is, to just remember the >> receipt of SIGQUIT by setting a global variable and call siglongjmp() in >> quickdie(), and perform tasks currently done in quickdie() when >> sigsetjmp() >> returns in PostgresMain(). > > I think this cure is considerably worse than the disease. As stated, > it's not a fix at all: longjmp'ing out of a signal handler is no better > defined than what happens now, in fact it's probably even less safe. > We could just set a flag and wait for the mainline code to notice, > but that would make SIGQUIT hardly any stronger than SIGTERM --- in > particular it couldn't get you out of any loop that wasn't checking for > interrupts. Oh, I was careless. You are right, my suggestion is not a fix at all because free() would continue to hold some lock after siglongjmp(), which malloc() tries to acquire. > The long and the short of it is that SIGQUIT is the emergency-stop panic > button. You don't use it for routine shutdowns --- you use it when > there is a damn good reason to and you're prepared to do some manual > cleanup if necessary. > > http://en.wikipedia.org/wiki/Big_Red_Switch How about the case where some backend crashes due to a bug of PostgreSQL? In this case, postmaster sends SIGQUIT to all backends, too. The instance is expected to disappear cleanly and quickly. Doesn't the hanging backend harm the restart of the instance? How about using SIGKILL instead of SIGQUIT? The purpose of SIGQUIT is to shutdown the processes quickly. SIGKILL is the best signal for that purpose. The WARNING message would not be sent to clients, but that does not justify the inability of immediately shutting down. Regards MauMau
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Tom Lane
Date:
"MauMau" <maumau307@gmail.com> writes: > From: "Tom Lane" <tgl@sss.pgh.pa.us> >> The long and the short of it is that SIGQUIT is the emergency-stop panic >> button. You don't use it for routine shutdowns --- you use it when >> there is a damn good reason to and you're prepared to do some manual >> cleanup if necessary. > How about the case where some backend crashes due to a bug of PostgreSQL? > In this case, postmaster sends SIGQUIT to all backends, too. The instance > is expected to disappear cleanly and quickly. Doesn't the hanging backend > harm the restart of the instance? [ shrug... ] That isn't guaranteed, and never has been --- for instance, the process might have SIGQUIT blocked, perhaps as a result of third-party code we have no control over. > How about using SIGKILL instead of SIGQUIT? Because then we couldn't notify clients at all. One practical disadvantage of that is that it would become quite hard to tell from the outside which client session actually crashed, which is frequently useful to know. This isn't an area that admits of quick-fix solutions --- everything we might do has disadvantages. Also, the lack of complaints to date shows that the problem is not so large as to justify panic responses. I'm not really inclined to mess around with a tradeoff that's been working pretty well for a dozen years or more. regards, tom lane
> This isn't an area that admits of quick-fix solutions --- everything > we might do has disadvantages. Also, the lack of complaints to date > shows that the problem is not so large as to justify panic responses. > I'm not really inclined to mess around with a tradeoff that's been > working pretty well for a dozen years or more. What about adding a caution to the doc something like: "pg_ctl -m -i stop" may cause a PostgreSQL hang if native laguage support enabled. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On 2013-01-31 08:27:13 +0900, Tatsuo Ishii wrote: > > This isn't an area that admits of quick-fix solutions --- everything > > we might do has disadvantages. Also, the lack of complaints to date > > shows that the problem is not so large as to justify panic responses. > > I'm not really inclined to mess around with a tradeoff that's been > > working pretty well for a dozen years or more. > > What about adding a caution to the doc something like: > > "pg_ctl -m -i stop" may cause a PostgreSQL hang if native laguage support enabled. That doesn't entirely solve the problem, see quote and reply in 6845.1359561252@sss.pgh.pa.us I think adding errmsg_raw() or somesuch that doesn't allocate any memory and only accepts constant strings could solve the problem more completely, at the obvious price of not allowing translated strings directly. Those could be pretranslated during startup, but thats mighty ugly. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-01-31 08:27:13 +0900, Tatsuo Ishii wrote: >> What about adding a caution to the doc something like: >> "pg_ctl -m -i stop" may cause a PostgreSQL hang if native laguage support enabled. > That doesn't entirely solve the problem, see quote and reply in > 6845.1359561252@sss.pgh.pa.us > I think adding errmsg_raw() or somesuch that doesn't allocate any memory > and only accepts constant strings could solve the problem more > completely, at the obvious price of not allowing translated strings > directly. I really doubt that this would make a measurable difference in the probability of failure. The OP's case looks like it might not have occurred if we weren't translating, but (a) that's not actually proven, and (b) there are any number of other, equally low-probability, reasons to have a problem here. Please note for instance that elog.c would still be doing a whole lot of palloc's even if the passed strings were not copied. I think if we want to make it bulletproof we'd have to do what the OP suggested and switch to SIGKILL. I'm not enamored of that for the reasons I mentioned --- but one idea that might dodge the disadvantages is to have the postmaster wait a few seconds and then SIGKILL any backends that hadn't exited. regards, tom lane
>> What about adding a caution to the doc something like: >> >> "pg_ctl -m -i stop" may cause a PostgreSQL hang if native laguage support enabled. > > That doesn't entirely solve the problem, see quote and reply in > 6845.1359561252@sss.pgh.pa.us Oh, I see now. > I think adding errmsg_raw() or somesuch that doesn't allocate any memory > and only accepts constant strings could solve the problem more > completely, at the obvious price of not allowing translated strings > directly. > Those could be pretranslated during startup, but thats mighty ugly. Are you suggesting to call errmsg_raw() instead of errmsg() in quickdie()? Tom said: > That would reduce our exposure slightly, but hardly to zero. For > instance, if SIGQUIT happened in the midst of handling a regular error, > ErrorContext might be pretty full already, necessitating further malloc > requests. If I understand this correctly, I don't think errmsg_raw() solves the particular problem. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
As I promised yesterday, I'll show you the precise call stack: #0 0x0000003fa0cf542e in __lll_lock_wait_private () from /lib64/libc.so.6 #1 0x0000003fa0c7bed5 in _L_lock_9323 () from /lib64/libc.so.6 #2 0x0000003fa0c797c6 in malloc () from /lib64/libc.so.6 #3 0x0000003fa0c2fd99 in _nl_make_l10nflist () from /lib64/libc.so.6 #4 0x0000003fa0c2e0a5 in _nl_find_domain () from /lib64/libc.so.6 #5 0x0000003fa0c2d990 in __dcigettext () from /lib64/libc.so.6 #6 0x00000000006f2a71 in errhint () #7 0x0000000000634064 in quickdie () #8 <signal handler called> #9 0x0000003fa0c77813 in _int_free () from /lib64/libc.so.6 #10 0x000000000070e329 in AllocSetDelete () #11 0x000000000070e8cb in MemoryContextDelete () #12 0x0000000000571723 in FreeExprContext () #13 0x0000000000571781 in FreeExecutorState () #14 0x00000000005dc883 in evaluate_expr () #15 0x00000000005ddca0 in simplify_function () #16 0x00000000005de69f in eval_const_expressions_mutator () #17 0x0000000000599143 in expression_tree_mutator () #18 0x00000000005de452 in eval_const_expressions_mutator () #19 0x0000000000599143 in expression_tree_mutator () #20 0x00000000005de452 in eval_const_expressions_mutator () #21 0x00000000005dfa2f in eval_const_expressions () #22 0x00000000005cf16d in preprocess_expression () #23 0x00000000005d2201 in subquery_planner () #24 0x00000000005d23cf in standard_planner () #25 0x000000000063426a in pg_plan_query () #26 0x0000000000634354 in pg_plan_queries () #27 0x0000000000635310 in exec_simple_query () #28 0x0000000000636333 in PostgresMain () #29 0x00000000005f64e9 in PostmasterMain () #30 0x0000000000596e20 in main ()
From: "Tom Lane" <tgl@sss.pgh.pa.us> > "MauMau" <maumau307@gmail.com> writes: >> How about the case where some backend crashes due to a bug of PostgreSQL? >> In this case, postmaster sends SIGQUIT to all backends, too. The >> instance >> is expected to disappear cleanly and quickly. Doesn't the hanging >> backend >> harm the restart of the instance? > > [ shrug... ] That isn't guaranteed, and never has been --- for > instance, the process might have SIGQUIT blocked, perhaps as a result > of third-party code we have no control over. Are you concerned about user-defined C functions? I don't think they need to block signals. So I don't find it too restrictive to say "do not block or send signals in user-defined functions." If it's a real concern, it should be noted in the manul, rather than writing "do not use pg_ctl stop -mi as much as you can, because it can leave hanging backends." >> How about using SIGKILL instead of SIGQUIT? > > Because then we couldn't notify clients at all. One practical > disadvantage of that is that it would become quite hard to tell from > the outside which client session actually crashed, which is frequently > useful to know. How is the message below useful to determine which client session actually crashed? The message doesn't contain information about the crashed session. Are you talking about log_line_prefix? ERROR: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. However, it is not quickdie() but LogChildExit() that emits useful information to tell which session crashed. So I don't think quickdie()'s message is very helpful. > I think if we want to make it bulletproof we'd have to do what the > OP suggested and switch to SIGKILL. I'm not enamored of that for the > reasons I mentioned --- but one idea that might dodge the disadvantages > is to have the postmaster wait a few seconds and then SIGKILL any > backends that hadn't exited. I believe that SIGKILL is the only and simple way to choose. Consider again: the purpose of "pg_ctl stop -mi" is to immediately and reliably shut down the instance. If it is not reliable, what can we do instead? Regards MauMau
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Peter Eisentraut
Date:
On 1/30/13 9:11 AM, MauMau wrote: > When I ran "pg_ctl stop -mi" against the primary, some applications > connected to the primary did not stop. The cause was that the backends > was deadlocked in quickdie() with some call stack like the following. > I'm sorry to have left the stack trace file on the testing machine, so > I'll show you the precise stack trace tomorrow. I've had similar problems in the past: http://www.postgresql.org/message-id/1253704891.20834.8.camel@fsopti579.F-Secure.com The discussion there never quite concluded. But yes, you need to be prepared that in rare circumstances SIGQUIT won't succeed and you need to use SIGKILL.
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
From: "Peter Eisentraut" <peter_e@gmx.net> > On 1/30/13 9:11 AM, MauMau wrote: >> When I ran "pg_ctl stop -mi" against the primary, some applications >> connected to the primary did not stop. The cause was that the backends >> was deadlocked in quickdie() with some call stack like the following. >> I'm sorry to have left the stack trace file on the testing machine, so >> I'll show you the precise stack trace tomorrow. > > I've had similar problems in the past: > > http://www.postgresql.org/message-id/1253704891.20834.8.camel@fsopti579.F-Secure.com > > The discussion there never quite concluded. But yes, you need to be > prepared that in rare circumstances SIGQUIT won't succeed and you need > to use SIGKILL. Thank you for sharing your experience. So you also considered making postmaster SIGKILL children like me, didn't you? I bet most of people who encounter this problem would feel like that. It is definitely pg_ctl who needs to be prepared, not the users. It may not be easy to find out postgres processes to SIGKILL if multiple instances are running on the same host. Just doing "pkill postgres" will unexpectedly terminate postgres of other instances. I would like to make a patch which that changes SIGQUIT to SIGKILL when postmaster terminates children. Any other better ideas? Regards MauMau
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Kevin Grittner
Date:
MauMau <maumau307@gmail.com> wrote: > Just doing "pkill postgres" will unexpectedly terminate postgres > of other instances. Not if you run each instance under a different OS user, and execute pkill with the right user. (Never use root for that!) This is just one of the reasons that you should not run multiple clusters on the same machine with the same OS user. -Kevin
On 2013-01-22 22:19:25 -0500, Tom Lane wrote: > Since we've fixed a couple of relatively nasty bugs recently, the core > committee has determined that it'd be a good idea to push out PG update > releases soon. The current plan is to wrap on Monday Feb 4 for public > announcement Thursday Feb 7. If you're aware of any bug fixes you think > ought to get included, now's the time to get them done ... I think the patch in http://www.postgresql.org/message-id/20130130145521.GB3355@awork2.anarazel.de should get applied before the release if we decide it should be backpatched. And I think backpatching makes sense, the amount of useless full-table-scans will sure hurt more than some workload thats accidentally fixed by this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Peter Eisentraut
Date:
On 1/31/13 5:42 PM, MauMau wrote: > Thank you for sharing your experience. So you also considered making > postmaster SIGKILL children like me, didn't you? I bet most of people > who encounter this problem would feel like that. > > It is definitely pg_ctl who needs to be prepared, not the users. It may > not be easy to find out postgres processes to SIGKILL if multiple > instances are running on the same host. Just doing "pkill postgres" > will unexpectedly terminate postgres of other instances. In my case, it was one backend process segfaulting, and then some other backend processes didn't respond to the subsequent SIGQUIT sent out by the postmaster. So pg_ctl didn't have any part in it. We ended up addressing that by installing a nagios event handler that checked for this situation and cleaned it up. > I would like to make a patch which that changes SIGQUIT to SIGKILL when > postmaster terminates children. Any other better ideas? That was my idea back then, but there were some concerns about it. I found an old patch that I had prepared for this, which I have attached. YMMV.
Attachment
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Andres Freund
Date:
On 2013-02-01 08:55:24 -0500, Peter Eisentraut wrote: > On 1/31/13 5:42 PM, MauMau wrote: > > Thank you for sharing your experience. So you also considered making > > postmaster SIGKILL children like me, didn't you? I bet most of people > > who encounter this problem would feel like that. > > > > It is definitely pg_ctl who needs to be prepared, not the users. It may > > not be easy to find out postgres processes to SIGKILL if multiple > > instances are running on the same host. Just doing "pkill postgres" > > will unexpectedly terminate postgres of other instances. > > In my case, it was one backend process segfaulting, and then some other > backend processes didn't respond to the subsequent SIGQUIT sent out by > the postmaster. So pg_ctl didn't have any part in it. > > We ended up addressing that by installing a nagios event handler that > checked for this situation and cleaned it up. > > > I would like to make a patch which that changes SIGQUIT to SIGKILL when > > postmaster terminates children. Any other better ideas? > > That was my idea back then, but there were some concerns about it. > > I found an old patch that I had prepared for this, which I have > attached. YMMV. > +static void > +quickdie_alarm_handler(SIGNAL_ARGS) > +{ > + /* > + * We got here if ereport() was blocking, so don't go there again > + * except when really asked for. > + */ > + elog(DEBUG5, "quickdie aborted by alarm"); > + Its probably not wise to enter elog.c again, that path might allocate memory and we wouldn't be any wiser. Unfortunately there's not much besides a write(2) to stderr that can safely be done... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-02-01 08:55:24 -0500, Peter Eisentraut wrote: >> I found an old patch that I had prepared for this, which I have >> attached. YMMV. >> +static void >> +quickdie_alarm_handler(SIGNAL_ARGS) >> +{ >> + /* >> + * We got here if ereport() was blocking, so don't go there again >> + * except when really asked for. >> + */ >> + elog(DEBUG5, "quickdie aborted by alarm"); >> + > Its probably not wise to enter elog.c again, that path might allocate > memory and we wouldn't be any wiser. Unfortunately there's not much > besides a write(2) to stderr that can safely be done... Another objection to this is it still doesn't fix the case where the process has blocked SIGQUIT. My faith that libperl, libR, etc will never do that is nil. I think if we want something that's actually trustworthy, the wait-and-then-kill logic has to be in the postmaster. regards, tom lane
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
Hello, Tom-san, folks, From: "Tom Lane" <tgl@sss.pgh.pa.us> > I think if we want to make it bulletproof we'd have to do what the > OP suggested and switch to SIGKILL. I'm not enamored of that for the > reasons I mentioned --- but one idea that might dodge the disadvantages > is to have the postmaster wait a few seconds and then SIGKILL any > backends that hadn't exited. > > I think if we want something that's actually trustworthy, the > wait-and-then-kill logic has to be in the postmaster. I'm sorry to be late to submit a patch. I made a fix according to the above comment. Unfortunately, it was not in time for 9.1.8 release... I hope this patch will be included in 9.1.9. Could you review the patch? The summary of the change is: 1. postmaster waits for children to terminate when it gets an immediate shutdown request, instead of exiting. 2. postmaster sends SIGKILL to remaining children if all of the child processes do not terminate within 10 seconds since the start of immediate shutdown or FatalError condition. 3. On Windows, kill(SIGKILL) calls TerminateProcess(). 4. Documentation change to describe the behavior of immediate shutdown. Regards MauMau
Attachment
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Alvaro Herrera
Date:
MauMau escribió: > Could you review the patch? The summary of the change is: > 1. postmaster waits for children to terminate when it gets an > immediate shutdown request, instead of exiting. > > 2. postmaster sends SIGKILL to remaining children if all of the > child processes do not terminate within 10 seconds since the start > of immediate shutdown or FatalError condition. This seems reasonable. Why 10 seconds? We could wait 5 seconds, or 15. Is there a rationale behind the 10? If we said 60, that would fit perfectly well within the already existing 60-second loop in postmaster, but that seems way too long. I have only one concern about this patch, which is visible in the documentation proposed change: <para> This is the <firstterm>Immediate Shutdown</firstterm> mode. The master <command>postgres</command>process will send a - <systemitem>SIGQUIT</systemitem> to all child processes and exit - immediately, without properly shutting itself down. The child processes - likewise exit immediately upon receiving - <systemitem>SIGQUIT</systemitem>. This will lead to recovery (by + <systemitem>SIGQUIT</systemitem> to all child processes, wait for + them to terminate, and exit. The child processes + exit immediately upon receiving + <systemitem>SIGQUIT</systemitem>. If any of the child processes + does not terminate within 10 seconds for some unexpected reason, + the master postgres process will send a <systemitem>SIGKILL</systemitem> + to all remaining ones, wait for their termination + again, and exit. This will lead to recovery (by replaying the WAL log) upon next start-up. This is recommended only in emergencies. </para> Note that the previous text said that postmaster will send SIGQUIT, then terminate without checking anything. In the new code, postmaster sends SIGQUIT, then waits, then SIGKILL, then waits again. If there's an unkillable process (say because it's stuck in a noninterruptible sleep) postmaster might never exit. I think it should send SIGQUIT, then wait, then SIGKILL, then exit without checking. I have tweaked the patch a bit and I'm about to commit as soon as we resolve the above two items. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
First, thank you for the review. From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > This seems reasonable. Why 10 seconds? We could wait 5 seconds, or 15. > Is there a rationale behind the 10? If we said 60, that would fit > perfectly well within the already existing 60-second loop in postmaster, > but that seems way too long. There is no good rationale. I arbitrarily chose a short period because this is "immediate" shutdown. I felt more than 10 second was long. I think 5 second may be better. Although not directly related to this fix, these influenced my choice: 1. According to the man page of init, init sends SIGKILL to all remaining processes 5 seconds after it sends SIGTERM to them. 2. At computer shutdown, Windows proceeds shutdown forcibly after waiting for services to terminate 20 seconds. > I have only one concern about this patch, which is visible in the > documentation proposed change: > > <para> > This is the <firstterm>Immediate Shutdown</firstterm> mode. > The master <command>postgres</command> process will send a > - <systemitem>SIGQUIT</systemitem> to all child processes and exit > - immediately, without properly shutting itself down. The child > processes > - likewise exit immediately upon receiving > - <systemitem>SIGQUIT</systemitem>. This will lead to recovery (by > + <systemitem>SIGQUIT</systemitem> to all child processes, wait for > + them to terminate, and exit. The child processes > + exit immediately upon receiving > + <systemitem>SIGQUIT</systemitem>. If any of the child processes > + does not terminate within 10 seconds for some unexpected reason, > + the master postgres process will send a > <systemitem>SIGKILL</systemitem> > + to all remaining ones, wait for their termination > + again, and exit. This will lead to recovery (by > replaying the WAL log) upon next start-up. This is recommended > only in emergencies. > </para> > > Note that the previous text said that postmaster will send SIGQUIT, then > terminate without checking anything. In the new code, postmaster sends > SIGQUIT, then waits, then SIGKILL, then waits again. If there's an > unkillable process (say because it's stuck in a noninterruptible sleep) > postmaster might never exit. I think it should send SIGQUIT, then wait, > then SIGKILL, then exit without checking. At first I thought the same, but decided not to do that. The purpose of this patch is to make the immediate shutdown "reliable". Here, "reliable" means that the database server is certainly shut down when pg_ctl returns, not telling a lie that "I shut down the server processes for you, so you do not have to be worried that some postgres process might still remain and write to disk". I suppose reliable shutdown is crucial especially in HA cluster. If pg_ctl stop -mi gets stuck forever when there is an unkillable process (in what situations does this happen? OS bug, or NFS hard mount?), I think the DBA has to notice this situation from the unfinished pg_ctl, investigate the cause, and take corrective action. Anyway, in HA cluster, the clusterware will terminate the node with STONITH, not leaving pg_ctl running forever. > I have tweaked the patch a bit and I'm about to commit as soon as we > resolve the above two items. I appreciate your tweaking, especially in the documentation and source code comments, as English is not my mother tongue. Regards MauMau
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Alvaro Herrera
Date:
MauMau escribió: > First, thank you for the review. > > From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > >This seems reasonable. Why 10 seconds? We could wait 5 seconds, or 15. > >Is there a rationale behind the 10? If we said 60, that would fit > >perfectly well within the already existing 60-second loop in postmaster, > >but that seems way too long. > > There is no good rationale. I arbitrarily chose a short period > because this is "immediate" shutdown. I felt more than 10 second > was long. I think 5 second may be better. Although not directly > related to this fix, these influenced my choice: > > 1. According to the man page of init, init sends SIGKILL to all > remaining processes 5 seconds after it sends SIGTERM to them. > > 2. At computer shutdown, Windows proceeds shutdown forcibly after > waiting for services to terminate 20 seconds. Well, as for the first of these, I don't think it matters whether processes get their SIGKILL from postmaster or from init, when a shutdown or runlevel is changing. Now, one would think that this sets a precedent. I think 20 seconds might be too long; perhaps it's expected in an operating system that forcibly has a GUI. I feel safe to ignore that. I will go with 5 seconds, then. > >Note that the previous text said that postmaster will send SIGQUIT, then > >terminate without checking anything. In the new code, postmaster sends > >SIGQUIT, then waits, then SIGKILL, then waits again. If there's an > >unkillable process (say because it's stuck in a noninterruptible sleep) > >postmaster might never exit. I think it should send SIGQUIT, then wait, > >then SIGKILL, then exit without checking. > > At first I thought the same, but decided not to do that. The > purpose of this patch is to make the immediate shutdown "reliable". My point is that there is no difference. For one thing, once we enter immediate shutdown state, and sigkill has been sent, no further action is taken. Postmaster will just sit there indefinitely until processes are gone. If we were to make it repeat SIGKILL until they die, that would be different. However, repeating SIGKILL is pointless, because it they didn't die when they first received it, they will still not die when they receive it second. Also, if they're in uninterruptible sleep and don't die, then they will die as soon as they get out of that state; no further queries will get processed, no further memory access will be done. So there's no harm in they remaining there until underlying storage returns to life, ISTM. > Here, "reliable" means that the database server is certainly shut > down when pg_ctl returns, not telling a lie that "I shut down the > server processes for you, so you do not have to be worried that some > postgres process might still remain and write to disk". I suppose > reliable shutdown is crucial especially in HA cluster. If pg_ctl > stop -mi gets stuck forever when there is an unkillable process (in > what situations does this happen? OS bug, or NFS hard mount?), I > think the DBA has to notice this situation from the unfinished > pg_ctl, investigate the cause, and take corrective action. So you're suggesting that keeping postmaster up is a useful sign that the shutdown is not going well? I'm not really sure about this. What do others think? > >I have tweaked the patch a bit and I'm about to commit as soon as we > >resolve the above two items. > > I appreciate your tweaking, especially in the documentation and > source code comments, as English is not my mother tongue. IIRC the only other interesting tweak I did was rename the SignalAllChildren() function to TerminateChildren(). I did this because it doesn't really signal all children; syslogger and dead_end backends are kept around. So the original name was a bit misleading. And we couldn't really name it SignalAlmostAllChildren(), could we .. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > I will go with 5 seconds, then. OK, I agree. > My point is that there is no difference. For one thing, once we enter > immediate shutdown state, and sigkill has been sent, no further action > is taken. Postmaster will just sit there indefinitely until processes > are gone. If we were to make it repeat SIGKILL until they die, that > would be different. However, repeating SIGKILL is pointless, because it > they didn't die when they first received it, they will still not die > when they receive it second. Also, if they're in uninterruptible sleep > and don't die, then they will die as soon as they get out of that state; > no further queries will get processed, no further memory access will be > done. So there's no harm in they remaining there until underlying > storage returns to life, ISTM. > >> Here, "reliable" means that the database server is certainly shut >> down when pg_ctl returns, not telling a lie that "I shut down the >> server processes for you, so you do not have to be worried that some >> postgres process might still remain and write to disk". I suppose >> reliable shutdown is crucial especially in HA cluster. If pg_ctl >> stop -mi gets stuck forever when there is an unkillable process (in >> what situations does this happen? OS bug, or NFS hard mount?), I >> think the DBA has to notice this situation from the unfinished >> pg_ctl, investigate the cause, and take corrective action. > > So you're suggesting that keeping postmaster up is a useful sign that > the shutdown is not going well? I'm not really sure about this. What > do others think? I think you are right, and there is no harm in leaving postgres processes in unkillable state. I'd like to leave the decision to you and/or others. One concern is that umount would fail in such a situation because postgres has some open files on the filesystem, which is on the shared disk in case of traditional HA cluster. However, STONITH should resolve the problem by terminating the stuck node... I just feel it is strange for umount to fail due to remaining postgres, because pg_ctl stop -mi reported success. > IIRC the only other interesting tweak I did was rename the > SignalAllChildren() function to TerminateChildren(). I did this because > it doesn't really signal all children; syslogger and dead_end backends > are kept around. So the original name was a bit misleading. And we > couldn't really name it SignalAlmostAllChildren(), could we .. I see. thank you. Regards MauMau
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Noah Misch
Date:
On Thu, Jun 20, 2013 at 12:33:25PM -0400, Alvaro Herrera wrote: > MauMau escribi?: > > Here, "reliable" means that the database server is certainly shut > > down when pg_ctl returns, not telling a lie that "I shut down the > > server processes for you, so you do not have to be worried that some > > postgres process might still remain and write to disk". I suppose > > reliable shutdown is crucial especially in HA cluster. If pg_ctl > > stop -mi gets stuck forever when there is an unkillable process (in > > what situations does this happen? OS bug, or NFS hard mount?), I > > think the DBA has to notice this situation from the unfinished > > pg_ctl, investigate the cause, and take corrective action. > > So you're suggesting that keeping postmaster up is a useful sign that > the shutdown is not going well? I'm not really sure about this. What > do others think? It would be valuable for "pg_ctl -w -m immediate stop" to have the property that an subsequent start attempt will not fail due to the presence of some backend still attached to shared memory. (Maybe that's true anyway or can be achieved a better way; I have not investigated.) -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Alvaro Herrera
Date:
Noah Misch escribió: > On Thu, Jun 20, 2013 at 12:33:25PM -0400, Alvaro Herrera wrote: > > MauMau escribi?: > > > Here, "reliable" means that the database server is certainly shut > > > down when pg_ctl returns, not telling a lie that "I shut down the > > > server processes for you, so you do not have to be worried that some > > > postgres process might still remain and write to disk". I suppose > > > reliable shutdown is crucial especially in HA cluster. If pg_ctl > > > stop -mi gets stuck forever when there is an unkillable process (in > > > what situations does this happen? OS bug, or NFS hard mount?), I > > > think the DBA has to notice this situation from the unfinished > > > pg_ctl, investigate the cause, and take corrective action. > > > > So you're suggesting that keeping postmaster up is a useful sign that > > the shutdown is not going well? I'm not really sure about this. What > > do others think? > > It would be valuable for "pg_ctl -w -m immediate stop" to have the property > that an subsequent start attempt will not fail due to the presence of some > backend still attached to shared memory. (Maybe that's true anyway or can be > achieved a better way; I have not investigated.) Well, the only case where a process that's been SIGKILLed does not go away, as far as I know, is when it is in some uninterruptible sleep due to in-kernel operations that get stuck. Personally I have never seen this happen in any other case than some network filesystem getting disconnected, or a disk that doesn't respond. And whenever the filesystem starts to respond again, the process gets out of its sleep only to die due to the signal. So a subsequent start attempt will either find that the filesystem is not responding, in which case it'll probably fail to work properly anyway (presumably the filesystem corresponds to part of the data directory), or that it has revived in which case the old backends have already gone away. If we leave postmaster running after SIGKILLing its children, the only thing we can do is have it continue to SIGKILL processes continuously every few seconds until they die (or just sit around doing nothing until they all die). I don't think this will have a different effect than postmaster going away trusting the first SIGKILL to do its job eventually. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Alvaro Herrera
Date:
MauMau escribió: > From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > One concern is that umount would fail in such a situation because > postgres has some open files on the filesystem, which is on the > shared disk in case of traditional HA cluster. See my reply to Noah. If postmaster stays around, would this be any different? I don't think so. > >IIRC the only other interesting tweak I did was rename the > >SignalAllChildren() function to TerminateChildren(). I did this because > >it doesn't really signal all children; syslogger and dead_end backends > >are kept around. So the original name was a bit misleading. And we > >couldn't really name it SignalAlmostAllChildren(), could we .. > > I see. thank you. Actually, in further testing I noticed that the fast-path you introduced in BackendCleanup (or was it HandleChildCrash?) in the immediate shutdown case caused postmaster to fail to clean up properly after sending the SIGKILL signal, so I had to remove that as well. Was there any reason for that? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Alvaro Herrera
Date:
Actually, I think it would be cleaner to have a new state in pmState, namely PM_IMMED_SHUTDOWN which is entered when we send SIGQUIT. When we're in this state, postmaster is only waiting for the timeout to expire; and when it does, it sends SIGKILL and exits. Pretty much the same you have, except that instead of checking AbortStartTime we check the pmState variable. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Hitoshi Harada
Date:
On Thu, Jun 20, 2013 at 3:40 PM, MauMau <maumau307@gmail.com> wrote:
Thanks,I think you are right, and there is no harm in leaving postgres processes in unkillable state. I'd like to leave the decision to you and/or others.Here, "reliable" means that the database server is certainly shut
down when pg_ctl returns, not telling a lie that "I shut down the
server processes for you, so you do not have to be worried that some
postgres process might still remain and write to disk". I suppose
reliable shutdown is crucial especially in HA cluster. If pg_ctl
stop -mi gets stuck forever when there is an unkillable process (in
what situations does this happen? OS bug, or NFS hard mount?), I
think the DBA has to notice this situation from the unfinished
pg_ctl, investigate the cause, and take corrective action.
So you're suggesting that keeping postmaster up is a useful sign that
the shutdown is not going well? I'm not really sure about this. What
do others think?
+1 for leaving processes, not waiting for long (or possibly forever, remember not all people are running postgres on such cluster ware). I'm sure some users rely on the current behavior of immediate shutdown. If someone wants to ensure processes are finished when pg_ctl returns, is it fast shutdown, not immediate shutdown? To me the current immediate shutdown is reliable, in that it without doubt returns control back to terminal, after killing postmaster at least.
--
Hitoshi Harada
Hitoshi Harada
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Andres Freund
Date:
On 2013-06-20 22:36:45 -0400, Alvaro Herrera wrote: > Noah Misch escribió: > > On Thu, Jun 20, 2013 at 12:33:25PM -0400, Alvaro Herrera wrote: > > > MauMau escribi?: > > > > Here, "reliable" means that the database server is certainly shut > > > > down when pg_ctl returns, not telling a lie that "I shut down the > > > > server processes for you, so you do not have to be worried that some > > > > postgres process might still remain and write to disk". I suppose > > > > reliable shutdown is crucial especially in HA cluster. If pg_ctl > > > > stop -mi gets stuck forever when there is an unkillable process (in > > > > what situations does this happen? OS bug, or NFS hard mount?), I > > > > think the DBA has to notice this situation from the unfinished > > > > pg_ctl, investigate the cause, and take corrective action. > > > > > > So you're suggesting that keeping postmaster up is a useful sign that > > > the shutdown is not going well? I'm not really sure about this. What > > > do others think? > > > > It would be valuable for "pg_ctl -w -m immediate stop" to have the property > > that an subsequent start attempt will not fail due to the presence of some > > backend still attached to shared memory. (Maybe that's true anyway or can be > > achieved a better way; I have not investigated.) > > Well, the only case where a process that's been SIGKILLed does not go > away, as far as I know, is when it is in some uninterruptible sleep due > to in-kernel operations that get stuck. Personally I have never seen > this happen in any other case than some network filesystem getting > disconnected, or a disk that doesn't respond. And whenever the > filesystem starts to respond again, the process gets out of its sleep > only to die due to the signal. Those are the situation in which it takes a really long time, yes. But there can be timing issues involved. Consider a backend that's currently stuck in a write() because it hit the dirtying limit. Say you have a postgres cluster that's currently slowing down to a crawl because it's overloaded and hitting the dirty limit. Somebody very well might just want to restart it with -m immediate. In that case a delay of a second or two till enough dirty memory has been written that write() can continue is enough for the postmaster to start up again and try to attach to shared memory where it will find the shared memory to be still in use. I don't really see any argument for *not* waiting. Sure it might take a bit longer till it's shut down, but if it didn't wait that will cause problems down the road. > If we leave postmaster running after SIGKILLing its children, the only > thing we can do is have it continue to SIGKILL processes continuously > every few seconds until they die (or just sit around doing nothing until > they all die). I don't think this will have a different effect than > postmaster going away trusting the first SIGKILL to do its job > eventually. I think it should just wait till all its child processes are dead. No need to repeat sending the signals - as you say, that won't help. What we could do to improve the robustness a bit - at least on linux - is to prctl(PR_SET_PDEATHSIG, SIGKILL) which would cause children to be killed if the postmaster goes away... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > MauMau escribió: > >> One concern is that umount would fail in such a situation because >> postgres has some open files on the filesystem, which is on the >> shared disk in case of traditional HA cluster. > > See my reply to Noah. If postmaster stays around, would this be any > different? I don't think so. I'll consider this a bit and respond as a reply to Andres-san's mail. > Actually, in further testing I noticed that the fast-path you introduced > in BackendCleanup (or was it HandleChildCrash?) in the immediate > shutdown case caused postmaster to fail to clean up properly after > sending the SIGKILL signal, so I had to remove that as well. Was there > any reason for that? You are talking about the code below at the beginning of HandleChildCrash(), aren't you? /* Do nothing if the child terminated due to immediate shutdown */if (Shutdown == ImmediateShutdown) return; If my memory is correct, this was necessary; without this, HandleChildCrash() or LogChildExit() left extra messages in the server log. LOG: %s (PID %d) exited with exit code %d LOG: terminating any other active server processes These messages are output because postmaster sends SIGQUIT to its children and wait for them to terminate. The children exit with non-zero status when they receive SIGQUIT. Regards MauMau
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > Actually, I think it would be cleaner to have a new state in pmState, > namely PM_IMMED_SHUTDOWN which is entered when we send SIGQUIT. When > we're in this state, postmaster is only waiting for the timeout to > expire; and when it does, it sends SIGKILL and exits. Pretty much the > same you have, except that instead of checking AbortStartTime we check > the pmState variable. Are you suggesting simplifying the following part in ServerLoop()? I welcome the idea if this condition becomes simpler. However, I cannot imagine how. if (AbortStartTime > 0 && /* SIGKILL only once */ (Shutdown == ImmediateShutdown || (FatalError && !SendStop)) && now- AbortStartTime >= 10) { SignalAllChildren(SIGKILL); AbortStartTime = 0; } I thought of adding some new state of pmState for some reason (that might be the same as your idea). But I refrained from doing that, because pmState has already many states. I was afraid adding a new pmState value for this bug fix would complicate the state management (e.g. state transition in PostmasterStateMachine()). In addition, I felt PM_WAIT_BACKENDS was appropriate because postmaster is actually waiting for backends to terminate after sending SIGQUIT. The state name is natural. I don't have strong objection to your idea if it makes the code cleaner and more understandable. Thank you very much. Regards MauMau
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Robert Haas
Date:
On Thu, Jun 20, 2013 at 12:33 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I will go with 5 seconds, then. I'm uncomfortable with this whole concept, and particularly with such a short timeout. On a very busy system, things can take a LOT longer than they think we should; it can take 30 seconds or more just to get a prompt back from a shell command. 5 seconds is the blink of an eye. More generally, what do we think the point is of sending SIGQUIT rather than SIGKILL in the first place, and why does that point cease to be valid after 5 seconds? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > More generally, what do we think the point is of sending SIGQUIT > rather than SIGKILL in the first place, and why does that point cease > to be valid after 5 seconds? Well, mostly it's about telling the client we're committing hara-kiri. Without that, there's no very good reason to run quickdie() at all. A practical issue with starting to send SIGKILL ourselves is that we will no longer be able to reflexively diagnose "server process died on signal 9" as "the linux OOM killer got you". I'm not at all convinced that the cases where SIGQUIT doesn't work are sufficiently common to justify losing that property. regards, tom lane
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Robert Haas
Date:
On Fri, Jun 21, 2013 at 2:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> More generally, what do we think the point is of sending SIGQUIT >> rather than SIGKILL in the first place, and why does that point cease >> to be valid after 5 seconds? > > Well, mostly it's about telling the client we're committing hara-kiri. > Without that, there's no very good reason to run quickdie() at all. That's what I thought, too. It seems to me that if we think that's important, then it's important even if it takes more than 5 seconds for some reason. > A practical issue with starting to send SIGKILL ourselves is that we > will no longer be able to reflexively diagnose "server process died > on signal 9" as "the linux OOM killer got you". I'm not at all > convinced that the cases where SIGQUIT doesn't work are sufficiently > common to justify losing that property. I'm not, either. Maybe this question will provoke many indignant responses, but who in their right mind even uses immediate shutdown on a production server with any regularity? The shutdown checkpoint is sometimes painfully long, but do you really want to run recovery just to avoid it? And in the rare case where an immediate shutdown fails to work, what's wrong will "killall -9 postgres"? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Alvaro Herrera
Date:
Andres Freund escribió: > On 2013-06-20 22:36:45 -0400, Alvaro Herrera wrote: > > If we leave postmaster running after SIGKILLing its children, the only > > thing we can do is have it continue to SIGKILL processes continuously > > every few seconds until they die (or just sit around doing nothing until > > they all die). I don't think this will have a different effect than > > postmaster going away trusting the first SIGKILL to do its job > > eventually. > > I think it should just wait till all its child processes are dead. No > need to repeat sending the signals - as you say, that won't help. OK, I can buy that. So postmaster stays around waiting in ServerLoop until all children are gone; and if any persists for whatever reason, well, tough. > What we could do to improve the robustness a bit - at least on linux - > is to prctl(PR_SET_PDEATHSIG, SIGKILL) which would cause children to be > killed if the postmaster goes away... This is an interesting idea (even if it has no relationship to the patch at hand). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Christopher Browne
Date:
<p dir="ltr">The case where I wanted "routine" shutdown immediate (and I'm not sure I ever actually got it) was when we wereusing IBM HA/CMP, where I wanted a "terminate with a fair bit of prejudice".<p dir="ltr">If we know we want to "switchright away now", immediate seemed pretty much right. I was fine with interrupting user sessions, and there wasn'tas much going on in the way of system background stuff back then.<p dir="ltr">I wasn't keen on waiting on much of anything. The background writer ought to be keeping things from being too desperately out of date.<p dir="ltr">If there'sstuff worth waiting a few seconds for, I'm all ears.<p dir="ltr">But if I have to wait arbitrarily long, colour meunhappy.<p dir="ltr">If I have to distinguish, myself, between a checkpoint nearly done flushing and a backend that's stuckwaiting forlornly for filesystem access, I'm inclined to "kill -9" and hope recovery doesn't take *too* long on thenext node...<p dir="ltr">If shutting a server down in an emergency situation requires a DBA to look in, as opposed toinit.d doing its thing, I think that's pretty much the same problem too.
Re: Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Andres Freund escribi�: >> What we could do to improve the robustness a bit - at least on linux - >> is to prctl(PR_SET_PDEATHSIG, SIGKILL) which would cause children to be >> killed if the postmaster goes away... > This is an interesting idea (even if it has no relationship to the patch > at hand). The traditional theory has been that that would be less robust, not more so. Child backends are (mostly) able to carry out queries whether or not the postmaster is around. True, you can't make new connections, but how does killing the existing children make that better? regards, tom lane
Re: Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Robert Haas
Date:
On Fri, Jun 21, 2013 at 5:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Andres Freund escribió: >>> What we could do to improve the robustness a bit - at least on linux - >>> is to prctl(PR_SET_PDEATHSIG, SIGKILL) which would cause children to be >>> killed if the postmaster goes away... > >> This is an interesting idea (even if it has no relationship to the patch >> at hand). > > The traditional theory has been that that would be less robust, not > more so. Child backends are (mostly) able to carry out queries whether > or not the postmaster is around. I think that's the Tom Lane theory. The Robert Haas theory is that if the postmaster has died, there's no reason to suppose that it hasn't corrupted shared memory on the way down, or that the system isn't otherwise heavily fuxxored in some way. > True, you can't make new connections, > but how does killing the existing children make that better? It allows you to start a new postmaster in a timely fashion, instead of waiting for an idle connection that may not ever terminate without operator intervention. Even if it were possible to start a new postmaster that attached to the existing shared memory segment and began spawning new children, I think I'd be heavily in favor of killing the old ones off first and doing a full system reset just for safety. But it isn't, so what you get is a crippled system that never recovers without operator intervention. And note that I'm not talking about "pg_ctl restart"; that fails because the children have the shmem segment still attached and the postmaster, which is the only thing listed in the PID file, is already dead. I'm talking about "killall -9 postgres", at least. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
From: "Robert Haas" <robertmhaas@gmail.com>On Thu, Jun 20, 2013 at 12:33 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> I will go with 5 seconds, then. > > I'm uncomfortable with this whole concept, and particularly with such > a short timeout. On a very busy system, things can take a LOT longer > than they think we should; it can take 30 seconds or more just to get > a prompt back from a shell command. 5 seconds is the blink of an eye. I'm comfortable with 5 seconds. We are talking about the interval between sending SIGQUIT to the children and then sending SIGKILL to them. In most situations, the backends should terminate immediately. However, as I said a few months ago, ereport() call in quickdie() can deadlock indefinitely. This is a PostgreSQL's bug. In addition, Tom san was concerned that some PLs (PL/Perl or PL/Python?) block SIGQUIT while executing the UDF, so they may not be able to respond to the immediate shutdown request. What DBAs want from "pg_ctl stop -mi" is to shutdown the database server as immediately as possible. So I think 5 second is reasonable. Regards MauMau
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
From: "Robert Haas" <robertmhaas@gmail.com> > On Fri, Jun 21, 2013 at 2:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> More generally, what do we think the point is of sending SIGQUIT >>> rather than SIGKILL in the first place, and why does that point cease >>> to be valid after 5 seconds? >> >> Well, mostly it's about telling the client we're committing hara-kiri. >> Without that, there's no very good reason to run quickdie() at all. > > That's what I thought, too. It seems to me that if we think that's > important, then it's important even if it takes more than 5 seconds > for some reason. I guess Tom san is saying "we should be as kind as possible to the client, so try to notify the client of the shutdown". Not complete kindness. Because the DBA requested immediate shutdown by running "pg_ctl stop -mi", the top priority is to shutdown the database server as immediately as possible. The idea here is to try to be friendly to the client as long as the DBA can stand. That's tthe 5 second. >> A practical issue with starting to send SIGKILL ourselves is that we >> will no longer be able to reflexively diagnose "server process died >> on signal 9" as "the linux OOM killer got you". I'm not at all >> convinced that the cases where SIGQUIT doesn't work are sufficiently >> common to justify losing that property. > > I'm not, either. Maybe this question will provoke many indignant > responses, but who in their right mind even uses immediate shutdown on > a production server with any regularity? The shutdown checkpoint is > sometimes painfully long, but do you really want to run recovery just > to avoid it? And in the rare case where an immediate shutdown fails > to work, what's wrong will "killall -9 postgres"? Checkpoint is irrelevant here because we are discussing immediate shutdown. Some problems with "killall -9 postgres" are: 1. It's not available on Windows. 2. It may kill other database server instances running on the same machine. Regards MauMau
Re: Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jun 21, 2013 at 5:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The traditional theory has been that that would be less robust, not >> more so. Child backends are (mostly) able to carry out queries whether >> or not the postmaster is around. > I think that's the Tom Lane theory. The Robert Haas theory is that if > the postmaster has died, there's no reason to suppose that it hasn't > corrupted shared memory on the way down, or that the system isn't > otherwise heavily fuxxored in some way. Eh? The postmaster does its level best never to touch shared memory (after initialization anyway). >> True, you can't make new connections, >> but how does killing the existing children make that better? > It allows you to start a new postmaster in a timely fashion, instead > of waiting for an idle connection that may not ever terminate without > operator intervention. There may be something in that argument, but I find the other one completely bogus. regards, tom lane
Re: Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Andres Freund
Date:
On 2013-06-21 23:19:27 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Fri, Jun 21, 2013 at 5:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> The traditional theory has been that that would be less robust, not > >> more so. Child backends are (mostly) able to carry out queries whether > >> or not the postmaster is around. > > > I think that's the Tom Lane theory. The Robert Haas theory is that if > > the postmaster has died, there's no reason to suppose that it hasn't > > corrupted shared memory on the way down, or that the system isn't > > otherwise heavily fuxxored in some way. > > Eh? The postmaster does its level best never to touch shared memory > (after initialization anyway). I am not sure that will never happen - but I think the chain of argument misses the main point. Normally we rely on the postmaster to kill off all other backends if a backend PANICs or segfaults for all the known reasons. As soon as there's no postmaster anymore we loose that capability. And *that* is scary imo. Especially as I would say the chance of getting PANICs or segfaults increases if there's no postmaster anymore since we might reach code branches we otherwise won't. > >> True, you can't make new connections, > >> but how does killing the existing children make that better? > > > It allows you to start a new postmaster in a timely fashion, instead > > of waiting for an idle connection that may not ever terminate without > > operator intervention. And it's no always easy to figure out which cluster those backends belong to if there are multiple postgres instances running as the same user. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Alvaro Herrera
Date:
MauMau escribió: > Are you suggesting simplifying the following part in ServerLoop()? > I welcome the idea if this condition becomes simpler. However, I > cannot imagine how. > if (AbortStartTime > 0 && /* SIGKILL only once */ > (Shutdown == ImmediateShutdown || (FatalError && !SendStop)) && > now - AbortStartTime >= 10) > { > SignalAllChildren(SIGKILL); > AbortStartTime = 0; > } Yes, that's what I'm suggesting. I haven't tried coding it yet. > I thought of adding some new state of pmState for some reason (that > might be the same as your idea). > But I refrained from doing that, because pmState has already many > states. I was afraid adding a new pmState value for this bug fix > would complicate the state management (e.g. state transition in > PostmasterStateMachine()). In addition, I felt PM_WAIT_BACKENDS was > appropriate because postmaster is actually waiting for backends to > terminate after sending SIGQUIT. The state name is natural. Well, a natural state name is of no use if we're using it to indicate two different states, which I think is what would be happening here. Basically, with your patch, PM_WAIT_BACKENDS means two different things depending on AbortStartTime. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Andres Freund
Date:
On 2013-06-21 16:56:57 -0400, Alvaro Herrera wrote: > > What we could do to improve the robustness a bit - at least on linux - > > is to prctl(PR_SET_PDEATHSIG, SIGKILL) which would cause children to be > > killed if the postmaster goes away... > > This is an interesting idea (even if it has no relationship to the patch > at hand). Well, we could just set the deathsig to SIGKILL and exit normally - which would be a twoliner or so. Admittedly the cross-platform issue makes this approach not so great... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Robert Haas
Date:
On Fri, Jun 21, 2013 at 11:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think that's the Tom Lane theory. The Robert Haas theory is that if >> the postmaster has died, there's no reason to suppose that it hasn't >> corrupted shared memory on the way down, or that the system isn't >> otherwise heavily fuxxored in some way. > > Eh? The postmaster does its level best never to touch shared memory > (after initialization anyway). And yet it certainly does - see pmsignal.c, for example. Besides which, as Andres points out, if the postmaster is dead, there is zip for a guarantee that some OTHER backend hasn't panicked. I think it's just ridiculous to suppose that the system can run in any sort of reasonable way without the postmaster. The whole reason why we work so hard to make sure that the postmaster doesn't die in the first place is because we need it to clean up when things go horribly wrong.If that cleanup function is important, then we needa living postmaster at all times. If it's not important, then our extreme paranoia about what operations the postmaster is permitted to engage in is overblown. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Robert Haas
Date:
On Fri, Jun 21, 2013 at 10:02 PM, MauMau <maumau307@gmail.com> wrote: > I'm comfortable with 5 seconds. We are talking about the interval between > sending SIGQUIT to the children and then sending SIGKILL to them. In most > situations, the backends should terminate immediately. However, as I said a > few months ago, ereport() call in quickdie() can deadlock indefinitely. This > is a PostgreSQL's bug. So let's fix that bug. Then we don't need this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > MauMau escribió: >> I thought of adding some new state of pmState for some reason (that >> might be the same as your idea). >> But I refrained from doing that, because pmState has already many >> states. I was afraid adding a new pmState value for this bug fix >> would complicate the state management (e.g. state transition in >> PostmasterStateMachine()). In addition, I felt PM_WAIT_BACKENDS was >> appropriate because postmaster is actually waiting for backends to >> terminate after sending SIGQUIT. The state name is natural. > > Well, a natural state name is of no use if we're using it to indicate > two different states, which I think is what would be happening here. > Basically, with your patch, PM_WAIT_BACKENDS means two different things > depending on AbortStartTime. i PROBABLY GOT YOUR FEELING. yOU AREN'T FEELING COMFORTABLE ABOUT USING THE TIME VARIABLE aBORTsTARTtIME AS A STATE VARIABLE TO CHANGE POSTMASTER'S BEHAVIOR, ARE YOU? tHAT MAY BE RIGHT, BUT i'M NOT SURE WELL... BECAUSE IN pm_wait_backends, AS THE NAME INDICATES, POSTMASTER IS INDEED WAITING FOR THE BACKENDS TO TERMINATE REGARDLESS OF aBORTsTARTtIME. aPART FROM THIS, POSTMASTER SEEMS TO CHANGE ITS BEHAVIOR IN THE SAME PMsTATE DEPENDING ON OTHER VARIABLES SUCH AS sHUTDOWN AND fATALeRROR. i'M NOT CONFIDENT IN WHICH IS BETTER, SO i WON'T OBJECT IF THE REVIEWER OR COMMITTER MODIFIES THE CODE. rEGARDS mAUmAU
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
From: "Robert Haas" <robertmhaas@gmail.com> > On Fri, Jun 21, 2013 at 10:02 PM, MauMau <maumau307@gmail.com> wrote: >> I'm comfortable with 5 seconds. We are talking about the interval >> between >> sending SIGQUIT to the children and then sending SIGKILL to them. In >> most >> situations, the backends should terminate immediately. However, as I >> said a >> few months ago, ereport() call in quickdie() can deadlock indefinitely. >> This >> is a PostgreSQL's bug. > > So let's fix that bug. Then we don't need this. tHERE ARE TWO WAYS TO FIX THAT BUG. yOU ARE SUGGESTING 1 OF THE FOLLOWING TWO METHODS, AREN'T YOU? 1. (rOBERT-SAN'S IDEA) uPON RECEIPT OF IMMEDIATE SHUTDOWN REQUEST, POSTMASTER SENDS sigkill TO ITS CHILDREN. 2. (tOM-SAN'S IDEA) uPON RECEIPT OF IMMEDIATE SHUTDOWN REQUEST, POSTMASTER FIRST SENDS sigquit TO ITS CHILDREN, WAIT A WHILE FOR THEM TO TERMINATE, THEN SENDS sigkill TO THEM. aT FIRST i PROPOSED 1. tHEN tOM SAN SUGGESTED 2 SO THAT THE SERVER IS AS FRIENDLY TO THE CLIENTS AS NOW BY NOTIFYING THEM OF THE SERVER SHUTDOWN. i WAS CONVINCED BY THAT IDEA AND CHOSE 2. bASICALLY, i THINK BOTH IDEAS ARE RIGHT. tHEY CAN SOLVE THE ORIGINAL PROBLEM. hOWEVER, RE-CONSIDERING THE MEANING OF "IMMEDIATE" SHUTDOWN, i FEEL 1 IS A BIT BETTER, BECAUSE TRYING TO DO SOMETHING IN QUICKDIE() OR SOMEWHERE DOES NOT MATCH THE IDEA OF "IMMEDIATE". wE MAY NOT HAVE TO BE FRIENDLY TO THE CLIENTS AT THE IMMEDIATE SHUTDOWN. tHE CODE GETS MUCH SIMPLER. iN ADDITION, IT MAY BE BETTER THAT WE SIMILARLY SEND sigkill IN BACKEND CRASH (fATALeRROR) CASE, ELIMINATE THE USE OF sigquit AND REMOVE QUICKDIE() AND OTHER sigquit HANDLERS. wHAT DO YOU THINK? hOW SHOULD WE MAKE CONSENSUS AND PROCEED? rEGARDS mAUmAU
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Noah Misch
Date:
On Sun, Jun 23, 2013 at 01:55:19PM +0900, MauMau wrote: > From: "Robert Haas" <robertmhaas@gmail.com> >> On Fri, Jun 21, 2013 at 10:02 PM, MauMau <maumau307@gmail.com> wrote: >>> I'm comfortable with 5 seconds. We are talking about the interval >>> between >>> sending SIGQUIT to the children and then sending SIGKILL to them. In >>> most >>> situations, the backends should terminate immediately. However, as I >>> said a >>> few months ago, ereport() call in quickdie() can deadlock >>> indefinitely. This >>> is a PostgreSQL's bug. >> >> So let's fix that bug. Then we don't need this. [quoted part filtered to undo caps lock] > There are two ways to fix that bug. You are suggesting 1 of the > following two methods, aren't you? I suspect Robert meant to prevent the deadlock by limiting quickdie() to calling only async-signal-safe functions. Implementing that looked grotty when we last discussed it, though, and it would not help against a library or trusted PL function suspending SIGQUIT. > 1. (Robert-san's idea) > Upon receipt of immediate shutdown request, postmaster sends SIGKILL to > its children. > > 2. (Tom-san's idea) > Upon receipt of immediate shutdown request, postmaster first sends > SIGQUIT to its children, wait a while for them to terminate, then sends > SIGKILL to them. > > > At first I proposed 1. Then Tom san suggested 2 so that the server is as > friendly to the clients as now by notifying them of the server shutdown. > I was convinced by that idea and chose 2. > > Basically, I think both ideas are right. They can solve the original > problem. Agreed. > However, re-considering the meaning of "immediate" shutdown, I feel 1 is > a bit better, because trying to do something in quickdie() or somewhere > does not match the idea of "immediate". We may not have to be friendly to Perhaps so, but more important than the definition of the word "immediate" is what an immediate shutdown has long meant in PostgreSQL. All this time it has made some effort to send a message to the client. Note also that the patch under discussion affects both immediate shutdowns and the automatic reset in response to a backend crash. I think the latter is the more-important use case, and the message is nice to have there. > the clients at the immediate shutdown. The code gets much simpler. In > addition, it may be better that we similarly send SIGKILL in backend > crash (FatalError) case, eliminate the use of SIGQUIT and remove > quickdie() and other SIGQUIT handlers. My take is that the client message has some value, and we shouldn't just discard it to simplify the code slightly. Finishing the shutdown quickly has value, of course. The relative importance of those things should guide the choice of a timeout under method #2, but I don't see a rigorous way to draw the line. I feel five seconds is, if anything, too high. What about using deadlock_timeout? It constitutes a rough barometer on the system's tolerance for failure detection latency, and we already overload it by having it guide log_lock_waits. The default of 1s makes sense to me for this purpose, too. We can always add a separate immediate_shutdown_timeout if there's demand, but I don't expect such demand. (If we did add such a GUC, setting it to zero could be the way to request method 1. If some folks strongly desire method 1, that's probably the way to offer it.) Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Alvaro Herrera
Date:
MauMau escribió: > From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > >Actually, in further testing I noticed that the fast-path you introduced > >in BackendCleanup (or was it HandleChildCrash?) in the immediate > >shutdown case caused postmaster to fail to clean up properly after > >sending the SIGKILL signal, so I had to remove that as well. Was there > >any reason for that? > > You are talking about the code below at the beginning of > HandleChildCrash(), aren't you? > > /* Do nothing if the child terminated due to immediate shutdown */ > if (Shutdown == ImmediateShutdown) > return; > > If my memory is correct, this was necessary; without this, > HandleChildCrash() or LogChildExit() left extra messages in the > server log. > > LOG: %s (PID %d) exited with exit code %d > LOG: terminating any other active server processes > > These messages are output because postmaster sends SIGQUIT to its > children and wait for them to terminate. The children exit with > non-zero status when they receive SIGQUIT. Yeah, I see that --- after removing that early exit, there are unwanted messages. And in fact there are some signals sent that weren't previously sent. Clearly we need something here: if we're in immediate shutdown handler, don't signal anyone (because they have already been signalled) and don't log any more messages; but the cleaning up of postmaster's process list must still be carried out. Would you please add that on top of the attached cleaned up version of your patch? Noah Misch escribió: > On Sun, Jun 23, 2013 at 01:55:19PM +0900, MauMau wrote: > > the clients at the immediate shutdown. The code gets much simpler. In > > addition, it may be better that we similarly send SIGKILL in backend > > crash (FatalError) case, eliminate the use of SIGQUIT and remove > > quickdie() and other SIGQUIT handlers. > > My take is that the client message has some value, and we shouldn't just > discard it to simplify the code slightly. Finishing the shutdown quickly has > value, of course. The relative importance of those things should guide the > choice of a timeout under method #2, but I don't see a rigorous way to draw > the line. I feel five seconds is, if anything, too high. There's obviously a lot of disagreement on 5 seconds being too high or too low. We have just followed SysV init's precedent of waiting 5 seconds by default between sending signals TERM and QUIT during a shutdown. I will note that during a normal shutdown, services are entitled to do much more work than just signal all their children to exit immediately; and yet I don't find much evidence that this period is inordinately short. I don't feel strongly that it couldn't be shorter, though. > What about using deadlock_timeout? It constitutes a rough barometer on the > system's tolerance for failure detection latency, and we already overload it > by having it guide log_lock_waits. The default of 1s makes sense to me for > this purpose, too. We can always add a separate immediate_shutdown_timeout if > there's demand, but I don't expect such demand. (If we did add such a GUC, > setting it to zero could be the way to request method 1. If some folks > strongly desire method 1, that's probably the way to offer it.) I dunno. Having this be configurable seems overkill to me. But perhaps that's the way to satisfy most people: some people could set it very high so that they could have postmaster wait longer if they believe their server is going to be overloaded; people wishing immediate SIGKILL could get that too, as you describe. I think this should be a separate patch, however. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > Yeah, I see that --- after removing that early exit, there are unwanted > messages. And in fact there are some signals sent that weren't > previously sent. Clearly we need something here: if we're in immediate > shutdown handler, don't signal anyone (because they have already been > signalled) and don't log any more messages; but the cleaning up of > postmaster's process list must still be carried out. > > Would you please add that on top of the attached cleaned up version of > your patch? Thanks. I'll do that tomorrow at the earliest. > Noah Misch escribió: >> On Sun, Jun 23, 2013 at 01:55:19PM +0900, MauMau wrote: > >> > the clients at the immediate shutdown. The code gets much simpler. In >> > addition, it may be better that we similarly send SIGKILL in backend >> > crash (FatalError) case, eliminate the use of SIGQUIT and remove >> > quickdie() and other SIGQUIT handlers. >> >> My take is that the client message has some value, and we shouldn't just >> discard it to simplify the code slightly. Finishing the shutdown quickly >> has >> value, of course. The relative importance of those things should guide >> the >> choice of a timeout under method #2, but I don't see a rigorous way to >> draw >> the line. I feel five seconds is, if anything, too high. > > There's obviously a lot of disagreement on 5 seconds being too high or > too low. We have just followed SysV init's precedent of waiting 5 > seconds by default between sending signals TERM and QUIT during a > shutdown. I will note that during a normal shutdown, services are > entitled to do much more work than just signal all their children to > exit immediately; and yet I don't find much evidence that this period is > inordinately short. I don't feel strongly that it couldn't be shorter, > though. > >> What about using deadlock_timeout? It constitutes a rough barometer on >> the >> system's tolerance for failure detection latency, and we already overload >> it >> by having it guide log_lock_waits. The default of 1s makes sense to me >> for >> this purpose, too. We can always add a separate >> immediate_shutdown_timeout if >> there's demand, but I don't expect such demand. (If we did add such a >> GUC, >> setting it to zero could be the way to request method 1. If some folks >> strongly desire method 1, that's probably the way to offer it.) > > I dunno. Having this be configurable seems overkill to me. But perhaps > that's the way to satisfy most people: some people could set it very > high so that they could have postmaster wait longer if they believe > their server is going to be overloaded; people wishing immediate SIGKILL > could get that too, as you describe. > > I think this should be a separate patch, however. I think so, too. We can add a parameter later if we find it highly necessary after some experience in the field. Regards MauMau
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
"MauMau"
Date:
Hi, Alvaro san, From: "Alvaro Herrera" <alvherre@2ndquadrant.com> > MauMau escribió: > Yeah, I see that --- after removing that early exit, there are unwanted > messages. And in fact there are some signals sent that weren't > previously sent. Clearly we need something here: if we're in immediate > shutdown handler, don't signal anyone (because they have already been > signalled) and don't log any more messages; but the cleaning up of > postmaster's process list must still be carried out. > > Would you please add that on top of the attached cleaned up version of > your patch? I did this. Please find attached the revised patch. I modified HandleChildCrash(). I tested the immediate shutdown, and the child cleanup succeeded. In addition, I added if condition at the end of the function. This is to prevent resetting AbortStartTime every time one child terminates. Regards MauMau
Attachment
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Alvaro Herrera
Date:
MauMau escribió: Hi, > I did this. Please find attached the revised patch. I modified > HandleChildCrash(). I tested the immediate shutdown, and the child > cleanup succeeded. Thanks, committed. There are two matters pending here: 1. do we want postmaster to exit immediately after sending the SIGKILL, or hang around until all children are gone? 2. how far do we want to backpatch this, if at all? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
From
Robert Haas
Date:
On Fri, Jun 28, 2013 at 6:00 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > MauMau escribió: > > Hi, > >> I did this. Please find attached the revised patch. I modified >> HandleChildCrash(). I tested the immediate shutdown, and the child >> cleanup succeeded. > > Thanks, committed. > > There are two matters pending here: > > 1. do we want postmaster to exit immediately after sending the SIGKILL, > or hang around until all children are gone? > > 2. how far do we want to backpatch this, if at all? Considering that neither Tom nor I was convinced that this was a good idea AT ALL, I'm surprised you committed it in the first place. I'd be more inclined to revert it than back-patch it, but at least if we only change it in HEAD we have some chance of finding out whether or not it is in fact a bad idea before it's too late to change our mind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company