Thread: 7.0.2 dies when connection dropped mid-transaction
I have a program that does a: DECLARE getsitescursor CURSOR FOR select... I ^C'd it and it didn't properly shut down the channel to postgresql and I got this crash: #0 0x4828ffc8 in kill () from /usr/lib/libc.so.4 #1 0x482cbbf2 in abort () from /usr/lib/libc.so.4 #2 0x814442f in ExcAbort () at excabort.c:27 #3 0x81443ae in ExcUnCaught (excP=0x81a6070, detail=0, data=0x0, message=0x819a860 "!(AllocSetContains(set, pointer))")at exc.c:170 #4 0x81443f5 in ExcRaise (excP=0x81a6070, detail=0, data=0x0, message=0x819a860 "!(AllocSetContains(set, pointer))")at exc.c:187 #5 0x8143ae4 in ExceptionalCondition ( conditionName=0x819a860 "!(AllocSetContains(set, pointer))", exceptionP=0x81a6070,detail=0x0, fileName=0x819a720 "aset.c", lineNumber=392) at assert.c:73 #6 0x8147897 in AllocSetFree (set=0x8465134, pointer=0x84e0018 "<hashtable 1>") at aset.c:392 #7 0x8148394 in PortalVariableMemoryFree (this=0x846512c, pointer=0x84e0018 "<hashtable 1>") at portalmem.c:204 #8 0x8147e99 in MemoryContextFree (context=0x846512c, pointer=0x84e0018 "<hashtable 1>") at mcxt.c:245 #9 0x81490e5 in PortalDrop (portalP=0x8467944) at portalmem.c:802 #10 0x8148715 in CollectNamedPortals (portalP=0x0, destroy=1) at portalmem.c:442 #11 0x814880f in AtEOXact_portals () at portalmem.c:472 #12 0x80870ad in AbortTransaction () at xact.c:1053 #13 0x80872ec in AbortOutOfAnyTransaction () at xact.c:1552 #14 0x810b3d0 in PostgresMain (argc=9, argv=0xbfbff0e0, real_argc=10, real_argv=0xbfbffb40) at postgres.c:1643 #15 0x80f0736 in DoBackend (port=0x8464000) at postmaster.c:2009 #16 0x80f02c9 in BackendStartup (port=0x8464000) at postmaster.c:1776 #17 0x80ef4ed in ServerLoop () at postmaster.c:1037 #18 0x80eeed2 in PostmasterMain (argc=10, argv=0xbfbffb40) at postmaster.c:725 #19 0x80bf3df in main (argc=10, argv=0xbfbffb40) at main.c:93 #20 0x8063495 in _start () things go to pot here: 387 { 388 AllocChunk chunk; 389 390 /* AssertArg(AllocSetIsValid(set)); */ 391 /* AssertArg(AllocPointerIsValid(pointer)); */ 392 AssertArg(AllocSetContains(set, pointer)); 393 394 chunk = AllocPointerGetChunk(pointer); 395 396 #ifdef CLOBBER_FREED_MEMORY (gdb) print *set $2 = {blocks = 0x0, freelist = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}} (gdb) print pointer $3 = 0x84e0018 "<hashtable 1>" These sources are the current CVS sources with the exception of some removed files by Marc. Is there any more information I can provide? -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
Alfred Perlstein <bright@wintelcom.net> writes: > I have a program that does a: > DECLARE getsitescursor CURSOR FOR select... > I ^C'd it and it didn't properly shut down the channel to > postgresql and I got this crash: > ... > These sources are the current CVS sources with the exception of > some removed files by Marc. I tried this on my copy of 7.0.3: test7=# begin; declare c cursor for select * from foo; BEGIN SELECT test7=# fetch 1 from c;f1 ---- 1 (1 row) [kill -9 on the psql process from another window] test7=# Killed The postmaster log shows pq_recvbuf: unexpected EOF on client connection and no sign of a crash. So there's more to this than just killing a client that has a cursor. Can you provide a more complete example? regards, tom lane
* Alfred Perlstein <bright@wintelcom.net> [001109 17:07] wrote: > I have a program that does a: > DECLARE getsitescursor CURSOR FOR select... > > I ^C'd it and it didn't properly shut down the channel to > postgresql and I got this crash: [snip] > These sources are the current CVS sources with the exception of > some removed files by Marc. > > Is there any more information I can provide? I forgot to mention, this is the latest REL7_0_PATCHES. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
I said: > So there's more to this than just killing > a client that has a cursor. OK, after digging some more, it seems that the critical requirement is that the cursor's query contain a hash join. I've been able to reproduce a crash here... regards, tom lane
I said: > OK, after digging some more, it seems that the critical requirement > is that the cursor's query contain a hash join. Here's the deal: test7=# set enable_mergejoin to off; SET VARIABLE test7=# begin; BEGIN -- I've previously checked that this produces a hash join plan: test7=# declare c cursor for select * from foo t1, foo t2 where t1.f1=t2.f1; SELECT test7=# fetch 1 from c;f1 | f1 ----+---- 1 | 1 (1 row) test7=# abort; NOTICE: trying to delete portal name that does not exist. pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. This happens with either 7.0.2 or 7.0.3 (probably with anything back to 6.5, if not before). It does *not* happen with current development tip. The problem is that two "portal" structures are used. One holds the overall query plan and execution state for the cursor, and the other holds the hash table for the hash join. During abort, the portal manager tries to delete both of them. BUT: deleting the query plan causes query cleanup to be executed, which among other things deletes the hash join's table. Then the portal manager tries to delete the already-deleted second portal, which leads first to the above notice and then to Assert failure (and probably would lead to coredump if you didn't have Asserts on). Alternatively, it might try to delete the hash join portal first, which would leave the query cleanup code deleting an already-deleted portal, and doubtless still crashing. Current sources don't show the problem because hashtables aren't kept in portals anymore. I've thought for some time that CollectNamedPortals is a horrid kluge, and really ought to be rewritten. Hadn't seen it actually do the wrong thing before, but now... I guess the immediate question is do we want to hold up 7.0.3 release for a fix? This bug is clearly ancient, so I'm not sure it's appropriate to go through a fire drill to fix it for 7.0.3. Comments? regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [001109 18:30] wrote: > I said: > > OK, after digging some more, it seems that the critical requirement > > is that the cursor's query contain a hash join. > > Here's the deal: > > test7=# set enable_mergejoin to off; > SET VARIABLE > test7=# begin; > BEGIN > -- I've previously checked that this produces a hash join plan: > test7=# declare c cursor for select * from foo t1, foo t2 where t1.f1=t2.f1; > SELECT > test7=# fetch 1 from c; > f1 | f1 > ----+---- > 1 | 1 > (1 row) > > test7=# abort; > NOTICE: trying to delete portal name that does not exist. > pqReadData() -- backend closed the channel unexpectedly. > This probably means the backend terminated abnormally > before or while processing the request. > > This happens with either 7.0.2 or 7.0.3 (probably with anything back to > 6.5, if not before). It does *not* happen with current development tip. > > The problem is that two "portal" structures are used. One holds the > overall query plan and execution state for the cursor, and the other > holds the hash table for the hash join. During abort, the portal > manager tries to delete both of them. BUT: deleting the query plan > causes query cleanup to be executed, which among other things deletes > the hash join's table. Then the portal manager tries to delete the > already-deleted second portal, which leads first to the above notice > and then to Assert failure (and probably would lead to coredump if > you didn't have Asserts on). Alternatively, it might try to delete > the hash join portal first, which would leave the query cleanup code > deleting an already-deleted portal, and doubtless still crashing. > > Current sources don't show the problem because hashtables aren't kept > in portals anymore. > > I've thought for some time that CollectNamedPortals is a horrid kluge, > and really ought to be rewritten. Hadn't seen it actually do the wrong > thing before, but now... > > I guess the immediate question is do we want to hold up 7.0.3 release > for a fix? This bug is clearly ancient, so I'm not sure it's > appropriate to go through a fire drill to fix it for 7.0.3. > Comments? I dunno, having the database crash because a errant client disconnected without shutting down, or needed to abort a transaction looks like a show stopper. We do track CVS and wouldn't have a problem shifting to 7_0_3_PATCHES, but I'm not sure if the rest of the userbase is going to have much fun. It seems to be a serious problem, I think people wouldn't mind waiting for you to squash this one. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
> I guess the immediate question is do we want to hold up 7.0.3 release > for a fix? This bug is clearly ancient, so I'm not sure it's > appropriate to go through a fire drill to fix it for 7.0.3. > Comments? We have delayed 7.0.3 already. Tom is fixing so many bugs, we may find at some point that Tom never stops fixing bugs long enough for us to do a release. I say let's push 7.0.3 out. We can always do 7.0.4 later if we wish. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [001109 18:55] wrote: > > I guess the immediate question is do we want to hold up 7.0.3 release > > for a fix? This bug is clearly ancient, so I'm not sure it's > > appropriate to go through a fire drill to fix it for 7.0.3. > > Comments? > > We have delayed 7.0.3 already. Tom is fixing so many bugs, we may find > at some point that Tom never stops fixing bugs long enough for us to do > a release. I say let's push 7.0.3 out. We can always do 7.0.4 later if > we wish. I think being able to crash the backend by just dropping a connection during a pretty trivial query is a bad thing and it'd be more prudent to wait. I have no problem syncing with your guys CVS, but people using redhat RPMS and FreeBSD Packages are going to wind up with this bug if you cut the release before squashing it. :( -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
On Thu, 9 Nov 2000, Alfred Perlstein wrote: > * Bruce Momjian <pgman@candle.pha.pa.us> [001109 18:55] wrote: > > > I guess the immediate question is do we want to hold up 7.0.3 release > > > for a fix? This bug is clearly ancient, so I'm not sure it's > > > appropriate to go through a fire drill to fix it for 7.0.3. > > > Comments? > > > > We have delayed 7.0.3 already. Tom is fixing so many bugs, we may find > > at some point that Tom never stops fixing bugs long enough for us to do > > a release. I say let's push 7.0.3 out. We can always do 7.0.4 later if > > we wish. > > I think being able to crash the backend by just dropping a connection > during a pretty trivial query is a bad thing and it'd be more > prudent to wait. I have no problem syncing with your guys CVS, > but people using redhat RPMS and FreeBSD Packages are going to wind > up with this bug if you cut the release before squashing it. :( I'm going to fall behind Alfred on this one ... something this easy to reproduce is a show stopper ... Tom, if you can plug this one in the next, say, 48hrs (Saturday night), please do ... else, I'll announce 7.0.3 on Saturday night and we'll leave it with such a large showstopper :(
The Hermit Hacker <scrappy@hub.org> writes: > Tom, if you can plug this one in the next, say, 48hrs (Saturday night), > please do ... else, I'll announce 7.0.3 on Saturday night and we'll leave > it with such a large showstopper :( I do have an idea for a simple stopgap answer --- testing now ... regards, tom lane
The Hermit Hacker <scrappy@hub.org> writes: > Tom, if you can plug this one in the next, say, 48hrs (Saturday night), Done. Want to generate some new 7.0.3 release-candidate tarballs? regards, tom lane
On Thu, 9 Nov 2000, Tom Lane wrote: > The Hermit Hacker <scrappy@hub.org> writes: > > Tom, if you can plug this one in the next, say, 48hrs (Saturday night), > > Done. Want to generate some new 7.0.3 release-candidate tarballs? Done, and just forced a sync to ftp.postgresql.org of the new tarballs ... if nobody reports any probs with this by ~midnight tomorrow night, I'll finish up the 'release links' and get vince to add release info to the WWW site, followed by putting out an official announcement ... Great work, as always :)
* The Hermit Hacker <scrappy@hub.org> [001109 20:19] wrote: > On Thu, 9 Nov 2000, Tom Lane wrote: > > > The Hermit Hacker <scrappy@hub.org> writes: > > > Tom, if you can plug this one in the next, say, 48hrs (Saturday night), > > > > Done. Want to generate some new 7.0.3 release-candidate tarballs? > > Done, and just forced a sync to ftp.postgresql.org of the new tarballs > ... if nobody reports any probs with this by ~midnight tomorrow night, > I'll finish up the 'release links' and get vince to add release info to > the WWW site, followed by putting out an official announcement ... > > Great work, as always :) Tom rules. *thinking freebsd port should add user tgl rather than pgsql* :) -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
* The Hermit Hacker <scrappy@hub.org> [001109 20:19] wrote: > On Thu, 9 Nov 2000, Tom Lane wrote: > > > The Hermit Hacker <scrappy@hub.org> writes: > > > Tom, if you can plug this one in the next, say, 48hrs (Saturday night), > > > > Done. Want to generate some new 7.0.3 release-candidate tarballs? > > Done, and just forced a sync to ftp.postgresql.org of the new tarballs > ... if nobody reports any probs with this by ~midnight tomorrow night, > I'll finish up the 'release links' and get vince to add release info to > the WWW site, followed by putting out an official announcement ... > > Great work, as always :) Just wanted to confirm that we haven't experianced the bug since we've applied Tom's patch several days ago. thanks for the excellent work! -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."