Thread: 7.0.2 dies when connection dropped mid-transaction

7.0.2 dies when connection dropped mid-transaction

From
Alfred Perlstein
Date:
I have a program that does a:
DECLARE getsitescursor CURSOR FOR select...

I ^C'd it and it didn't properly shut down the channel to
postgresql and I got this crash:

#0  0x4828ffc8 in kill () from /usr/lib/libc.so.4
#1  0x482cbbf2 in abort () from /usr/lib/libc.so.4
#2  0x814442f in ExcAbort () at excabort.c:27
#3  0x81443ae in ExcUnCaught (excP=0x81a6070, detail=0, data=0x0,    message=0x819a860 "!(AllocSetContains(set,
pointer))")at exc.c:170
 
#4  0x81443f5 in ExcRaise (excP=0x81a6070, detail=0, data=0x0,    message=0x819a860 "!(AllocSetContains(set,
pointer))")at exc.c:187
 
#5  0x8143ae4 in ExceptionalCondition (   conditionName=0x819a860 "!(AllocSetContains(set, pointer))",
exceptionP=0x81a6070,detail=0x0, fileName=0x819a720 "aset.c",    lineNumber=392) at assert.c:73
 
#6  0x8147897 in AllocSetFree (set=0x8465134,    pointer=0x84e0018 "<hashtable 1>") at aset.c:392
#7  0x8148394 in PortalVariableMemoryFree (this=0x846512c,    pointer=0x84e0018 "<hashtable 1>") at portalmem.c:204
#8  0x8147e99 in MemoryContextFree (context=0x846512c,    pointer=0x84e0018 "<hashtable 1>") at mcxt.c:245
#9  0x81490e5 in PortalDrop (portalP=0x8467944) at portalmem.c:802
#10 0x8148715 in CollectNamedPortals (portalP=0x0, destroy=1)   at portalmem.c:442
#11 0x814880f in AtEOXact_portals () at portalmem.c:472
#12 0x80870ad in AbortTransaction () at xact.c:1053
#13 0x80872ec in AbortOutOfAnyTransaction () at xact.c:1552
#14 0x810b3d0 in PostgresMain (argc=9, argv=0xbfbff0e0, real_argc=10,    real_argv=0xbfbffb40) at postgres.c:1643
#15 0x80f0736 in DoBackend (port=0x8464000) at postmaster.c:2009
#16 0x80f02c9 in BackendStartup (port=0x8464000) at postmaster.c:1776
#17 0x80ef4ed in ServerLoop () at postmaster.c:1037
#18 0x80eeed2 in PostmasterMain (argc=10, argv=0xbfbffb40) at postmaster.c:725
#19 0x80bf3df in main (argc=10, argv=0xbfbffb40) at main.c:93
#20 0x8063495 in _start ()

things go to pot here:
387     {
388             AllocChunk      chunk;
389     
390             /* AssertArg(AllocSetIsValid(set)); */
391             /* AssertArg(AllocPointerIsValid(pointer)); */
392             AssertArg(AllocSetContains(set, pointer));
393     
394             chunk = AllocPointerGetChunk(pointer);
395     
396     #ifdef CLOBBER_FREED_MEMORY
(gdb) print *set
$2 = {blocks = 0x0, freelist = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}
(gdb) print pointer
$3 = 0x84e0018 "<hashtable 1>"

These sources are the current CVS sources with the exception of
some removed files by Marc.

Is there any more information I can provide?

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


Re: 7.0.2 dies when connection dropped mid-transaction

From
Tom Lane
Date:
Alfred Perlstein <bright@wintelcom.net> writes:
> I have a program that does a:
> DECLARE getsitescursor CURSOR FOR select...
> I ^C'd it and it didn't properly shut down the channel to
> postgresql and I got this crash:
> ...
> These sources are the current CVS sources with the exception of
> some removed files by Marc.

I tried this on my copy of 7.0.3:

test7=# begin; declare c cursor for select * from foo;
BEGIN
SELECT
test7=# fetch 1 from c;f1
---- 1
(1 row)

[kill -9 on the psql process from another window]

test7=# Killed

The postmaster log shows

pq_recvbuf: unexpected EOF on client connection

and no sign of a crash.  So there's more to this than just killing
a client that has a cursor.  Can you provide a more complete example?
        regards, tom lane


Re: 7.0.2 dies when connection dropped mid-transaction

From
Alfred Perlstein
Date:
* Alfred Perlstein <bright@wintelcom.net> [001109 17:07] wrote:
> I have a program that does a:
> DECLARE getsitescursor CURSOR FOR select...
> 
> I ^C'd it and it didn't properly shut down the channel to
> postgresql and I got this crash:

[snip]

> These sources are the current CVS sources with the exception of
> some removed files by Marc.
> 
> Is there any more information I can provide?

I forgot to mention, this is the latest REL7_0_PATCHES.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


Re: 7.0.2 dies when connection dropped mid-transaction

From
Tom Lane
Date:
I said:
> So there's more to this than just killing
> a client that has a cursor.

OK, after digging some more, it seems that the critical requirement
is that the cursor's query contain a hash join.  I've been able to
reproduce a crash here...
        regards, tom lane


Re: 7.0.2 dies when connection dropped mid-transaction

From
Tom Lane
Date:
I said:
> OK, after digging some more, it seems that the critical requirement
> is that the cursor's query contain a hash join.

Here's the deal:

test7=# set enable_mergejoin to off;
SET VARIABLE
test7=# begin;
BEGIN
-- I've previously checked that this produces a hash join plan:
test7=# declare c cursor for select * from foo t1, foo t2 where t1.f1=t2.f1;
SELECT
test7=# fetch 1 from c;f1 | f1
----+---- 1 |  1
(1 row)

test7=# abort;
NOTICE:  trying to delete portal name that does not exist.
pqReadData() -- backend closed the channel unexpectedly.       This probably means the backend terminated abnormally
  before or while processing the request.
 

This happens with either 7.0.2 or 7.0.3 (probably with anything back to
6.5, if not before).  It does *not* happen with current development tip.

The problem is that two "portal" structures are used.  One holds the
overall query plan and execution state for the cursor, and the other
holds the hash table for the hash join.  During abort, the portal
manager tries to delete both of them.  BUT: deleting the query plan
causes query cleanup to be executed, which among other things deletes
the hash join's table.  Then the portal manager tries to delete the
already-deleted second portal, which leads first to the above notice
and then to Assert failure (and probably would lead to coredump if
you didn't have Asserts on).  Alternatively, it might try to delete
the hash join portal first, which would leave the query cleanup code
deleting an already-deleted portal, and doubtless still crashing.

Current sources don't show the problem because hashtables aren't kept
in portals anymore.

I've thought for some time that CollectNamedPortals is a horrid kluge,
and really ought to be rewritten.  Hadn't seen it actually do the wrong
thing before, but now...

I guess the immediate question is do we want to hold up 7.0.3 release
for a fix?  This bug is clearly ancient, so I'm not sure it's
appropriate to go through a fire drill to fix it for 7.0.3.
Comments?
        regards, tom lane


Re: 7.0.2 dies when connection dropped mid-transaction

From
Alfred Perlstein
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [001109 18:30] wrote:
> I said:
> > OK, after digging some more, it seems that the critical requirement
> > is that the cursor's query contain a hash join.
> 
> Here's the deal:
> 
> test7=# set enable_mergejoin to off;
> SET VARIABLE
> test7=# begin;
> BEGIN
> -- I've previously checked that this produces a hash join plan:
> test7=# declare c cursor for select * from foo t1, foo t2 where t1.f1=t2.f1;
> SELECT
> test7=# fetch 1 from c;
>  f1 | f1
> ----+----
>   1 |  1
> (1 row)
> 
> test7=# abort;
> NOTICE:  trying to delete portal name that does not exist.
> pqReadData() -- backend closed the channel unexpectedly.
>         This probably means the backend terminated abnormally
>         before or while processing the request.
> 
> This happens with either 7.0.2 or 7.0.3 (probably with anything back to
> 6.5, if not before).  It does *not* happen with current development tip.
> 
> The problem is that two "portal" structures are used.  One holds the
> overall query plan and execution state for the cursor, and the other
> holds the hash table for the hash join.  During abort, the portal
> manager tries to delete both of them.  BUT: deleting the query plan
> causes query cleanup to be executed, which among other things deletes
> the hash join's table.  Then the portal manager tries to delete the
> already-deleted second portal, which leads first to the above notice
> and then to Assert failure (and probably would lead to coredump if
> you didn't have Asserts on).  Alternatively, it might try to delete
> the hash join portal first, which would leave the query cleanup code
> deleting an already-deleted portal, and doubtless still crashing.
> 
> Current sources don't show the problem because hashtables aren't kept
> in portals anymore.
> 
> I've thought for some time that CollectNamedPortals is a horrid kluge,
> and really ought to be rewritten.  Hadn't seen it actually do the wrong
> thing before, but now...
> 
> I guess the immediate question is do we want to hold up 7.0.3 release
> for a fix?  This bug is clearly ancient, so I'm not sure it's
> appropriate to go through a fire drill to fix it for 7.0.3.
> Comments?

I dunno, having the database crash because a errant client disconnected
without shutting down, or needed to abort a transaction looks like
a show stopper.

We do track CVS and wouldn't have a problem shifting to 7_0_3_PATCHES,
but I'm not sure if the rest of the userbase is going to have much
fun.

It seems to be a serious problem, I think people wouldn't mind
waiting for you to squash this one.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


Re: 7.0.2 dies when connection dropped mid-transaction

From
Bruce Momjian
Date:
> I guess the immediate question is do we want to hold up 7.0.3 release
> for a fix?  This bug is clearly ancient, so I'm not sure it's
> appropriate to go through a fire drill to fix it for 7.0.3.
> Comments?

We have delayed 7.0.3 already.  Tom is fixing so many bugs, we may find
at some point that Tom never stops fixing bugs long enough for us to do
a release.  I say let's push 7.0.3 out.  We can always do 7.0.4 later if
we wish.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: 7.0.2 dies when connection dropped mid-transaction

From
Alfred Perlstein
Date:
* Bruce Momjian <pgman@candle.pha.pa.us> [001109 18:55] wrote:
> > I guess the immediate question is do we want to hold up 7.0.3 release
> > for a fix?  This bug is clearly ancient, so I'm not sure it's
> > appropriate to go through a fire drill to fix it for 7.0.3.
> > Comments?
> 
> We have delayed 7.0.3 already.  Tom is fixing so many bugs, we may find
> at some point that Tom never stops fixing bugs long enough for us to do
> a release.  I say let's push 7.0.3 out.  We can always do 7.0.4 later if
> we wish.

I think being able to crash the backend by just dropping a connection
during a pretty trivial query is a bad thing and it'd be more
prudent to wait.  I have no problem syncing with your guys CVS,
but people using redhat RPMS and FreeBSD Packages are going to wind
up with this bug if you cut the release before squashing it. :(

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]


Re: 7.0.2 dies when connection dropped mid-transaction

From
The Hermit Hacker
Date:
On Thu, 9 Nov 2000, Alfred Perlstein wrote:

> * Bruce Momjian <pgman@candle.pha.pa.us> [001109 18:55] wrote:
> > > I guess the immediate question is do we want to hold up 7.0.3 release
> > > for a fix?  This bug is clearly ancient, so I'm not sure it's
> > > appropriate to go through a fire drill to fix it for 7.0.3.
> > > Comments?
> > 
> > We have delayed 7.0.3 already.  Tom is fixing so many bugs, we may find
> > at some point that Tom never stops fixing bugs long enough for us to do
> > a release.  I say let's push 7.0.3 out.  We can always do 7.0.4 later if
> > we wish.
> 
> I think being able to crash the backend by just dropping a connection
> during a pretty trivial query is a bad thing and it'd be more
> prudent to wait.  I have no problem syncing with your guys CVS,
> but people using redhat RPMS and FreeBSD Packages are going to wind
> up with this bug if you cut the release before squashing it. :(

I'm going to fall behind Alfred on this one ... something this easy to
reproduce is a show stopper ...

Tom, if you can plug this one in the next, say, 48hrs (Saturday night),
please do ... else, I'll announce 7.0.3 on Saturday night and we'll leave
it with such a large showstopper :(



Re: 7.0.2 dies when connection dropped mid-transaction

From
Tom Lane
Date:
The Hermit Hacker <scrappy@hub.org> writes:
> Tom, if you can plug this one in the next, say, 48hrs (Saturday night),
> please do ... else, I'll announce 7.0.3 on Saturday night and we'll leave
> it with such a large showstopper :(

I do have an idea for a simple stopgap answer --- testing now ...
        regards, tom lane


Re: 7.0.2 dies when connection dropped mid-transaction

From
Tom Lane
Date:
The Hermit Hacker <scrappy@hub.org> writes:
> Tom, if you can plug this one in the next, say, 48hrs (Saturday night),

Done.  Want to generate some new 7.0.3 release-candidate tarballs?
        regards, tom lane


Re: 7.0.2 dies when connection dropped mid-transaction

From
The Hermit Hacker
Date:
On Thu, 9 Nov 2000, Tom Lane wrote:

> The Hermit Hacker <scrappy@hub.org> writes:
> > Tom, if you can plug this one in the next, say, 48hrs (Saturday night),
> 
> Done.  Want to generate some new 7.0.3 release-candidate tarballs?

Done, and just forced a sync to ftp.postgresql.org of the new tarballs
... if nobody reports any probs with this by ~midnight tomorrow night,
I'll finish up the 'release links' and get vince to add release info to
the WWW site, followed by putting out an official announcement ...

Great work, as always :)




Re: 7.0.2 dies when connection dropped mid-transaction

From
Alfred Perlstein
Date:
* The Hermit Hacker <scrappy@hub.org> [001109 20:19] wrote:
> On Thu, 9 Nov 2000, Tom Lane wrote:
> 
> > The Hermit Hacker <scrappy@hub.org> writes:
> > > Tom, if you can plug this one in the next, say, 48hrs (Saturday night),
> > 
> > Done.  Want to generate some new 7.0.3 release-candidate tarballs?
> 
> Done, and just forced a sync to ftp.postgresql.org of the new tarballs
> ... if nobody reports any probs with this by ~midnight tomorrow night,
> I'll finish up the 'release links' and get vince to add release info to
> the WWW site, followed by putting out an official announcement ...
> 
> Great work, as always :)

Tom rules.

*thinking freebsd port should add user tgl rather than pgsql*

:)

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


Re: 7.0.2 dies when connection dropped mid-transaction

From
Alfred Perlstein
Date:
* The Hermit Hacker <scrappy@hub.org> [001109 20:19] wrote:
> On Thu, 9 Nov 2000, Tom Lane wrote:
> 
> > The Hermit Hacker <scrappy@hub.org> writes:
> > > Tom, if you can plug this one in the next, say, 48hrs (Saturday night),
> > 
> > Done.  Want to generate some new 7.0.3 release-candidate tarballs?
> 
> Done, and just forced a sync to ftp.postgresql.org of the new tarballs
> ... if nobody reports any probs with this by ~midnight tomorrow night,
> I'll finish up the 'release links' and get vince to add release info to
> the WWW site, followed by putting out an official announcement ...
> 
> Great work, as always :)

Just wanted to confirm that we haven't experianced the bug since we've
applied Tom's patch several days ago.

thanks for the excellent work!

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."