Re: 7.0.2 dies when connection dropped mid-transaction - Mailing list pgsql-hackers

From Alfred Perlstein
Subject Re: 7.0.2 dies when connection dropped mid-transaction
Date
Msg-id 20001109184324.L11449@fw.wintelcom.net
Whole thread Raw
In response to Re: 7.0.2 dies when connection dropped mid-transaction  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
* Tom Lane <tgl@sss.pgh.pa.us> [001109 18:30] wrote:
> I said:
> > OK, after digging some more, it seems that the critical requirement
> > is that the cursor's query contain a hash join.
> 
> Here's the deal:
> 
> test7=# set enable_mergejoin to off;
> SET VARIABLE
> test7=# begin;
> BEGIN
> -- I've previously checked that this produces a hash join plan:
> test7=# declare c cursor for select * from foo t1, foo t2 where t1.f1=t2.f1;
> SELECT
> test7=# fetch 1 from c;
>  f1 | f1
> ----+----
>   1 |  1
> (1 row)
> 
> test7=# abort;
> NOTICE:  trying to delete portal name that does not exist.
> pqReadData() -- backend closed the channel unexpectedly.
>         This probably means the backend terminated abnormally
>         before or while processing the request.
> 
> This happens with either 7.0.2 or 7.0.3 (probably with anything back to
> 6.5, if not before).  It does *not* happen with current development tip.
> 
> The problem is that two "portal" structures are used.  One holds the
> overall query plan and execution state for the cursor, and the other
> holds the hash table for the hash join.  During abort, the portal
> manager tries to delete both of them.  BUT: deleting the query plan
> causes query cleanup to be executed, which among other things deletes
> the hash join's table.  Then the portal manager tries to delete the
> already-deleted second portal, which leads first to the above notice
> and then to Assert failure (and probably would lead to coredump if
> you didn't have Asserts on).  Alternatively, it might try to delete
> the hash join portal first, which would leave the query cleanup code
> deleting an already-deleted portal, and doubtless still crashing.
> 
> Current sources don't show the problem because hashtables aren't kept
> in portals anymore.
> 
> I've thought for some time that CollectNamedPortals is a horrid kluge,
> and really ought to be rewritten.  Hadn't seen it actually do the wrong
> thing before, but now...
> 
> I guess the immediate question is do we want to hold up 7.0.3 release
> for a fix?  This bug is clearly ancient, so I'm not sure it's
> appropriate to go through a fire drill to fix it for 7.0.3.
> Comments?

I dunno, having the database crash because a errant client disconnected
without shutting down, or needed to abort a transaction looks like
a show stopper.

We do track CVS and wouldn't have a problem shifting to 7_0_3_PATCHES,
but I'm not sure if the rest of the userbase is going to have much
fun.

It seems to be a serious problem, I think people wouldn't mind
waiting for you to squash this one.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Re: Recursive use of syscaches (was: relation ### modified while in use)
Next
From: Larry Rosenman
Date:
Subject: Re: Summary: what to do about INET/CIDR