Thread: Crash on UPDATE in 7.0beta3

Crash on UPDATE in 7.0beta3

From
Magnus Hagander
Date:
Hello!

When running multiple concurrent backends on my benchmark database, I see
consistent crashing when running with 8 clients, and sporadic crashes when
running with 4. After the crash, at least one index is broken, so if I run
it again, it will crash on that. When I rebuild my indexes, I get back to
the same crash.
It dies on the same query every time (but only when I have many concurrent
backends - I can run it "alone" from psql without any problem). 

My platform is Linux 2.2.14 running on a Dual Pentium-III 550MHz with 384Mb
RAM.

An example gdb backtrace:

#0  ExecEvalVar (variable=0x8239910, econtext=0x8239fe0, isNull=0x823aac9
"")   at execQual.c:275
#1  0x80a31ab in ExecEvalExpr (expression=0x8239910, econtext=0x8239fe0,   isNull=0x823aac9 "", isDone=0xbfffe59f
"\001\020\022")at
 
execQual.c:1203
#2  0x80a2d07 in ExecEvalFuncArgs (fcache=0x823ab58, econtext=0x8239fe0,   argList=0x82398f8, argV=0xbfffe5a0,
argIsDone=0xbfffe59f"\001\020\022")   at execQual.c:634
 
#3  0x80a2dbb in ExecMakeFunctionResult (node=0x82398a8,
arguments=0x82398f8,   econtext=0x8239fe0, isNull=0xbfffe67e "",   isDone=0xbfffe61f
"\bPæÿ¿\"2\n\b\200\230#\bà\237#\b~æÿ¿`\236\023\bP\231#\bà\237#\b`æÿ¿ \t\017\
b") at execQual.c:710
#4  0x80a2f2d in ExecEvalOper (opClause=0x8239880, econtext=0x8239fe0,   isNull=0xbfffe67e "") at execQual.c:896
#5  0x80a3222 in ExecEvalExpr (expression=0x8239880, econtext=0x8239fe0,   isNull=0xbfffe67e "", isDone=0xbfffe67f
"\001àæÿ¿no\n\bP\231#\bà\237#\b")   at execQual.c:1238
#6  0x80a32ee in ExecQual (qual=0x8239950, econtext=0x8239fe0,   resultForNull=0 '\000') at execQual.c:1365
#7  0x80a6f6e in IndexNext (node=0x8239320) at nodeIndexscan.c:142
#8  0x80a3741 in ExecScan (node=0x8239320, accessMtd=0x80a6e80 <IndexNext>)   at execScan.c:103
#9  0x80a7123 in ExecIndexScan (node=0x8239320) at nodeIndexscan.c:287
#10 0x80a1ec9 in ExecProcNode (node=0x8239320, parent=0x8238be0)   at execProcnode.c:272
#11 0x80a85ab in ExecNestLoop (node=0x8238be0, parent=0x8238be0)   at nodeNestloop.c:192
#12 0x80a1eda in ExecProcNode (node=0x8238be0, parent=0x8238be0)   at execProcnode.c:280
#13 0x80a1c00 in EvalPlanQualNext (estate=0x8235ed0) at execMain.c:1980
#14 0x80a1bd3 in EvalPlanQual (estate=0x8235ed0, rti=1, tid=0xbfffe8b8)   at execMain.c:1966
#15 0x80a13f9 in ExecReplace (slot=0x82364a0, tupleid=0xbfffe928,   estate=0x8235ed0) at execMain.c:1535
#16 0x80a1071 in ExecutePlan (estate=0x8235ed0, plan=0x8235bb8,   operation=CMD_UPDATE, offsetTuples=0, numberTuples=0,
 direction=ForwardScanDirection, destfunc=0x8237ad8) at execMain.c:1202
 
#17 0x80a060e in ExecutorRun (queryDesc=0x8236028, estate=0x8235ed0,   feature=3, limoffset=0x0, limcount=0x0) at
execMain.c:325
#18 0x80fdbb5 in ProcessQueryDesc (queryDesc=0x8236028, limoffset=0x0,   limcount=0x0) at pquery.c:310
#19 0x80fdc41 in ProcessQuery (parsetree=0x82282a8, plan=0x8235bb8,   dest=Remote) at pquery.c:353
#20 0x80fc4cc in pg_exec_query_dest (   query_string=0x81ab248 "UPDATE items SET
itemsinstock=itemsinstock-order_items.amount WHERE
items.itemid=order_items.itemid AND order_items.orderid=467134",
dest=Remote, aclOverride=0) at postgres.c:719
#21 0x80fc392 in pg_exec_query (   query_string=0x81ab248 "UPDATE items SET
itemsinstock=itemsinstock-order_items.amount WHERE
items.itemid=order_items.itemid AND order_items.orderid=467134") at
postgres.c:607
#22 0x80fd533 in PostgresMain (argc=8, argv=0xbffff050, real_argc=8,   real_argv=0xbffffa04) at postgres.c:1642
#23 0x80e5f42 in DoBackend (port=0x81d3d80) at postmaster.c:1953
#24 0x80e5b1a in BackendStartup (port=0x81d3d80) at postmaster.c:1722
#25 0x80e4ce9 in ServerLoop () at postmaster.c:994
#26 0x80e46be in PostmasterMain (argc=8, argv=0xbffffa04) at
postmaster.c:700
#27 0x80b1dfb in main (argc=8, argv=0xbffffa04) at main.c:93



The line of crash is:
275             heapTuple = slot->val;

And this is because:
(gdb) print slot
$1 = (TupleTableSlot *) 0x0

Input to the switch() statement used to set slot is:
(gdb) print variable->varno
$3 = 65001
(gdb) print econtext->ecxt_scantuple
$5 = (TupleTableSlot *) 0x8238a38
(gdb) print econtext->ecxt_innertuple
$6 = (TupleTableSlot *) 0x0
(gdb) print econtext->ecxt_outertuple
$7 = (TupleTableSlot *) 0x0

Seems it wants to use the ecxt_outertuple, but it's NULL. That's about as
far as I can get :-)



Table definitions used:
CREATE TABLE items (  itemid int not null,  description varchar(128) not null,  price int,  itemsinstock int,  supplier
intnot null
 
);
CREATE UNIQUE INDEX items_id_idx ON items (itemid);
CREATE INDEX items_desc_idx ON items (description);
CREATE INDEX items_supp_idx ON items (supplier);
CREATE TABLE order_items (  orderid int not null,  itemid int not null,  amount int not null
);
CREATE UNIQUE INDEX order_items_both_idx ON order_items (orderid, itemid);
CREATE INDEX order_items_order_idx ON order_items (orderid);
CREATE INDEX order_items_item_idx ON order_items (itemid);


Items has ~ 10,000 rows, order_items has ~22 million.


//Magnus


Re: Crash on UPDATE in 7.0beta3

From
Tom Lane
Date:
Magnus Hagander <mha@sollentuna.net> writes:
> When running multiple concurrent backends on my benchmark database, I see
> consistent crashing when running with 8 clients, and sporadic crashes when
> running with 4. After the crash, at least one index is broken, so if I run
> it again, it will crash on that. When I rebuild my indexes, I get back to
> the same crash.
> It dies on the same query every time (but only when I have many concurrent
> backends - I can run it "alone" from psql without any problem). 

What are the other backends doing?

After chasing this logic a little bit, my suspicion is focused on
ExecIndexReScan at about nodeIndexscan.c:342.  If an inner indexscan
is restarted in the context of EvalPlanQual (which, in fact, is exactly
where we are according to the backtrace) then this code returns without
doing much, and in particular without setting the indexscan node's
cs_ExprContext->ecxt_outertuple from the outer plan's value.  Perhaps
the next three lines ought to happen before testing for the PlanQual
case, instead of after (Vadim, what do you think?).  But I don't
understand why having other backends running concurrently would affect
this.
        regards, tom lane


Re: Crash on UPDATE in 7.0beta3

From
Tom Lane
Date:
I said:
> ... But I don't
> understand why having other backends running concurrently would affect
> this.

Yes I do: this entire path of control is only invoked if ExecReplace
discovers that the tuple it's trying to update is already updated by
a concurrent transaction.  Evidently no one's tried running concurrent
UPDATEs where the updates use a nestloop+inner indexscan join plan,
because this path is certain to fail in that case.

Magnus, try swapping the code segments in ExecIndexReScan()
(executor/nodeIndexscan.c:341 ff) to become
   /* it's possible in subselects */   if (exprCtxt == NULL)       exprCtxt =
node->scan.scanstate->cstate.cs_ExprContext;
   node->scan.scanstate->cstate.cs_ExprContext->ecxt_outertuple = exprCtxt->ecxt_outertuple;
   /* If this is re-scanning of PlanQual ... */   if (estate->es_evTuple != NULL &&
estate->es_evTuple[node->scan.scanrelid- 1] != NULL)   {       estate->es_evTupleNull[node->scan.scanrelid - 1] =
false;      return;   }
 

and see if that makes things more reliable.

It looks like nodeTidscan has the same bug...
        regards, tom lane


RE: Crash on UPDATE in 7.0beta3

From
Magnus Hagander
Date:
> I said:
> > ... But I don't
> > understand why having other backends running concurrently 
> would affect
> > this.
> 
> Yes I do: this entire path of control is only invoked if ExecReplace
> discovers that the tuple it's trying to update is already updated by
> a concurrent transaction.  Evidently no one's tried running concurrent
> UPDATEs where the updates use a nestloop+inner indexscan join plan,
> because this path is certain to fail in that case.
> 
> Magnus, try swapping the code segments in ExecIndexReScan()
<snip>

Looks much better - at least it doesn't crash. Instead, I the messages
below. I don't know if this is because of the same thing, though - since it
runs into areas I never reached before. But these messages do *not* show up
if I run with fewer backends (or when I run with the old code - crash).

//Magnus


NOTICE:  Buffer Leak: [3205] (freeNext=-3, freePrev=-3,
relname=order_items_order_idx, blockNum=13532, flags=0x4, refcount=1 1)
NOTICE:  Buffer Leak: [3214] (freeNext=-3, freePrev=-3, relname=order_items,
blockNum=39804, flags=0x4, refcount=1 1)
NOTICE:  Buffer Leak: [5110] (freeNext=-3, freePrev=-3,
relname=order_items_order_idx, blockNum=29437, flags=0x4, refcount=1 1)
NOTICE:  Buffer Leak: [5117] (freeNext=-3, freePrev=-3, relname=order_items,
blockNum=86602, flags=0x4, refcount=1 1)
NOTICE:  Buffer Leak: [13184] (freeNext=-3, freePrev=-3,
relname=order_items_order_idx, blockNum=2115, flags=0x4, refcount=1 1)
NOTICE:  Buffer Leak: [13191] (freeNext=-3, freePrev=-3,
relname=order_items, blockNum=6214, flags=0x4, refcount=1 1)
NOTICE:  Buffer Leak: [4248] (freeNext=-3, freePrev=-3,
relname=order_items_order_idx, blockNum=29583, flags=0x4, refcount=1 1)
NOTICE:  Buffer Leak: [4443] (freeNext=-3, freePrev=-3, relname=order_items,
blockNum=87032, flags=0x4, refcount=1 1)