Thread: subtransaction assert failure

subtransaction assert failure

From

Neil Conway

Date:

16 September 2004, 08:32:19

I'm seeing an intermittent assertion failure while running "make check" 
with current sources. I can usually reproduce the problem at least once 
out of every 3 or 4 runs of "make check" on each of two different 
machines (one is an 866 mhz iBook running OSX, the other is a dual Xeon 
@ 2.8 ghz running Debian w/ kernel 2.6). The assertion failure is:

TRAP: FailedAssertion("!(TransactionIdFollowsOrEquals(xid, 
RecentXmin))", File: 
"/Users/neilc/pgsql/src/backend/access/transam/subtrans.c", Line: 146)

The backtrace is:

#0  0x900429ac in kill ()
#1  0x9009eb1c in abort ()
#2  0x001c3b60 in ExceptionalCondition (conditionName=0x0, 
errorType=0x25 "", fileName=0x95 "", lineNumber=1768842554) at 
/Users/neilc/pgsql/src/backend/utils/error/assert.c:51
#3  0x000461c8 in SubTransGetTopmostTransaction (xid=6145) at 
/Users/neilc/pgsql/src/backend/access/transam/subtrans.c:146
#4  0x001e266c in HeapTupleSatisfiesSnapshot (tuple=0x1801, 
snapshot=0x1801) at /Users/neilc/pgsql/src/backend/utils/time/tqual.c:798
#5  0x00016330 in heap_release_fetch (relation=0xae2a40, 
snapshot=0x203ec1c, tuple=0x2048f6c, userbuf=0x2048f84, keep_buf=1 
'\001', pgstat_info=0x2048fa8) at 
/Users/neilc/pgsql/src/backend/access/heap/heapam.c:973
#6  0x0001efa4 in index_getnext (scan=0x1, direction=11414080) at 
/Users/neilc/pgsql/src/backend/access/index/indexam.c:524
#7  0x000cd420 in IndexNext (node=0x27615c) at 
/Users/neilc/pgsql/src/backend/executor/nodeIndexscan.c:316
#8  0x000c6f5c in ExecScan (node=0x20489d8, accessMtd=0xcd114 
<IndexNext>) at /Users/neilc/pgsql/src/backend/executor/execScan.c:98
#9  0x000c1c4c in ExecProcNode (node=0x20489d8) at 
/Users/neilc/pgsql/src/backend/executor/execProcnode.c:307
#10 0x000ceac8 in ExecMergeJoin (node=0x20489d8) at 
/Users/neilc/pgsql/src/backend/executor/nodeMergejoin.c:754
#11 0x000c1c88 in ExecProcNode (node=0x2048680) at 
/Users/neilc/pgsql/src/backend/executor/execProcnode.c:330
#12 0x000cabe8 in agg_retrieve_direct (aggstate=0x2048388) at 
/Users/neilc/pgsql/src/backend/executor/nodeAgg.c:762
#13 0x000c1cc4 in ExecProcNode (node=0x2048388) at 
/Users/neilc/pgsql/src/backend/executor/execProcnode.c:353
#14 0x000d0b84 in ExecSort (node=0x20482fc) at 
/Users/neilc/pgsql/src/backend/executor/nodeSort.c:102
#15 0x000c1cac in ExecProcNode (node=0x20482fc) at 
/Users/neilc/pgsql/src/backend/executor/execProcnode.c:345
#16 0x000c0294 in ExecutePlan (estate=0x204801c, planstate=0x20482fc, 
operation=CMD_SELECT, numberTuples=0, direction=1768842554, 
dest=0xaf2684) at /Users/neilc/pgsql/src/backend/executor/execMain.c:1060
#17 0x000bf4ac in ExecutorRun (queryDesc=0x203ec48, direction=33849372, 
count=0) at /Users/neilc/pgsql/src/backend/executor/execMain.c:226
#18 0x001521f8 in PortalRunSelect (portal=0x201f5c8, forward=-4 '?', 
count=33811528, dest=0x0) at 
/Users/neilc/pgsql/src/backend/tcop/pquery.c:718
[...]

(captured from the OSX machine; note that certain parts of the backtrace 
seem corrupted, so I wouldn't put _too_ much faith in it.)

I can consistently repro this, provided I have the patience to run "make 
check" enough times. Can anyone else repro the problem?

-Neil

Re: subtransaction assert failure

From

Gavin Sherry

Date:

16 September 2004, 09:29:51

On Thu, 16 Sep 2004, Neil Conway wrote:

> I can consistently repro this, provided I have the patience to run "make
> check" enough times. Can anyone else repro the problem?

I've managed to recreate this.

I've tried on two linux 2.4 systems. One way and four way systems.
Interestingly, I *cannot* recreate on the single CPU system and I cannot
get abort() to generate a core. Neil has found the same problem under
Linux.

Gavin

Re: subtransaction assert failure

From

Dennis Bjorklund

Date:

16 September 2004, 09:55:02

On Thu, 16 Sep 2004, Gavin Sherry wrote:

> Interestingly, I *cannot* recreate on the single CPU system and I cannot

I've benn trying to reproduce it on a single cpu system with fedora 2, but 
no luck.

If someone wants else to try, don't forget the configure flag
--enable-cassert.

-- 
/Dennis Björklund

Re: subtransaction assert failure

From

Andrew Dunstan

Date:

16 September 2004, 12:48:30

I am also seeing regular "make check" failures on my test buildfarm 
system (FC1, 2.4.22 kernel, cassert turned on):
sysname |      snapshot       | status | stage--------+---------------------+--------+-------cat     | 2004-09-15
15:25:59|      2 | Checkcat     | 2004-09-15 23:27:37 |      0 | OKcat     | 2004-09-16 00:25:55 |      2 | Check
 

cheers

andrew


Neil Conway wrote:

> I'm seeing an intermittent assertion failure while running "make 
> check" with current sources. I can usually reproduce the problem at 
> least once out of every 3 or 4 runs of "make check" on each of two 
> different machines (one is an 866 mhz iBook running OSX, the other is 
> a dual Xeon @ 2.8 ghz running Debian w/ kernel 2.6). The assertion 
> failure is:
>
> TRAP: FailedAssertion("!(TransactionIdFollowsOrEquals(xid, 
> RecentXmin))", File: 
> "/Users/neilc/pgsql/src/backend/access/transam/subtrans.c", Line: 146)
>
> The backtrace is:
>
> #0  0x900429ac in kill ()
> #1  0x9009eb1c in abort ()
> #2  0x001c3b60 in ExceptionalCondition (conditionName=0x0, 
> errorType=0x25 "", fileName=0x95 "", lineNumber=1768842554) at 
> /Users/neilc/pgsql/src/backend/utils/error/assert.c:51
> #3  0x000461c8 in SubTransGetTopmostTransaction (xid=6145) at 
> /Users/neilc/pgsql/src/backend/access/transam/subtrans.c:146
> #4  0x001e266c in HeapTupleSatisfiesSnapshot (tuple=0x1801, 
> snapshot=0x1801) at /Users/neilc/pgsql/src/backend/utils/time/tqual.c:798
> #5  0x00016330 in heap_release_fetch (relation=0xae2a40, 
> snapshot=0x203ec1c, tuple=0x2048f6c, userbuf=0x2048f84, keep_buf=1 
> '\001', pgstat_info=0x2048fa8) at 
> /Users/neilc/pgsql/src/backend/access/heap/heapam.c:973
> #6  0x0001efa4 in index_getnext (scan=0x1, direction=11414080) at 
> /Users/neilc/pgsql/src/backend/access/index/indexam.c:524
> #7  0x000cd420 in IndexNext (node=0x27615c) at 
> /Users/neilc/pgsql/src/backend/executor/nodeIndexscan.c:316
> #8  0x000c6f5c in ExecScan (node=0x20489d8, accessMtd=0xcd114 
> <IndexNext>) at /Users/neilc/pgsql/src/backend/executor/execScan.c:98
> #9  0x000c1c4c in ExecProcNode (node=0x20489d8) at 
> /Users/neilc/pgsql/src/backend/executor/execProcnode.c:307
> #10 0x000ceac8 in ExecMergeJoin (node=0x20489d8) at 
> /Users/neilc/pgsql/src/backend/executor/nodeMergejoin.c:754
> #11 0x000c1c88 in ExecProcNode (node=0x2048680) at 
> /Users/neilc/pgsql/src/backend/executor/execProcnode.c:330
> #12 0x000cabe8 in agg_retrieve_direct (aggstate=0x2048388) at 
> /Users/neilc/pgsql/src/backend/executor/nodeAgg.c:762
> #13 0x000c1cc4 in ExecProcNode (node=0x2048388) at 
> /Users/neilc/pgsql/src/backend/executor/execProcnode.c:353
> #14 0x000d0b84 in ExecSort (node=0x20482fc) at 
> /Users/neilc/pgsql/src/backend/executor/nodeSort.c:102
> #15 0x000c1cac in ExecProcNode (node=0x20482fc) at 
> /Users/neilc/pgsql/src/backend/executor/execProcnode.c:345
> #16 0x000c0294 in ExecutePlan (estate=0x204801c, planstate=0x20482fc, 
> operation=CMD_SELECT, numberTuples=0, direction=1768842554, 
> dest=0xaf2684) at /Users/neilc/pgsql/src/backend/executor/execMain.c:1060
> #17 0x000bf4ac in ExecutorRun (queryDesc=0x203ec48, 
> direction=33849372, count=0) at 
> /Users/neilc/pgsql/src/backend/executor/execMain.c:226
> #18 0x001521f8 in PortalRunSelect (portal=0x201f5c8, forward=-4 '?', 
> count=33811528, dest=0x0) at 
> /Users/neilc/pgsql/src/backend/tcop/pquery.c:718
> [...]
>
> (captured from the OSX machine; note that certain parts of the 
> backtrace seem corrupted, so I wouldn't put _too_ much faith in it.)
>
> I can consistently repro this, provided I have the patience to run 
> "make check" enough times. Can anyone else repro the problem?
>
> -Neil
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Re: subtransaction assert failure

From

Gavin Sherry

Date:

16 September 2004, 13:19:20

On Thu, 16 Sep 2004, Gavin Sherry wrote:

> On Thu, 16 Sep 2004, Neil Conway wrote:
>
> > I can consistently repro this, provided I have the patience to run "make
> > check" enough times. Can anyone else repro the problem?
>
> I've managed to recreate this.
>
> I've tried on two linux 2.4 systems. One way and four way systems.
> Interestingly, I *cannot* recreate on the single CPU system and I cannot
> get abort() to generate a core. Neil has found the same problem under
> Linux.

Attached is a bt which does not appear to be corrupted:

#0  0x080adb70 in SubTransGetTopmostTransaction (xid=6384) at
subtrans.c:149
#1  0x08201984 in HeapTupleSatisfiesSnapshot (tuple=0x4074eba8,   snapshot=0x8388908) at tqual.c:798
#2  0x08083f17 in heapgettup (relation=0x40b17ba0, dir=1, tuple=0x83c09ec,   buffer=0x83c0a04, snapshot=0x8388908,
nkeys=0,key=0x0, pages=2)   at heapam.c:312
 
#3  0x08084db2 in heap_getnext (scan=0x83c09d8,
direction=ForwardScanDirection)   at heapam.c:830
#4  0x0812518a in SeqNext (node=0x0) at nodeSeqscan.c:102
#5  0x0811d664 in ExecScan (node=0x83c0594, accessMtd=0x8125110 <SeqNext>)   at execScan.c:98
#6  0x081251af in ExecSeqScan (node=0x83c0594) at nodeSeqscan.c:139
#7  0x0811904e in ExecProcNode (node=0x83c0594) at execProcnode.c:303
#8  0x08117a89 in ExecutePlan (estate=0x83bfd44, planstate=0x83c0594,   operation=CMD_UPDATE, numberTuples=0,
direction=137262828,
dest=0x82a1020)   at execMain.c:1060
#9  0x08116eeb in ExecutorRun (queryDesc=0x8388934,   direction=ForwardScanDirection, count=0) at execMain.c:226
#10 0x0812abbe in _SPI_pquery (queryDesc=0x8388934, tcount=0) at
spi.c:1498
#11 0x0812a978 in _SPI_execute_plan (plan=0x83d620c, Values=0x83840dc,   Nulls=0x838448c "  ~", '\177' <repeats 197
times>...,snapshot=0x0,   crosscheck_snapshot=0x0, read_only=0 '\0', tcount=0) at spi.c:1427
 
#12 0x081290b4 in SPI_execute_plan (plan=0x83d620c, Values=0x83840dc,   Nulls=0x838448c "  ~", '\177' <repeats 197
times>...,read_only=0
 
'\0',   tcount=0) at spi.c:309
#13 0x40b48d69 in exec_stmt_execsql (estate=0xbfffcee0, stmt=0x82e76ec)   at pl_exec.c:2155
#14 0x40b47782 in exec_stmt (estate=0xbfffcee0, stmt=0x838f988)   at pl_exec.c:1108
#15 0x40b47649 in exec_stmts (estate=0xbfffcee0, stmts=0x838fa20)   at pl_exec.c:1024
#16 0x40b47a9c in exec_stmt_if (estate=0xbfffcee0, stmt=0x835dbf0)   at pl_exec.c:1260
#17 0x40b476ea in exec_stmt (estate=0xbfffcee0, stmt=0x835dbf0)   at pl_exec.c:1068
#18 0x40b47649 in exec_stmts (estate=0xbfffcee0, stmts=0x835dc08)   at pl_exec.c:1024
#19 0x40b4751d in exec_stmt_block (estate=0xbfffcee0, block=0x835dcb8)   at pl_exec.c:979
#20 0x40b46c24 in plpgsql_exec_trigger (func=0x838fa90,
trigdata=0xbfffd0f0)   at pl_exec.c:655
#21 0x40b4368a in plpgsql_call_handler (fcinfo=0xbfffcfb0) at
pl_handler.c:124
#22 0x08104393 in ExecCallTriggerFunc (trigdata=0xbfffd0f0,
finfo=0x8338dd4,   per_tuple_context=0x832fc74) at trigger.c:1149
#23 0x08105368 in AfterTriggerExecute (event=0x83d133c, rel=0x1,   trigdesc=0x8338bc4, finfo=0x0,
per_tuple_context=0x832fc74)  at trigger.c:1964
 
#24 0x08105746 in afterTriggerInvokeEvents (events=0x83d12f0, firing_id=0,   delete_ok=1 '\001') at trigger.c:2160
#25 0x08105996 in AfterTriggerEndQuery () at trigger.c:2327
#26 0x0818a758 in ProcessQuery (parsetree=0x8338b78, plan=0x832e0f0,   params=0x0, dest=0xbfffd360,
completionTag=0xbfffd360"UPDATE 1")   at pquery.c:215
 
#27 0x0818b5bf in PortalRunMulti (portal=0x83369ac, dest=0x83adf08,   altdest=0x83adf08, completionTag=0xbfffd360
"UPDATE1") at
 
pquery.c:988
#28 0x0818aedf in PortalRun (portal=0x83369ac, count=2147483647,   dest=0x83adf08, altdest=0x83adf08,
completionTag=0xbfffd360"UPDATE
 
1")   at pquery.c:601
#29 0x0818794c in exec_simple_query (   query_string=0x832c4fc "update PField set name = 'PF0_2' where name =
'PF0_X';") at postgres.c:924
#30 0x08189c1f in PostgresMain (argc=4, argv=0x82e7a80,   username=0x82e791c "swm") at postgres.c:2967
#31 0x08161593 in BackendRun (port=0x82f6958) at postmaster.c:2848
#32 0x08161031 in BackendStartup (port=0x82f6958) at postmaster.c:2470
#33 0x0815f717 in ServerLoop () at postmaster.c:1215
#34 0x0815ec66 in PostmasterMain (argc=4, argv=0x82e74b0) at
postmaster.c:898
#35 0x081318eb in main (argc=4, argv=0x82e74b0) at main.c:265
#36 0x42015574 in __libc_start_main () from /lib/tls/libc.so.6

The following may also be helpful:

(gdb) print RecentXmin
$2 = 6385

The xid passed to SubTransGetTopmostTransaction() is 6384.

Gavin

Re: subtransaction assert failure

From

Michael Glaesemann

Date:

16 September 2004, 13:24:27

On Sep 16, 2004, at 8:48 PM, Andrew Dunstan wrote:

> I am also seeing regular "make check" failures on my test buildfarm 
> system (FC1, 2.4.22 kernel, cassert turned on):

Mmm :) buildfarm! I like the sound of that! Let us know how it goes, eh?

Michael Glaesemann
grzm myrealbox com

Re: subtransaction assert failure

From

Dennis Bjorklund

Date:

16 September 2004, 15:43:44

On Thu, 16 Sep 2004, Dennis Bjorklund wrote:

> > Interestingly, I *cannot* recreate on the single CPU system and I cannot
> 
> I've been trying to reproduce it on a single cpu system with fedora 2,
> but no luck.

I spook too soon. After having run it even more times I finally got a 
failure as well.

-- 
/Dennis Björklund

Re: subtransaction assert failure

From

Tom Lane

Date:

16 September 2004, 15:51:05

Gavin Sherry <swm@linuxworld.com.au> writes:
> Interestingly, I *cannot* recreate on the single CPU system and I cannot
> get abort() to generate a core.

By that do you mean that you don't see any corefile in the DB directory
when you look after the dust settles?

I ran into the same problem yesterday in another connection, and
eventually realized that the corefile is getting removed because of the
logic I added recently to do WAL replay of CREATE/DROP DATABASE.  The
regression test sequence is short enough (on modern machines) that there
may not be any checkpoint between its start and the point where you have
a crash, so that the initial "CREATE DATABASE regression" operation is
still in the range of WAL entries to be replayed.  In dbcommands.c
it sez:
       /*        * Our theory for replaying a CREATE is to forcibly drop the        * target subdirectory if present,
thenre-copy the source data.        * This may be more work than needed, but it is simple to        * implement.
*/

So what's happening is that WAL replay is wiping the database directory
(including the core file).

I don't really want to change the CREATE DATABASE replay logic, so I was
thinking of suggesting that we hack around this by modifying pg_regress
to force a checkpoint right after its CREATE DATABASE.  Then any crashes
during the regression tests wouldn't cause a replay of the CREATE.  This
is mighty ugly though :-(
        regards, tom lane

Re: subtransaction assert failure

From

Andrew Dunstan

Date:

16 September 2004, 15:57:41

Michael Glaesemann wrote:

>
> On Sep 16, 2004, at 8:48 PM, Andrew Dunstan wrote:
>
>> I am also seeing regular "make check" failures on my test buildfarm 
>> system (FC1, 2.4.22 kernel, cassert turned on):
>
>
> Mmm :) buildfarm! I like the sound of that! Let us know how it goes, eh?
>
>
As you saw, the actual running of tests and upload of results is 
functional. It's designed as a distributed farm, somewhat along the 
lines of what Samba uses.

In a few days when I have the registration and display pieces done I 
will be calling for a handful of volunteers to help me test it.

Farm members will need internet access to pull from CVS and to push test 
results, plus Perl with the LWP, HTTP::Request::Common, MIME::Base64, 
Digest::SHA1 and File::Find modules (most of these come with a base Perl 
install these days). The whole thing runs from cron - no user action 
required. Configuration for a farm member would be trivial.

cheers

andrew

Re: subtransaction assert failure

From

Tom Lane

Date:

16 September 2004, 16:43:36

Neil Conway <neilc@samurai.com> writes:
> I'm seeing an intermittent assertion failure while running "make check" 
> with current sources.

> TRAP: FailedAssertion("!(TransactionIdFollowsOrEquals(xid, 
> RecentXmin))", File: 
> "/Users/neilc/pgsql/src/backend/access/transam/subtrans.c", Line: 146)

Hmm, I haven't seen this myself but I see what's causing it --- sloppy
thinking about the role of RecentXmin.  We need another variable named
something like TransactionXmin, which would be the RecentXmin first
computed during the current transaction.  (This would be a copy of
MyProc->xmin; we could just look directly at that but I think I prefer
a separate backend-local variable.)  The tests that protect access to
pg_subtrans need to use this, not RecentXmin, as the cutoff point.

The problem of course is that RecentXmin is the xmin associated with the
most recently computed snapshot, but there may be older snapshots still
in use in the transaction.  The time qual test code needs to be able to
get answers for transactions that could be present in any of the
transaction's snapshots, not only the most recent.

Mea culpa, will fix.
        regards, tom lane

PS: this points up once again that the regression tests do not do very
well at testing concurrent behavior.  We need to think about some sort
of parallel-test harness that would let us write tests that have better
odds of catching bugs like this.

Re: subtransaction assert failure

From

Gavin Sherry

Date:

16 September 2004, 23:49:03

On Thu, 16 Sep 2004, Tom Lane wrote:

> Gavin Sherry <swm@linuxworld.com.au> writes:
> > Interestingly, I *cannot* recreate on the single CPU system and I cannot
> > get abort() to generate a core.
>
> By that do you mean that you don't see any corefile in the DB directory
> when you look after the dust settles?

Right. I was actually just doing a find . -name core and came up with no
results.


>
> I ran into the same problem yesterday in another connection, and
> eventually realized that the corefile is getting removed because of the
> logic I added recently to do WAL replay of CREATE/DROP DATABASE.  The
> regression test sequence is short enough (on modern machines) that there
> may not be any checkpoint between its start and the point where you have
> a crash, so that the initial "CREATE DATABASE regression" operation is
> still in the range of WAL entries to be replayed.  In dbcommands.c
> it sez:
>
>         /*
>          * Our theory for replaying a CREATE is to forcibly drop the
>          * target subdirectory if present, then re-copy the source data.
>          * This may be more work than needed, but it is simple to
>          * implement.
>          */
>
> So what's happening is that WAL replay is wiping the database directory
> (including the core file).
>
> I don't really want to change the CREATE DATABASE replay logic, so I was
> thinking of suggesting that we hack around this by modifying pg_regress
> to force a checkpoint right after its CREATE DATABASE.  Then any crashes
> during the regression tests wouldn't cause a replay of the CREATE.  This
> is mighty ugly though :-(

Yes, a bit ugly. Not as ugly as my putting a for(;;) in
SubTransGetTopmostTransaction() when the Assert condition would fail so
that I could get a useful backtrace.

Gavin

testing concurrency (was Re: subtransaction assert failure)

From

Neil Conway

Date:

22 September 2004, 02:01:07

On Fri, 2004-09-17 at 01:43, Tom Lane wrote:
> PS: this points up once again that the regression tests do not do very
> well at testing concurrent behavior.  We need to think about some sort
> of parallel-test harness that would let us write tests that have better
> odds of catching bugs like this.

I agree -- something like this would be very useful. I'm a little
stumped about how to actually write such a thing, though...

One incremental improvement to the existing RT system would be to make
each test self-contained (i.e. as far as possible it would assume
nothing about DBMS state before it ran, and not leave the DBMS state
modified when it finished running). If writing completely self-contained
tests isn't possible, we could specify a dependency graph for the tests.
Then we could have the regression test system generate a randomized
grouping of tests to be executed. In order to have reproducible results,
we could associate each randomized grouping with a numeric key (there
should be some way to produce a 1-to-1 mapping). So someone reporting a
regression test problem would say: "I see the following regression test
failures, the test key is xyz", and a developer could run the regression
tests specifying that key and have the RT system run the same
combination of tests in the same order that produced the problem the
user saw. This would be a complement to the existing method of running
the RTs.

I still think we need a separate way to test parallel behavior though,
in addition to the above idea. Can anyone think of a good way to do
this?

-Neil

Re: testing concurrency (was Re: subtransaction assert

From

Gavin Sherry

Date:

22 September 2004, 02:29:17

On Wed, 22 Sep 2004, Neil Conway wrote:

> On Fri, 2004-09-17 at 01:43, Tom Lane wrote:
> > PS: this points up once again that the regression tests do not do very
> > well at testing concurrent behavior.  We need to think about some sort
> > of parallel-test harness that would let us write tests that have better
> > odds of catching bugs like this.
>
> I agree -- something like this would be very useful. I'm a little
> stumped about how to actually write such a thing, though...
>
> One incremental improvement to the existing RT system would be to make
> each test self-contained (i.e. as far as possible it would assume
> nothing about DBMS state before it ran, and not leave the DBMS state
> modified when it finished running). If writing completely self-contained
> tests isn't possible, we could specify a dependency graph for the tests.
> Then we could have the regression test system generate a randomized
> grouping of tests to be executed. In order to have reproducible results,
> we could associate each randomized grouping with a numeric key (there
> should be some way to produce a 1-to-1 mapping). So someone reporting a
> regression test problem would say: "I see the following regression test
> failures, the test key is xyz", and a developer could run the regression
> tests specifying that key and have the RT system run the same
> combination of tests in the same order that produced the problem the
> user saw. This would be a complement to the existing method of running
> the RTs.

If someone puts in the work to make the RTs self contained, is it
worthwhile also including a logic to return the system to the state before
the test took place (self contained tests kind of imply that, to me
anyway).

What this also gives us is the ability to run each test continuously while
we run every single other test. Sure, this would take a while to run but I
think it gives as a lot of concurrent testing with a relatively small
amount of work.

>
> I still think we need a separate way to test parallel behavior though,
> in addition to the above idea. Can anyone think of a good way to do
> this?

I think if the tests are self contained then we have a higher likelihood
of hitting concurrency issues. The thing is, the major concurrency issues
which have come up recently (from memory) are all race conditions. So,
running queries in parallel we cannot pre determine who will win the race
and I cannot see how we can. Best we can do is increase the chances of
finding it by running it more than once.

>
> -Neil

Gavin

Re: testing concurrency (was Re: subtransaction assert failure)

From

Tom Lane

Date:

22 September 2004, 05:03:02

Neil Conway <neilc@samurai.com> writes:
> On Fri, 2004-09-17 at 01:43, Tom Lane wrote:
>> PS: this points up once again that the regression tests do not do very
>> well at testing concurrent behavior.  We need to think about some sort
>> of parallel-test harness that would let us write tests that have better
>> odds of catching bugs like this.

> One incremental improvement to the existing RT system would be to make
> each test self-contained (i.e. as far as possible it would assume
> nothing about DBMS state before it ran, and not leave the DBMS state
> modified when it finished running). If writing completely self-contained
> tests isn't possible, we could specify a dependency graph for the tests.
> Then we could have the regression test system generate a randomized
> grouping of tests to be executed.

This is pretty much what we have now, and it's quite useless, because
the tests that are run concurrently are very careful to stay out of each
others' way.  There are certainly no tests that deliberately exercise
concurrent-update scenarios, and AFAICS the scheme you are suggesting
would not do that either.

There are some speculations in the archives about writing a test engine
that could fire up multiple database connections and then issue specific
command sequences across the set of connections.  It's clear that you
could test questions like does-this-lock-really-block-that-operation
that way.  I'm not sure that this approach does anything much to catch
problems with small windows-of-opportunity though :-(

Mulling it over ... it seems the real issue is that the regression tests
are designed for reproducibility, whereas what we'd really like to probe
for this sort of bug is *non* reproducible behavior.  Not sure how to
approach that, except that maybe we shouldn't think in terms of tests
that every Joe Bob and Harry can run and not need any intelligence to
interpret the results of ...
        regards, tom lane