Thread: Regression test failures

Regression test failures

From
Bruce Momjian
Date:
I am still seeing random regression test failures on my SMP BSD/OS
machine.  It basically happens when doing 'gmake check'.

I have tried running repeated tests and can't get it to reproduce, but
when checking patches it has happened perhaps once a week for the past
six weeks.  It happens once and then doesn't happen again.

I will keep investigating.  I reported this perhaps three weeks ago.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Regression test failures

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I am still seeing random regression test failures on my SMP BSD/OS
> machine.  It basically happens when doing 'gmake check'.

> I have tried running repeated tests and can't get it to reproduce, but
> when checking patches it has happened perhaps once a week for the past
> six weeks.  It happens once and then doesn't happen again.

> I will keep investigating.  I reported this perhaps three weeks ago.

Do these failures look anything like this?

--- 78,86 ----         DROP TABLE foo;         CREATE TABLE bar (a int);     ROLLBACK TO SAVEPOINT one;
! WARNING:  AbortSubTransaction while in ABORT state
! ERROR:  relation 555088 deleted while still in use
! server closed the connection unexpectedly
!     This probably means the server terminated abnormally
!     before or while processing the request.
! connection to server was lost

I got this once this morning and have been unable to reproduce it.
The OID referenced in the message seemed to correspond to the relation
"bar", created just above the point of error.
        regards, tom lane


Re: Regression test failures

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> 
>>I am still seeing random regression test failures on my SMP BSD/OS
>>machine.  It basically happens when doing 'gmake check'.
> 
> 
>>I have tried running repeated tests and can't get it to reproduce, but
>>when checking patches it has happened perhaps once a week for the past
>>six weeks.  It happens once and then doesn't happen again.
> 
> 
>>I will keep investigating.  I reported this perhaps three weeks ago.
> 
> 
> Do these failures look anything like this?
> 
> --- 78,86 ----
>           DROP TABLE foo;
>           CREATE TABLE bar (a int);
>       ROLLBACK TO SAVEPOINT one;
> ! WARNING:  AbortSubTransaction while in ABORT state
> ! ERROR:  relation 555088 deleted while still in use
> ! server closed the connection unexpectedly
> !     This probably means the server terminated abnormally
> !     before or while processing the request.
> ! connection to server was lost
> 
> I got this once this morning and have been unable to reproduce it.
> The OID referenced in the message seemed to correspond to the relation
> "bar", created just above the point of error.


Just for the record I had strange errors too on beta1 - when playing 
with creating/deleting/altering tables and savepoints(not sure if that 
is related anyhow).
I had it once two times in a row, but when I tried to build a testcase 
to report this issue I couldn't reproduce it again :-(

iirc the error I got was something along the line of:

ERROR:  catalog is missing 1 attribute(s) for relid 17231



Stefan


Re: Regression test failures

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> iirc the error I got was something along the line of:
> ERROR:  catalog is missing 1 attribute(s) for relid 17231

It's possible that that's the same problem but in a form triggered by
ALTER ADD COLUMN.

I was able to reproduce the problem I saw, and have now decided that
there are several interacting bugs involved.  Basically, the sequence
BEGIN;SAVEPOINT x;    CREATE TABLE foo ...;ROLLBACK TO x;

ought to *always* fail (bug #1) but chances not to do so because of bug
#2 --- except that there's a race condition (bug #3) which allows the
failure to emerge if some other backend has done the right thing during
a narrow time window.

Bug #1 is that relcache.c isn't accounting for subtransactions in its
handling of rd_isnew relcache entries.  In the above example, foo is
marked rd_isnew and so relcache.c tries to preserve the relcache entry
until transaction end.  The ROLLBACK will hit it with cache invalidation
actions telling it that the pg_class and pg_attribute entries for the
table have changed.  Normally that would cause the relcache entry to be
dropped, but since it's rd_isnew, relcache.c mulishly tries to rebuild
it instead.  So it's reading catalog entries that are now considered
deleted, and so the "deleted while in use" error is exactly what you'd
expect.

So why don't you get that all the time?  Well, bug #2 is that
TransactionIdIsCurrentTransactionId still considers the already-aborted
subtransaction ID to be current, so the validity tests in tqual.c will
think the catalog rows are still valid.

Except that there is a race condition.  If, between the time that the
subxact is marked aborted in pg_clog and the time relcache.c tries to
re-read these rows, some other backend comes along and examines the
pg_class row in question, it will mark the row as XMIN_INVALID, in
which case tqual.c doesn't bother to check
TransactionIdIsCurrentTransactionId but just declares the row no good.
So, with just the right sort of concurrent activity, it's possible to
observe the error.

I think your "catalog is missing 1 attribute" example might be the same
sort of thing, except the XMIN marking happened to a pg_attribute row
instead of a pg_class row.  (I'm not totally convinced by that
explanation though --- the failure should be transient rather than
repeatable, if this was the mechanism.)

After further thought I'm thinking that bug #2 is not so much whether
TransactionIdIsCurrentTransactionId is behaving correctly, but the fact
that it is being invoked at all.  We should never be doing catalog
accesses in transaction-aborted state.  The relcache code therefore
needs to be changed so that it doesn't try to do rebuilds immediately,
but waits until we are back in a "good" state (either out of the failed
subtransaction, or starting a whole fresh transaction if invalidation
happened at the end of a main transaction).  I think this bug exists
in existing releases too, for invalidation events affecting
nailed-in-cache system catalogs.  It's not clear you'd ever see an
actual failure in the field for that case, but it's still wrong.
        regards, tom lane