Forward zeroing of pg_clog - Mailing list pgsql-hackers

From Tom Lane
Subject Forward zeroing of pg_clog
Date
Msg-id 8958.1093889989@sss.pgh.pa.us
Whole thread Raw
Responses Re: Forward zeroing of pg_clog  ("Simon Riggs" <simon@2ndquadrant.com>)
Re: Forward zeroing of pg_clog  (Alvaro Herrera <alvherre@dcc.uchile.cl>)
List pgsql-hackers
I just spent some time chasing weird failures ("PANIC: cannot abort
transaction 201109, it was already committed" after some but not all
errors) which I eventually realized were because pg_clog contained
commit and abort flags for several thousand transactions ahead of where
the current XID counter is in my test database.

How did it get that way?  Well, yesterday I was testing the XLOG mods to
support huge COMMIT records, so I ran a test script that would commit a
transaction with 20000 subcommitted subtransactions.  And then I kill 9'd
the backend to force WAL replay of that large transaction.

WAL replay sets the XID counter as one more than the largest XID that it
sees evidence of in the replayed log.  However, it's not looking inside
the COMMIT or ABORT records, and so in this case the largest XID it saw
was that of the parent transaction.  The actual pre-crash XID counter
was of course 20000 more than that.

This particular issue is just a simple oversight in xact_redo, and it's
easily fixed: make sure nextXID gets advanced past all of the committed
or aborted subXIDs too.

But thinking about it, I realized that we have some other issues in the
same area.  Because subxact commit sets clog bits but emits no WAL
record, it's at least theoretically possible that post-crash there will
be written-out clog bits for XIDs ahead of every XID of which there is
any record in the WAL data.  RecordTransactionCommit and friends have
other cases in which they think it's sufficient to write a clog entry
and no WAL entry.  Perhaps that's broken, but I think the cleanest fix
is that the clog code ought to forcibly zero all clog entries ahead of
whatever nextXID is settled on by WAL replay.  Otherwise we run some
risk of subtransactions that are still running looking like they are
subcommitted (or worse) in the clog data.

This is already true at the page level: when advancing into a new page
we zero it instead of reading anything from disk.  I am thinking of
adding code to StartupCLOG to zero the remaining portion of the
"current" page too.

Thoughts?
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Jim Buttafuoco"
Date:
Subject: Re: beta 1 failed on linux mipsel
Next
From: Joe Conway
Date:
Subject: pgxs regression test support