Thread: Forward zeroing of pg_clog

Forward zeroing of pg_clog

From
Tom Lane
Date:
I just spent some time chasing weird failures ("PANIC: cannot abort
transaction 201109, it was already committed" after some but not all
errors) which I eventually realized were because pg_clog contained
commit and abort flags for several thousand transactions ahead of where
the current XID counter is in my test database.

How did it get that way?  Well, yesterday I was testing the XLOG mods to
support huge COMMIT records, so I ran a test script that would commit a
transaction with 20000 subcommitted subtransactions.  And then I kill 9'd
the backend to force WAL replay of that large transaction.

WAL replay sets the XID counter as one more than the largest XID that it
sees evidence of in the replayed log.  However, it's not looking inside
the COMMIT or ABORT records, and so in this case the largest XID it saw
was that of the parent transaction.  The actual pre-crash XID counter
was of course 20000 more than that.

This particular issue is just a simple oversight in xact_redo, and it's
easily fixed: make sure nextXID gets advanced past all of the committed
or aborted subXIDs too.

But thinking about it, I realized that we have some other issues in the
same area.  Because subxact commit sets clog bits but emits no WAL
record, it's at least theoretically possible that post-crash there will
be written-out clog bits for XIDs ahead of every XID of which there is
any record in the WAL data.  RecordTransactionCommit and friends have
other cases in which they think it's sufficient to write a clog entry
and no WAL entry.  Perhaps that's broken, but I think the cleanest fix
is that the clog code ought to forcibly zero all clog entries ahead of
whatever nextXID is settled on by WAL replay.  Otherwise we run some
risk of subtransactions that are still running looking like they are
subcommitted (or worse) in the clog data.

This is already true at the page level: when advancing into a new page
we zero it instead of reading anything from disk.  I am thinking of
adding code to StartupCLOG to zero the remaining portion of the
"current" page too.

Thoughts?
        regards, tom lane


Re: Forward zeroing of pg_clog

From
"Simon Riggs"
Date:
> Tom Lane wrote

> This is already true at the page level: when advancing into a new page
> we zero it instead of reading anything from disk.  I am thinking of
> adding code to StartupCLOG to zero the remaining portion of the
> "current" page too.
>
> Thoughts?
>

That sounds like the right thing to do. IIRC that means we do it once at the
very end of recovery, which is quick as well as safe.

This is important for point-in-time recovery also, since there would always
be clog entries ahead of the recovery target.

Best Regards, Simon Riggs



Re: Forward zeroing of pg_clog

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> This is important for point-in-time recovery also, since there would always
> be clog entries ahead of the recovery target.

Not really, because they'd not have gotten applied.  AFAICS only crash
recovery really has an issue here.
        regards, tom lane


Re: Forward zeroing of pg_clog

From
Alvaro Herrera
Date:
On Mon, Aug 30, 2004 at 02:19:49PM -0400, Tom Lane wrote:

> This particular issue is just a simple oversight in xact_redo, and it's
> easily fixed: make sure nextXID gets advanced past all of the committed
> or aborted subXIDs too.

Certainly this is the right thing to do.

> [...] but I think the cleanest fix is that the clog code ought to
> forcibly zero all clog entries ahead of whatever nextXID is settled on
> by WAL replay.

I agree with this too.

Thanks for taking care of all this stuff.

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"We are who we choose to be", sang the goldfinch
when the sun is high (Sandman)