Re: [PATCHES] WAL logging freezing - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [PATCHES] WAL logging freezing
Date
Msg-id 17552.1162227919@sss.pgh.pa.us
Whole thread Raw
In response to Re: [PATCHES] WAL logging freezing  (Alvaro Herrera <alvherre@commandprompt.com>)
Responses Re: [PATCHES] WAL logging freezing  ("Simon Riggs" <simon@2ndquadrant.com>)
List pgsql-hackers
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Ugh.  Is there another solution to this?  Say, sync the buffer so that
> the hint bits are written to disk?

Yeah.  The original design for all this is explained by the notes for
TruncateCLOG:

 * When this is called, we know that the database logically contains no
 * reference to transaction IDs older than oldestXact.    However, we must
 * not truncate the CLOG until we have performed a checkpoint, to ensure
 * that no such references remain on disk either; else a crash just after
 * the truncation might leave us with a problem.

The pre-8.2 coding is actually perfectly safe within a single database,
because TruncateCLOG is only called at the end of a database-wide
vacuum, and so the checkpoint is guaranteed to have flushed valid hint
bits for all tuples to disk.  There is a risk in other databases though.
I think that in the 8.2 structure the equivalent notion must be that
VACUUM has to flush and fsync a table before it can advance the table's
relminxid.

That still leaves us with the problem of hint bits not being updated
during WAL replay.  I think the best solution for this is for WAL replay
to force relvacuumxid to equal relminxid (btw, these field names seem
poorly chosen, and the comment in catalogs.sgml isn't self-explanatory...)
rather than adopting the value shown in the WAL record.  This probably
is best done by abandoning the generic "overwrite tuple" WAL record type
in favor of something specific to minxid updates.  The effect would then
be that a PITR slave would not truncate its clog beyond the freeze
horizon until it had performed a vacuum of its own.

The point about aborted xmax being a risk factor is a good one.  I don't
think the risk is material for ordinary crash recovery scenarios,
because ordinarily we'd have many opportunities to set the hint bit
before anything really breaks, but it's definitely an issue for
long-term PITR replay scenarios.

I'll work on this as soon as I get done with the btree-index issue I'm
messing with now.

            regards, tom lane

pgsql-hackers by date:

Previous
From: "Jim C. Nasby"
Date:
Subject: Re: bug in on_error_rollback !?
Next
From: "Simon Riggs"
Date:
Subject: Re: [PATCHES] WAL logging freezing