I have noticed that a large fraction of the I/O done by 7.1 is
associated with initializing new segments of the WAL log for use.
(We have to physically fill each segment with zeroes to ensure that
the system has actually allocated a whole 16MB to it; otherwise we
fall victim to the "hole-saving" allocation technique of most Unix
filesystems.) I just had an idea about how to avoid this cost:
why not recycle old log segments? At the point where the code
currently deletes a no-longer-needed segment, just rename it to
become the next created-in-advance segment.
With this approach, shortly after installation the system would converge
to a steady state with a constant number of WAL segments (basically
CHECKPOINT_SEGMENTS + WAL_FILES + 1, maybe one or two more if load is
really high). So, in addition to eliminating initialization writes,
we would also reduce the metadata traffic (inode and indirect blocks)
to a very low level. That has to be good both for performance and for
improving the odds that the WAL files will survive a system crash.
The sole disadvantage I can see to this approach is that a recycled
segment would not contain zeroes, but valid WAL records. We'd need
to take care that in a recovery situation, we not mistake old records
beyond the last one we actually wrote for new records we should redo.
While checking the xl_prev back-pointers in each record should be
sufficient to detect this, I'd feel more comfortable if we extended
the XLogPageHeader record to contain the file/segment number that it
belongs to. This'd cost an extra 8 bytes per 8K XLOG page, which seems
worth it to me.
Another issue is whether the recycling logic should be "always recycle"
(hence number of extant WAL segments will never decrease), or should
it be more like "recycle if there are fewer than WAL_FILES advance
segments, else delete". If we were supporting WAL-based UNDO then I
think it'd have to be the latter, so that we could reduce the WAL usage
from a peak created by a long-running transaction. But with the present
logic that the WAL log is truncated after each checkpoint, I think it'd
be better just to never delete. Otherwise, the behavior is likely to
be that the system varies between N and N+1 extant segments due to
roundoff effects (ie, depending on just where you are in the current
segment when a checkpoint happens). That's exactly what we do not want.
A possible answer is "recycle if there are fewer than WAL_FILES + SLOP
advance files, else delete", where SLOP is (say) about three or four
segments. That would avoid unwanted oscillations in the number of
extant files, while still allowing decrease from a peak for UNDO.
Comments, better ideas?
regards, tom lane