On 2014-01-21 19:23:57 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-01-21 18:59:13 -0500, Tom Lane wrote:
> >> Another thing to think about is whether we couldn't put a hard limit on
> >> WAL record size somehow. Multi-megabyte WAL records are an abuse of the
> >> design anyway, when you get right down to it. So for example maybe we
> >> could split up commit records, with most of the bulky information dumped
> >> into separate records that appear before the "real commit". This would
> >> complicate replay --- in particular, if we abort the transaction after
> >> writing a few such records, how does the replayer realize that it can
> >> forget about those records? But that sounds probably surmountable.
>
> > I think removing the list of subtransactions from commit records would
> > essentially require not truncating pg_subtrans after a restart
> > anymore.
>
> I'm not suggesting that we stop providing that information! I'm just
> saying that we perhaps don't need to store it all in one WAL record,
> if instead we put the onus on WAL replay to be able to reconstruct what
> it needs from a series of WAL records.
That'd likely require something similar to the incomplete actions used
in btrees (and until recently in more places). I think that is/was a
disaster I really don't want to extend.
> > We could relatively easily split of logging the dropped files from
> > commit records and log them in groups afterwards, we already have
> > several races allowing to leak files.
>
> I was thinking the other way around: emit the subsidiary records before the
> atomic commit or abort record, indeed before we've actually committed.
> Part of the point is to reduce the risk that lack of WAL space would
> prevent us from fully committing.
> Replay would then involve either accumulating the subsidiary records in
> memory, or being willing to go back and re-read them when the real commit
> or abort record is seen.
Well, the reason I suggested doing it the other way round is that we
wouldn't need to reassemble anything (outside of cache invalidations
which I don't know how to handle that way) which I think is a
significant increase in robustness and decrease in complexity.
> Also, writing those records afterwards
> increases the risk of a post-commit failure, which is a bad thing.
Well, most of those could be done outside of a critical section,
possibly just FATALing out. Beats PANICing.
Greetings,
Andres Freund
-- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services