Re: Cost of XLogInsert CRC calculations - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Cost of XLogInsert CRC calculations |
Date | |
Msg-id | 1117581552.3844.797.camel@localhost.localdomain Whole thread Raw |
In response to | Re: Cost of XLogInsert CRC calculations (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Cost of XLogInsert CRC calculations
|
List | pgsql-hackers |
On Tue, 2005-05-31 at 12:27 -0400, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > >>> Is the random WAL data really the concern? It seems like a more reliable way > >>> of dealing with that would be to just accompany every WAL entry with a > >>> sequential id and stop when the next id isn't the correct one. > >> > >> We do that, too (the xl_prev links and page header addresses serve that > >> purpose). But it's not sufficient given that WAL records can span pages > >> and therefore may be incompletely written. > > Actually, on reviewing the code I notice two missed bets here. > > 1. During WAL replay, we aren't actually verifying that xl_prev matches > the address of the prior WAL record. This means we are depending only > on the page header addresses to make sure we don't replay stale WAL data > left over from the previous cycle of use of the physical WAL file. This > is fairly dangerous, considering the likelihood of partial write of a > WAL page during a power failure: the first 512-byte sector(s) of a page > may have been updated but not the rest. If an old WAL record happens to > start at exactly the sector boundary then we lose. Hmmm. I seem to recall asking myself why xl_prev existed if it wasn't used, but passed that by. Damn. > 2. We store a separate CRC for each backup block attached to a WAL > record. Therefore the same torn-page problem could hit us if a stale > backup block starts exactly at a intrapage sector boundary --- there is > nothing guaranteeing that the backup block really goes with the WAL > record. > > #1 seems like a pretty critical, but also easily fixed, bug. To fix #2 > I suppose we'd have to modify the WAL format to store just one CRC > covering the whole of a WAL record and its attached backup block(s). > > I think the reasoning behind the separate CRCs was to put a limit on > the amount of data guarded by one CRC, and thereby minimize the risk > of undetected errors. But using the CRCs in this way is failing to > guard against exactly the problem that we want the CRCs to guard against > in the first place, namely torn WAL records ... so worrying about > detection reliability seems misplaced. > > The odds of an actual failure from case #2 are fortunately not high, > since a backup block will necessarily span across at least one WAL page > boundary and so we should be able to detect stale data by noting that > the next page's header address is wrong. (If it's not wrong, then at > least the first sector of the next page is up-to-date, so if there is > any tearing the CRC should tell us.) Therefore I don't feel any need > to try to work out a back-patchable solution for #2. But I do think we > ought to change the WAL format going forward to compute just one CRC > across a WAL record and all attached backup blocks. There was talk of > allowing compression of backup blocks, and if we do that we could no > longer feel any certainty that a page crossing would occur. > > Thoughts? PreAllocXLog was already a reason to have somebody prepare new xlog files ahead of them being used. Surely the right solution here is to have that agent prepare fresh/zeroed files prior to them being required. That way no stale data can ever occur and both of these bugs go away.... Fixing that can be backpatched so that the backend that switches files can do the work, rather than bgwriter [ or ?]. Best Regards, Simon Riggs
pgsql-hackers by date: