Thread: Adding new flags to XLogRecord
I'd like to add some new flag bits to XLogRecord. (xlog.h) Where? xl_prev. xl_prev is an XLogRecPtr which points backwards to the immediately preceeding WAL record. All of the bits are currently used, but I have some observations and a proposal to change that. We currently compare the whole xl_prev value against the whole XLogRecPtr of the last WAL record. When we are reading back WAL, if a WAL record is valid the xlogid portion of the value seldom differs by more than +1 from pointer of the current record, since that would imply an xlog record of more than 4GB. If it is incorrect, it will either be garbage or occasionally be a previously valid value but from two prior checkpoints back before this file was reused. So we probably don't need to compare the whole of xl_prev against the whole of the last WAL record pointer, we can probably avoid comparing some of the high bits, since the range of valid values is so limited. How many bits? checkpoint_segments is limited to INT_MAX, which means the xlogid increase of a single checkpoint is always at most INT_MAX/255. That means that the xl_prev value cannot differ by more than 2* INT_MAX/255 across two checkpoints. (I make that 134 Petabytes). Alternatively, the checkpoint_timeout is one hour. So we're OK until systems can write WAL at 67 Petabytes/hour. Which means if * we never get WAL records of more than 67 Petabytes in size *and* * the lowest 25 bits of xl_prev do not match the position of the last WAL record then the XLogRecord is invalid, no matter what the value of the highest 7 bits of xl_prev. So I would like to propose that we ignore the top 4 bits in xl_prev.xlogid when comparing values, rather than using all 32 bits for comparison. That then frees up 4 new flag bits on XLogRecords. Changing xl_prev handling is only required in 3 places, all in xlog.c, plus some log outputs. I would simply document the limitation of WAL record sizes. Putting code in for that would be pointless since the test would last years on current systems. (We wouldn't need dtrace to measure the WALInsertLock hold time, we could use tree rings.:-) These values would vary if we allow XLOG_SEG_SIZE higher than 16MB, but we should probably limit checkpoint_segments according to the setting of XLOG_SEG_SIZE anyhow. Thoughts? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > I'd like to add some new flag bits to XLogRecord. (xlog.h) > > Where? xl_prev. I'm more curious about "What?" and "Why?", actually ;-), > So I would like to propose that we ignore the top 4 bits in > xl_prev.xlogid when comparing values, rather than using all 32 bits for > comparison. That then frees up 4 new flag bits on XLogRecords. Changing > xl_prev handling is only required in 3 places, all in xlog.c, plus some > log outputs. Or, we could store only the delta between current record and the previous one. Assuming we know what the current record is, that wouldn't lose any precision. That way xl_prev only needs to be as big as the biggest possible WAL record we can have. Not that I think the precision in your scheme isn't enough, but I find the delta easier to comprehend. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-09-18 at 13:56 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > I'd like to add some new flag bits to XLogRecord. (xlog.h) > > > > Where? xl_prev. > > I'm more curious about "What?" and "Why?", actually ;-), Just trying to solve the egg/chicken problem of "I want to add a flag"; "but there are no flags available". I'm sure there's a few uses coming up in synch replication also. I want two flags for Hot Standby, but lets justify them on another post. > > So I would like to propose that we ignore the top 4 bits in > > xl_prev.xlogid when comparing values, rather than using all 32 bits for > > comparison. That then frees up 4 new flag bits on XLogRecords. Changing > > xl_prev handling is only required in 3 places, all in xlog.c, plus some > > log outputs. > > Or, we could store only the delta between current record and the > previous one. Assuming we know what the current record is, that wouldn't > lose any precision. That way xl_prev only needs to be as big as the > biggest possible WAL record we can have. > > Not that I think the precision in your scheme isn't enough, but I find > the delta easier to comprehend. That can't work, I know, that was my first thought. The files are reused, so xl_prev protects against reusing a file and then having a perfectly valid WAL record *after* the correct end of WAL that makes it look like WAL continues. So a delta could be valid data even though the record was invalid. We don't want to zero the files cause that costs too much, so we have to allow for seemingly correct data as well as correct data in WAL. I think the xl_prev field could be removed completely when streaming, except the new flags of course. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Why bit just add a new bitfield for flags if we need them? I'm usually the one worried about data density so perhaps I should be on the other side of the fence here but I'm not sure. The conventional wisdom is that wal bandwidth is not a major issue. greg On 18 Sep 2008, at 12:15, Simon Riggs <simon@2ndQuadrant.com> wrote: > > On Thu, 2008-09-18 at 13:56 +0300, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> I'd like to add some new flag bits to XLogRecord. (xlog.h) >>> >>> Where? xl_prev. >> >> I'm more curious about "What?" and "Why?", actually ;-), > > Just trying to solve the egg/chicken problem of "I want to add a > flag"; > "but there are no flags available". I'm sure there's a few uses coming > up in synch replication also. > > I want two flags for Hot Standby, but lets justify them on another > post. > >>> So I would like to propose that we ignore the top 4 bits in >>> xl_prev.xlogid when comparing values, rather than using all 32 >>> bits for >>> comparison. That then frees up 4 new flag bits on XLogRecords. >>> Changing >>> xl_prev handling is only required in 3 places, all in xlog.c, plus >>> some >>> log outputs. >> >> Or, we could store only the delta between current record and the >> previous one. Assuming we know what the current record is, that >> wouldn't >> lose any precision. That way xl_prev only needs to be as big as the >> biggest possible WAL record we can have. >> >> Not that I think the precision in your scheme isn't enough, but I >> find >> the delta easier to comprehend. > > That can't work, I know, that was my first thought. > > The files are reused, so xl_prev protects against reusing a file and > then having a perfectly valid WAL record *after* the correct end of > WAL > that makes it look like WAL continues. So a delta could be valid data > even though the record was invalid. > > We don't want to zero the files cause that costs too much, so we > have to > allow for seemingly correct data as well as correct data in WAL. > > I think the xl_prev field could be removed completely when streaming, > except the new flags of course. > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, 2008-09-18 at 12:40 +0100, Greg Stark wrote: > Why bit just add a new bitfield for flags if we need them? I'm usually > the one worried about data density so perhaps I should be on the other > side of the fence here but I'm not sure. The conventional wisdom is > that wal bandwidth is not a major issue. In some cases, but my wish is also to minimise WAL volume as much as possible. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, 2008-09-18 at 12:40 +0100, Greg Stark wrote: >> Why bit just add a new bitfield for flags if we need them? I'm usually >> the one worried about data density so perhaps I should be on the other >> side of the fence here but I'm not sure. The conventional wisdom is >> that wal bandwidth is not a major issue. > In some cases, but my wish is also to minimise WAL volume as much as > possible. I'm with Greg on this one: baroque bit-squeezing schemes are a bad idea. You still haven't answered the question of what you need four more bits for (and why four more is all that anyone will ever need --- unless you can prove that, we might as well just add another flag field). regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Or, we could store only the delta between current record and the > previous one. Assuming we know what the current record is, that wouldn't > lose any precision. That way xl_prev only needs to be as big as the > biggest possible WAL record we can have. The trouble with either approach is that it discards forensic intelligence in the name of bit squeezing. The high bits of xl_prev are the only direct evidence *within* a WAL record of where it thinks it is in the WAL sequence, and the back-comparison against where we thought the previous record was is correspondingly the only really strong protection against a torn page problem within a WAL page, should the sector boundary happen to fall exactly at a WAL record boundary. I fear that a delta would be completely unacceptable for that check, because it's entirely possible that a lot of different WAL records would be the same size (consider bulk load into a fixed-width table for an example). Simon's scheme merely removes some of the protection, not all of it ;-). But I don't really like removing any of it. If we need more bits in WAL headers, then so be it --- they'd likely still be smaller than they were a couple releases ago. regards, tom lane
I wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> In some cases, but my wish is also to minimise WAL volume as much as >> possible. > I'm with Greg on this one: baroque bit-squeezing schemes are a bad idea. Wait a minute ... why are we even having this conversation? XLogRecord has at least two entirely-wasted bytes right now, due to alignment. It is entirely not sane to consider messing with xl_prev in preference to using that space. regards, tom lane
On Thu, 2008-09-18 at 08:38 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Thu, 2008-09-18 at 12:40 +0100, Greg Stark wrote: > >> Why bit just add a new bitfield for flags if we need them? I'm usually > >> the one worried about data density so perhaps I should be on the other > >> side of the fence here but I'm not sure. The conventional wisdom is > >> that wal bandwidth is not a major issue. > > > In some cases, but my wish is also to minimise WAL volume as much as > > possible. > > I'm with Greg on this one: baroque bit-squeezing schemes are a bad idea. I'm easy, just trying to save bytes. > You still haven't answered the question of what you need four more bits > for (and why four more is all that anyone will ever need --- unless you > can prove that, we might as well just add another flag field). Just trying to avoid having same info on multiple threads. I would like to use 2 flag bits for Hot Standby: * flag1 to indicate the first WAL record for an xid. This allows Hot Standby to efficiently manage a recovery snapshot. * flag2 to indicate the first WAL record for a subxid. This will indicate that there is another 4 byte xid at end of struc and before backup blocks that holds parentxid for this subxid. That allows us to maintain subtrans during Hot Standby. If we do this the suggested way it will add X bytes for flag bits (2?) to every WAL record, plus 4 bytes to every initial WAL record in a subtransaction. There will also be occasional WAL records to avoid race conditions, as described in the thread on transaction snapshots and hot standby, though those are rare. There are also some other WAL records that may need to be written in other less common paths also, which will be individually justified when I come to them. The two bits I need will be set zero in xl_prev until we have written 4x10^18 bytes of WAL, or 4000 Petabytes. So we're not really ever like to witness them set in our lifetimes for diagnostic purposes. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-09-18 at 09:57 -0400, Tom Lane wrote: > I wrote: > > Simon Riggs <simon@2ndQuadrant.com> writes: > >> In some cases, but my wish is also to minimise WAL volume as much as > >> possible. > > > I'm with Greg on this one: baroque bit-squeezing schemes are a bad idea. > > Wait a minute ... why are we even having this conversation? XLogRecord > has at least two entirely-wasted bytes right now, due to alignment. > It is entirely not sane to consider messing with xl_prev in preference > to using that space. OK, two bytes it is then. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support