Thread: Adding new flags to XLogRecord

Adding new flags to XLogRecord

From
Simon Riggs
Date:
I'd like to add some new flag bits to XLogRecord. (xlog.h)

Where? xl_prev.

xl_prev is an XLogRecPtr which points backwards to the immediately
preceeding WAL record. All of the bits are currently used, but I have
some observations and a proposal to change that.

We currently compare the whole xl_prev value against the whole
XLogRecPtr of the last WAL record.

When we are reading back WAL, if a WAL record is valid the xlogid
portion of the value seldom differs by more than +1 from pointer of the
current record, since that would imply an xlog record of more than 4GB.
If it is incorrect, it will either be garbage or occasionally be a
previously valid value but from two prior checkpoints back before this
file was reused.

So we probably don't need to compare the whole of xl_prev against the
whole of the last WAL record pointer, we can probably avoid comparing
some of the high bits, since the range of valid values is so limited.
How many bits?

checkpoint_segments is limited to INT_MAX, which means the xlogid
increase of a single checkpoint is always at most INT_MAX/255. That
means that the xl_prev value cannot differ by more than 2* INT_MAX/255
across two checkpoints. (I make that 134 Petabytes). Alternatively, the
checkpoint_timeout is one hour. So we're OK until systems can write WAL
at 67 Petabytes/hour. 

Which means if
* we never get WAL records of more than 67 Petabytes in size *and* 
* the lowest 25 bits of xl_prev do not match the position of the last
WAL record 
then the XLogRecord is invalid, no matter what the value of the highest
7 bits of xl_prev. 

So I would like to propose that we ignore the top 4 bits in
xl_prev.xlogid when comparing values, rather than using all 32 bits for
comparison. That then frees up 4 new flag bits on XLogRecords. Changing
xl_prev handling is only required in 3 places, all in xlog.c, plus some
log outputs.

I would simply document the limitation of WAL record sizes. Putting code
in for that would be pointless since the test would last years on
current systems. (We wouldn't need dtrace to measure the WALInsertLock
hold time, we could use tree rings.:-)

These values would vary if we allow XLOG_SEG_SIZE higher than 16MB, but
we should probably limit checkpoint_segments according to the setting of
XLOG_SEG_SIZE anyhow.

Thoughts?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Adding new flags to XLogRecord

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> I'd like to add some new flag bits to XLogRecord. (xlog.h)
> 
> Where? xl_prev.

I'm more curious about "What?" and "Why?", actually ;-),

> So I would like to propose that we ignore the top 4 bits in
> xl_prev.xlogid when comparing values, rather than using all 32 bits for
> comparison. That then frees up 4 new flag bits on XLogRecords. Changing
> xl_prev handling is only required in 3 places, all in xlog.c, plus some
> log outputs.

Or, we could store only the delta between current record and the 
previous one. Assuming we know what the current record is, that wouldn't 
lose any precision. That way xl_prev only needs to be as big as the 
biggest possible WAL record we can have.

Not that I think the precision in your scheme isn't enough, but I find 
the delta easier to comprehend.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Adding new flags to XLogRecord

From
Simon Riggs
Date:
On Thu, 2008-09-18 at 13:56 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > I'd like to add some new flag bits to XLogRecord. (xlog.h)
> > 
> > Where? xl_prev.
> 
> I'm more curious about "What?" and "Why?", actually ;-),

Just trying to solve the egg/chicken problem of "I want to add a flag";
"but there are no flags available". I'm sure there's a few uses coming
up in synch replication also.

I want two flags for Hot Standby, but lets justify them on another post.

> > So I would like to propose that we ignore the top 4 bits in
> > xl_prev.xlogid when comparing values, rather than using all 32 bits for
> > comparison. That then frees up 4 new flag bits on XLogRecords. Changing
> > xl_prev handling is only required in 3 places, all in xlog.c, plus some
> > log outputs.
> 
> Or, we could store only the delta between current record and the 
> previous one. Assuming we know what the current record is, that wouldn't 
> lose any precision. That way xl_prev only needs to be as big as the 
> biggest possible WAL record we can have.
> 
> Not that I think the precision in your scheme isn't enough, but I find 
> the delta easier to comprehend.

That can't work, I know, that was my first thought.

The files are reused, so xl_prev protects against reusing a file and
then having a perfectly valid WAL record *after* the correct end of WAL
that makes it look like WAL continues. So a delta could be valid data
even though the record was invalid.

We don't want to zero the files cause that costs too much, so we have to
allow for seemingly correct data as well as correct data in WAL.

I think the xl_prev field could be removed completely when streaming,
except the new flags of course.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Adding new flags to XLogRecord

From
Greg Stark
Date:
Why bit just add a new bitfield for flags if we need them? I'm usually  
the one worried about data density so perhaps I should be on the other  
side of the fence here but I'm not sure. The conventional wisdom is  
that wal bandwidth is not a major issue.

greg

On 18 Sep 2008, at 12:15, Simon Riggs <simon@2ndQuadrant.com> wrote:

>
> On Thu, 2008-09-18 at 13:56 +0300, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> I'd like to add some new flag bits to XLogRecord. (xlog.h)
>>>
>>> Where? xl_prev.
>>
>> I'm more curious about "What?" and "Why?", actually ;-),
>
> Just trying to solve the egg/chicken problem of "I want to add a  
> flag";
> "but there are no flags available". I'm sure there's a few uses coming
> up in synch replication also.
>
> I want two flags for Hot Standby, but lets justify them on another  
> post.
>
>>> So I would like to propose that we ignore the top 4 bits in
>>> xl_prev.xlogid when comparing values, rather than using all 32  
>>> bits for
>>> comparison. That then frees up 4 new flag bits on XLogRecords.  
>>> Changing
>>> xl_prev handling is only required in 3 places, all in xlog.c, plus  
>>> some
>>> log outputs.
>>
>> Or, we could store only the delta between current record and the
>> previous one. Assuming we know what the current record is, that  
>> wouldn't
>> lose any precision. That way xl_prev only needs to be as big as the
>> biggest possible WAL record we can have.
>>
>> Not that I think the precision in your scheme isn't enough, but I  
>> find
>> the delta easier to comprehend.
>
> That can't work, I know, that was my first thought.
>
> The files are reused, so xl_prev protects against reusing a file and
> then having a perfectly valid WAL record *after* the correct end of  
> WAL
> that makes it look like WAL continues. So a delta could be valid data
> even though the record was invalid.
>
> We don't want to zero the files cause that costs too much, so we  
> have to
> allow for seemingly correct data as well as correct data in WAL.
>
> I think the xl_prev field could be removed completely when streaming,
> except the new flags of course.
>
> -- 
> Simon Riggs           www.2ndQuadrant.com
> PostgreSQL Training, Services and Support
>
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: Adding new flags to XLogRecord

From
Simon Riggs
Date:
On Thu, 2008-09-18 at 12:40 +0100, Greg Stark wrote:

> Why bit just add a new bitfield for flags if we need them? I'm usually  
> the one worried about data density so perhaps I should be on the other  
> side of the fence here but I'm not sure. The conventional wisdom is  
> that wal bandwidth is not a major issue.

In some cases, but my wish is also to minimise WAL volume as much as
possible.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Adding new flags to XLogRecord

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Thu, 2008-09-18 at 12:40 +0100, Greg Stark wrote:
>> Why bit just add a new bitfield for flags if we need them? I'm usually  
>> the one worried about data density so perhaps I should be on the other  
>> side of the fence here but I'm not sure. The conventional wisdom is  
>> that wal bandwidth is not a major issue.

> In some cases, but my wish is also to minimise WAL volume as much as
> possible.

I'm with Greg on this one: baroque bit-squeezing schemes are a bad idea.

You still haven't answered the question of what you need four more bits
for (and why four more is all that anyone will ever need --- unless you
can prove that, we might as well just add another flag field).
        regards, tom lane


Re: Adding new flags to XLogRecord

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Or, we could store only the delta between current record and the 
> previous one. Assuming we know what the current record is, that wouldn't 
> lose any precision. That way xl_prev only needs to be as big as the 
> biggest possible WAL record we can have.

The trouble with either approach is that it discards forensic
intelligence in the name of bit squeezing.  The high bits of xl_prev are
the only direct evidence *within* a WAL record of where it thinks it is
in the WAL sequence, and the back-comparison against where we thought
the previous record was is correspondingly the only really strong
protection against a torn page problem within a WAL page, should the
sector boundary happen to fall exactly at a WAL record boundary.

I fear that a delta would be completely unacceptable for that check,
because it's entirely possible that a lot of different WAL records would
be the same size (consider bulk load into a fixed-width table for an
example).  Simon's scheme merely removes some of the protection, not all
of it ;-).

But I don't really like removing any of it.  If we need more bits in WAL
headers, then so be it --- they'd likely still be smaller than they were
a couple releases ago.
        regards, tom lane


Re: Adding new flags to XLogRecord

From
Tom Lane
Date:
I wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> In some cases, but my wish is also to minimise WAL volume as much as
>> possible.

> I'm with Greg on this one: baroque bit-squeezing schemes are a bad idea.

Wait a minute ... why are we even having this conversation?  XLogRecord
has at least two entirely-wasted bytes right now, due to alignment.
It is entirely not sane to consider messing with xl_prev in preference
to using that space.
        regards, tom lane


Re: Adding new flags to XLogRecord

From
Simon Riggs
Date:
On Thu, 2008-09-18 at 08:38 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Thu, 2008-09-18 at 12:40 +0100, Greg Stark wrote:
> >> Why bit just add a new bitfield for flags if we need them? I'm usually  
> >> the one worried about data density so perhaps I should be on the other  
> >> side of the fence here but I'm not sure. The conventional wisdom is  
> >> that wal bandwidth is not a major issue.
> 
> > In some cases, but my wish is also to minimise WAL volume as much as
> > possible.
> 
> I'm with Greg on this one: baroque bit-squeezing schemes are a bad idea.

I'm easy, just trying to save bytes.

> You still haven't answered the question of what you need four more bits
> for (and why four more is all that anyone will ever need --- unless you
> can prove that, we might as well just add another flag field).

Just trying to avoid having same info on multiple threads.

I would like to use 2 flag bits for Hot Standby:

* flag1 to indicate the first WAL record for an xid. This allows Hot
Standby to efficiently manage a recovery snapshot.
* flag2 to indicate the first WAL record for a subxid. This will
indicate that there is another 4 byte xid at end of struc and before
backup blocks that holds parentxid for this subxid. That allows us to
maintain subtrans during Hot Standby.

If we do this the suggested way it will add X bytes for flag bits (2?)
to every WAL record, plus 4 bytes to every initial WAL record in a
subtransaction. 

There will also be occasional WAL records to avoid race conditions, as
described in the thread on transaction snapshots and hot standby, though
those are rare. There are also some other WAL records that may need to
be written in other less common paths also, which will be individually
justified when I come to them.


The two bits I need will be set zero in xl_prev until we have written
4x10^18 bytes of WAL, or 4000 Petabytes. So we're not really ever like
to witness them set in our lifetimes for diagnostic purposes.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Adding new flags to XLogRecord

From
Simon Riggs
Date:
On Thu, 2008-09-18 at 09:57 -0400, Tom Lane wrote:
> I wrote:
> > Simon Riggs <simon@2ndQuadrant.com> writes:
> >> In some cases, but my wish is also to minimise WAL volume as much as
> >> possible.
> 
> > I'm with Greg on this one: baroque bit-squeezing schemes are a bad idea.
> 
> Wait a minute ... why are we even having this conversation?  XLogRecord
> has at least two entirely-wasted bytes right now, due to alignment.
> It is entirely not sane to consider messing with xl_prev in preference
> to using that space.

OK, two bytes it is then. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support