Thread: Reducing size of WAL record headers

Reducing size of WAL record headers

From

Simon Riggs

Date:

09 January 2013, 23:36:14

Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we
waste 4 bytes per record. Or put another way, if we could reduce
record header by 4 bytes, we would actually reduce it by 8 bytes per
record. So looking for ways to do that seems like a good idea.

The WAL record header starts with xl_tot_len, a 4 byte field. There is
also another field, xl_len. The difference is that xl_tot_len includes
the header, xl_len and any backup blocks. Since the header is fixed,
the only time xl_tot_len != SizeOfXLogRecord + xl_len is when we have
backup blocks.

We can re-arrange the record layout so that we remove xl_tot_len and
add another (maxaligned) 4 byte field (--> 8 bytes) after the record
header, xl_bkpblock_len that only exists if we have backup blocks.
This will then save 8 bytes from every record that doesn't have backup
blocks, and be the same as now with backup blocks.

The only problem is that we currently allow WAL records to be written
so that the header wraps across pages. This allows us to save space in
WAL when we have between 5 and 32 bytes spare at the end of a page. To
reduce the header size by 8 bytes we would need to ensure that the
whole header, which would now be 24 or 32 bytes, is all on one page.
My math tells me that would waste on average 12 bytes per page because
of the end-of-page wastage, but would gain 8 bytes per record when we
don't have backup blocks. My thinking is that the end of page loss
would be much reduced on average when we had backup blocks, so we
could ignore that case.

Assuming typically 100 records per page when we have no backup blocks,
this is a considerable upside. We would make gains on any page with 3
or more WAL records on it, so low downside even in worst cases. That
seems like a great break-even point for optimisation.

Since we've changed the WAL format already this release, another
change seems OK. More to the point, we can remove backup blocks in the
common case without changing WAL format, so this might be the last
time we have the chance to make this change.

Forcing the XLogRecord header to be all on one page makes the format
more robust and simplifies the code that copes with header wrapping.

The format changes would mean that its still possible to work out the
length of the WAL record precisely
= SizeOfXLogRecord + (HasBkpBlocks ? SizeOf(uint32) : 0)  + xl_len
and so would then be protected by the WAL record CRC.

Thoughts?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Reducing size of WAL record headers

From

Heikki Linnakangas

Date:

09 January 2013, 23:54:31

On 09.01.2013 22:36, Simon Riggs wrote:
> Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we
> waste 4 bytes per record. Or put another way, if we could reduce
> record header by 4 bytes, we would actually reduce it by 8 bytes per
> record. So looking for ways to do that seems like a good idea.

Agreed.

> The WAL record header starts with xl_tot_len, a 4 byte field. There is
> also another field, xl_len. The difference is that xl_tot_len includes
> the header, xl_len and any backup blocks. Since the header is fixed,
> the only time xl_tot_len != SizeOfXLogRecord + xl_len is when we have
> backup blocks.
>
> We can re-arrange the record layout so that we remove xl_tot_len and
> add another (maxaligned) 4 byte field (-->  8 bytes) after the record
> header, xl_bkpblock_len that only exists if we have backup blocks.
> This will then save 8 bytes from every record that doesn't have backup
> blocks, and be the same as now with backup blocks.

Here's a better idea:

Let's keep xl_tot_len as it is, but move xl_len at the very end of the 
WAL record, after all the backup blocks. If there are no backup blocks, 
xl_len is omitted. Seems more robust to keep xl_tot_len, so that you 
require less math to figure out where one record ends and where the next 
one begins.

> Forcing the XLogRecord header to be all on one page makes the format
> more robust and simplifies the code that copes with header wrapping.

-1 on that. That would essentially revert the changes I made earlier. 
The purpose of allowing the header to be wrapped was that you could 
easily calculate ahead of time exactly how much space a WAL record 
takes. My motivation for that was the XLogInsert scaling patch. Now, I 
admit I haven't had a chance to work further on that patch, so we're not 
gaining much from the format change at the moment. Nevertheless, I don't 
want us to get back to the situation that you sometimes need to add 
padding to the end of a WAL page.

My suggestion above to keep xl_tot_len and remove xl_len from XLogRecord 
doesn't have a problem with crossing page boundaries.

- Heikki

Re: Reducing size of WAL record headers

From

Bruce Momjian

Date:

09 January 2013, 23:59:06

On Wed, Jan  9, 2013 at 10:54:33PM +0200, Heikki Linnakangas wrote:
> On 09.01.2013 22:36, Simon Riggs wrote:
> >Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we
> >waste 4 bytes per record. Or put another way, if we could reduce
> >record header by 4 bytes, we would actually reduce it by 8 bytes per
> >record. So looking for ways to do that seems like a good idea.
> 
> Agreed.
> 
> >The WAL record header starts with xl_tot_len, a 4 byte field. There is
> >also another field, xl_len. The difference is that xl_tot_len includes
> >the header, xl_len and any backup blocks. Since the header is fixed,
> >the only time xl_tot_len != SizeOfXLogRecord + xl_len is when we have
> >backup blocks.
> >
> >We can re-arrange the record layout so that we remove xl_tot_len and
> >add another (maxaligned) 4 byte field (-->  8 bytes) after the record
> >header, xl_bkpblock_len that only exists if we have backup blocks.
> >This will then save 8 bytes from every record that doesn't have backup
> >blocks, and be the same as now with backup blocks.
> 
> Here's a better idea:
> 
> Let's keep xl_tot_len as it is, but move xl_len at the very end of
> the WAL record, after all the backup blocks. If there are no backup
> blocks, xl_len is omitted. Seems more robust to keep xl_tot_len, so
> that you require less math to figure out where one record ends and
> where the next one begins.

OK, crazy idea, but can we just record xl_len as a difference against
xl_tot_len, and shorten the xl_len field?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Reducing size of WAL record headers

From

Heikki Linnakangas

Date:

10 January 2013, 00:02:17

On 09.01.2013 22:59, Bruce Momjian wrote:
> On Wed, Jan  9, 2013 at 10:54:33PM +0200, Heikki Linnakangas wrote:
>> On 09.01.2013 22:36, Simon Riggs wrote:
>>> The WAL record header starts with xl_tot_len, a 4 byte field. There is
>>> also another field, xl_len. The difference is that xl_tot_len includes
>>> the header, xl_len and any backup blocks. Since the header is fixed,
>>> the only time xl_tot_len != SizeOfXLogRecord + xl_len is when we have
>>> backup blocks.
>>>
>>> We can re-arrange the record layout so that we remove xl_tot_len and
>>> add another (maxaligned) 4 byte field (-->   8 bytes) after the record
>>> header, xl_bkpblock_len that only exists if we have backup blocks.
>>> This will then save 8 bytes from every record that doesn't have backup
>>> blocks, and be the same as now with backup blocks.
>>
>> Here's a better idea:
>>
>> Let's keep xl_tot_len as it is, but move xl_len at the very end of
>> the WAL record, after all the backup blocks. If there are no backup
>> blocks, xl_len is omitted. Seems more robust to keep xl_tot_len, so
>> that you require less math to figure out where one record ends and
>> where the next one begins.
>
> OK, crazy idea, but can we just record xl_len as a difference against
> xl_tot_len, and shorten the xl_len field?

Hmm, so it would essentially be the length of all the backup blocks. 
perhaps rename it to xl_bkpblk_len.

However, that would cap the total size of backup blocks to 64k. Which 
would not be enough with 32k BLCKSZ.

- Heikki

Re: Reducing size of WAL record headers

From

Simon Riggs

Date:

10 January 2013, 00:15:22

On 9 January 2013 21:02, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

>> OK, crazy idea, but can we just record xl_len as a difference against
>> xl_tot_len, and shorten the xl_len field?
>
>
> Hmm, so it would essentially be the length of all the backup blocks. perhaps
> rename it to xl_bkpblk_len.
>
> However, that would cap the total size of backup blocks to 64k. Which would
> not be enough with 32k BLCKSZ.

Since that requires a recompile anyway, why not make XLogRecord
smaller only for 16k BLCKSZ or less?

Problem if we do that is that xl_len is used extensively in _redo
routines, so its a much more invasive patch.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Reducing size of WAL record headers

From

Simon Riggs

Date:

10 January 2013, 00:17:32

On 9 January 2013 20:54, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> Here's a better idea:
>
> Let's keep xl_tot_len as it is, but move xl_len at the very end of the WAL
> record, after all the backup blocks. If there are no backup blocks, xl_len
> is omitted. Seems more robust to keep xl_tot_len, so that you require less
> math to figure out where one record ends and where the next one begins.

OK, I avoided tampering with xl_len cos its so widely used. Will look.

>> Forcing the XLogRecord header to be all on one page makes the format
>> more robust and simplifies the code that copes with header wrapping.

> -1 on that. That would essentially revert the changes I made earlier.

OK, idea dropped.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Reducing size of WAL record headers

From

Bruce Momjian

Date:

10 January 2013, 00:43:34

On Wed, Jan  9, 2013 at 09:15:16PM +0000, Simon Riggs wrote:
> On 9 January 2013 21:02, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> 
> >> OK, crazy idea, but can we just record xl_len as a difference against
> >> xl_tot_len, and shorten the xl_len field?
> >
> >
> > Hmm, so it would essentially be the length of all the backup blocks. perhaps
> > rename it to xl_bkpblk_len.
> >
> > However, that would cap the total size of backup blocks to 64k. Which would
> > not be enough with 32k BLCKSZ.
> 
> Since that requires a recompile anyway, why not make XLogRecord
> smaller only for 16k BLCKSZ or less?
> 
> Problem if we do that is that xl_len is used extensively in _redo
> routines, so its a much more invasive patch.

I would just make it int16 on <=16k block size, and int32 on >16k
blocks.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Reducing size of WAL record headers

From

Tom Lane

Date:

10 January 2013, 01:06:53

Simon Riggs <simon@2ndQuadrant.com> writes:
> Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we
> waste 4 bytes per record. Or put another way, if we could reduce
> record header by 4 bytes, we would actually reduce it by 8 bytes per
> record. So looking for ways to do that seems like a good idea.

I think this is extremely premature, in view of the ongoing discussions
about shoehorning logical replication and other kinds of data into the
WAL stream.  It seems quite likely that we'll end up eating some of
that padding space to support those features.  So whacking a lot of code
around in service of squeezing the existing padding out could very
easily end up being wasted work, in fact counterproductive if it
degrades either code readability or robustness.

Let's wait till we see where the logical rep stuff ends up before we
worry about saving 4 bytes per WAL record.
        regards, tom lane

Re: Reducing size of WAL record headers

From

Bruce Momjian

Date:

10 January 2013, 22:54:30

On Wed, Jan  9, 2013 at 05:06:49PM -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we
> > waste 4 bytes per record. Or put another way, if we could reduce
> > record header by 4 bytes, we would actually reduce it by 8 bytes per
> > record. So looking for ways to do that seems like a good idea.
> 
> I think this is extremely premature, in view of the ongoing discussions
> about shoehorning logical replication and other kinds of data into the
> WAL stream.  It seems quite likely that we'll end up eating some of
> that padding space to support those features.  So whacking a lot of code
> around in service of squeezing the existing padding out could very
> easily end up being wasted work, in fact counterproductive if it
> degrades either code readability or robustness.
> 
> Let's wait till we see where the logical rep stuff ends up before we
> worry about saving 4 bytes per WAL record.

Well, we have wal_level to control the amount of WAL traffic.  It is
hard to imagine we are going to want to ship logical WAL information by
default, so most people will not be using logical WAL and would see a
benefit from an optimized WAL stream?  

What percentage is 8-bytes in a typical WAL record?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Reducing size of WAL record headers

From

Tom Lane

Date:

10 January 2013, 23:13:28

Bruce Momjian <bruce@momjian.us> writes:
> On Wed, Jan  9, 2013 at 05:06:49PM -0500, Tom Lane wrote:
>> Let's wait till we see where the logical rep stuff ends up before we
>> worry about saving 4 bytes per WAL record.

> Well, we have wal_level to control the amount of WAL traffic.

That's entirely irrelevant.  The point here is that we'll need more bits
to identify what any particular record is, unless we make a decision
that we'll have physically separate streams for logical replication
info, which doesn't sound terribly attractive; and in any case no such
decision has been made yet, AFAIK.
        regards, tom lane

Re: Reducing size of WAL record headers

From

Simon Riggs

Date:

11 January 2013, 03:15:02

On 10 January 2013 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> On Wed, Jan  9, 2013 at 05:06:49PM -0500, Tom Lane wrote:
>>> Let's wait till we see where the logical rep stuff ends up before we
>>> worry about saving 4 bytes per WAL record.
>
>> Well, we have wal_level to control the amount of WAL traffic.
>
> That's entirely irrelevant.  The point here is that we'll need more bits
> to identify what any particular record is, unless we make a decision
> that we'll have physically separate streams for logical replication
> info, which doesn't sound terribly attractive; and in any case no such
> decision has been made yet, AFAIK.

You were right to say that this is less important than logical
replication. I don't need any more reason than that to stop talking
about it.

I have a patch for this, but as yet no way to submit it while at the
same time saying "put this at the back of the queue".

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Reducing size of WAL record headers

From

Jim Nasby

Date:

24 August 2013, 08:03:45

On 1/10/13 6:14 PM, Simon Riggs wrote:
> On 10 January 2013 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Bruce Momjian <bruce@momjian.us> writes:
>>> On Wed, Jan  9, 2013 at 05:06:49PM -0500, Tom Lane wrote:
>>>> Let's wait till we see where the logical rep stuff ends up before we
>>>> worry about saving 4 bytes per WAL record.
>>
>>> Well, we have wal_level to control the amount of WAL traffic.
>>
>> That's entirely irrelevant.  The point here is that we'll need more bits
>> to identify what any particular record is, unless we make a decision
>> that we'll have physically separate streams for logical replication
>> info, which doesn't sound terribly attractive; and in any case no such
>> decision has been made yet, AFAIK.
>
> You were right to say that this is less important than logical
> replication. I don't need any more reason than that to stop talking
> about it.
>
> I have a patch for this, but as yet no way to submit it while at the
> same time saying "put this at the back of the queue".

Anything ever come of this?
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net