Thread: Reducing size of WAL record headers
Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we waste 4 bytes per record. Or put another way, if we could reduce record header by 4 bytes, we would actually reduce it by 8 bytes per record. So looking for ways to do that seems like a good idea. The WAL record header starts with xl_tot_len, a 4 byte field. There is also another field, xl_len. The difference is that xl_tot_len includes the header, xl_len and any backup blocks. Since the header is fixed, the only time xl_tot_len != SizeOfXLogRecord + xl_len is when we have backup blocks. We can re-arrange the record layout so that we remove xl_tot_len and add another (maxaligned) 4 byte field (--> 8 bytes) after the record header, xl_bkpblock_len that only exists if we have backup blocks. This will then save 8 bytes from every record that doesn't have backup blocks, and be the same as now with backup blocks. The only problem is that we currently allow WAL records to be written so that the header wraps across pages. This allows us to save space in WAL when we have between 5 and 32 bytes spare at the end of a page. To reduce the header size by 8 bytes we would need to ensure that the whole header, which would now be 24 or 32 bytes, is all on one page. My math tells me that would waste on average 12 bytes per page because of the end-of-page wastage, but would gain 8 bytes per record when we don't have backup blocks. My thinking is that the end of page loss would be much reduced on average when we had backup blocks, so we could ignore that case. Assuming typically 100 records per page when we have no backup blocks, this is a considerable upside. We would make gains on any page with 3 or more WAL records on it, so low downside even in worst cases. That seems like a great break-even point for optimisation. Since we've changed the WAL format already this release, another change seems OK. More to the point, we can remove backup blocks in the common case without changing WAL format, so this might be the last time we have the chance to make this change. Forcing the XLogRecord header to be all on one page makes the format more robust and simplifies the code that copes with header wrapping. The format changes would mean that its still possible to work out the length of the WAL record precisely = SizeOfXLogRecord + (HasBkpBlocks ? SizeOf(uint32) : 0) + xl_len and so would then be protected by the WAL record CRC. Thoughts? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 09.01.2013 22:36, Simon Riggs wrote: > Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we > waste 4 bytes per record. Or put another way, if we could reduce > record header by 4 bytes, we would actually reduce it by 8 bytes per > record. So looking for ways to do that seems like a good idea. Agreed. > The WAL record header starts with xl_tot_len, a 4 byte field. There is > also another field, xl_len. The difference is that xl_tot_len includes > the header, xl_len and any backup blocks. Since the header is fixed, > the only time xl_tot_len != SizeOfXLogRecord + xl_len is when we have > backup blocks. > > We can re-arrange the record layout so that we remove xl_tot_len and > add another (maxaligned) 4 byte field (--> 8 bytes) after the record > header, xl_bkpblock_len that only exists if we have backup blocks. > This will then save 8 bytes from every record that doesn't have backup > blocks, and be the same as now with backup blocks. Here's a better idea: Let's keep xl_tot_len as it is, but move xl_len at the very end of the WAL record, after all the backup blocks. If there are no backup blocks, xl_len is omitted. Seems more robust to keep xl_tot_len, so that you require less math to figure out where one record ends and where the next one begins. > Forcing the XLogRecord header to be all on one page makes the format > more robust and simplifies the code that copes with header wrapping. -1 on that. That would essentially revert the changes I made earlier. The purpose of allowing the header to be wrapped was that you could easily calculate ahead of time exactly how much space a WAL record takes. My motivation for that was the XLogInsert scaling patch. Now, I admit I haven't had a chance to work further on that patch, so we're not gaining much from the format change at the moment. Nevertheless, I don't want us to get back to the situation that you sometimes need to add padding to the end of a WAL page. My suggestion above to keep xl_tot_len and remove xl_len from XLogRecord doesn't have a problem with crossing page boundaries. - Heikki
On Wed, Jan 9, 2013 at 10:54:33PM +0200, Heikki Linnakangas wrote: > On 09.01.2013 22:36, Simon Riggs wrote: > >Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we > >waste 4 bytes per record. Or put another way, if we could reduce > >record header by 4 bytes, we would actually reduce it by 8 bytes per > >record. So looking for ways to do that seems like a good idea. > > Agreed. > > >The WAL record header starts with xl_tot_len, a 4 byte field. There is > >also another field, xl_len. The difference is that xl_tot_len includes > >the header, xl_len and any backup blocks. Since the header is fixed, > >the only time xl_tot_len != SizeOfXLogRecord + xl_len is when we have > >backup blocks. > > > >We can re-arrange the record layout so that we remove xl_tot_len and > >add another (maxaligned) 4 byte field (--> 8 bytes) after the record > >header, xl_bkpblock_len that only exists if we have backup blocks. > >This will then save 8 bytes from every record that doesn't have backup > >blocks, and be the same as now with backup blocks. > > Here's a better idea: > > Let's keep xl_tot_len as it is, but move xl_len at the very end of > the WAL record, after all the backup blocks. If there are no backup > blocks, xl_len is omitted. Seems more robust to keep xl_tot_len, so > that you require less math to figure out where one record ends and > where the next one begins. OK, crazy idea, but can we just record xl_len as a difference against xl_tot_len, and shorten the xl_len field? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 09.01.2013 22:59, Bruce Momjian wrote: > On Wed, Jan 9, 2013 at 10:54:33PM +0200, Heikki Linnakangas wrote: >> On 09.01.2013 22:36, Simon Riggs wrote: >>> The WAL record header starts with xl_tot_len, a 4 byte field. There is >>> also another field, xl_len. The difference is that xl_tot_len includes >>> the header, xl_len and any backup blocks. Since the header is fixed, >>> the only time xl_tot_len != SizeOfXLogRecord + xl_len is when we have >>> backup blocks. >>> >>> We can re-arrange the record layout so that we remove xl_tot_len and >>> add another (maxaligned) 4 byte field (--> 8 bytes) after the record >>> header, xl_bkpblock_len that only exists if we have backup blocks. >>> This will then save 8 bytes from every record that doesn't have backup >>> blocks, and be the same as now with backup blocks. >> >> Here's a better idea: >> >> Let's keep xl_tot_len as it is, but move xl_len at the very end of >> the WAL record, after all the backup blocks. If there are no backup >> blocks, xl_len is omitted. Seems more robust to keep xl_tot_len, so >> that you require less math to figure out where one record ends and >> where the next one begins. > > OK, crazy idea, but can we just record xl_len as a difference against > xl_tot_len, and shorten the xl_len field? Hmm, so it would essentially be the length of all the backup blocks. perhaps rename it to xl_bkpblk_len. However, that would cap the total size of backup blocks to 64k. Which would not be enough with 32k BLCKSZ. - Heikki
On 9 January 2013 21:02, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> OK, crazy idea, but can we just record xl_len as a difference against >> xl_tot_len, and shorten the xl_len field? > > > Hmm, so it would essentially be the length of all the backup blocks. perhaps > rename it to xl_bkpblk_len. > > However, that would cap the total size of backup blocks to 64k. Which would > not be enough with 32k BLCKSZ. Since that requires a recompile anyway, why not make XLogRecord smaller only for 16k BLCKSZ or less? Problem if we do that is that xl_len is used extensively in _redo routines, so its a much more invasive patch. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 9 January 2013 20:54, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Here's a better idea: > > Let's keep xl_tot_len as it is, but move xl_len at the very end of the WAL > record, after all the backup blocks. If there are no backup blocks, xl_len > is omitted. Seems more robust to keep xl_tot_len, so that you require less > math to figure out where one record ends and where the next one begins. OK, I avoided tampering with xl_len cos its so widely used. Will look. >> Forcing the XLogRecord header to be all on one page makes the format >> more robust and simplifies the code that copes with header wrapping. > -1 on that. That would essentially revert the changes I made earlier. OK, idea dropped. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 9, 2013 at 09:15:16PM +0000, Simon Riggs wrote: > On 9 January 2013 21:02, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > > >> OK, crazy idea, but can we just record xl_len as a difference against > >> xl_tot_len, and shorten the xl_len field? > > > > > > Hmm, so it would essentially be the length of all the backup blocks. perhaps > > rename it to xl_bkpblk_len. > > > > However, that would cap the total size of backup blocks to 64k. Which would > > not be enough with 32k BLCKSZ. > > Since that requires a recompile anyway, why not make XLogRecord > smaller only for 16k BLCKSZ or less? > > Problem if we do that is that xl_len is used extensively in _redo > routines, so its a much more invasive patch. I would just make it int16 on <=16k block size, and int32 on >16k blocks. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Simon Riggs <simon@2ndQuadrant.com> writes: > Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we > waste 4 bytes per record. Or put another way, if we could reduce > record header by 4 bytes, we would actually reduce it by 8 bytes per > record. So looking for ways to do that seems like a good idea. I think this is extremely premature, in view of the ongoing discussions about shoehorning logical replication and other kinds of data into the WAL stream. It seems quite likely that we'll end up eating some of that padding space to support those features. So whacking a lot of code around in service of squeezing the existing padding out could very easily end up being wasted work, in fact counterproductive if it degrades either code readability or robustness. Let's wait till we see where the logical rep stuff ends up before we worry about saving 4 bytes per WAL record. regards, tom lane
On Wed, Jan 9, 2013 at 05:06:49PM -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > Overall, the WAL record is MAXALIGN'd, so with 8 byte alignment we > > waste 4 bytes per record. Or put another way, if we could reduce > > record header by 4 bytes, we would actually reduce it by 8 bytes per > > record. So looking for ways to do that seems like a good idea. > > I think this is extremely premature, in view of the ongoing discussions > about shoehorning logical replication and other kinds of data into the > WAL stream. It seems quite likely that we'll end up eating some of > that padding space to support those features. So whacking a lot of code > around in service of squeezing the existing padding out could very > easily end up being wasted work, in fact counterproductive if it > degrades either code readability or robustness. > > Let's wait till we see where the logical rep stuff ends up before we > worry about saving 4 bytes per WAL record. Well, we have wal_level to control the amount of WAL traffic. It is hard to imagine we are going to want to ship logical WAL information by default, so most people will not be using logical WAL and would see a benefit from an optimized WAL stream? What percentage is 8-bytes in a typical WAL record? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> writes: > On Wed, Jan 9, 2013 at 05:06:49PM -0500, Tom Lane wrote: >> Let's wait till we see where the logical rep stuff ends up before we >> worry about saving 4 bytes per WAL record. > Well, we have wal_level to control the amount of WAL traffic. That's entirely irrelevant. The point here is that we'll need more bits to identify what any particular record is, unless we make a decision that we'll have physically separate streams for logical replication info, which doesn't sound terribly attractive; and in any case no such decision has been made yet, AFAIK. regards, tom lane
On 10 January 2013 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bruce Momjian <bruce@momjian.us> writes: >> On Wed, Jan 9, 2013 at 05:06:49PM -0500, Tom Lane wrote: >>> Let's wait till we see where the logical rep stuff ends up before we >>> worry about saving 4 bytes per WAL record. > >> Well, we have wal_level to control the amount of WAL traffic. > > That's entirely irrelevant. The point here is that we'll need more bits > to identify what any particular record is, unless we make a decision > that we'll have physically separate streams for logical replication > info, which doesn't sound terribly attractive; and in any case no such > decision has been made yet, AFAIK. You were right to say that this is less important than logical replication. I don't need any more reason than that to stop talking about it. I have a patch for this, but as yet no way to submit it while at the same time saying "put this at the back of the queue". -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 1/10/13 6:14 PM, Simon Riggs wrote: > On 10 January 2013 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Bruce Momjian <bruce@momjian.us> writes: >>> On Wed, Jan 9, 2013 at 05:06:49PM -0500, Tom Lane wrote: >>>> Let's wait till we see where the logical rep stuff ends up before we >>>> worry about saving 4 bytes per WAL record. >> >>> Well, we have wal_level to control the amount of WAL traffic. >> >> That's entirely irrelevant. The point here is that we'll need more bits >> to identify what any particular record is, unless we make a decision >> that we'll have physically separate streams for logical replication >> info, which doesn't sound terribly attractive; and in any case no such >> decision has been made yet, AFAIK. > > You were right to say that this is less important than logical > replication. I don't need any more reason than that to stop talking > about it. > > I have a patch for this, but as yet no way to submit it while at the > same time saying "put this at the back of the queue". Anything ever come of this? -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net