Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: WAL format and API changes (9.5)
Date
Msg-id 20141113150445.GB29112@alap3.anarazel.de
Whole thread Raw
In response to Re: WAL format and API changes (9.5)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: WAL format and API changes (9.5)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: WAL format and API changes (9.5)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On 2014-11-13 15:33:44 +0200, Heikki Linnakangas wrote:
> Here's a new version, with big changes again to the record format. Have a
> look at xlogrecord.h for the details, but in a nutshell:
> 
> 1. The overall format is now: XLogRecord, per-block headers, header for main
> data portion, per-block data, main data.
> 
> 2. I removed xl_len field from XLogRecord and rearranged the fields, to
> shrink the XLogRecord struct from 32 to 24 bytes. (instead, there's a new 2-
> or 5-byte header for the "main data", after the block headers).
> 
> 3. No alignment padding. (the data chunks are copied to aligned buffers at
> replay, so redo functions can still assume aligned access)

Whoa. That's a large amount of changes... Not bad at all.

> In quick testing, this new WAL format is somewhat more compact than the 9.4
> format.

Well, that's mixing different kinds of changes together to some
degree. Most of this could have been done independently as well. I'm not
generally objecting to doing that, but the space/performance comparison
isn't entirely fair. One could (I'm not!) argue that we should just do a
refactoring like you suggest above, without the additional block
information.

> That also seems to have more than bought back the performance
> regression I saw earlier.

I think we really need to do the performance test with a different CRC
implementation. So far all the tests in this thread seem to be
correlating pretty linearly to the size of the record.

> Here are results from my laptop, using the wal-update-testsuite.sh script:

Would be nice if that'd also compute the differences in wal_generated
and duration between two runs... :)

> Aside from the WAL record format changes, this patch adds the "decoded WAL
> record" infrastructure that we talked about with Andres.

Ah, cool.

> XLogReader now has
> a new function, DecodeXLogRecord, which parses the block headers etc. from
> the WAL record, and copies the data chunks to aligned buffers. The redo
> routines are passed a pointer to the XLogReaderState, instead of the plain
> XLogRecord, and the redo routines can use macros and functions defined
> xlogreader.h to access the already-decoded WAL record. The new WAL record
> format is difficult to parse in a piece-meal fashion, so it really needs
> this separate decoding pass to be efficient.

Hm. I'm not sure if there's really a point in exposing
DecodeXLogRecord() outside of xlogreader.c. In which case would a
xlogreader user not want to decode the record? I.e. couldn't we just
always do that and not bother exposing the function?

Sure, you could skip decoding if the record is of a "uninteresting"
rmgr, but since all validation but CRC now happens only during decoding
I'm not sure that's a good idea. You e.g. have a block

+        if (!DecodeXLogRecord(xlogreader_state, &errormsg))
+        {
+            fprintf(stderr, "error in WAL record at %X/%X: %s\n",
+                        (uint32) (xlogreader_state->ReadRecPtr >> 32),
+                        (uint32) xlogreader_state->ReadRecPtr,
+                        errormsg);
+            /*
+             * Parsing the record failed, but it passed CRC check, so in
+             * theory we can continue reading.
+             */
+            continue;
+        }
+

I think that's quite the bad idea. The CRC is only a part of the error
detection, not its entirety.

> Thoughts on this new WAL record format? I've attached the xlogrecord.h file
> here separately for easy reading, if you want to take a quick look at just
> that without applying the whole patch.

After a quick skim I like it in general. There's some details I'd rather
change, but I think it's quite the step forward.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Ants Aasma
Date:
Subject: Re: Failback to old master
Next
From: Tom Lane
Date:
Subject: Re: On the warpath again about ill-considered inclusion nests