Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: WAL format and API changes (9.5)
Date
Msg-id 5469F68E.3080300@vmware.com
Whole thread Raw
In response to Re: WAL format and API changes (9.5)  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: WAL format and API changes (9.5)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On 11/13/2014 05:04 PM, Andres Freund wrote:
> On 2014-11-13 15:33:44 +0200, Heikki Linnakangas wrote:
>> In quick testing, this new WAL format is somewhat more compact than the 9.4
>> format.
>
> Well, that's mixing different kinds of changes together to some
> degree. Most of this could have been done independently as well. I'm not
> generally objecting to doing that, but the space/performance comparison
> isn't entirely fair. One could (I'm not!) argue that we should just do a
> refactoring like you suggest above, without the additional block
> information.

Yep. It would be interesting to just do the minimal changes to the
current record format, to get a similar reduction in WAL volume, and
then compare the performance and WAL volume of that vs. the big patch.
But I don't think it's really worth it; we do want the bigger changes,
anyway, even if it costs a little.

>> That also seems to have more than bought back the performance
>> regression I saw earlier.
>
> I think we really need to do the performance test with a different CRC
> implementation. So far all the tests in this thread seem to be
> correlating pretty linearly to the size of the record.

Will do.

>> XLogReader now has
>> a new function, DecodeXLogRecord, which parses the block headers etc. from
>> the WAL record, and copies the data chunks to aligned buffers. The redo
>> routines are passed a pointer to the XLogReaderState, instead of the plain
>> XLogRecord, and the redo routines can use macros and functions defined
>> xlogreader.h to access the already-decoded WAL record. The new WAL record
>> format is difficult to parse in a piece-meal fashion, so it really needs
>> this separate decoding pass to be efficient.
>
> Hm. I'm not sure if there's really a point in exposing
> DecodeXLogRecord() outside of xlogreader.c. In which case would a
> xlogreader user not want to decode the record? I.e. couldn't we just
> always do that and not bother exposing the function?
>
> Sure, you could skip decoding if the record is of a "uninteresting"
> rmgr, but since all validation but CRC now happens only during decoding
> I'm not sure that's a good idea.

Well, extracting all the backup block data etc. is not free, so it would
be nice to be able to filter first. I'm thinking of logical decoding in
particular; it could avoid parsing index AM records. Then again, the
parsing isn't very expensive anyway.

I'll change the patch to decode as part of the XLogReadRecord call for
now; it does things a little bit simpler. We can change it back later if
it seems worthwhile.

I nevertheless kept the DecodeXLogRecord function. It can be used to
decode an XLogRecord that's not read from disk, using the xlogreader,
but already in memory. That was needed to support WAL_DEBUG at
insertion; it needs to rm_desc() a record that's not previously read
from disk with an xlogreader.

>> Thoughts on this new WAL record format? I've attached the xlogrecord.h file
>> here separately for easy reading, if you want to take a quick look at just
>> that without applying the whole patch.
>
> After a quick skim I like it in general. There's some details I'd rather
> change, but I think it's quite the step forward.

Glad to hear that.

Here is an updated version of the patch. It's rebased over current
master, and fixes a bunch of issues you and Michael pointed out, and a
ton of other cleanup.

I'll do more comprehensive performance testing with this, and post results.

BTW, I'm planning to replace the malloc() calls in xlogreader.c with
palloc(). It currently uses malloc because xlogreader.c needs to be
usable in frontend code, but nowadays we map palloc automatically to
malloc when building frontend code. A consequence of that is that you
get a straight exit() on out-of-memory, but I think that's acceptable.

- Heikki


Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Review of Refactoring code for sync node detection
Next
From: Alex Shulgin
Date:
Subject: Re: Idle transaction cancel/timeout and SSL revisited