Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: WAL format and API changes (9.5)
Date
Msg-id 545B9491.6040409@vmware.com
Whole thread Raw
In response to Re: WAL format and API changes (9.5)  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: WAL format and API changes (9.5)  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
Replying to some of your comments below. The rest were trivial issues 
that I'll just fix.

On 10/30/2014 09:19 PM, Andres Freund wrote:
> * Is it really a good idea to separate XLogRegisterBufData() from
>    XLogRegisterBuffer() the way you've done it ? If we ever actually get
>    a record with a large numbers of blocks touched this issentially is
>    O(touched_buffers*data_entries).

Are you worried that the linear search in XLogRegisterBufData(), to find 
the right registered_buffer struct, might be expensive? If that ever 
becomes a problem, a simple fix would be to start the linear search from 
the end, and make sure that when you touch a large number of blocks, you 
do all the XLogRegisterBufData() calls right after the corresponding 
XLogRegisterBuffer() call.

I've also though about having XLogRegisterBuffer() return the pointer to 
the struct, and passing it as argument to XLogRegisterBufData. So the 
pattern in WAL generating code would be like:

registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would have 
potential to turn XLogRegisterBufData into a macro or inline function, 
to eliminate the function call overhead. I played with that a little 
bit, but the difference in performance was so small that it didn't seem 
worth it. But passing the registered_buffer pointer, like above, might 
not be a bad thing anyway.

> * There's lots of functions like XLogRecHasBlockRef() that dig through
>    the wal record. A common pattern is something like:
> if (XLogRecHasBlockRef(record, 1))
>      XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
> else
>      oldblk = newblk;
>
>    I think doing that repeatedly is quite a bad idea. We should parse the
>    record once and then use it in a sensible format. Not do it in pieces,
>    over and over again. It's not like we ignore backup blocks - so doing
>    this lazily doesn't make sense to me.
>    Especially as ValidXLogRecord() *already* has parsed the whole damn
>    thing once.

Hmm. Adding some kind of a parsed XLogRecord representation would need a 
fair amount of new infrastructure. Vast majority of WAL records contain 
one, maybe two, block references, so it's not that expensive to find the 
right one, even if you do it several times.

I ran a quick performance test on WAL replay performance yesterday. I 
ran pgbench for 1000000 transactions with WAL archiving enabled, and 
measured the time it took to replay the generated WAL. I did that with 
and without the patch, and I didn't see any big difference in replay 
times. I also ran "perf" on the startup process, and the profiles looked 
identical. I'll do more comprehensive testing later, with different 
index types, but I'm convinced that there is no performance issue in 
replay that we'd need to worry about.

If it mattered, a simple optimization to the above pattern would be to 
have XLogRecGetBlockTag return true/false, indicating if the block 
reference existed at all. So you'd do:

if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))    oldblk != newblk;


On the other hand, decomposing the WAL record into parts, and passing 
the decomposed representation to the redo routines would allow us to 
pack the WAL record format more tightly, as accessing the different 
parts of the on-disk format wouldn't then need to be particularly fast. 
For example, I've been thinking that it would be nice to get rid of the 
alignment padding in XLogRecord, and between the per-buffer data 
portions. We could copy the data to aligned addresses as part of the 
decomposition or parsing of the WAL record, so that the redo routines 
could still assume aligned access.

> * I wonder if it wouldn't be worthwile, for the benefit of the FPI
>    compression patch, to keep the bkp block data after *all* the
>    "headers". That'd make it easier to just compress the data.

Maybe. If we do that, I'd also be inclined to move all the bkp block 
headers to the beginning of the WAL record, just after the XLogInsert 
struct. Somehow it feels weird to have a bunch of header structs 
sandwiched between the rmgr-data and per-buffer data. Also, 4-byte 
alignment is enough for the XLogRecordBlockData struct, so that would be 
an easy way to make use of the space currently wasted for alignment 
padding in XLogRecord.

> * I think heap_xlog_update is buggy for wal_level=logical because it
>    computes the length of the tuple using
>    tuplen = recdataend - recdata;
>    But the old primary key/old tuple value might be stored there as
>    well. Afaics that code has to continue to use xl_heap_header_len.

No, the old primary key or tuple is stored in the main data portion. 
That tuplen computation is done on backup block 0's data.

> * It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
>    to get rid of:
> +       /*
> +        * The new tuple is normally stored as buffer 0's data. But if
> +        * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
> +        * data, after the xl_heap_insert struct.
> +        */
> +       if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
> +       {
> +           data = XLogRecGetData(record) + SizeOfHeapInsert;
> +           datalen = record->xl_len - SizeOfHeapInsert;
> +       }
> +       else
> +           data = XLogRecGetBlockData(record, 0, &datalen);
>   or have I misunderstood how that works?

Ah, you're right. Actually, the code that writes the WAL record *does* 
use REGBUF_KEEP_DATA. That was a bug in the redo routine, it should 
always look into buffer 0's data.

- Heikki




pgsql-hackers by date:

Previous
From: Bernd Helmle
Date:
Subject: Re: Amazon Redshift
Next
From: Robert Haas
Date:
Subject: Re: group locking: incomplete patch, just for discussion