Home > mailing lists

Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: WAL format and API changes (9.5)
Date	November 10, 2014 00:48:02
Msg-id	20141109214755.GB28007@alap3.anarazel.de Whole thread Raw
In response to	Re: WAL format and API changes (9.5) (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses	Re: WAL format and API changes (9.5) (Heikki Linnakangas <hlinnakangas@vmware.com>)
List	pgsql-hackers

Tree view

On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote:
> Replying to some of your comments below. The rest were trivial issues that
> I'll just fix.
> 
> On 10/30/2014 09:19 PM, Andres Freund wrote:
> >* Is it really a good idea to separate XLogRegisterBufData() from
> >   XLogRegisterBuffer() the way you've done it ? If we ever actually get
> >   a record with a large numbers of blocks touched this issentially is
> >   O(touched_buffers*data_entries).
> 
> Are you worried that the linear search in XLogRegisterBufData(), to find the
> right registered_buffer struct, might be expensive? If that ever becomes a
> problem, a simple fix would be to start the linear search from the end, and
> make sure that when you touch a large number of blocks, you do all the
> XLogRegisterBufData() calls right after the corresponding
> XLogRegisterBuffer() call.

Yes, that was what I was (mildly) worried about. Since you specified a
high limit of buffers I'm sure someone will come up with a use case for
it ;)

> I've also though about having XLogRegisterBuffer() return the pointer to the
> struct, and passing it as argument to XLogRegisterBufData. So the pattern in
> WAL generating code would be like:
> 
> registered_buffer *buf0;
> 
> buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
> XLogRegisterBufData(buf0, data, length);
> 
> registered_buffer would be opaque to the callers. That would have potential
> to turn XLogRegisterBufData into a macro or inline function, to eliminate
> the function call overhead. I played with that a little bit, but the
> difference in performance was so small that it didn't seem worth it. But
> passing the registered_buffer pointer, like above, might not be a bad thing
> anyway.

Yes, that was roughly what I was thinking of as well. It's not all that
pretty, but it generally does seem like a good idea to me anyay.

> >* There's lots of functions like XLogRecHasBlockRef() that dig through
> >   the wal record. A common pattern is something like:
> >if (XLogRecHasBlockRef(record, 1))
> >     XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
> >else
> >     oldblk = newblk;
> >
> >   I think doing that repeatedly is quite a bad idea. We should parse the
> >   record once and then use it in a sensible format. Not do it in pieces,
> >   over and over again. It's not like we ignore backup blocks - so doing
> >   this lazily doesn't make sense to me.
> >   Especially as ValidXLogRecord() *already* has parsed the whole damn
> >   thing once.
> 
> Hmm. Adding some kind of a parsed XLogRecord representation would need a
> fair amount of new infrastructure.

True.

> Vast majority of WAL records contain one,
> maybe two, block references, so it's not that expensive to find the right
> one, even if you do it several times.

I'm not convinced. It's not an infrequent thing these days to hear
people being bottlenecked by replay. And grovelling repeatedly through
larger records isn't *that* cheap.

> I ran a quick performance test on WAL replay performance yesterday. I ran
> pgbench for 1000000 transactions with WAL archiving enabled, and measured
> the time it took to replay the generated WAL. I did that with and without
> the patch, and I didn't see any big difference in replay times. I also ran
> "perf" on the startup process, and the profiles looked identical. I'll do
> more comprehensive testing later, with different index types, but I'm
> convinced that there is no performance issue in replay that we'd need to
> worry about.

Interesting. What checkpoint_segments/timeout and what scale did you
use? Since that heavily influences the average size of the record that's
quite relevant...

> If it mattered, a simple optimization to the above pattern would be to have
> XLogRecGetBlockTag return true/false, indicating if the block reference
> existed at all. So you'd do:
> 
> if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
>     oldblk != newblk;

That sounds like a good idea anyway?

> On the other hand, decomposing the WAL record into parts, and passing the
> decomposed representation to the redo routines would allow us to pack the
> WAL record format more tightly, as accessing the different parts of the
> on-disk format wouldn't then need to be particularly fast. For example, I've
> been thinking that it would be nice to get rid of the alignment padding in
> XLogRecord, and between the per-buffer data portions. We could copy the data
> to aligned addresses as part of the decomposition or parsing of the WAL
> record, so that the redo routines could still assume aligned access.

Right. I think it'd generally give us a bit more flexibility.

> >* I wonder if it wouldn't be worthwile, for the benefit of the FPI
> >   compression patch, to keep the bkp block data after *all* the
> >   "headers". That'd make it easier to just compress the data.
> 
> Maybe. If we do that, I'd also be inclined to move all the bkp block headers
> to the beginning of the WAL record, just after the XLogInsert struct.

> Somehow it feels weird to have a bunch of header structs sandwiched between
> the rmgr-data and per-buffer data. Also, 4-byte alignment is enough for the
> XLogRecordBlockData struct, so that would be an easy way to make use of the
> space currently wasted for alignment padding in XLogRecord.

Hm, ok. Don't really care ;)

> >* I think heap_xlog_update is buggy for wal_level=logical because it
> >   computes the length of the tuple using
> >   tuplen = recdataend - recdata;
> >   But the old primary key/old tuple value might be stored there as
> >   well. Afaics that code has to continue to use xl_heap_header_len.
> 
> No, the old primary key or tuple is stored in the main data portion. That
> tuplen computation is done on backup block 0's data.

Ah. Still living in the "old world" apparently ;). Will look at it again.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

pgsql-hackers by date:

From: Petr Jelinek
Date: 10 November 2014, 00:41:57
Subject: Re: Column/type dependency recording inconsistencies

From: Tom Lane
Date: 10 November 2014, 01:04:37
Subject: Re: Column/type dependency recording inconsistencies

Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers

Previous

Next