Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: WAL format and API changes (9.5)
Date
Msg-id 20141030191909.GB12193@alap3.anarazel.de
Whole thread Raw
In response to Re: WAL format and API changes (9.5)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: WAL format and API changes (9.5)
Re: WAL format and API changes (9.5)
List pgsql-hackers
Hi,

On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote:
> The second patch contains the interesting changes.

Heikki's pushed the newest version of this to the git tree.

Some things I noticed while reading the patch:
* potential mismerge:
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -408,7 +408,7 @@ main(int argc, char **argv)       }   }
-   while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:S:nF:wWv",
+   while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:nF:wWv",                           long_options, &option_index))
!=-1)   {       switch (c)
 

* Can't you just have REGBUF_WILL_INIT include REGBUF_NO_IMAGE? Then you don't need to test both.

* Is it really a good idea to separate XLogRegisterBufData() from XLogRegisterBuffer() the way you've done it ? If we
everactually get a record with a large numbers of blocks touched this issentially is O(touched_buffers*data_entries).
 

* At least in Assert mode XLogRegisterBuffer() should ensure we're not registering the same buffer twice. It's a pretty
easyto make mistake to do it twice for some kind of actions (think UPDATE), and the consequences will be far from
obviousafaics.
 

* There's lots of functions like XLogRecHasBlockRef() that dig through the wal record. A common pattern is something
like:
if (XLogRecHasBlockRef(record, 1))   XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else   oldblk = newblk;
 I think doing that repeatedly is quite a bad idea. We should parse the record once and then use it in a sensible
format.Not do it in pieces, over and over again. It's not like we ignore backup blocks - so doing this lazily doesn't
makesense to me. Especially as ValidXLogRecord() *already* has parsed the whole damn thing once.
 

* Maybe XLogRegisterData() and XLogRegisterBufData() should accept void* instead of char*? Right now there's lots of
pointlesscasts in callers needed that don't add any safety. The one additional line that's then needed in these
functionsseems well worth it.
 

* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only refer to the relation, but not to the block number.
Thesestill log their rnode manually. Shouldn't we somehow deal with those in a similar way explicit block references
aredealt with?
 

* You've added an assert in RestoreBackupBlockContents() that ensures the page isn't new. That wasn't there before,
insteadthere was a special case for it. I can't immediately think how that previously could happen, but I do wonder why
itwas there.
 

* The following comment in +RestoreBackupBlock probably is two lines to early
+   /* Found it, apply the update */
+   if (!(bkpb->fork_flags & BKPBLOCK_HAS_IMAGE))
+       return InvalidBuffer; This new InvalidBuffer case probably could use a comment in either
XLogReadBufferForRedoExtendedor here.
 

* Most of the commentary above RestoreBackupBlock() probably doesn't belong there anymore.

* Maybe XLogRecordBlockData.fork_flags should be renamed? Maybe just into flags_fork? Right now it makes at least me
thinkthat it's fork specific flags, which really isn't the actual meaning.
 

* I wonder if it wouldn't be worthwile, for the benefit of the FPI compression patch, to keep the bkp block data after
*all*the "headers". That'd make it easier to just compress the data.
 

* +InitXLogRecordConstruct() seems like a remainder of the xlogconstruct.c naming. I'd just go for InitXLogInsert()

* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the other slru stuff) doesn't include a reference
tothe page. Isn't that bad? Shouldn't we make XLogRegisterBlock() usable for that case? Otherwise I fail to see how
pg_rewindlike tools can sanely deal with this?
 

* Why did you duplicate the identical cases in btree_desc()? I guess due to rebasing ontop the xlogdump stats series?

* I haven't actually looked at the xlogdump output so I might be completely wrong, but won't the output of e.g. updates
beharder to read now because it's not clear which of the references old/new buffers are? Not that it matters that
much.

* s/recdataend/recdata_end/? The former is somewhat hard to parse for me.

* I think heap_xlog_update is buggy for wal_level=logical because it computes the length of the tuple using tuplen =
recdataend- recdata; But the old primary key/old tuple value might be stored there as well. Afaics that code has to
continueto use xl_heap_header_len.
 

* It looks to me like insert wal logging could just use REGBUF_KEEP_DATA to get rid of:
+       /*
+        * The new tuple is normally stored as buffer 0's data. But if
+        * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
+        * data, after the xl_heap_insert struct.
+        */
+       if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
+       {
+           data = XLogRecGetData(record) + SizeOfHeapInsert;
+           datalen = record->xl_len - SizeOfHeapInsert;
+       }
+       else
+           data = XLogRecGetBlockData(record, 0, &datalen);or have I misunderstood how that works?

* spurious reindent
@@ -7306,9 +7120,9 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)        * XLOG record's LSN, we mustn't mark
thepage all-visible, because        * the subsequent update won't be replayed to clear the flag.        */
 
-       page = BufferGetPage(buffer);
-       PageSetAllVisible(page);
-       MarkBufferDirty(buffer);
+           page = BufferGetPage(buffer);
+           PageSetAllVisible(page);
+           MarkBufferDirty(buffer);   }

* Somewhat long line:XLogRegisterBuffer(1, heap_buffer, XLogHintBitIsNeeded() ? REGBUF_STANDARD : (REGBUF_STANDARD |
REGBUF_NO_IMAGE));

Man. This page is friggin huge. Now I'm tired ;)


This patch needs at the very least:
* Careful benchmarking of both primary and standbys
* Very careful review of the many changed XLogInsert/replay sites. I'd be very surprised if there weren't bugs hidden
somewhere.I've lost the energy to read them all now.
 
* A good amount of consideration about the increase in WAL volume. I think there's some cases where that's not
inconsiderable.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Next
From: Tom Lane
Date:
Subject: Re: infinite loop in _bt_getstackbuf