Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: WAL format and API changes (9.5) |
Date | |
Msg-id | 20141030191909.GB12193@alap3.anarazel.de Whole thread Raw |
In response to | Re: WAL format and API changes (9.5) (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: WAL format and API changes (9.5)
Re: WAL format and API changes (9.5) |
List | pgsql-hackers |
Hi, On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote: > The second patch contains the interesting changes. Heikki's pushed the newest version of this to the git tree. Some things I noticed while reading the patch: * potential mismerge: +++ b/src/bin/pg_basebackup/pg_receivexlog.c @@ -408,7 +408,7 @@ main(int argc, char **argv) } } - while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:S:nF:wWv", + while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:nF:wWv", long_options, &option_index)) !=-1) { switch (c) * Can't you just have REGBUF_WILL_INIT include REGBUF_NO_IMAGE? Then you don't need to test both. * Is it really a good idea to separate XLogRegisterBufData() from XLogRegisterBuffer() the way you've done it ? If we everactually get a record with a large numbers of blocks touched this issentially is O(touched_buffers*data_entries). * At least in Assert mode XLogRegisterBuffer() should ensure we're not registering the same buffer twice. It's a pretty easyto make mistake to do it twice for some kind of actions (think UPDATE), and the consequences will be far from obviousafaics. * There's lots of functions like XLogRecHasBlockRef() that dig through the wal record. A common pattern is something like: if (XLogRecHasBlockRef(record, 1)) XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk); else oldblk = newblk; I think doing that repeatedly is quite a bad idea. We should parse the record once and then use it in a sensible format.Not do it in pieces, over and over again. It's not like we ignore backup blocks - so doing this lazily doesn't makesense to me. Especially as ValidXLogRecord() *already* has parsed the whole damn thing once. * Maybe XLogRegisterData() and XLogRegisterBufData() should accept void* instead of char*? Right now there's lots of pointlesscasts in callers needed that don't add any safety. The one additional line that's then needed in these functionsseems well worth it. * There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only refer to the relation, but not to the block number. Thesestill log their rnode manually. Shouldn't we somehow deal with those in a similar way explicit block references aredealt with? * You've added an assert in RestoreBackupBlockContents() that ensures the page isn't new. That wasn't there before, insteadthere was a special case for it. I can't immediately think how that previously could happen, but I do wonder why itwas there. * The following comment in +RestoreBackupBlock probably is two lines to early + /* Found it, apply the update */ + if (!(bkpb->fork_flags & BKPBLOCK_HAS_IMAGE)) + return InvalidBuffer; This new InvalidBuffer case probably could use a comment in either XLogReadBufferForRedoExtendedor here. * Most of the commentary above RestoreBackupBlock() probably doesn't belong there anymore. * Maybe XLogRecordBlockData.fork_flags should be renamed? Maybe just into flags_fork? Right now it makes at least me thinkthat it's fork specific flags, which really isn't the actual meaning. * I wonder if it wouldn't be worthwile, for the benefit of the FPI compression patch, to keep the bkp block data after *all*the "headers". That'd make it easier to just compress the data. * +InitXLogRecordConstruct() seems like a remainder of the xlogconstruct.c naming. I'd just go for InitXLogInsert() * Hm. At least WriteMZeroPageXlogRec (and probably the same for all the other slru stuff) doesn't include a reference tothe page. Isn't that bad? Shouldn't we make XLogRegisterBlock() usable for that case? Otherwise I fail to see how pg_rewindlike tools can sanely deal with this? * Why did you duplicate the identical cases in btree_desc()? I guess due to rebasing ontop the xlogdump stats series? * I haven't actually looked at the xlogdump output so I might be completely wrong, but won't the output of e.g. updates beharder to read now because it's not clear which of the references old/new buffers are? Not that it matters that much. * s/recdataend/recdata_end/? The former is somewhat hard to parse for me. * I think heap_xlog_update is buggy for wal_level=logical because it computes the length of the tuple using tuplen = recdataend- recdata; But the old primary key/old tuple value might be stored there as well. Afaics that code has to continueto use xl_heap_header_len. * It looks to me like insert wal logging could just use REGBUF_KEEP_DATA to get rid of: + /* + * The new tuple is normally stored as buffer 0's data. But if + * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main + * data, after the xl_heap_insert struct. + */ + if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE) + { + data = XLogRecGetData(record) + SizeOfHeapInsert; + datalen = record->xl_len - SizeOfHeapInsert; + } + else + data = XLogRecGetBlockData(record, 0, &datalen);or have I misunderstood how that works? * spurious reindent @@ -7306,9 +7120,9 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record) * XLOG record's LSN, we mustn't mark thepage all-visible, because * the subsequent update won't be replayed to clear the flag. */ - page = BufferGetPage(buffer); - PageSetAllVisible(page); - MarkBufferDirty(buffer); + page = BufferGetPage(buffer); + PageSetAllVisible(page); + MarkBufferDirty(buffer); } * Somewhat long line:XLogRegisterBuffer(1, heap_buffer, XLogHintBitIsNeeded() ? REGBUF_STANDARD : (REGBUF_STANDARD | REGBUF_NO_IMAGE)); Man. This page is friggin huge. Now I'm tired ;) This patch needs at the very least: * Careful benchmarking of both primary and standbys * Very careful review of the many changed XLogInsert/replay sites. I'd be very surprised if there weren't bugs hidden somewhere.I've lost the energy to read them all now. * A good amount of consideration about the increase in WAL volume. I think there's some cases where that's not inconsiderable. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: