Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers
From | Fujii Masao |
---|---|
Subject | Re: WAL format and API changes (9.5) |
Date | |
Msg-id | CAHGQGwG-uvL8krea3f_ExRSYkPK6t=cn6YzoYpj_vDCfx7H_ng@mail.gmail.com Whole thread Raw |
In response to | Re: WAL format and API changes (9.5) (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: WAL format and API changes (9.5)
|
List | pgsql-hackers |
On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Thanks for the review! > > > On 06/27/2014 09:12 AM, Michael Paquier wrote: >> >> Looking at this patch, it does not compile when assertions are enabled >> because of this assertion in heapam.c: >> Assert(recdata == data + datalen); >> It seems like a thinko as removing it makes the code compile. > > > Fixed. > > >> Some comments about the code: >> 1) In src/backend/access/transam/README: >> 's/no further action is no required/no further action is required/g' > > > Fixed. > > >> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and >> XLogGetBlockRefIds are missing in src/backend/access/transam/README. > > > Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure > what to do with the latter two. They're not really intended for use by redo > routines, but for things like pg_xlogdump that want to extract information > about any record type. > > >> 3) There are a couple of code blocks marked as FIXME and BROKEN. There are >> as well some NOT_USED blocks. > > > Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They mostly > stand for code that used to print block numbers or relfilenodes, which > doesn't make much sense to do in an ad hoc way in rmgr-specific desc > routines anymore. Will need to go through them all an decide what's the > important information to print for each record type. > > The few NOT_USED blocks are around code in redo routines that extract some > information from the WAL record, that isn't needed anymore. We could remove > the fields from the WAL records altogether, but might be useful to keep for > debugging purposes. > > >> 4) Current patch crashes when running make check in >> contrib/test_decoding. >> Looking closer, wal_level = logical is broken: >> =# show wal_level; >> wal_level >> ----------- >> logical >> (1 row) >> =# create database foo; >> The connection to the server was lost. Attempting reset: Failed. > > > Fixed. > > >> Looking even closer, log_heap_new_cid is broken: > > > Fixed. > > >> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of >> the static XLogRecData entries if buffer_id >= 0 or to >> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the >> WAL record). XLogRegisterXLogRecData does something similar, but with an >> entry of XLogRecData already built. What do you think about removing >> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the >> interface simpler, and XLogRecData could be kept private in xlog.c. This >> would need more work especially on gin side where I am seeing for example >> constructLeafRecompressWALData directly building a XLogRecData entry. > > > Hmm. I considered that, but punted because I didn't want to rewrite all the > places that currently use XLogRecDatas in less straightforward ways. I kept > XLogRegisterXLogRecData as an escape hatch for those. > > But I bit the bullet now, and got rid of all the XLogRegisterXLogRecData() > calls. XLogRecData struct is now completely private to xlog.c. > > Another big change in the attached patch version is that XLogRegisterData() > now *copies* the data to a working area, instead of merely adding a > XLogRecData entry to the chain. This simplifies things in xlog.c, now that > the XLogRecData concept is not visible to the rest of the system. This is a > subtle semantic change for the callers: you can no longer register a struct > first, and fill the fields afterwards. > > That adds a lot of memcpys to the critical path, which could be bad for > performance. In practice, I think it's OK, or even a win, because a typical > WAL record payload is very small. But I haven't measured that. If it becomes > an issue, we could try to optimize, and link larger data blocks to the rdata > chain, but memcpy small small chunks of data. There are a few WAL record > types that don't fit in a small statically allocated buffer, so I provided a > new function, XLogEnsureSpace(), that can be called before XLogBegin() to > allocate a large-enough working area. > > These changes required changing a couple of places where WAL records are > created: > > * ginPlaceToPage. This function has a strange split of responsibility > between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage. ginPlaceToPage > calls dataPlaceToPage or entryPlaceToPage, depending on what kind of a tree > it is, and dataPlaceToPage or entryPlaceToPage contributes some data to the > WAL record. Then ginPlaceToPage adds some more, and finally calls > XLogInsert(). I simplified the SPLIT case so that we now just create full > page images of both page halves. We were already logging all the data on > both page halves, but we were doing it in a "smarter" way, knowing what kind > of pages they were. For example, for an entry tree page, we included an > IndexTuple struct for every tuple on the page, but didn't include the item > pointers. A straight full page image takes more space, but not much, so I > think in any real application it's going to be lost in noise. (again, > haven't measured it though) > > >> 8) What do you think about adding in the README a description about how to >> retrieve the block list modified by a WAL record, for a flow similar to >> that: >> record = XLogReadRecord(...); >> nblocks = XLogGetBlockRefIds(record, blockids); >> for (i = 0; i < nblocks; i++) >> { >> bkpb = XLogGetBlockRef(record, id, NULL); >> // stuff >> } >> pg_xlogdump is a good example of what can be done as well. > > > Yeah, that's probably in order. I don't actually like the way that works > currently very much. Passing the char[256] blockids array to the function > feels weird. Will have to think a bit harder about that.. > > >> 9) In ginxlog.c, shouldn't this block of code use XLogGetPayload? >> + char *payload = XLogRecGetData(record) + >> sizeof(ginxlogInsert); >> +#ifdef NOT_USED >> leftChildBlkno = BlockIdGetBlockNumber((BlockId) >> payload); >> +#endif > > > No. XLogRecGetData() is used to get data from the "main chunk", while > XLogGetPayload() is used to get data associated with a buffer. This does the > former. > > >> 10) Shouldn't the call to XLogBeginInsert come first here? >> @@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState >> *state, >> { >> XLogRecPtr recptr; >> >> - recptr = XLogInsert(RM_SPGIST_ID, >> XLOG_SPGIST_ADD_NODE, rdata); >> + XLogRegisterBuffer(0, saveCurrent.buffer, >> REGBUF_STANDARD); /* orig page */ >> + XLogRegisterBuffer(1, current->buffer, >> REGBUF_STANDARD); /* new page */ >> + if (xlrec.parentBlk == 2) >> + XLogRegisterBuffer(2, parent->buffer, >> REGBUF_STANDARD); >> + >> + XLogBeginInsert(); >> + XLogRegisterData(-1, (char *) &xlrec, >> sizeof(xlrec)); >> + XLogRegisterData(-1, (char *) newInnerTuple, >> newInnerTuple->size); >> + >> + recptr = XLogInsert(RM_SPGIST_ID, >> XLOG_SPGIST_ADD_NODE); > > > Yeah. Apparently we have no regression test coverage of this codepath, or it > would've triggered an assertion. Fixed now, just by looking at the code, but > there's still no test coverage of this so it's likely still broken in other > ways. Before this is committed, I think we'll need to go through the > coverage report and make sure all the XLogInsert() codepaths (and their redo > routines) are covered. > > >> 11) Any reason for raising a PANIC here? >> @@ -1126,7 +1126,7 @@ LWLockRelease(LWLock *l) >> break; >> } >> if (i < 0) >> - elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l)); >> + elog(PANIC, "lock %s %d is not held", T_NAME(l), T_ID(l)); >> num_held_lwlocks--; >> for (; i < num_held_lwlocks; i++) > > > No, leftover from debugging. Fixed. > > >> 14) SPgist redo is crashing (during make installcheck run on a master, >> redo >> done on a standby at recovery). >> (lldb) bt >> * thread #1: tid = 0x0000, 0x00007fff8a3c2866 >> libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP >> * frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill + >> 10 >> frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill + >> 92 >> frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125 >> frame #3: 0x000000010ba8d819 >> postgres`ExceptionalCondition(conditionName=0x000000010bb21ea9, >> errorType=0x000000010baf019f, fileName=0x000000010bb212dd, lineNumber=65) >> + >> 137 at assert.c:54 >> frame #4: 0x000000010b5c3a50 >> postgres`addOrReplaceTuple(page=0x000000010bfee980, >> tuple=0x00007ffd41822902, size=770059, offset=18) + 688 at spgxlog.c:65 >> frame #5: 0x000000010b5c031f postgres`spgRedoMoveLeafs(lsn=71708608, >> record=0x00007ffd41822800) + 719 at spgxlog.c:263 >> frame #6: 0x000000010b5bf492 postgres`spg_redo(lsn=71708608, >> record=0x00007ffd41822800) + 402 at spgxlog.c:988 >> frame #7: 0x000000010b5e774a postgres`StartupXLOG + 8474 at >> xlog.c:6780 >> frame #8: 0x000000010b86a5fe postgres`StartupProcessMain + 430 at >> startup.c:224 >> frame #9: 0x000000010b600409 postgres`AuxiliaryProcessMain(argc=2, >> argv=0x00007fff546ea190) + 1897 at bootstrap.c:416 >> frame #10: 0x000000010b864998 >> postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090 >> frame #11: 0x000000010b862aa9 postgres`PostmasterMain(argc=3, >> argv=0x00007ffd40c04470) + 5401 at postmaster.c:1212 >> frame #12: 0x000000010b7a9335 postgres`main(argc=3, >> argv=0x00007ffd40c04470) + 773 at main.c:227 > > > Fixed. > > >> Also, I wanted to run the WAL replay facility to compare before and after >> buffer images, but code crashed before... I am still planning to do so >> once >> this code gets more stable though. > > > I did that earlier, and did it now again. I got a single difference from a > sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will have to > investigate it later. With the patch, when I ran the simple test which generates lots of WAL, I got PANIC error. =# create extension pgcrypto ; CREATE EXTENSION =# create extension pg_trgm ; CREATE EXTENSION =# create table hoge (col1 text); CREATE TABLE =# create index hogeidx on hoge using gin (col1 gin_trgm_ops); CREATE INDEX =# insert into hoge select gen_random_uuid() from generate_series(1,10000000); PANIC: too many registered buffers PANIC: too many registered buffers The connection to the server was lost. Attempting reset: Failed. Server log is 2014-07-01 07:12:37 JST PANIC: too many registered buffers 2014-07-01 07:12:37 JST STATEMENT: insert into hoge select gen_random_uuid() from generate_series(1,10000000); 2014-07-01 07:12:40 JST LOG: server process (PID 64020) was terminated by signal 6: Abort trap 2014-07-01 07:12:40 JST DETAIL: Failed process was running: insert into hoge select gen_random_uuid() from generate_series(1,10000000); Regards, -- Fujii Masao
pgsql-hackers by date: