Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: WAL format and API changes (9.5)
Date
Msg-id CAHGQGwG-uvL8krea3f_ExRSYkPK6t=cn6YzoYpj_vDCfx7H_ng@mail.gmail.com
Whole thread Raw
In response to Re: WAL format and API changes (9.5)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: WAL format and API changes (9.5)
List pgsql-hackers
On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Thanks for the review!
>
>
> On 06/27/2014 09:12 AM, Michael Paquier wrote:
>>
>> Looking at this patch, it does not compile when assertions are enabled
>> because of this assertion in heapam.c:
>> Assert(recdata == data + datalen);
>> It seems like a thinko as removing it makes the code compile.
>
>
> Fixed.
>
>
>> Some comments about the code:
>> 1) In src/backend/access/transam/README:
>> 's/no further action is no required/no further action is required/g'
>
>
> Fixed.
>
>
>> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
>> XLogGetBlockRefIds are missing in src/backend/access/transam/README.
>
>
> Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
> what to do with the latter two. They're not really intended for use by redo
> routines, but for things like pg_xlogdump that want to extract information
> about any record type.
>
>
>> 3) There are a couple of code blocks marked as FIXME and BROKEN. There are
>> as well some NOT_USED blocks.
>
>
> Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They mostly
> stand for code that used to print block numbers or relfilenodes, which
> doesn't make much sense to do in an ad hoc way in rmgr-specific desc
> routines anymore. Will need to go through them all an decide what's the
> important information to print for each record type.
>
> The few NOT_USED blocks are around code in redo routines that extract some
> information from the WAL record, that isn't needed anymore. We could remove
> the fields from the WAL records altogether, but might be useful to keep for
> debugging purposes.
>
>
>>   4) Current patch crashes when running make check in
>> contrib/test_decoding.
>> Looking closer, wal_level = logical is broken:
>> =# show wal_level;
>>   wal_level
>> -----------
>>   logical
>> (1 row)
>> =# create database foo;
>> The connection to the server was lost. Attempting reset: Failed.
>
>
> Fixed.
>
>
>> Looking even closer, log_heap_new_cid is broken:
>
>
> Fixed.
>
>
>> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of
>> the static XLogRecData entries if buffer_id >= 0 or to
>> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
>> WAL record). XLogRegisterXLogRecData does something similar, but with an
>> entry of XLogRecData already built. What do you think about removing
>> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
>> interface simpler, and XLogRecData could be kept private in xlog.c. This
>> would need more work especially on gin side where I am seeing for example
>> constructLeafRecompressWALData directly building a XLogRecData entry.
>
>
> Hmm. I considered that, but punted because I didn't want to rewrite all the
> places that currently use XLogRecDatas in less straightforward ways. I kept
> XLogRegisterXLogRecData as an escape hatch for those.
>
> But I bit the bullet now, and got rid of all the XLogRegisterXLogRecData()
> calls. XLogRecData struct is now completely private to xlog.c.
>
> Another big change in the attached patch version is that XLogRegisterData()
> now *copies* the data to a working area, instead of merely adding a
> XLogRecData entry to the chain. This simplifies things in xlog.c, now that
> the XLogRecData concept is not visible to the rest of the system. This is a
> subtle semantic change for the callers: you can no longer register a struct
> first, and fill the fields afterwards.
>
> That adds a lot of memcpys to the critical path, which could be bad for
> performance. In practice, I think it's OK, or even a win, because a typical
> WAL record payload is very small. But I haven't measured that. If it becomes
> an issue, we could try to optimize, and link larger data blocks to the rdata
> chain, but memcpy small small chunks of data. There are a few WAL record
> types that don't fit in a small statically allocated buffer, so I provided a
> new function, XLogEnsureSpace(), that can be called before XLogBegin() to
> allocate a large-enough working area.
>
> These changes required changing a couple of places where WAL records are
> created:
>
> * ginPlaceToPage. This function has a strange split of responsibility
> between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage. ginPlaceToPage
> calls dataPlaceToPage or entryPlaceToPage, depending on what kind of a tree
> it is, and dataPlaceToPage or entryPlaceToPage contributes some data to the
> WAL record. Then ginPlaceToPage adds some more, and finally calls
> XLogInsert(). I simplified the SPLIT case so that we now just create full
> page images of both page halves. We were already logging all the data on
> both page halves, but we were doing it in a "smarter" way, knowing what kind
> of pages they were. For example, for an entry tree page, we included an
> IndexTuple struct for every tuple on the page, but didn't include the item
> pointers. A straight full page image takes more space, but not much, so I
> think in any real application it's going to be lost in noise. (again,
> haven't measured it though)
>
>
>> 8) What do you think about adding in the README a description about how to
>> retrieve the block list modified by a WAL record, for a flow similar to
>> that:
>> record = XLogReadRecord(...);
>> nblocks = XLogGetBlockRefIds(record, blockids);
>> for (i = 0; i < nblocks; i++)
>> {
>>      bkpb = XLogGetBlockRef(record, id, NULL);
>>      // stuff
>> }
>> pg_xlogdump is a good example of what can be done as well.
>
>
> Yeah, that's probably in order. I don't actually like the way that works
> currently very much. Passing the char[256] blockids array to the function
> feels weird. Will have to think a bit harder about that..
>
>
>> 9) In ginxlog.c, shouldn't this block of code use XLogGetPayload?
>> +               char       *payload = XLogRecGetData(record) +
>> sizeof(ginxlogInsert);
>> +#ifdef NOT_USED
>>                  leftChildBlkno = BlockIdGetBlockNumber((BlockId)
>> payload);
>> +#endif
>
>
> No. XLogRecGetData() is used to get data from the "main chunk", while
> XLogGetPayload() is used to get data associated with a buffer. This does the
> former.
>
>
>> 10) Shouldn't the call to XLogBeginInsert come first here?
>> @@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState
>> *state,
>>                  {
>>                          XLogRecPtr      recptr;
>>
>> -                       recptr = XLogInsert(RM_SPGIST_ID,
>> XLOG_SPGIST_ADD_NODE, rdata);
>> +                       XLogRegisterBuffer(0, saveCurrent.buffer,
>> REGBUF_STANDARD);     /* orig page */
>> +                       XLogRegisterBuffer(1, current->buffer,
>> REGBUF_STANDARD);                /* new page */
>> +                       if (xlrec.parentBlk == 2)
>> +                               XLogRegisterBuffer(2, parent->buffer,
>> REGBUF_STANDARD);
>> +
>> +                       XLogBeginInsert();
>> +                       XLogRegisterData(-1, (char *) &xlrec,
>> sizeof(xlrec));
>> +                       XLogRegisterData(-1, (char *) newInnerTuple,
>> newInnerTuple->size);
>> +
>> +                       recptr = XLogInsert(RM_SPGIST_ID,
>> XLOG_SPGIST_ADD_NODE);
>
>
> Yeah. Apparently we have no regression test coverage of this codepath, or it
> would've triggered an assertion. Fixed now, just by looking at the code, but
> there's still no test coverage of this so it's likely still broken in other
> ways. Before this is committed, I think we'll need to go through the
> coverage report and make sure all the XLogInsert() codepaths (and their redo
> routines) are covered.
>
>
>> 11) Any reason for raising a PANIC here?
>> @@ -1126,7 +1126,7 @@ LWLockRelease(LWLock *l)
>>                          break;
>>          }
>>          if (i < 0)
>> -               elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l));
>> +               elog(PANIC, "lock %s %d is not held", T_NAME(l), T_ID(l));
>>          num_held_lwlocks--;
>>          for (; i < num_held_lwlocks; i++)
>
>
> No, leftover from debugging. Fixed.
>
>
>> 14) SPgist redo is crashing (during make installcheck run on a master,
>> redo
>> done on a standby at recovery).
>> (lldb) bt
>> * thread #1: tid = 0x0000, 0x00007fff8a3c2866
>> libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
>>    * frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill +
>> 10
>>      frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill +
>> 92
>>      frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
>>      frame #3: 0x000000010ba8d819
>> postgres`ExceptionalCondition(conditionName=0x000000010bb21ea9,
>> errorType=0x000000010baf019f, fileName=0x000000010bb212dd, lineNumber=65)
>> +
>> 137 at assert.c:54
>>      frame #4: 0x000000010b5c3a50
>> postgres`addOrReplaceTuple(page=0x000000010bfee980,
>> tuple=0x00007ffd41822902, size=770059, offset=18) + 688 at spgxlog.c:65
>>      frame #5: 0x000000010b5c031f postgres`spgRedoMoveLeafs(lsn=71708608,
>> record=0x00007ffd41822800) + 719 at spgxlog.c:263
>>      frame #6: 0x000000010b5bf492 postgres`spg_redo(lsn=71708608,
>> record=0x00007ffd41822800) + 402 at spgxlog.c:988
>>      frame #7: 0x000000010b5e774a postgres`StartupXLOG + 8474 at
>> xlog.c:6780
>>      frame #8: 0x000000010b86a5fe postgres`StartupProcessMain + 430 at
>> startup.c:224
>>      frame #9: 0x000000010b600409 postgres`AuxiliaryProcessMain(argc=2,
>> argv=0x00007fff546ea190) + 1897 at bootstrap.c:416
>>      frame #10: 0x000000010b864998
>> postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090
>>      frame #11: 0x000000010b862aa9 postgres`PostmasterMain(argc=3,
>> argv=0x00007ffd40c04470) + 5401 at postmaster.c:1212
>>      frame #12: 0x000000010b7a9335 postgres`main(argc=3,
>> argv=0x00007ffd40c04470) + 773 at main.c:227
>
>
> Fixed.
>
>
>> Also, I wanted to run the WAL replay facility to compare before and after
>> buffer images, but code crashed before... I am still planning to do so
>> once
>> this code gets more stable though.
>
>
> I did that earlier, and did it now again. I got a single difference from a
> sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will have to
> investigate it later.

With the patch, when I ran the simple test which generates lots of
WAL, I got PANIC error.

=# create extension pgcrypto ;
CREATE EXTENSION
=# create extension pg_trgm ;
CREATE EXTENSION
=# create table hoge (col1 text);
CREATE TABLE
=# create index hogeidx on hoge using gin (col1 gin_trgm_ops);
CREATE INDEX
=# insert into hoge select gen_random_uuid() from generate_series(1,10000000);
PANIC:  too many registered buffers
PANIC:  too many registered buffers
The connection to the server was lost. Attempting reset: Failed.

Server log is
2014-07-01 07:12:37 JST PANIC:  too many registered buffers
2014-07-01 07:12:37 JST STATEMENT:  insert into hoge select
gen_random_uuid() from generate_series(1,10000000);
2014-07-01 07:12:40 JST LOG:  server process (PID 64020) was
terminated by signal 6: Abort trap
2014-07-01 07:12:40 JST DETAIL:  Failed process was running: insert
into hoge select gen_random_uuid() from generate_series(1,10000000);

Regards,

--
Fujii Masao



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Fresh initdb contains a few deleted B-Tree pages
Next
From: Andres Freund
Date:
Subject: Re: Spinlocks and compiler/memory barriers