Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: WAL format and API changes (9.5)
Date
Msg-id CAB7nPqR2RVPJL1WmF-PXa3GLhD58Pr4DKK9K=0aazyE3dhKuLw@mail.gmail.com
Whole thread Raw
In response to Re: WAL format and API changes (9.5)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: WAL format and API changes (9.5)  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers



On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 06/27/2014 09:12 AM, Michael Paquier wrote:
2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
XLogGetBlockRefIds are missing in src/backend/access/transam/README.

Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure what to do with the latter two. They're not really intended for use by redo routines, but for things like pg_xlogdump that want to extract information about any record type.

That's definitely not part of the redo section or WAL construction section. It could be a new section in the README describing how to get the list of blocks touched by a WAL record. I'd rather have those APIs documented somewhere than nowhere, and the README would be well-suited for that (as well as in-code comments) as those APIs are aimed at developers.


3) There are a couple of code blocks marked as FIXME and BROKEN. There are
as well some NOT_USED blocks.

Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They mostly stand for code that used to print block numbers or relfilenodes, which doesn't make much sense to do in an ad hoc way in rmgr-specific desc routines anymore. Will need to go through them all an decide what's the important information to print for each record type.

The few NOT_USED blocks are around code in redo routines that extract some information from the WAL record, that isn't needed anymore. We could remove the fields from the WAL records altogether, but might be useful to keep for debugging purposes.
They could be removed and be kept as a part of the git history... That's not really a vital point btw.
 

6) XLogRegisterData creates a XLogRecData entry and appends it in one of
the static XLogRecData entries if buffer_id >= 0 or to
rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
WAL record). XLogRegisterXLogRecData does something similar, but with an
entry of XLogRecData already built. What do you think about removing
XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
interface simpler, and XLogRecData could be kept private in xlog.c. This
would need more work especially on gin side where I am seeing for example
constructLeafRecompressWALData directly building a XLogRecData entry.

Another big change in the attached patch version is that XLogRegisterData() now *copies* the data to a working area, instead of merely adding a XLogRecData entry to the chain. This simplifies things in xlog.c, now that the XLogRecData concept is not visible to the rest of the system. This is a subtle semantic change for the callers: you can no longer register a struct first, and fill the fields afterwards.

That adds a lot of memcpys to the critical path, which could be bad for performance. In practice, I think it's OK, or even a win, because a typical WAL record payload is very small. But I haven't measured that. If it becomes an issue, we could try to optimize, and link larger data blocks to the rdata chain, but memcpy small small chunks of data. There are a few WAL record types that don't fit in a small statically allocated buffer, so I provided a new function, XLogEnsureSpace(), that can be called before XLogBegin() to allocate a large-enough working area.
 
I just ran the following test on my laptop with your latest patch:
- Patched version:
=# CREATE TABLE foo AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 32913.419 ms
=# CREATE TABLE foo2 AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 32261.045 ms
- master (d97e98e):
=#  CREATE TABLE foo AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 29651.394 ms
=#  CREATE TABLE foo2 AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 29734.887 ms
10% difference... That was just a quick test though.

These changes required changing a couple of places where WAL records are created:

* ginPlaceToPage. This function has a strange split of responsibility between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage. ginPlaceToPage calls dataPlaceToPage or entryPlaceToPage, depending on what kind of a tree it is, and dataPlaceToPage or entryPlaceToPage contributes some data to the WAL record. Then ginPlaceToPage adds some more, and finally calls XLogInsert(). I simplified the SPLIT case so that we now just create full page images of both page halves. We were already logging all the data on both page halves, but we were doing it in a "smarter" way, knowing what kind of pages they were. For example, for an entry tree page, we included an IndexTuple struct for every tuple on the page, but didn't include the item pointers. A straight full page image takes more space, but not much, so I think in any real application it's going to be lost in noise. (again, haven't measured it though)
Thanks, the code looks cleaner this way. Having looked a bit at the way ginPlaceToPage works, it seems a bit sensible to refactor dataPlaceToPageLeaf and dataPlaceToPageInternal to make the WAL operations smarter than they are now. I am not saying that this is impossible, just not that straight-forward, especially considering that this code is like that for a long time.
I actually tried to run some tests on the WAL difference it generates, but got discouraged by the email Fujii-san sent reporting the crash in gin code path.


10) Shouldn't the call to XLogBeginInsert come first here?
@@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState *state,
                 {
                         XLogRecPtr      recptr;

-                       recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE, rdata);
+                       XLogRegisterBuffer(0, saveCurrent.buffer,
REGBUF_STANDARD);     /* orig page */
+                       XLogRegisterBuffer(1, current->buffer,
REGBUF_STANDARD);                /* new page */
+                       if (xlrec.parentBlk == 2)
+                               XLogRegisterBuffer(2, parent->buffer,
REGBUF_STANDARD);
+
+                       XLogBeginInsert();
+                       XLogRegisterData(-1, (char *) &xlrec,
sizeof(xlrec));
+                       XLogRegisterData(-1, (char *) newInnerTuple,
newInnerTuple->size);
+
+                       recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE);

Yeah. Apparently we have no regression test coverage of this codepath, or it would've triggered an assertion. Fixed now, just by looking at the code, but there's still no test coverage of this so it's likely still broken in other ways. Before this is committed, I think we'll need to go through the coverage report and make sure all the XLogInsert() codepaths (and their redo routines) are covered.

Hm. This is not really related to this patch as it means that even current code is not tested properly. I wouldn't mind taking a closer look at the test suite and come up new tests improving coverage of WAL replay and record construction...


Also, I wanted to run the WAL replay facility to compare before and after
buffer images, but code crashed before... I am still planning to do so once
this code gets more stable though.

I did that earlier, and did it now again. I got a single difference from a sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will have to investigate it later.
OK, thanks.

I haven't looked at the patch in more details:
1) Noticed a small type => s/reprsent/represent/g
2) This time make installcheck passed for both master and standby, even on contrib stuff...
3) I noticed a bug in gin redo code path when trying to use the WAL replay facility though so I couldn't make a proper run with it.
--
Michael

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Audit of logout
Next
From: Michael Paquier
Date:
Subject: Re: WAL format and API changes (9.5)