Re: WAL format and API changes (9.5) - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: WAL format and API changes (9.5)
Date
Msg-id 54611209.705@vmware.com
Whole thread Raw
In response to Re: WAL format and API changes (9.5)  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: WAL format and API changes (9.5)  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
Attached is a new version. It's rebased on current git master, including
BRIN. I've also fixed the laundry list of small things you reported, as
well as a bunch of bugs I uncovered during my own testing.

Alvaro: you still have the BRIN WAL-logging code fresh in your memory,
so could you take a look at that part of this patch to check that I
didn't break it? And also, how do you like the new way of writing that,
over what you had to in HEAD ?

More below.

On 11/09/2014 11:47 PM, Andres Freund wrote:
> On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote:
>> Replying to some of your comments below. The rest were trivial issues that
>> I'll just fix.
>>
>> On 10/30/2014 09:19 PM, Andres Freund wrote:
>>> * Is it really a good idea to separate XLogRegisterBufData() from
>>>    XLogRegisterBuffer() the way you've done it ? If we ever actually get
>>>    a record with a large numbers of blocks touched this issentially is
>>>    O(touched_buffers*data_entries).
>>
>> Are you worried that the linear search in XLogRegisterBufData(), to find the
>> right registered_buffer struct, might be expensive? If that ever becomes a
>> problem, a simple fix would be to start the linear search from the end, and
>> make sure that when you touch a large number of blocks, you do all the
>> XLogRegisterBufData() calls right after the corresponding
>> XLogRegisterBuffer() call.
>
> Yes, that was what I was (mildly) worried about. Since you specified a
> high limit of buffers I'm sure someone will come up with a use case for
> it ;)
>
>> I've also though about having XLogRegisterBuffer() return the pointer to the
>> struct, and passing it as argument to XLogRegisterBufData. So the pattern in
>> WAL generating code would be like:
>>
>> registered_buffer *buf0;
>>
>> buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
>> XLogRegisterBufData(buf0, data, length);
>>
>> registered_buffer would be opaque to the callers. That would have potential
>> to turn XLogRegisterBufData into a macro or inline function, to eliminate
>> the function call overhead. I played with that a little bit, but the
>> difference in performance was so small that it didn't seem worth it. But
>> passing the registered_buffer pointer, like above, might not be a bad thing
>> anyway.
>
> Yes, that was roughly what I was thinking of as well. It's not all that
> pretty, but it generally does seem like a good idea to me anyay.

I ended up doing something different: The block_id is now used as the
index into the registered_buffers array. So looking up the right struct
is now just ®istered_buffers[block_id]. That obviously gets rid of
the linear search. It doesn't seem to make any difference in the quick
performance testing I've been doing.

>>> * There's lots of functions like XLogRecHasBlockRef() that dig through
>>>    the wal record. A common pattern is something like:
>>> if (XLogRecHasBlockRef(record, 1))
>>>      XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
>>> else
>>>      oldblk = newblk;
>>>
>>>    I think doing that repeatedly is quite a bad idea. We should parse the
>>>    record once and then use it in a sensible format. Not do it in pieces,
>>>    over and over again. It's not like we ignore backup blocks - so doing
>>>    this lazily doesn't make sense to me.
>>>    Especially as ValidXLogRecord() *already* has parsed the whole damn
>>>    thing once.
>>
>> Hmm. Adding some kind of a parsed XLogRecord representation would need a
>> fair amount of new infrastructure.
>
> True.
>
>> Vast majority of WAL records contain one,
>> maybe two, block references, so it's not that expensive to find the right
>> one, even if you do it several times.
>
> I'm not convinced. It's not an infrequent thing these days to hear
> people being bottlenecked by replay. And grovelling repeatedly through
> larger records isn't *that* cheap.

I haven't done anything about this. We might want to do that, but I'd
like to get this committed now, with all the changes to the AM's, and
then spend some time trying to further optimize the WAL format. I might
add the "deparsed WAL record" infrastructure as part of that
optimization work.

>> I ran a quick performance test on WAL replay performance yesterday. I ran
>> pgbench for 1000000 transactions with WAL archiving enabled, and measured
>> the time it took to replay the generated WAL. I did that with and without
>> the patch, and I didn't see any big difference in replay times. I also ran
>> "perf" on the startup process, and the profiles looked identical. I'll do
>> more comprehensive testing later, with different index types, but I'm
>> convinced that there is no performance issue in replay that we'd need to
>> worry about.
>
> Interesting. What checkpoint_segments/timeout and what scale did you
> use? Since that heavily influences the average size of the record that's
> quite relevant...

Scale factor 5, checkpoint_segments=10, checkpoint_timeout=30s.
pg_xlogdump --stats says:

> Type                                           N      (%)          Record size      (%)             FPI size      (%)
      Combined size      (%) 
> ----                                           -      ---          -----------      ---             --------      ---
      -------------      --- 
> XLOG                                        1434 (  0.01)                48912 (  0.00)             10149668 (  0.42)
           10198580 (  0.29) 
> Transaction                              4000767 ( 16.24)            176048028 ( 16.75)                    0 (  0.00)
          176048028 (  5.04) 
> Storage                                      232 (  0.00)                11136 (  0.00)                    0 (  0.00)
              11136 (  0.00) 
> CLOG                                         123 (  0.00)                 4428 (  0.00)                    0 (  0.00)
               4428 (  0.00) 
> Database                                       2 (  0.00)                   96 (  0.00)                    0 (  0.00)
                 96 (  0.00) 
> Tablespace                                     0 (  0.00)                    0 (  0.00)                    0 (  0.00)
                  0 (  0.00) 
> MultiXact                                      0 (  0.00)                    0 (  0.00)                    0 (  0.00)
                  0 (  0.00) 
> RelMap                                        14 (  0.00)                 7784 (  0.00)                    0 (  0.00)
               7784 (  0.00) 
> Standby                                       94 (  0.00)                 6016 (  0.00)                    0 (  0.00)
               6016 (  0.00) 
> Heap2                                    3563327 ( 14.47)            142523126 ( 13.56)           1332675761 ( 54.52)
         1475198887 ( 42.20) 
> Heap                                    16880705 ( 68.54)            726291821 ( 69.09)            972350846 ( 39.78)
         1698642667 ( 48.59) 
> Btree                                     182603 (  0.74)              6342386 (  0.60)            129238796 (  5.29)
          135581182 (  3.88) 
> Hash                                           0 (  0.00)                    0 (  0.00)                    0 (  0.00)
                  0 (  0.00) 
> Gin                                            0 (  0.00)                    0 (  0.00)                    0 (  0.00)
                  0 (  0.00) 
> Gist                                           0 (  0.00)                    0 (  0.00)                    0 (  0.00)
                  0 (  0.00) 
> Sequence                                       0 (  0.00)                    0 (  0.00)                    0 (  0.00)
                  0 (  0.00) 
> SPGist                                         0 (  0.00)                    0 (  0.00)                    0 (  0.00)
                  0 (  0.00) 
> BRIN                                           0 (  0.00)                    0 (  0.00)                    0 (  0.00)
                  0 (  0.00) 
>                                         --------                      --------                      --------
           -------- 
> Total                                   24629301                    1051283733 [30.07%]           2444415071 [69.93%]
         3495698804 [100%] 

So full-page writes indeed form most of the volume.

I run this again with full_page_writes=off. With that, the patched
version generated about 3% more WAL than master. Replay seems to be a
few percentage points slower, too, which can be explained by the
increased WAL volume.

Attached is a shell script I used to create the WAL.


My next steps for this are:

1. Do more correctness testing of WAL replay routines. I've spent some
time looking at the coverage report generated by "make coverage-html",
after running "make installcheck" on a master-server setup, and crafting
test cases to hit the redo routines that are not otherwise covered by
our regression tests. That has already uncovered a couple of bugs in the
patch that I've now fixed. (That's also how I found the bug I just fixed
in 1961b1c1)

2. Reduce the WAL volume. Any increase in WAL volume is painful on its
own, and also reduces performance in general. This seems to add 1-5%,
depending on the workload. There's some padding bytes the WAL format
that I could use. I've refrained from using them up for now, because I
consider that cheating: we could have kept the current WAL format and
used the padding bytes to make that even smaller/faster. But if we
accept that this patch is worth it, even if it leads to some WAL bloat,
we'll want to buy that back if we can so that users won't see a regression.

If I can find some ways to reduce the WAL volume, I might still commit
this patch more or less as it is first, and commit the optimizations
separately later. Depends on how what those optimizations look like.

- Heikki


Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Proposal: Log inability to lock pages during vacuum
Next
From: Robert Haas
Date:
Subject: Re: pg_background (and more parallelism infrastructure patches)