Thread: Batching page logging during B-tree build

Batching page logging during B-tree build

From
"Andrey M. Borodin"
Date:
Hi!

There's a thread about GiST build [0]. And we've found out there that logging newly created index with batches of 32
pagesis slightly faster. Heikki implemented new logging routine log_newpages() for GiST build and it makes code for
thisbatching nice and clean. 
Here is PoC with porting that same routine to B-tree. It allows to build B-trees ~10% faster on my machine.

Thanks!

Best regards, Andrey Borodin.

[0]
https://www.postgresql.org/message-id/flat/12CC4B83-9A5D-4A59-85A1-215CF9D1AB4B%40yandex-team.ru#6b93ee95fcc51ba710715d2ff6017b7f

Attachment

Re: Batching page logging during B-tree build

From
Peter Geoghegan
Date:
On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> Here is PoC with porting that same routine to B-tree. It allows to build B-trees ~10% faster on my machine.

It doesn't seem to make any difference on my machine, which has an
NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the
sync time isn't exceptional. My test case is "reindex index
pgbench_accounts_pkey", with pgbench scale 500. I thought that this
would be a sympathetic case, since it's bottlenecked on writing the
index, with relatively little time spent scanning and sorting in
parallel workers.

Can you provide a test case that is sympathetic towards the patch?

BTW, I noticed that the index build is absurdly bottlenecked on
compressing WAL with wal_compression=on. It's almost 3x slower with
compression turned on!

-- 
Peter Geoghegan



Re: Batching page logging during B-tree build

From
Andres Freund
Date:
Hi,

On 2020-09-23 11:19:18 -0700, Peter Geoghegan wrote:
> On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> > Here is PoC with porting that same routine to B-tree. It allows to build B-trees ~10% faster on my machine.

I wonder what the effect of logging WAL records as huge as this (~256kb)
is on concurrent sessions. I think it's possible that logging 32 pages
at once would cause latency increases for concurrent OLTP-ish
writes. And that a smaller batch size would reduce that, while still
providing most of the speedup.


> It doesn't seem to make any difference on my machine, which has an
> NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the
> sync time isn't exceptional.

Yea, they are surprisingly slow at syncing, somewhat disappointing for
the upper end of the consumer oriented devices.


> BTW, I noticed that the index build is absurdly bottlenecked on
> compressing WAL with wal_compression=on. It's almost 3x slower with
> compression turned on!

Really should replace WAL compression with lz4 (or possibly zstd).

Greetings,

Andres Freund



Re: Batching page logging during B-tree build

From
Peter Geoghegan
Date:
On Wed, Sep 23, 2020 at 11:29 AM Andres Freund <andres@anarazel.de> wrote:
> I wonder what the effect of logging WAL records as huge as this (~256kb)
> is on concurrent sessions. I think it's possible that logging 32 pages
> at once would cause latency increases for concurrent OLTP-ish
> writes. And that a smaller batch size would reduce that, while still
> providing most of the speedup.

Something to consider, but I cannot see any speedup, and so have no
way of evaluating the idea right now.

> > It doesn't seem to make any difference on my machine, which has an
> > NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the
> > sync time isn't exceptional.
>
> Yea, they are surprisingly slow at syncing, somewhat disappointing for
> the upper end of the consumer oriented devices.

I hesitate to call anything about this SSD disappointing, since
overall it performs extremely well -- especially when you consider the
price tag.

Detachable storage is a trend that's here to stay, so the storage
latency is still probably a lot lower than what you'll see on many
serious production systems. For better or worse.

> Really should replace WAL compression with lz4 (or possibly zstd).

Yeah. WAL compression is generally a good idea, and we should probably
find a way to enable it by default (in the absence of some better way
of dealing with the FPI bottleneck, at least). I had no idea that
compression could hurt this much with index builds until now, though.
To be clear: the *entire* index build takes 3 times more wall clock
time, start to finish -- if I drilled down to the portion of the index
build that actually writes WAL then it would be an even greater
bottleneck.

I know that we've tested different compression methods in the past,
but perhaps index build performance was overlooked. My guess is that
the compression algorithm matters a lot less with pure OLTP workloads.
Also, parallel CREATE INDEX may be a bit of an outlier here. Even
still, it's a very important outlier.

-- 
Peter Geoghegan



Re: Batching page logging during B-tree build

From
Andres Freund
Date:
Hi,

On 2020-09-23 12:02:42 -0700, Peter Geoghegan wrote:
> On Wed, Sep 23, 2020 at 11:29 AM Andres Freund <andres@anarazel.de> wrote:
> > Really should replace WAL compression with lz4 (or possibly zstd).
> 
> Yeah. WAL compression is generally a good idea, and we should probably
> find a way to enable it by default (in the absence of some better way
> of dealing with the FPI bottleneck, at least). I had no idea that
> compression could hurt this much with index builds until now, though.
> To be clear: the *entire* index build takes 3 times more wall clock
> time, start to finish -- if I drilled down to the portion of the index
> build that actually writes WAL then it would be an even greater
> bottleneck.

I have definitely seen this before. The faster the storage, the bigger
the issue (because there's little to be gained by reducing the volume of
WAL).  In my experience the problem tends to be bigger when there's
*less* concurrency, because without the concurrency it's much less
likely for WAL writes to be a bottleneck.


> My guess is that the compression algorithm matters a lot less with
> pure OLTP workloads.

I don't think that's quite right. A quick benchmark shows a ~1.4x
slowdown for a single client pgbench on a 1TB 970Pro during the phase
FPIs are emitted (pgbench -i -s 100, then a -T10 -c 1 run). With ~45% of
the time spent in pglz.

My experience in the past has been that wal_compression is a looser
until the first storage limits are reached, then it's a winner until CPU
time becomes the primary bottleneck, at which time the difference gets
smaller and smaller, sometimes reaching the point where wal_compression
looses again.


> I know that we've tested different compression methods in the past,
> but perhaps index build performance was overlooked.

I am pretty sure we have known that pglz for this was much much slower
than alternatives. I seem to recall somebody posting convincing numbers,
but can't find them just now.

Greetings,

Andres Freund



Re: Batching page logging during B-tree build

From
Andrey Borodin
Date:

> 23 сент. 2020 г., в 23:19, Peter Geoghegan <pg@bowt.ie> написал(а):
>
> On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>> Here is PoC with porting that same routine to B-tree. It allows to build B-trees ~10% faster on my machine.
>
> It doesn't seem to make any difference on my machine, which has an
> NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the
> sync time isn't exceptional. My test case is "reindex index
> pgbench_accounts_pkey", with pgbench scale 500. I thought that this
> would be a sympathetic case, since it's bottlenecked on writing the
> index, with relatively little time spent scanning and sorting in
> parallel workers.
> Can you provide a test case that is sympathetic towards the patch?
Thanks for looking into this!

I've tried this test on my machine (2019 macbook) on scale 10 for 20 seconds.
With patch I get consistently ~ tps = 2.403440, without patch ~ tps = 1.951975.
On scale 500 with patch
postgres=# reindex index pgbench_accounts_pkey;
REINDEX
Time: 21577,640 ms (00:21,578)
without patch
postgres=# reindex index pgbench_accounts_pkey;
REINDEX
Time: 26139,175 ms (00:26,139)

I think it's hardware dependent, I will try on servers.
>
> BTW, I noticed that the index build is absurdly bottlenecked on
> compressing WAL with wal_compression=on. It's almost 3x slower with
> compression turned on!



> 24 сент. 2020 г., в 00:33, Andres Freund <andres@anarazel.de> написал(а):
>
>> I know that we've tested different compression methods in the past,
>> but perhaps index build performance was overlooked.
>
> I am pretty sure we have known that pglz for this was much much slower
> than alternatives. I seem to recall somebody posting convincing numbers,
> but can't find them just now.

There was a thread about different compressions[0]. It was demonstrated there that lz4 is 10 times faster on
compression.
We have a patch to speedup pglz compression x1.43 [1], but I was hoping that we will go lz4\zstd way. It seems to me
now,I actually should finish that speedup patch, it's very focused local refactoring. 

Thanks!

Best regards, Andrey Borodin.


[0]
https://www.postgresql.org/message-id/flat/ea57b49a-ecf0-481a-a77b-631833354f7d%40postgrespro.ru#dcac101f8a73dfce98924066f6a12a13
[1]
https://www.postgresql.org/message-id/flat/169163A8-C96F-4DBE-A062-7D1CECBE9E5D%40yandex-team.ru#996a194c12bacd2d093be2cb7ac54ca6


Re: Batching page logging during B-tree build

From
Andrey Borodin
Date:

> 23 сент. 2020 г., в 23:29, Andres Freund <andres@anarazel.de> написал(а):
>
> Hi,
>
> On 2020-09-23 11:19:18 -0700, Peter Geoghegan wrote:
>> On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>>> Here is PoC with porting that same routine to B-tree. It allows to build B-trees ~10% faster on my machine.
>
> I wonder what the effect of logging WAL records as huge as this (~256kb)
> is on concurrent sessions. I think it's possible that logging 32 pages
> at once would cause latency increases for concurrent OLTP-ish
> writes. And that a smaller batch size would reduce that, while still
> providing most of the speedup.

Indeed, on my machine performance is indistinguishable between batches of 8, 16 and 32 pages, while still observable
(~3...8%)with 8 against 4. 

Best regards, Andrey Borodin,


Re: Batching page logging during B-tree build

From
Dmitry Dolgov
Date:
> On Fri, Oct 09, 2020 at 11:08:42PM +0500, Andrey Borodin wrote:
>
> > 23 сент. 2020 г., в 23:19, Peter Geoghegan <pg@bowt.ie> написал(а):
> >
> > On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> >> Here is PoC with porting that same routine to B-tree. It allows to build B-trees ~10% faster on my machine.
> >
> > It doesn't seem to make any difference on my machine, which has an
> > NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the
> > sync time isn't exceptional. My test case is "reindex index
> > pgbench_accounts_pkey", with pgbench scale 500. I thought that this
> > would be a sympathetic case, since it's bottlenecked on writing the
> > index, with relatively little time spent scanning and sorting in
> > parallel workers.
> > Can you provide a test case that is sympathetic towards the patch?
> Thanks for looking into this!
>
> I've tried this test on my machine (2019 macbook) on scale 10 for 20 seconds.
> With patch I get consistently ~ tps = 2.403440, without patch ~ tps = 1.951975.
> On scale 500 with patch
> postgres=# reindex index pgbench_accounts_pkey;
> REINDEX
> Time: 21577,640 ms (00:21,578)
> without patch
> postgres=# reindex index pgbench_accounts_pkey;
> REINDEX
> Time: 26139,175 ms (00:26,139)
>
> I think it's hardware dependent, I will try on servers.

Hi,

Thanks for the patch! Out of curiosity I've tried to experiment with it,
and in the reindex scenario Peter mentioned above I also can't see any
visible difference. Interesting enough, _bt_blwritepage tracing (IIUC
after reading the patch it's the main target) shows the influence of
batching, but apparently in my case it hits some other bottleneck (e.g.
profiling shows increase in LockBuffer and LogicalTapeRead).