Thread: Batching page logging during B-tree build
Hi! There's a thread about GiST build [0]. And we've found out there that logging newly created index with batches of 32 pagesis slightly faster. Heikki implemented new logging routine log_newpages() for GiST build and it makes code for thisbatching nice and clean. Here is PoC with porting that same routine to B-tree. It allows to build B-trees ~10% faster on my machine. Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/12CC4B83-9A5D-4A59-85A1-215CF9D1AB4B%40yandex-team.ru#6b93ee95fcc51ba710715d2ff6017b7f
Attachment
On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > Here is PoC with porting that same routine to B-tree. It allows to build B-trees ~10% faster on my machine. It doesn't seem to make any difference on my machine, which has an NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the sync time isn't exceptional. My test case is "reindex index pgbench_accounts_pkey", with pgbench scale 500. I thought that this would be a sympathetic case, since it's bottlenecked on writing the index, with relatively little time spent scanning and sorting in parallel workers. Can you provide a test case that is sympathetic towards the patch? BTW, I noticed that the index build is absurdly bottlenecked on compressing WAL with wal_compression=on. It's almost 3x slower with compression turned on! -- Peter Geoghegan
Hi, On 2020-09-23 11:19:18 -0700, Peter Geoghegan wrote: > On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > Here is PoC with porting that same routine to B-tree. It allows to build B-trees ~10% faster on my machine. I wonder what the effect of logging WAL records as huge as this (~256kb) is on concurrent sessions. I think it's possible that logging 32 pages at once would cause latency increases for concurrent OLTP-ish writes. And that a smaller batch size would reduce that, while still providing most of the speedup. > It doesn't seem to make any difference on my machine, which has an > NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the > sync time isn't exceptional. Yea, they are surprisingly slow at syncing, somewhat disappointing for the upper end of the consumer oriented devices. > BTW, I noticed that the index build is absurdly bottlenecked on > compressing WAL with wal_compression=on. It's almost 3x slower with > compression turned on! Really should replace WAL compression with lz4 (or possibly zstd). Greetings, Andres Freund
On Wed, Sep 23, 2020 at 11:29 AM Andres Freund <andres@anarazel.de> wrote: > I wonder what the effect of logging WAL records as huge as this (~256kb) > is on concurrent sessions. I think it's possible that logging 32 pages > at once would cause latency increases for concurrent OLTP-ish > writes. And that a smaller batch size would reduce that, while still > providing most of the speedup. Something to consider, but I cannot see any speedup, and so have no way of evaluating the idea right now. > > It doesn't seem to make any difference on my machine, which has an > > NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the > > sync time isn't exceptional. > > Yea, they are surprisingly slow at syncing, somewhat disappointing for > the upper end of the consumer oriented devices. I hesitate to call anything about this SSD disappointing, since overall it performs extremely well -- especially when you consider the price tag. Detachable storage is a trend that's here to stay, so the storage latency is still probably a lot lower than what you'll see on many serious production systems. For better or worse. > Really should replace WAL compression with lz4 (or possibly zstd). Yeah. WAL compression is generally a good idea, and we should probably find a way to enable it by default (in the absence of some better way of dealing with the FPI bottleneck, at least). I had no idea that compression could hurt this much with index builds until now, though. To be clear: the *entire* index build takes 3 times more wall clock time, start to finish -- if I drilled down to the portion of the index build that actually writes WAL then it would be an even greater bottleneck. I know that we've tested different compression methods in the past, but perhaps index build performance was overlooked. My guess is that the compression algorithm matters a lot less with pure OLTP workloads. Also, parallel CREATE INDEX may be a bit of an outlier here. Even still, it's a very important outlier. -- Peter Geoghegan
Hi, On 2020-09-23 12:02:42 -0700, Peter Geoghegan wrote: > On Wed, Sep 23, 2020 at 11:29 AM Andres Freund <andres@anarazel.de> wrote: > > Really should replace WAL compression with lz4 (or possibly zstd). > > Yeah. WAL compression is generally a good idea, and we should probably > find a way to enable it by default (in the absence of some better way > of dealing with the FPI bottleneck, at least). I had no idea that > compression could hurt this much with index builds until now, though. > To be clear: the *entire* index build takes 3 times more wall clock > time, start to finish -- if I drilled down to the portion of the index > build that actually writes WAL then it would be an even greater > bottleneck. I have definitely seen this before. The faster the storage, the bigger the issue (because there's little to be gained by reducing the volume of WAL). In my experience the problem tends to be bigger when there's *less* concurrency, because without the concurrency it's much less likely for WAL writes to be a bottleneck. > My guess is that the compression algorithm matters a lot less with > pure OLTP workloads. I don't think that's quite right. A quick benchmark shows a ~1.4x slowdown for a single client pgbench on a 1TB 970Pro during the phase FPIs are emitted (pgbench -i -s 100, then a -T10 -c 1 run). With ~45% of the time spent in pglz. My experience in the past has been that wal_compression is a looser until the first storage limits are reached, then it's a winner until CPU time becomes the primary bottleneck, at which time the difference gets smaller and smaller, sometimes reaching the point where wal_compression looses again. > I know that we've tested different compression methods in the past, > but perhaps index build performance was overlooked. I am pretty sure we have known that pglz for this was much much slower than alternatives. I seem to recall somebody posting convincing numbers, but can't find them just now. Greetings, Andres Freund
> 23 сент. 2020 г., в 23:19, Peter Geoghegan <pg@bowt.ie> написал(а): > > On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: >> Here is PoC with porting that same routine to B-tree. It allows to build B-trees ~10% faster on my machine. > > It doesn't seem to make any difference on my machine, which has an > NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the > sync time isn't exceptional. My test case is "reindex index > pgbench_accounts_pkey", with pgbench scale 500. I thought that this > would be a sympathetic case, since it's bottlenecked on writing the > index, with relatively little time spent scanning and sorting in > parallel workers. > Can you provide a test case that is sympathetic towards the patch? Thanks for looking into this! I've tried this test on my machine (2019 macbook) on scale 10 for 20 seconds. With patch I get consistently ~ tps = 2.403440, without patch ~ tps = 1.951975. On scale 500 with patch postgres=# reindex index pgbench_accounts_pkey; REINDEX Time: 21577,640 ms (00:21,578) without patch postgres=# reindex index pgbench_accounts_pkey; REINDEX Time: 26139,175 ms (00:26,139) I think it's hardware dependent, I will try on servers. > > BTW, I noticed that the index build is absurdly bottlenecked on > compressing WAL with wal_compression=on. It's almost 3x slower with > compression turned on! > 24 сент. 2020 г., в 00:33, Andres Freund <andres@anarazel.de> написал(а): > >> I know that we've tested different compression methods in the past, >> but perhaps index build performance was overlooked. > > I am pretty sure we have known that pglz for this was much much slower > than alternatives. I seem to recall somebody posting convincing numbers, > but can't find them just now. There was a thread about different compressions[0]. It was demonstrated there that lz4 is 10 times faster on compression. We have a patch to speedup pglz compression x1.43 [1], but I was hoping that we will go lz4\zstd way. It seems to me now,I actually should finish that speedup patch, it's very focused local refactoring. Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/ea57b49a-ecf0-481a-a77b-631833354f7d%40postgrespro.ru#dcac101f8a73dfce98924066f6a12a13 [1] https://www.postgresql.org/message-id/flat/169163A8-C96F-4DBE-A062-7D1CECBE9E5D%40yandex-team.ru#996a194c12bacd2d093be2cb7ac54ca6
> 23 сент. 2020 г., в 23:29, Andres Freund <andres@anarazel.de> написал(а): > > Hi, > > On 2020-09-23 11:19:18 -0700, Peter Geoghegan wrote: >> On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: >>> Here is PoC with porting that same routine to B-tree. It allows to build B-trees ~10% faster on my machine. > > I wonder what the effect of logging WAL records as huge as this (~256kb) > is on concurrent sessions. I think it's possible that logging 32 pages > at once would cause latency increases for concurrent OLTP-ish > writes. And that a smaller batch size would reduce that, while still > providing most of the speedup. Indeed, on my machine performance is indistinguishable between batches of 8, 16 and 32 pages, while still observable (~3...8%)with 8 against 4. Best regards, Andrey Borodin,
> On Fri, Oct 09, 2020 at 11:08:42PM +0500, Andrey Borodin wrote: > > > 23 сент. 2020 г., в 23:19, Peter Geoghegan <pg@bowt.ie> написал(а): > > > > On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > >> Here is PoC with porting that same routine to B-tree. It allows to build B-trees ~10% faster on my machine. > > > > It doesn't seem to make any difference on my machine, which has an > > NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the > > sync time isn't exceptional. My test case is "reindex index > > pgbench_accounts_pkey", with pgbench scale 500. I thought that this > > would be a sympathetic case, since it's bottlenecked on writing the > > index, with relatively little time spent scanning and sorting in > > parallel workers. > > Can you provide a test case that is sympathetic towards the patch? > Thanks for looking into this! > > I've tried this test on my machine (2019 macbook) on scale 10 for 20 seconds. > With patch I get consistently ~ tps = 2.403440, without patch ~ tps = 1.951975. > On scale 500 with patch > postgres=# reindex index pgbench_accounts_pkey; > REINDEX > Time: 21577,640 ms (00:21,578) > without patch > postgres=# reindex index pgbench_accounts_pkey; > REINDEX > Time: 26139,175 ms (00:26,139) > > I think it's hardware dependent, I will try on servers. Hi, Thanks for the patch! Out of curiosity I've tried to experiment with it, and in the reindex scenario Peter mentioned above I also can't see any visible difference. Interesting enough, _bt_blwritepage tracing (IIUC after reading the patch it's the main target) shows the influence of batching, but apparently in my case it hits some other bottleneck (e.g. profiling shows increase in LockBuffer and LogicalTapeRead).