Thread: Processing btree walks as a batch to parallelize IO
$SUBJECT is still a very loosely formed idea, so forgive lack of detail or things I've likely missed, but I wanted to get it out there to see if it sounded at all intriguing to people.
Background: One of the big problems with non-local storage such as AWS EBS volumes or a SAN is that in a large database (really, working set, where working set includes reads) exceeds the size of buffer cache (and page cache) the cost of random page reads hitting the underlying disk system dominates. This is because networked disks have an order of magnitude higher latency than a bunch of RAIDed SSDs (even more so with NVMe storage). In some of our experiments on Aurora I've seen a 10x change versus pretty good physical hardware, and I'd assume RDS (since it's EBS-backed) is similar.
A specific area where this is particularly painful is btree index reads. Walking the tree to leaf pages isn't naturally prefetchable, and so for each level you pay the random page cost. Of course higher levels in the tree will almost certainly exhibit emergent behavior such that they (just by fact of the LRU caching) will be in the buffer cache, but for a large index lower levels likely won't be.
If we squint a bit, insertions look a whole lot like reads as well since we have to walk the tree to find the leaf insertion page for a new tuple. This is particularly true for indexes where inserts are roughly randomly distributed data, like a uuid.
The read-for-lookups problem is harder to solve, but the cost as it relates to table inserts is possibly more tractable. Tables typically have more than one index to update, so the obvious approach is "let's just parallelize the index insertions". Of course we know that's difficult given the multi-process approach Postgres uses for parallelism.
Another approach that at first glance seems like it fits better into Postgres (I'm not claiming it's easy or a small patch) would be to process a batch of indexes at once. For example, if the index access methods were extended to allow being given a list of indexes that need to be walked, then the btree code could process each layer in the walk as a group -- issuing IO fetches for all of the first level blocks in the tree, and then computing all of the next level blocks needed and issuing those IO requests at a time, and so on.
In some workloads we've been testing I believe such an approach could plausibly improve table insert (and update) performance by multiple hundreds of percent.
I don't have any code at the moment to show here, but I wanted to get the idea out there to see if there were any immediate reactions or other thoughts on the topic.
Thoughts?
James
On 4/9/21 7:33 PM, James Coleman wrote: > $SUBJECT is still a very loosely formed idea, so forgive lack of detail > or things I've likely missed, but I wanted to get it out there to see if > it sounded at all intriguing to people. > > Background: One of the big problems with non-local storage such as AWS > EBS volumes or a SAN is that in a large database (really, working set, > where working set includes reads) exceeds the size of buffer cache (and > page cache) the cost of random page reads hitting the underlying disk > system dominates. This is because networked disks have an order of > magnitude higher latency than a bunch of RAIDed SSDs (even more so with > NVMe storage). In some of our experiments on Aurora I've seen a 10x > change versus pretty good physical hardware, and I'd assume RDS (since > it's EBS-backed) is similar. > > A specific area where this is particularly painful is btree index reads. > Walking the tree to leaf pages isn't naturally prefetchable, and so for > each level you pay the random page cost. Of course higher levels in the > tree will almost certainly exhibit emergent behavior such that they > (just by fact of the LRU caching) will be in the buffer cache, but for a > large index lower levels likely won't be. > What do you consider a large index level? Consider a 1TB table, with just a single UUID column - that's ~25B rows, give or take. Real tables will have more columns, so this seems like a reasonable model of the largest number of rows per relation. With ~32B per index tuple, that's about 100M leaf pages, and with ~256 branches per internal page, that's still only ~5 levels. I think it's quite rare to see indexes with more than 6 or 7 levels. And the internal pages are maybe 0.5% of the whole index (so ~4GB out of 750GB). I think the usual expectation is that most of that will fit into RAM, but of course there may be more indexes competing for that. I think the index level is not really the crucial bit - it's more about the total amount of indexes in the DB. > If we squint a bit, insertions look a whole lot like reads as well since > we have to walk the tree to find the leaf insertion page for a new > tuple. This is particularly true for indexes where inserts are roughly > randomly distributed data, like a uuid. > Yep. We need to walk the index to the leaf pages in both cases, both for read and insert workloads. > The read-for-lookups problem is harder to solve, but the cost as it > relates to table inserts is possibly more tractable. Tables typically > have more than one index to update, so the obvious approach is "let's > just parallelize the index insertions". Of course we know that's > difficult given the multi-process approach Postgres uses for parallelism. > Hmm. Not sure if reads are harder to real with, but I think you're right those two cases (reads and writes) may look similar at the level of a single index, but may need rather different approaches exactly because inserts have to deal with all indexes, while reads only really deal with a single index. FWIW I think there are a couple options for improving reads, at least in some cases. 1) I wonder if e.g. _bt_readnextpage could prefetch at least one page ahead. We can't look further ahead, but perhaps this would help. 2) In some cases (e.g. nested loop with inner indexes scan) we could collect an array of values and then look them up at once, which should allow us to do at least some fo the I/O in parallel, I think. That's similar to what you propose for writes, except that it works against the same index. > Another approach that at first glance seems like it fits better into > Postgres (I'm not claiming it's easy or a small patch) would be to > process a batch of indexes at once. For example, if the index access > methods were extended to allow being given a list of indexes that need > to be walked, then the btree code could process each layer in the walk > as a group -- issuing IO fetches for all of the first level blocks in > the tree, and then computing all of the next level blocks needed and > issuing those IO requests at a time, and so on. > Yeah, I agree having a way to say "prefetch all pages needed to insert these keys into these indexes" might be better than just parallelizing it in a "naive" way. Not sure how complex would it be - I think the API would need to allow traversing the index with each step split into two phases: 1) determine the page needed for the next step, return it to caller 2) the caller collects pages from all indexes, initiates prefetch 3) instruct indexes to actually do the next step, stop if it's a leaf page (otherwise go to (1)) And then we might just do index inserts in a serial way, just like we do today, hoping to hit the prefetched pages. FWIW while this probably helps saturating the I/O, it unfortunately does nothing to reduce the write amplification - we still need to modify the same amount of leaf pages in all indexes, produce the same amount of WAL etc. I think there were some proposals to add small internal buffers, and instead of pushing the inserts all the way down to the leaf page, just add them to the internal buffer. And when the buffer gets full, propagate the contents to the next level of buffers. For example, each internal page might have one "buffer" page, so the index size would not really change (the internal pages would double, but it's still jut ~1% of the total index size). Of course, this makes lookups more complex/expensive, because we need to check the internal buffers. But it does reduce the write amplification, because it combines changes to leaf pages. > In some workloads we've been testing I believe such an approach could > plausibly improve table insert (and update) performance by multiple > hundreds of percent. > > I don't have any code at the moment to show here, but I wanted to get > the idea out there to see if there were any immediate reactions or other > thoughts on the topic. > > Thoughts? > I think you're right indexes may be a serious bottleneck in some cases, so exploring ways to improve that seems useful. Ultimately I think we should be looking for ways to reduce the amount of work we need to do, but parallelizing it (i.e. doing the same amount of work but in multiple processes) is a valid approach too. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 9, 2021 at 4:57 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 4/9/21 7:33 PM, James Coleman wrote: > > $SUBJECT is still a very loosely formed idea, so forgive lack of detail > > or things I've likely missed, but I wanted to get it out there to see if > > it sounded at all intriguing to people. > > > > Background: One of the big problems with non-local storage such as AWS > > EBS volumes or a SAN is that in a large database (really, working set, > > where working set includes reads) exceeds the size of buffer cache (and > > page cache) the cost of random page reads hitting the underlying disk > > system dominates. This is because networked disks have an order of > > magnitude higher latency than a bunch of RAIDed SSDs (even more so with > > NVMe storage). In some of our experiments on Aurora I've seen a 10x > > change versus pretty good physical hardware, and I'd assume RDS (since > > it's EBS-backed) is similar. > > > > A specific area where this is particularly painful is btree index reads. > > Walking the tree to leaf pages isn't naturally prefetchable, and so for > > each level you pay the random page cost. Of course higher levels in the > > tree will almost certainly exhibit emergent behavior such that they > > (just by fact of the LRU caching) will be in the buffer cache, but for a > > large index lower levels likely won't be. > > > > What do you consider a large index level? In general it's probably all levels but the leaves (though depends on cache and index size etc.) > Consider a 1TB table, with just a single UUID column - that's ~25B rows, > give or take. Real tables will have more columns, so this seems like a > reasonable model of the largest number of rows per relation. With ~32B > per index tuple, that's about 100M leaf pages, and with ~256 branches > per internal page, that's still only ~5 levels. I think it's quite rare > to see indexes with more than 6 or 7 levels. > > And the internal pages are maybe 0.5% of the whole index (so ~4GB out of > 750GB). I think the usual expectation is that most of that will fit into > RAM, but of course there may be more indexes competing for that. > > I think the index level is not really the crucial bit - it's more about > the total amount of indexes in the DB. I suppose? If the tables/indexes/etc. size is sufficiently large relative to cache size it won't matter the quantity. > > If we squint a bit, insertions look a whole lot like reads as well since > > we have to walk the tree to find the leaf insertion page for a new > > tuple. This is particularly true for indexes where inserts are roughly > > randomly distributed data, like a uuid. > > > > Yep. We need to walk the index to the leaf pages in both cases, both for > read and insert workloads. > > > The read-for-lookups problem is harder to solve, but the cost as it > > relates to table inserts is possibly more tractable. Tables typically > > have more than one index to update, so the obvious approach is "let's > > just parallelize the index insertions". Of course we know that's > > difficult given the multi-process approach Postgres uses for parallelism. > > > > Hmm. Not sure if reads are harder to real with, but I think you're right > those two cases (reads and writes) may look similar at the level of a > single index, but may need rather different approaches exactly because > inserts have to deal with all indexes, while reads only really deal with > a single index. Right. In practice it's harder to deal with a single index scan because you don't have multiple such scans to parallelize. > FWIW I think there are a couple options for improving reads, at least in > some cases. > > 1) I wonder if e.g. _bt_readnextpage could prefetch at least one page > ahead. We can't look further ahead, but perhaps this would help. > > 2) In some cases (e.g. nested loop with inner indexes scan) we could > collect an array of values and then look them up at once, which should > allow us to do at least some fo the I/O in parallel, I think. That's > similar to what you propose for writes, except that it works against the > same index. The "collect an array of values" approach isn't one I'd considered, but seems likely interesting. > > Another approach that at first glance seems like it fits better into > > Postgres (I'm not claiming it's easy or a small patch) would be to > > process a batch of indexes at once. For example, if the index access > > methods were extended to allow being given a list of indexes that need > > to be walked, then the btree code could process each layer in the walk > > as a group -- issuing IO fetches for all of the first level blocks in > > the tree, and then computing all of the next level blocks needed and > > issuing those IO requests at a time, and so on. > > > > Yeah, I agree having a way to say "prefetch all pages needed to insert > these keys into these indexes" might be better than just parallelizing > it in a "naive" way. > > Not sure how complex would it be - I think the API would need to allow > traversing the index with each step split into two phases: > > 1) determine the page needed for the next step, return it to caller > > 2) the caller collects pages from all indexes, initiates prefetch > > 3) instruct indexes to actually do the next step, stop if it's a leaf > page (otherwise go to (1)) > > And then we might just do index inserts in a serial way, just like we do > today, hoping to hit the prefetched pages. Correct; this is roughly what I was envisioning. > FWIW while this probably helps saturating the I/O, it unfortunately does > nothing to reduce the write amplification - we still need to modify the > same amount of leaf pages in all indexes, produce the same amount of WAL > etc. I think there were some proposals to add small internal buffers, > and instead of pushing the inserts all the way down to the leaf page, > just add them to the internal buffer. And when the buffer gets full, > propagate the contents to the next level of buffers. > > For example, each internal page might have one "buffer" page, so the > index size would not really change (the internal pages would double, but > it's still jut ~1% of the total index size). Of course, this makes > lookups more complex/expensive, because we need to check the internal > buffers. But it does reduce the write amplification, because it combines > changes to leaf pages. I think I've seen that discussion, and it's very interesting, but also I think still orthogonal to this. > > In some workloads we've been testing I believe such an approach could > > plausibly improve table insert (and update) performance by multiple > > hundreds of percent. > > > > I don't have any code at the moment to show here, but I wanted to get > > the idea out there to see if there were any immediate reactions or other > > thoughts on the topic. > > > > Thoughts? > > > > I think you're right indexes may be a serious bottleneck in some cases, > so exploring ways to improve that seems useful. Ultimately I think we > should be looking for ways to reduce the amount of work we need to do, > but parallelizing it (i.e. doing the same amount of work but in multiple > processes) is a valid approach too. Thanks for the feedback. James
On Fri, 9 Apr 2021 at 16:58, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 4/9/21 7:33 PM, James Coleman wrote: > > A specific area where this is particularly painful is btree index reads. > > Walking the tree to leaf pages isn't naturally prefetchable, and so for > > each level you pay the random page cost. Of course higher levels in the > > tree will almost certainly exhibit emergent behavior such that they > > (just by fact of the LRU caching) will be in the buffer cache, but for a > > large index lower levels likely won't be. We've talked before about buffering inserts even just for disk-based indexes. Much like how GIN buffers inserts and periodically flushes them out. We talked about doing a local buffer in each session since no other session even needs to see these buffered inserts until commit anyways. And we can more efficiently merge in multiple keys at once than doing them one by one. But that was just for disk i/o. For something longer-latency it would be an even bigger win. Buffer the inserted keys in local memory in case you do lookups in this same session and start the i/o to insert the rows into the index but handle that in the background or in a separate process without blocking the transaction until commit. > What do you consider a large index level? > > Consider a 1TB table, with just a single UUID column - that's ~25B rows, > give or take. Real tables will have more columns, so this seems like a > reasonable model of the largest number of rows per relation. With ~32B > per index tuple, that's about 100M leaf pages, and with ~256 branches > per internal page, that's still only ~5 levels. I think it's quite rare > to see indexes with more than 6 or 7 levels. That's a good model for a well-designed schema with an efficient index. There are plenty of less-than-optimal schemas with indexes on longer column lists or fairly large text fields.... -- greg
On Fri, May 7, 2021 at 3:34 PM Greg Stark <stark@mit.edu> wrote: > We've talked before about buffering inserts even just for disk-based > indexes. Much like how GIN buffers inserts and periodically flushes > them out. We talked about doing a local buffer in each session since > no other session even needs to see these buffered inserts until commit > anyways. And we can more efficiently merge in multiple keys at once > than doing them one by one. Mark Callaghan's high level analysis of the trade-offs here is worth a read, too. > That's a good model for a well-designed schema with an efficient > index. There are plenty of less-than-optimal schemas with indexes on > longer column lists or fairly large text fields.... Suffix truncation can take care of this -- all you really need is a minimally distinguishing separator key to delineate which values belong on which page one level down. It is almost always possible for leaf page splits to find a way to make the new high key (also the key to be inserted in the parent level) much smaller than your typical key. Granted, we don't have what I've called "classic" suffix truncation (within text column truncation) yet, so this analysis isn't going to work with long text keys (we only truncate at the attribute granularity currently). Even if we're pessimistic about suffix truncation, the logarithmic rate of growth still wins -- Tomas' analysis is sound. You cannot realistically make a Postgres B-Tree have more than about 1% of all pages as internal pages, unless you make the indexed keys ludicrously large -- as in several hundred bytes each (~0.5% is typical in practice). I think that 6 levels is very pessimistic, even with a massive B-Tree with weirdly large keys. My mental model for internal pages is that they are practically guaranteed to be in shared_buffers at all times, which is about as accurate as any generalization like that ever can be. I once wrote a test harness that deliberately created a B-Tree that was as tall as possible -- something with the largest possible index tuples on the leaf level (had to disable TOAST for this). I think that it was about 7 or 8 levels deep. The CPU overhead of the test case made it excruciatingly slow, but it wasn't I/O bound at all (pretty sure it all fitted in shared_buffers). -- Peter Geoghegan