Re: Processing btree walks as a batch to parallelize IO - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Processing btree walks as a batch to parallelize IO
Date
Msg-id 3507221b-f664-17ab-ddc3-6113f2032c51@enterprisedb.com
Whole thread Raw
In response to Processing btree walks as a batch to parallelize IO  (James Coleman <jtc331@gmail.com>)
Responses Re: Processing btree walks as a batch to parallelize IO  (James Coleman <jtc331@gmail.com>)
Re: Processing btree walks as a batch to parallelize IO  (Greg Stark <stark@mit.edu>)
List pgsql-hackers

On 4/9/21 7:33 PM, James Coleman wrote:
> $SUBJECT is still a very loosely formed idea, so forgive lack of detail
> or things I've likely missed, but I wanted to get it out there to see if
> it sounded at all intriguing to people. 
> 
> Background: One of the big problems with non-local storage such as AWS
> EBS volumes or a SAN is that in a large database (really, working set,
> where working set includes reads) exceeds the size of buffer cache (and
> page cache) the cost of random page reads hitting the underlying disk
> system dominates. This is because networked disks have an order of
> magnitude higher latency than a bunch of RAIDed SSDs (even more so with
> NVMe storage). In some of our experiments on Aurora I've seen a 10x
> change versus pretty good physical hardware, and I'd assume RDS (since
> it's EBS-backed) is similar. 
> 
> A specific area where this is particularly painful is btree index reads.
> Walking the tree to leaf pages isn't naturally prefetchable, and so for
> each level you pay the random page cost. Of course higher levels in the
> tree will almost certainly exhibit emergent behavior such that they
> (just by fact of the LRU caching) will be in the buffer cache, but for a
> large index lower levels likely won't be. 
> 

What do you consider a large index level?

Consider a 1TB table, with just a single UUID column - that's ~25B rows,
give or take. Real tables will have more columns, so this seems like a
reasonable model of the largest number of rows per relation. With ~32B
per index tuple, that's about 100M leaf pages, and with ~256 branches
per internal page, that's still only ~5 levels. I think it's quite rare
to see indexes with more than 6 or 7 levels.

And the internal pages are maybe 0.5% of the whole index (so ~4GB out of
750GB). I think the usual expectation is that most of that will fit into
RAM, but of course there may be more indexes competing for that.

I think the index level is not really the crucial bit - it's more about
the total amount of indexes in the DB.

> If we squint a bit, insertions look a whole lot like reads as well since
> we have to walk the tree to find the leaf insertion page for a new
> tuple. This is particularly true for indexes where inserts are roughly
> randomly distributed data, like a uuid. 
> 

Yep. We need to walk the index to the leaf pages in both cases, both for
read and insert workloads.

> The read-for-lookups problem is harder to solve, but the cost as it
> relates to table inserts is possibly more tractable. Tables typically
> have more than one index to update, so the obvious approach is "let's
> just parallelize the index insertions". Of course we know that's
> difficult given the multi-process approach Postgres uses for parallelism. 
> 

Hmm. Not sure if reads are harder to real with, but I think you're right
those two cases (reads and writes) may look similar at the level of a
single index, but may need rather different approaches exactly because
inserts have to deal with all indexes, while reads only really deal with
a single index.

FWIW I think there are a couple options for improving reads, at least in
some cases.

1) I wonder if e.g. _bt_readnextpage could prefetch at least one page
ahead. We can't look further ahead, but perhaps this would help.

2) In some cases (e.g. nested loop with inner indexes scan) we could
collect an array of values and then look them up at once, which should
allow us to do at least some fo the I/O in parallel, I think. That's
similar to what you propose for writes, except that it works against the
same index.


> Another approach that at first glance seems like it fits better into
> Postgres (I'm not claiming it's easy or a small patch) would be to
> process a batch of indexes at once. For example, if the index access
> methods were extended to allow being given a list of indexes that need
> to be walked, then the btree code could process each layer in the walk
> as a group -- issuing IO fetches for all of the first level blocks in
> the tree, and then computing all of the next level blocks needed and
> issuing those IO requests at a time, and so on. 
> 

Yeah, I agree having a way to say "prefetch all pages needed to insert
these keys into these indexes" might be better than just parallelizing
it in a "naive" way.

Not sure how complex would it be - I think the API would need to allow
traversing the index with each step split into two phases:

1) determine the page needed for the next step, return it to caller

2) the caller collects pages from all indexes, initiates prefetch

3) instruct indexes to actually do the next step, stop if it's a leaf
page (otherwise go to (1))

And then we might just do index inserts in a serial way, just like we do
today, hoping to hit the prefetched pages.


FWIW while this probably helps saturating the I/O, it unfortunately does
nothing to reduce the write amplification - we still need to modify the
same amount of leaf pages in all indexes, produce the same amount of WAL
etc. I think there were some proposals to add small internal buffers,
and instead of pushing the inserts all the way down to the leaf page,
just add them to the internal buffer. And when the buffer gets full,
propagate the contents to the next level of buffers.

For example, each internal page might have one "buffer" page, so the
index size would not really change (the internal pages would double, but
it's still jut ~1% of the total index size). Of course, this makes
lookups more complex/expensive, because we need to check the internal
buffers. But it does reduce the write amplification, because it combines
changes to leaf pages.

> In some workloads we've been testing I believe such an approach could
> plausibly improve table insert (and update) performance by multiple
> hundreds of percent. 
> 
> I don't have any code at the moment to show here, but I wanted to get
> the idea out there to see if there were any immediate reactions or other
> thoughts on the topic.
> 
> Thoughts?
> 

I think you're right indexes may be a serious bottleneck in some cases,
so exploring ways to improve that seems useful. Ultimately I think we
should be looking for ways to reduce the amount of work we need to do,
but parallelizing it (i.e. doing the same amount of work but in multiple
processes) is a valid approach too.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: pg_amcheck contrib application
Next
From: Alvaro Herrera
Date:
Subject: Re: pgsql: autovacuum: handle analyze for partitioned tables