Thread: Re: PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)

Hi,

On 2024-08-26 18:06:04 +0200, Tomas Vondra wrote:
> I'm getting back to work on the index prefetching patch [1], but one
> annoying aspect of that patch is that it's limited to the context of a
> single executor node. It can be very effective when there's an index
> scan with many matches for a key, but it's entirely useless for plans
> with many tiny index scans.

Right.



> The patch does this:
> --------------------
>
> 1) ExecPrefetch executor callback
>
> This call is meant to do the actual prefetching - the parent node sets
> everything up almost as if for ExecProcNode(), but does not expect the
> actual result. The child either does some prefetching or nothing.
>
> 2) ExecSupportsPrefetch to identify what nodes accept ExecPrefetch()
>
> This simply says if a given node supports prefetching. The only place
> calling this is the  nested loop, to enable prefetching only for nested
> loops with (parameterized) index scans.
>
> 3) ExecPrefetchIndexScan doing prefetching in index scans
>
> This is just trivial IndexNext() variant, getting TIDs and calling
> PrefetchBuffer() on them. Right now it just prefetches everything, but
> that's seems wrong - this is where the original index prefetching patch
> should kick in.
>
> 4) ExecNestLoop changes
>
> This is where the actual magic happens - if the inner child knows how to
> prefetch stuff (per ExecSupportsPrefetch), this switches to reading
> batches of outer slots, and calls ExecPrefetch() on them. Right now the
> batch size is hardcoded to 32, but it might use effective_io_concurrency
> or something like that. It's a bit naive in other aspects too - it
> always reads and prefetches the whole batch at once, instead of ramping
> up and then consuming and prefetching slots one by one. Good enough for
> PoC, but probably needs improvements.
>
> 5) adds enable_nestloop_prefetch to enable/disable this easily

Hm. Doing this via more executor tree traversals doesn't seem optimal, that's
not exactly free. And because the prefetching can be beneficial even if there
are nodes above the inner parametrized index node, we IMO would want to
iterate through multiple node levels.

Have you considered instead expanding the parameterized scan logic? Right now
nestloop passes down values one-by-one via PARAM_EXEC. What if we expanded
that to allow nodes, e.g. nestloop in this case, to pass down multiple values
in one parameter?  That'd e.g. allow passing down multiple rows to fetch from
nodeNestloop.c to nodeIndexscan.c without needing to iterate over the executor
state tree.  And it might be more powerful than just doing prefetching -
e.g. we could use one ScalarArrayOps scan in the index instead of doing a
separate scan for each of the to-be-prefetched values.



> benchmark
> ---------
> 
> Of course, the main promise of this is faster queries, so I did a simple
> benchmark, with a query like this:
> 
>   SELECT * FROM fact_table f JOIN dimension d ON (f.id = d.id)
>           WHERE f.r < 0.0001;
> 
> The "r" is simply a random value, allowing to select arbitrary fraction
> of the large fact table "f". Here it selects 0.01%, so ~10k rows from
> 100M table. Dimensions have 10M rows. See the .sql file attached.
> 
> For a variable number of dimensions (1 to 4) the results look like this:
> 
>   prefetch       1       2       3       4
>   ----------------------------------------
>        off    3260    6193    8993   11980
>         on    2145    3897    5531    7352
>   ----------------------------------------
>                66%     63%     62%     61%
> 
> This is on "cold" data, with a restart + drop caches between runs. The
> results suggest the prefetching makes it about twice as fast. I was
> hoping for more, but not bad for a Poc, chances are it can be improved.

I think that's indeed a pretty nice win.



> One minor detail is that I ran into some issues with tuple slots. I need
> a bunch of them to stash the slots received from the outer plan, so I
> created a couple slots with TTSOpsVirtual. And that mostly works, except
> that later ExecInterpExpr() happens to call slot_getsomeattrs() and that
> fails because tts_virtual_getsomeattrs() says:
>
>   elog(ERROR, "getsomeattrs is not required to be called on a virtual
>                tuple table slot");
>
> OK, that call is not necessary for virtual slots, it's noop. But it's
> not me calling that, the interpreter does that for some reason. I did
> comment that error out in the patch, but I wonder what's the proper way
> to make this work ...

Hm, that's odd. slot_getsomeattrs() only calls tts_virtual_getsomeattrs() if
slot->tts_nvalid is smaller than the requested attnum. Which should never be
the case for a virtual slot. So I suspect there might be some bug leading to
wrong contents being stored in slots?


Greetings,

Andres Freund



On Tue, Aug 27, 2024 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
> Have you considered instead expanding the parameterized scan logic? Right now
> nestloop passes down values one-by-one via PARAM_EXEC. What if we expanded
> that to allow nodes, e.g. nestloop in this case, to pass down multiple values
> in one parameter?  That'd e.g. allow passing down multiple rows to fetch from
> nodeNestloop.c to nodeIndexscan.c without needing to iterate over the executor
> state tree.

This sounds a bit like block nested loop join.

> And it might be more powerful than just doing prefetching -
> e.g. we could use one ScalarArrayOps scan in the index instead of doing a
> separate scan for each of the to-be-prefetched values.

ScalarArrayOps within nbtree are virtually the same thing as regular
index scans these days. That could make a big difference (perhaps this
is obvious).

One reason to do it this way is because it cuts down on index descent
costs, and other executor overheads. But it is likely that it will
also make prefetching itself more effective, too -- just because
prefetching will naturally end up with fewer, larger batches of
logically related work.

--
Peter Geoghegan



On 8/27/24 20:40, Andres Freund wrote:
> Hi,
> 
> On 2024-08-26 18:06:04 +0200, Tomas Vondra wrote:
>> I'm getting back to work on the index prefetching patch [1], but one
>> annoying aspect of that patch is that it's limited to the context of a
>> single executor node. It can be very effective when there's an index
>> scan with many matches for a key, but it's entirely useless for plans
>> with many tiny index scans.
> 
> Right.
> 
> 
> 
>> The patch does this:
>> --------------------
>>
>> 1) ExecPrefetch executor callback
>>
>> This call is meant to do the actual prefetching - the parent node sets
>> everything up almost as if for ExecProcNode(), but does not expect the
>> actual result. The child either does some prefetching or nothing.
>>
>> 2) ExecSupportsPrefetch to identify what nodes accept ExecPrefetch()
>>
>> This simply says if a given node supports prefetching. The only place
>> calling this is the  nested loop, to enable prefetching only for nested
>> loops with (parameterized) index scans.
>>
>> 3) ExecPrefetchIndexScan doing prefetching in index scans
>>
>> This is just trivial IndexNext() variant, getting TIDs and calling
>> PrefetchBuffer() on them. Right now it just prefetches everything, but
>> that's seems wrong - this is where the original index prefetching patch
>> should kick in.
>>
>> 4) ExecNestLoop changes
>>
>> This is where the actual magic happens - if the inner child knows how to
>> prefetch stuff (per ExecSupportsPrefetch), this switches to reading
>> batches of outer slots, and calls ExecPrefetch() on them. Right now the
>> batch size is hardcoded to 32, but it might use effective_io_concurrency
>> or something like that. It's a bit naive in other aspects too - it
>> always reads and prefetches the whole batch at once, instead of ramping
>> up and then consuming and prefetching slots one by one. Good enough for
>> PoC, but probably needs improvements.
>>
>> 5) adds enable_nestloop_prefetch to enable/disable this easily
> 
> Hm. Doing this via more executor tree traversals doesn't seem optimal, that's
> not exactly free.

Good point. I was wondering what the cost of the executor call might be
too, so I did a test with cached data (the results I presented in the
first message were with a restart + page cache drop before each query,
i.e. "best case" for prefetching).

If I run each query twice - uncached and cached - I get this:

              | prefetch=off      | prefetch=on
  dimensions  | cache    nocache  | cache   nocache  |  cache  nocache
  --------------------------------------------------------------------
           1  |    61       3314  |    74      2172  |   121%      66%
           2  |   100       6327  |   129      3900  |   129%      62%
           3  |   137       9374  |   177      5637  |   129%      60%
           4  |   169      12211  |   225      7319  |   133%      60%

The columns at the end are (prefetch=on)/(prefetch=off). This shows that
for uncached data, we get ~40% speedup, while for cached it's ~30%
regression. That's not great, but where does the regression come from?

Per flamegraphs, the wast majority of that is due to doing btgettuple
twice. ExecPrefetchIndexScan simply doing index_getnext_tid() too, just
like IndexNext().

If I remove that, leaving ExecPrefetchIndexScan() empty, the difference
entirely disappears. The difference is ~1%, maybe. So at least in this
case the overhead of traversal is quite negligible. I'm actually
surprised copying slots and building the parameters twice does not cause
a regression, but that's what I see.

> And because the prefetching can be beneficial even if there
> are nodes above the inner parametrized index node, we IMO would want
> to iterate through multiple node levels.

I don't think we'd actually want that. It makes it very hard to
determine how far ahead to prefetch, because how would you know what the
child nodes are doing? I think it'd be easy to end up prefetching way
too much data. But also because the subnodes likely need to do sync I/O
to do *their* prefetching.

I mean, what if the inner path has another nestloop? Surely that needs
to get the outer tuple? If we get those tuples, should we prefetch the
matching inner tuples too? Wouldn't that means we could be prefetching
exponential number of tuples?

I honestly don't know - maybe there are cases where this makes sense,
but I'm not sure why would that be "incompatible" with something like
ExecutorPrefetch().

In any case, I think it'd be fine to have prefetching at least for
simple cases, where we know it can help. It wasn't my ambition to make
the whole executor somehow asynchronous ;-)


> Have you considered instead expanding the parameterized scan logic? Right now
> nestloop passes down values one-by-one via PARAM_EXEC. What if we expanded
> that to allow nodes, e.g. nestloop in this case, to pass down multiple values
> in one parameter?  That'd e.g. allow passing down multiple rows to fetch from
> nodeNestloop.c to nodeIndexscan.c without needing to iterate over the executor
> state tree.  And it might be more powerful than just doing prefetching -
> e.g. we could use one ScalarArrayOps scan in the index instead of doing a
> separate scan for each of the to-be-prefetched values.
> 

I have not, but it seems "batching" the prefetches in some way might be
a way to go. I'm not sure it'll be much more efficient (not just for
btree, what about other index AMs?).

But then that really starts to look like BNL - why would we even batch
prefetches and then do the rest row-by-row? We could just as well pass
down the batch to the index scan, and let it handle the prefetches.

That'd be much less about prefetching and more about allowing batching
for some nodes.

> 
> 
>> benchmark
>> ---------
>>
>> Of course, the main promise of this is faster queries, so I did a simple
>> benchmark, with a query like this:
>>
>>   SELECT * FROM fact_table f JOIN dimension d ON (f.id = d.id)
>>           WHERE f.r < 0.0001;
>>
>> The "r" is simply a random value, allowing to select arbitrary fraction
>> of the large fact table "f". Here it selects 0.01%, so ~10k rows from
>> 100M table. Dimensions have 10M rows. See the .sql file attached.
>>
>> For a variable number of dimensions (1 to 4) the results look like this:
>>
>>   prefetch       1       2       3       4
>>   ----------------------------------------
>>        off    3260    6193    8993   11980
>>         on    2145    3897    5531    7352
>>   ----------------------------------------
>>                66%     63%     62%     61%
>>
>> This is on "cold" data, with a restart + drop caches between runs. The
>> results suggest the prefetching makes it about twice as fast. I was
>> hoping for more, but not bad for a Poc, chances are it can be improved.
> 
> I think that's indeed a pretty nice win.
> 

Yeah. It's a bit of a "best case" workload, though.

> 
> 
>> One minor detail is that I ran into some issues with tuple slots. I need
>> a bunch of them to stash the slots received from the outer plan, so I
>> created a couple slots with TTSOpsVirtual. And that mostly works, except
>> that later ExecInterpExpr() happens to call slot_getsomeattrs() and that
>> fails because tts_virtual_getsomeattrs() says:
>>
>>   elog(ERROR, "getsomeattrs is not required to be called on a virtual
>>                tuple table slot");
>>
>> OK, that call is not necessary for virtual slots, it's noop. But it's
>> not me calling that, the interpreter does that for some reason. I did
>> comment that error out in the patch, but I wonder what's the proper way
>> to make this work ...
> 
> Hm, that's odd. slot_getsomeattrs() only calls tts_virtual_getsomeattrs() if
> slot->tts_nvalid is smaller than the requested attnum. Which should never be
> the case for a virtual slot. So I suspect there might be some bug leading to
> wrong contents being stored in slots?
> 

Hmmm, it seems I can no longer reproduce this :-( Chances are it was
happening because of some other bug that I fixed, and didn't realize it
was causing this too. Sorry :-/


regards

-- 
Tomas Vondra




On 8/27/24 20:53, Peter Geoghegan wrote:
> On Tue, Aug 27, 2024 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
>> Have you considered instead expanding the parameterized scan logic? Right now
>> nestloop passes down values one-by-one via PARAM_EXEC. What if we expanded
>> that to allow nodes, e.g. nestloop in this case, to pass down multiple values
>> in one parameter?  That'd e.g. allow passing down multiple rows to fetch from
>> nodeNestloop.c to nodeIndexscan.c without needing to iterate over the executor
>> state tree.
> 
> This sounds a bit like block nested loop join.
> 

Yeah.

>> And it might be more powerful than just doing prefetching -
>> e.g. we could use one ScalarArrayOps scan in the index instead of doing a
>> separate scan for each of the to-be-prefetched values.
> 
> ScalarArrayOps within nbtree are virtually the same thing as regular
> index scans these days. That could make a big difference (perhaps this
> is obvious).
> 
> One reason to do it this way is because it cuts down on index descent
> costs, and other executor overheads. But it is likely that it will
> also make prefetching itself more effective, too -- just because
> prefetching will naturally end up with fewer, larger batches of
> logically related work.
> 

Perhaps. So nestloop would pass down multiple values, the inner subplan
would do whatever it wants (including prefetching), and then return the
matching rows, somehow? It's not very clear to me how would we return
the tuples for many matches, but it seems to shift the prefetching
closer to the "normal" index prefetching discussed elsewhere.


-- 
Tomas Vondra



On Tue, Aug 27, 2024 at 6:44 PM Tomas Vondra <tomas@vondra.me> wrote:
> > One reason to do it this way is because it cuts down on index descent
> > costs, and other executor overheads. But it is likely that it will
> > also make prefetching itself more effective, too -- just because
> > prefetching will naturally end up with fewer, larger batches of
> > logically related work.
> >
>
> Perhaps.

I expect this to be particularly effective whenever there is naturally
occuring locality. I think that's fairly common. We'll sort the SAOP
array on the nbtree side, as we always do.

> So nestloop would pass down multiple values, the inner subplan
> would do whatever it wants (including prefetching), and then return the
> matching rows, somehow?

Right.

> It's not very clear to me how would we return
> the tuples for many matches, but it seems to shift the prefetching
> closer to the "normal" index prefetching discussed elsewhere.

It'll be necessary to keep track of which outer side rows relate to
which inner-side array values (within a given batch/block). Some new
data structure will be needed to manage that book keeping.

Currently, we deduplicate arrays for SAOP scans. I suppose that it
works that way because it's not really clear what it would mean for
the scan to have duplicate array keys. I don't see any need to change
that for block nested loop join/whatever this is. We would have to use
the new data structure to "pair up" outer side tuples with their
associated inner side result sets, at the end of processing each
batch/block. That way we avoid repeating the same inner index scan
within a given block/batch -- a little like with a memoize node.

Obviously, that's the case where we can exploit naturally occuring
locality most effectively -- the case where multiple duplicate inner
index scans are literally combined into only one. But, as I already
touched on, locality will be important in a variety of cases, not just
this one.

--
Peter Geoghegan