Re: Hybrid Hash/Nested Loop joins and caching results from subplans - Mailing list pgsql-hackers

From David Rowley
Subject Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date
Msg-id CAApHDvrX9o35_WUoL5c5arJ0XbJFN-cDHckjL57-PR-Keeypdw@mail.gmail.com
Whole thread Raw
In response to Re: Hybrid Hash/Nested Loop joins and caching results from subplans  (Andy Fan <zhihui.fan1213@gmail.com>)
Responses Re: Hybrid Hash/Nested Loop joins and caching results from subplans  (Andy Fan <zhihui.fan1213@gmail.com>)
List pgsql-hackers
On Wed, 26 Aug 2020 at 05:18, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
> On Tue, Aug 25, 2020 at 11:53 PM Andres Freund <andres@anarazel.de> wrote:
>>
>> On 2020-08-25 20:48:37 +1200, David Rowley wrote:
>> > Also, just in case anyone is misunderstanding this Andres' argument.
>> > It's entirely based on the performance impact of having an additional
>> > node.
>>
>> Not entirely, no. It's also just that it doesn't make sense to have two
>> nodes setting parameters that then half magically picked up by a special
>> subsidiary node type and used as a cache key. This is pseudo modularity,
>> not real modularity. And makes it harder to display useful information
>> in explain etc. And makes it harder to e.g. clear the cache in cases we
>> know that there's no further use of the current cache. At least without
>> piercing the abstraction veil.
>>
>>
>> > However, given the correct planner choice, there will never be
>> > a gross slowdown due to having the extra node.
>>
>> There'll be a significant reduction in increase in performance.
>
>
> If this is a key blocking factor for this topic, I'd like to do a simple hack
> to put the cache function into the subplan node, then do some tests to
> show the real difference.  But it is better to decide how much difference
> can be thought of as a big difference.  And  for education purposes,
> I'd like to understand where these differences come from.  For my
> current knowledge,  my basic idea is it saves some function calls?

If testing this, the cache hit ratio will be pretty key to the
results. You'd notice the overhead much less with a larger cache hit
ratio since you're not pulling the tuple from as deeply a nested node.
  I'm unsure how you'd determine what is a good cache hit ratio to
test it with. The lower the cache expected cache hit ratio, the higher
the cost of the Result Cache node will be, so the planner has less
chance of choosing to use it.   Maybe some experiments will find a
case where the planner picks a Result Cache plan with a low hit ratio
can be tested.

Say you find a case with the hit ratio of 90%.  Going by [1] I found
pulling a tuple through an additional node to cost about 12
nanoseconds on an intel 4712HQ CPU.  With a hit ratio of 90% we'll
only pull 10% of tuples through the additional node, so that's about
1.2 nanoseconds per tuple, or 1.2 milliseconds per million tuples. It
might become hard to measure above the noise. More costly inner scans
will have the planner choose to Result Cache with lower estimated hit
ratios, but in that case, pulling the tuple through the additional
node during a cache miss will be less noticeable due to the more
costly inner side of the join.

Likely you could test the overhead only in theory without going to the
trouble of adapting the code to make SubPlan and Nested Loop do the
caching internally.  If you just modify ExecResultCache() to have it
simply return its subnode, then measure the performance with and
without enable_resultcache, you should get an idea of the per-tuple
overhead of pulling the tuple through the additional node on your CPU.
After you know that number, you could put the code back to what the
patches have and then experiment with a number of cases to find a case
that chooses Result Cache and gets a low hit ratio.


For example, from the plan I used in the initial email on this thread:

             ->  Index Only Scan using lookup_a_idx on lookup l
(actual time=0.002..0.011 rows=100 loops=1000)
                     Index Cond: (a = hk.thousand)
                     Heap Fetches: 0
 Planning Time: 0.113 ms
 Execution Time: 1876.741 ms

I don't have the exact per tuple overhead on the machine I ran that
on, but it's an AMD 3990x CPU, so I'll guess the overhead is about 8
nanoseconds per tuple, given I found it to be 12 nanoseconds on a 2014
CPU  If that's right, then the overhead is something like 8 * 100
(rows) * 1000 (loops) = 800000 nanoseconds = 0.8 milliseconds. If I
compare that to the execution time of the query, it's about 0.04%.

I imagine we'll need to find something with a much worse hit ratio so
we can actually measure the overhead.

David

[1] https://www.postgresql.org/message-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B%3DZRh-rxy9qxfPA5Gw%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Problems with the FSM, heap fillfactor, and temporal locality
Next
From: Ranier Vilela
Date:
Subject: Re: [PATCH] Fix Uninitialized scalar variable (UNINIT) (src/backend/access/heap/heapam_handler.c)