Re: Hybrid Hash/Nested Loop joins and caching results from subplans - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date
Msg-id CANP8+jLkh_g7Eu-=WoNP7P-Mi2Ye=Kt-5kceho80Xnd8xWrvfw@mail.gmail.com
Whole thread Raw
In response to Hybrid Hash/Nested Loop joins and caching results from subplans  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Hybrid Hash/Nested Loop joins and caching results from subplans  (David Rowley <dgrowleyml@gmail.com>)
Re: Hybrid Hash/Nested Loop joins and caching results from subplans  (Andy Fan <zhihui.fan1213@gmail.com>)
List pgsql-hackers
On Wed, 20 May 2020 at 12:44, David Rowley <dgrowleyml@gmail.com> wrote:
Hackers,

Over on [1], Heikki mentioned about the usefulness of caching results
from parameterized subplans so that they could be used again for
subsequent scans which have the same parameters as a previous scan.
On [2], I mentioned that parameterized nested loop joins could see
similar gains with such a cache. I suggested there that instead of
adding code that only allows this to work for subplans, that instead,
we add a new node type that can handle the caching for us.  We can
then just inject that node type in places where it seems beneficial.

Very cool
 
I've attached a patch which implements this.  The new node type is
called "Result Cache".  I'm not particularly wedded to keeping that
name, but if I change it, I only want to do it once. I've got a few
other names I mind, but I don't feel strongly or confident enough in
them to go and do the renaming.

How the caching works:

First off, it's only good for plugging in on top of parameterized
nodes that are rescanned with different parameters. The cache itself
uses a hash table using the simplehash.h implementation.  The memory
consumption is limited to work_mem. The code maintains an LRU list and
when we need to add new entries but don't have enough space to do so,
we free off older items starting at the top of the LRU list.  When we
get a cache hit, we move that entry to the end of the LRU list so that
it'll be the last to be evicted.

When should we cache:

For nested loop joins, the decision is made purely based on cost.

I thought the main reason to do this was the case when the nested loop subplan was significantly underestimated and we realize during execution that we should have built a hash table. So including this based on cost alone seems to miss a trick.
 
The patch does rely heavily on good ndistinct estimates. 

Exactly. We know we seldom get those with many-way joins.

So +1 for adding this technique. My question is whether it should be added as an optional facility of a parameterised sub plan, rather than an always-needed full-strength node. That way the choice of whether to use it can happen at execution time once we notice that we've been called too many times.

--
Simon Riggs                http://www.2ndQuadrant.com/
Mission Critical Databases

pgsql-hackers by date:

Previous
From: Atsushi Torikoshi
Date:
Subject: Re: Is it useful to record whether plans are generic or custom?
Next
From: Vik Fearing
Date:
Subject: Re: SEARCH and CYCLE clauses