Re: Subplan result caching - Mailing list pgsql-hackers

From David Rowley
Subject Re: Subplan result caching
Date
Msg-id CAKJS1f-kAk1cGVvzg9TXCLhPsxx_oFVOrTGSR5yTRXKWntTVFA@mail.gmail.com
Whole thread Raw
In response to Subplan result caching  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Subplan result caching  (Robert Haas <robertmhaas@gmail.com>)
Re: Subplan result caching  (Michael Paquier <michael@paquier.xyz>)
Re: Subplan result caching  (Andy Fan <zhihui.fan1213@gmail.com>)
List pgsql-hackers
On 23 May 2018 at 21:31, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I've been working on a patch to add a little cache to SubPlans, to speed up
> queries with correlated subqueries, where the same subquery is currently
> executed multiple times with the same parameters. The idea is to cache the
> result of the subplan, with the correlation vars as the cache key.

Hi,

This seems like an interesting area to make improvements, so I've
signed up to review the patch.

From looking at the code I see that the caching is being done inside
nodeSubplan.c. I don't think this is the right approach to the
problem. The problem exists for any parameterized path, so I think a
more general approach would be much better.

We already have Materialize nodes to cache the results of an entire
subplan, and this seems to have quite a bit in common with that, only
we'd want to cache multiple results with a key to determine which
result set should be returned. Due to the similarities with
Materialize, I think that the cache should be a node itself and not
bury the cache logic in some other node type that's meant for some
other purpose.

"LazyMaterialize" seems like a good option for a name. It seems better
than "LazyHash" since you may not want to restrict it to a hash table
based cache in the future.  A binary search tree may be a good option
for types that cannot be hashed.

Materialize nodes are injected above the inner side node of MergeJoins
based on cost, so I think this node type could just do the same. Maybe
something like estimate_num_groups(<exprs being compared to params>) /
path->rows is below some defined constant, perhaps something like 0.5.
Although experimentation would be required. It might be good to take
into account some other cost factors too.

I imagine we'd want to only allow this optimisation for hashjoinable
types. This seems pretty natural since your cache implementation is a
hash table, so, of course, we're going to need a hash function.

Wondering your thoughts on this idea.

I'll mark as waiting on author in the meantime.

It's great to see someone working on this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Let's remove DSM_IMPL_NONE.
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Non-reserved replication slots and slot advancing