Home > mailing lists
Hybrid Hash/Nested Loop joins and caching results from subplans - Mailing list pgsql-hackers

From	David Rowley
Subject	Hybrid Hash/Nested Loop joins and caching results from subplans
Date	May 20, 2020 11:44:27
Msg-id	CAApHDvrPcQyQdWERGYWx8J+2DLUNgXu+fOSbQ1UscxrunyXyrQ@mail.gmail.com Whole thread Raw
Responses	Re: Hybrid Hash/Nested Loop joins and caching results from subplans Re: Hybrid Hash/Nested Loop joins and caching results from subplans Re: Hybrid Hash/Nested Loop joins and caching results from subplans Re: Hybrid Hash/Nested Loop joins and caching results from subplans
List	pgsql-hackers
Tree view
Hackers,

Over on [1], Heikki mentioned about the usefulness of caching results
from parameterized subplans so that they could be used again for
subsequent scans which have the same parameters as a previous scan.
On [2], I mentioned that parameterized nested loop joins could see
similar gains with such a cache. I suggested there that instead of
adding code that only allows this to work for subplans, that instead,
we add a new node type that can handle the caching for us.  We can
then just inject that node type in places where it seems beneficial.

I've attached a patch which implements this.  The new node type is
called "Result Cache".  I'm not particularly wedded to keeping that
name, but if I change it, I only want to do it once. I've got a few
other names I mind, but I don't feel strongly or confident enough in
them to go and do the renaming.

How the caching works:

First off, it's only good for plugging in on top of parameterized
nodes that are rescanned with different parameters. The cache itself
uses a hash table using the simplehash.h implementation.  The memory
consumption is limited to work_mem. The code maintains an LRU list and
when we need to add new entries but don't have enough space to do so,
we free off older items starting at the top of the LRU list.  When we
get a cache hit, we move that entry to the end of the LRU list so that
it'll be the last to be evicted.

When should we cache:

For nested loop joins, the decision is made purely based on cost.  The
costing model looks at the expected number of calls, the distinct
value estimate and work_mem size. It then determines how many items
can be cached and then goes on to estimate an expected cache hit ratio
and also an eviction ratio. It adjusts the input costs according to
those ratios and adds some additional charges for caching and cache
lookups.

For subplans, since we plan subplans before we're done planning the
outer plan, there's very little information to go on about the number
of times that the cache will be looked up. For now, I've coded things
so the cache is always used for EXPR_SUBLINK type subplans.  There may
be other types of subplan that could support caching too, but I've not
really gone through them all yet to determine which.  I certainly know
there's some that we can't cache for.

Why caching might be good:

With hash joins, it's sometimes not so great that we have to hash the
entire inner plan and only probe a very small number of values.  If we
were able to only fill the hash table with values that are needed,
then then a lot of time and memory could be saved.  Effectively, the
patch does exactly this with the combination of a parameterized nested
loop join with a Result Cache node above the inner scan.

For subplans, the gains can be more because often subplans are much
more expensive to execute than what might go on the inside of a
parameterized nested loop join.

Current problems and some ways to make it better:

The patch does rely heavily on good ndistinct estimates. One
unfortunate problem is that if the planner has no statistics for
whatever it's trying to estimate for, it'll default to returning
DEFAULT_NUM_DISTINCT (200).  That may cause the Result Cache to appear
much more favourable than it should.  One way I can think to work
around that would be to have another function similar to
estimate_num_groups() which accepts a default value which it will
return if it was unable to find statistics to use.  In this case, such
a function could just be called passing the number of input rows as
the default, which would make the costing code think each value is
unique, which would not be favourable for caching.  I've not done
anything like that in what I've attached here.  That solution would
also do nothing if the ndistinct estimate was available, but was just
incorrect, as it often is.

There are currently a few compiler warnings with the patch due to the
scope of the simplehash.h hash table. Because the scope is static
rather than extern there's a load of unused function warnings.  Not
sure yet the best way to deal with this. I don't want to change the
scope to extern just to keep compilers quiet.

Also during cache_reduce_memory(), I'm performing a hash table lookup
followed by a hash table delete. I already have the entry to delete,
but there's no simplehash.h function that allows deletion by element
pointer, only by key. This wastes a hash table lookup. I'll likely
make an adjustment to the simplehash.h code to export the delete code
as a separate function to fix this.

Demo:

# explain (analyze, costs off) select relname,(select count(*) from
pg_class c2 where c1.relkind = c2.relkind) from pg_class c1;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Seq Scan on pg_class c1 (actual time=0.069..0.470 rows=391 loops=1)
   SubPlan 1
     ->  Result Cache (actual time=0.001..0.001 rows=1 loops=391)
           Cache Key: c1.relkind
           Cache Hits: 387  Cache Misses: 4 Cache Evictions: 0  Cache
Overflows: 0
           ->  Aggregate (actual time=0.062..0.062 rows=1 loops=4)
                 ->  Seq Scan on pg_class c2 (actual time=0.007..0.056
rows=98 loops=4)
                       Filter: (c1.relkind = relkind)
                       Rows Removed by Filter: 293
 Planning Time: 0.047 ms
 Execution Time: 0.536 ms
(11 rows)

# set enable_resultcache=0; -- disable result caching
SET
# explain (analyze, costs off) select relname,(select count(*) from
pg_class c2 where c1.relkind = c2.relkind) from pg_class c1;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Seq Scan on pg_class c1 (actual time=0.070..24.619 rows=391 loops=1)
   SubPlan 1
     ->  Aggregate (actual time=0.062..0.062 rows=1 loops=391)
           ->  Seq Scan on pg_class c2 (actual time=0.009..0.056
rows=120 loops=391)
                 Filter: (c1.relkind = relkind)
                 Rows Removed by Filter: 271
 Planning Time: 0.042 ms
 Execution Time: 24.653 ms
(8 rows)

-- Demo with parameterized nested loops
create table hundredk (hundredk int, tenk int, thousand int, hundred
int, ten int, one int);
insert into hundredk select x%100000,x%10000,x%1000,x%100,x%10,1 from
generate_Series(1,100000) x;
create table lookup (a int);
insert into lookup select x from generate_Series(1,100000)x,
generate_Series(1,100);
create index on lookup(a);
vacuum analyze lookup, hundredk;

# explain (analyze, costs off) select count(*) from hundredk hk inner
join lookup l on hk.thousand = l.a;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=1876.710..1876.710 rows=1 loops=1)
   ->  Nested Loop (actual time=0.013..1371.690 rows=9990000 loops=1)
         ->  Seq Scan on hundredk hk (actual time=0.005..8.451
rows=100000 loops=1)
         ->  Result Cache (actual time=0.000..0.005 rows=100 loops=100000)
               Cache Key: hk.thousand
               Cache Hits: 99000  Cache Misses: 1000 Cache Evictions:
0  Cache Overflows: 0
               ->  Index Only Scan using lookup_a_idx on lookup l
(actual time=0.002..0.011 rows=100 loops=1000)
                     Index Cond: (a = hk.thousand)
                     Heap Fetches: 0
 Planning Time: 0.113 ms
 Execution Time: 1876.741 ms
(11 rows)

# set enable_resultcache=0;
SET
# explain (analyze, costs off) select count(*) from hundredk hk inner
join lookup l on hk.thousand = l.a;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Aggregate (actual time=2401.351..2401.352 rows=1 loops=1)
   ->  Merge Join (actual time=28.412..1890.905 rows=9990000 loops=1)
         Merge Cond: (l.a = hk.thousand)
         ->  Index Only Scan using lookup_a_idx on lookup l (actual
time=0.005..10.170 rows=99901 loops=1)
               Heap Fetches: 0
         ->  Sort (actual time=28.388..576.783 rows=9990001 loops=1)
               Sort Key: hk.thousand
               Sort Method: quicksort  Memory: 7760kB
               ->  Seq Scan on hundredk hk (actual time=0.005..11.039
rows=100000 loops=1)
 Planning Time: 0.123 ms
 Execution Time: 2401.379 ms
(11 rows)

Cache Overflows:

You might have noticed "Cache Overflow" in the EXPLAIN ANALYZE output.
This happens if a single scan of the inner node exhausts the cache
memory. In this case, all the other entries will already have been
evicted in an attempt to make space for the current scan's tuples.
However, if we see an overflow then the size of the results from a
single scan alone must have exceeded work_mem.  There might be some
tweaking to do here as it seems a shame that a single overly larger
scan would flush the entire cache. I doubt it would be too hard to
limit the flushing to some percentage of work_mem. Similar to how
large seqscans don't entirely flush shared_buffers.

Current Status:

I've spent quite a bit of time getting this working. I'd like to take
a serious go at making this happen for PG14.  For now, it all seems to
work. I have some concerns about bad statistics causing nested loop
joins to be favoured more than they were previously due to the result
cache further lowering the cost of them when the cache hit ratio is
thought to be high.

For now, the node type is parallel_safe, but not parallel_aware. I can
see that a parallel_aware version would be useful, but I've not done
that here. Anything in that area will not be part of my initial
effort.  The unfortunate part about that is the actual hit ratio will
drop with more parallel workers since the caches of each worker are
separate.

Some tests show a 10x speedup on TPC-H Q2.

I'm interested in getting feedback on this before doing much further work on it.

Does it seem like something we might want for PG14?

David

[1] https://www.postgresql.org/message-id/daceb327-9a20-51f4-fe6c-60b898692305%40iki.fi
[2] https://www.postgresql.org/message-id/CAKJS1f8oNXQ-LqjK%3DBOFDmxLc_7s3uFr_g4qi7Ncrjig0JOCiA%40mail.gmail.com
Attachment

resultcache_2020-05-20.patch.bz2
pgsql-hackers by date:
From: Laurenz Albe
Date: 20 May 2020, 11:38:28
Subject: Re: Add A Glossary
From: Peter Eisentraut
Date: 20 May 2020, 11:46:18
Subject: SEARCH and CYCLE clauses
Hybrid Hash/Nested Loop joins and caching results from subplans - Mailing list pgsql-hackers

Attachment

Previous

Next