Hybrid Hash/Nested Loop joins and caching results from subplans - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Hybrid Hash/Nested Loop joins and caching results from subplans |
Date | |
Msg-id | CAApHDvrPcQyQdWERGYWx8J+2DLUNgXu+fOSbQ1UscxrunyXyrQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Re: Hybrid Hash/Nested Loop joins and caching results from subplans Re: Hybrid Hash/Nested Loop joins and caching results from subplans Re: Hybrid Hash/Nested Loop joins and caching results from subplans |
List | pgsql-hackers |
Hackers, Over on [1], Heikki mentioned about the usefulness of caching results from parameterized subplans so that they could be used again for subsequent scans which have the same parameters as a previous scan. On [2], I mentioned that parameterized nested loop joins could see similar gains with such a cache. I suggested there that instead of adding code that only allows this to work for subplans, that instead, we add a new node type that can handle the caching for us. We can then just inject that node type in places where it seems beneficial. I've attached a patch which implements this. The new node type is called "Result Cache". I'm not particularly wedded to keeping that name, but if I change it, I only want to do it once. I've got a few other names I mind, but I don't feel strongly or confident enough in them to go and do the renaming. How the caching works: First off, it's only good for plugging in on top of parameterized nodes that are rescanned with different parameters. The cache itself uses a hash table using the simplehash.h implementation. The memory consumption is limited to work_mem. The code maintains an LRU list and when we need to add new entries but don't have enough space to do so, we free off older items starting at the top of the LRU list. When we get a cache hit, we move that entry to the end of the LRU list so that it'll be the last to be evicted. When should we cache: For nested loop joins, the decision is made purely based on cost. The costing model looks at the expected number of calls, the distinct value estimate and work_mem size. It then determines how many items can be cached and then goes on to estimate an expected cache hit ratio and also an eviction ratio. It adjusts the input costs according to those ratios and adds some additional charges for caching and cache lookups. For subplans, since we plan subplans before we're done planning the outer plan, there's very little information to go on about the number of times that the cache will be looked up. For now, I've coded things so the cache is always used for EXPR_SUBLINK type subplans. There may be other types of subplan that could support caching too, but I've not really gone through them all yet to determine which. I certainly know there's some that we can't cache for. Why caching might be good: With hash joins, it's sometimes not so great that we have to hash the entire inner plan and only probe a very small number of values. If we were able to only fill the hash table with values that are needed, then then a lot of time and memory could be saved. Effectively, the patch does exactly this with the combination of a parameterized nested loop join with a Result Cache node above the inner scan. For subplans, the gains can be more because often subplans are much more expensive to execute than what might go on the inside of a parameterized nested loop join. Current problems and some ways to make it better: The patch does rely heavily on good ndistinct estimates. One unfortunate problem is that if the planner has no statistics for whatever it's trying to estimate for, it'll default to returning DEFAULT_NUM_DISTINCT (200). That may cause the Result Cache to appear much more favourable than it should. One way I can think to work around that would be to have another function similar to estimate_num_groups() which accepts a default value which it will return if it was unable to find statistics to use. In this case, such a function could just be called passing the number of input rows as the default, which would make the costing code think each value is unique, which would not be favourable for caching. I've not done anything like that in what I've attached here. That solution would also do nothing if the ndistinct estimate was available, but was just incorrect, as it often is. There are currently a few compiler warnings with the patch due to the scope of the simplehash.h hash table. Because the scope is static rather than extern there's a load of unused function warnings. Not sure yet the best way to deal with this. I don't want to change the scope to extern just to keep compilers quiet. Also during cache_reduce_memory(), I'm performing a hash table lookup followed by a hash table delete. I already have the entry to delete, but there's no simplehash.h function that allows deletion by element pointer, only by key. This wastes a hash table lookup. I'll likely make an adjustment to the simplehash.h code to export the delete code as a separate function to fix this. Demo: # explain (analyze, costs off) select relname,(select count(*) from pg_class c2 where c1.relkind = c2.relkind) from pg_class c1; QUERY PLAN ---------------------------------------------------------------------------------------- Seq Scan on pg_class c1 (actual time=0.069..0.470 rows=391 loops=1) SubPlan 1 -> Result Cache (actual time=0.001..0.001 rows=1 loops=391) Cache Key: c1.relkind Cache Hits: 387 Cache Misses: 4 Cache Evictions: 0 Cache Overflows: 0 -> Aggregate (actual time=0.062..0.062 rows=1 loops=4) -> Seq Scan on pg_class c2 (actual time=0.007..0.056 rows=98 loops=4) Filter: (c1.relkind = relkind) Rows Removed by Filter: 293 Planning Time: 0.047 ms Execution Time: 0.536 ms (11 rows) # set enable_resultcache=0; -- disable result caching SET # explain (analyze, costs off) select relname,(select count(*) from pg_class c2 where c1.relkind = c2.relkind) from pg_class c1; QUERY PLAN ------------------------------------------------------------------------------------- Seq Scan on pg_class c1 (actual time=0.070..24.619 rows=391 loops=1) SubPlan 1 -> Aggregate (actual time=0.062..0.062 rows=1 loops=391) -> Seq Scan on pg_class c2 (actual time=0.009..0.056 rows=120 loops=391) Filter: (c1.relkind = relkind) Rows Removed by Filter: 271 Planning Time: 0.042 ms Execution Time: 24.653 ms (8 rows) -- Demo with parameterized nested loops create table hundredk (hundredk int, tenk int, thousand int, hundred int, ten int, one int); insert into hundredk select x%100000,x%10000,x%1000,x%100,x%10,1 from generate_Series(1,100000) x; create table lookup (a int); insert into lookup select x from generate_Series(1,100000)x, generate_Series(1,100); create index on lookup(a); vacuum analyze lookup, hundredk; # explain (analyze, costs off) select count(*) from hundredk hk inner join lookup l on hk.thousand = l.a; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Aggregate (actual time=1876.710..1876.710 rows=1 loops=1) -> Nested Loop (actual time=0.013..1371.690 rows=9990000 loops=1) -> Seq Scan on hundredk hk (actual time=0.005..8.451 rows=100000 loops=1) -> Result Cache (actual time=0.000..0.005 rows=100 loops=100000) Cache Key: hk.thousand Cache Hits: 99000 Cache Misses: 1000 Cache Evictions: 0 Cache Overflows: 0 -> Index Only Scan using lookup_a_idx on lookup l (actual time=0.002..0.011 rows=100 loops=1000) Index Cond: (a = hk.thousand) Heap Fetches: 0 Planning Time: 0.113 ms Execution Time: 1876.741 ms (11 rows) # set enable_resultcache=0; SET # explain (analyze, costs off) select count(*) from hundredk hk inner join lookup l on hk.thousand = l.a; QUERY PLAN ----------------------------------------------------------------------------------------------------------- Aggregate (actual time=2401.351..2401.352 rows=1 loops=1) -> Merge Join (actual time=28.412..1890.905 rows=9990000 loops=1) Merge Cond: (l.a = hk.thousand) -> Index Only Scan using lookup_a_idx on lookup l (actual time=0.005..10.170 rows=99901 loops=1) Heap Fetches: 0 -> Sort (actual time=28.388..576.783 rows=9990001 loops=1) Sort Key: hk.thousand Sort Method: quicksort Memory: 7760kB -> Seq Scan on hundredk hk (actual time=0.005..11.039 rows=100000 loops=1) Planning Time: 0.123 ms Execution Time: 2401.379 ms (11 rows) Cache Overflows: You might have noticed "Cache Overflow" in the EXPLAIN ANALYZE output. This happens if a single scan of the inner node exhausts the cache memory. In this case, all the other entries will already have been evicted in an attempt to make space for the current scan's tuples. However, if we see an overflow then the size of the results from a single scan alone must have exceeded work_mem. There might be some tweaking to do here as it seems a shame that a single overly larger scan would flush the entire cache. I doubt it would be too hard to limit the flushing to some percentage of work_mem. Similar to how large seqscans don't entirely flush shared_buffers. Current Status: I've spent quite a bit of time getting this working. I'd like to take a serious go at making this happen for PG14. For now, it all seems to work. I have some concerns about bad statistics causing nested loop joins to be favoured more than they were previously due to the result cache further lowering the cost of them when the cache hit ratio is thought to be high. For now, the node type is parallel_safe, but not parallel_aware. I can see that a parallel_aware version would be useful, but I've not done that here. Anything in that area will not be part of my initial effort. The unfortunate part about that is the actual hit ratio will drop with more parallel workers since the caches of each worker are separate. Some tests show a 10x speedup on TPC-H Q2. I'm interested in getting feedback on this before doing much further work on it. Does it seem like something we might want for PG14? David [1] https://www.postgresql.org/message-id/daceb327-9a20-51f4-fe6c-60b898692305%40iki.fi [2] https://www.postgresql.org/message-id/CAKJS1f8oNXQ-LqjK%3DBOFDmxLc_7s3uFr_g4qi7Ncrjig0JOCiA%40mail.gmail.com
Attachment
pgsql-hackers by date: