Thread: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Lukas Fittl

Date:

05 March 2023, 00:20:59

Hi,

I was debugging a planner problem on Postgres 14.4 the other day - and the involved "bad" plan was including Memoize - though I don't necessarily think that Memoize is to blame (and this isn't any of the problems recently fixed in Memoize costing).

However, what I noticed whilst trying different ways to fix the plan, is that the Memoize output was a bit hard to reason about - especially since the plan involving Memoize was expensive to run, and so I was mostly running EXPLAIN without ANALYZE to look at the costing.

Here is an example of the output I was looking at:

-> Nested Loop (cost=1.00..971672.56 rows=119623 width=0)
-> Index Only Scan using table1_idx on table1 (cost=0.43..372676.50 rows=23553966 width=8)
-> Memoize (cost=0.57..0.61 rows=1 width=8)
Cache Key: table1.table2_id
Cache Mode: logical

-> Index Scan using table2_idx on table2 (cost=0.56..0.60 rows=1 width=8)
Index Cond: (id = table1.table2_id)

The other plan I was comparing with (that I wanted the planner to choose instead), had a total cost of 1,451,807.35 -- and so I was trying to figure out why the Nested Loop was costed as 971,672.56.

Simple math makes me expect the Nested Loop should roughly have a total cost of14,740,595.76 here (372,676.50 + 23,553,966 * 0.61), ignoring a lot of the smaller costs. Thus, in this example, it appears Memoize made the plan cost significantly cheaper (roughly 6% of the regular cost).

Essentially this comes down to the "cost reduction" performed by Memoize only being implicitly visible in the Nested Loop's total cost - and with nothing useful on the Memoize node itself - since the rescan costs are not shown.

I think explicitly adding the estimated cache hit ratio for Memoize nodes might make this easier to reason about, like this:

-> Memoize (cost=0.57..0.61 rows=1 width=8)
Cache Key: table1.table2_id
Cache Mode: logical

Cache Hit Ratio Estimated: 0.94

Alternatively (or in addition) we could consider showing the "ndistinct" value that is calculated in cost_memoize_rescan - since that's the most significant contributor to the cache hit ratio (and you can influence that directly by improving the ndistinct statistics).

See attached a patch that implements showing the cache hit ratio as a discussion starter.

I'll park this in the July commitfest for now.

Thanks,

Lukas

Lukas Fittl

Attachment

v1-0001-Add-Estimated-Cache-Hit-Ratio-for-Memoize-plan-no.patch

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

David Rowley

Date:

07 March 2023, 09:51:20

On Sun, 5 Mar 2023 at 13:21, Lukas Fittl <lukas@fittl.com> wrote:
> Alternatively (or in addition) we could consider showing the "ndistinct" value that is calculated in
cost_memoize_rescan- since that's the most significant contributor to the cache hit ratio (and you can influence that
directlyby improving the ndistinct statistics). 

I think the ndistinct estimate plus the est_entries together would be
useful. I think showing just the hit ratio number might often just
raise too many questions about how that's calculated. To calculate the
hit ratio we need to estimate the number of entries that can be kept
in the cache at once and also the number of input rows and the number
of distinct values.  We can see the input rows by looking at the outer
side of the join in EXPLAIN, but we've no idea about the ndistinct or
how many items the planner thought could be kept in the cache at once.

The plan node already has est_entries, so it should just be a matter
of storing the ndistinct estimate in the Path and putting it into the
Plan node so the executor has access to it during EXPLAIN.

David

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Daniel Gustafsson

Date:

06 July 2023, 07:56:18

> On 7 Mar 2023, at 10:51, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sun, 5 Mar 2023 at 13:21, Lukas Fittl <lukas@fittl.com> wrote:
>> Alternatively (or in addition) we could consider showing the "ndistinct" value that is calculated in
cost_memoize_rescan- since that's the most significant contributor to the cache hit ratio (and you can influence that
directlyby improving the ndistinct statistics). 
>
> I think the ndistinct estimate plus the est_entries together would be
> useful. I think showing just the hit ratio number might often just
> raise too many questions about how that's calculated. To calculate the
> hit ratio we need to estimate the number of entries that can be kept
> in the cache at once and also the number of input rows and the number
> of distinct values.  We can see the input rows by looking at the outer
> side of the join in EXPLAIN, but we've no idea about the ndistinct or
> how many items the planner thought could be kept in the cache at once.
>
> The plan node already has est_entries, so it should just be a matter
> of storing the ndistinct estimate in the Path and putting it into the
> Plan node so the executor has access to it during EXPLAIN.

Lukas: do you have an updated patch for this commitfest to address David's
comments?

--
Daniel Gustafsson

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Lukas Fittl

Date:

06 July 2023, 08:27:26

On Thu, Jul 6, 2023 at 12:56 AM Daniel Gustafsson <daniel@yesql.se> wrote:

Lukas: do you have an updated patch for this commitfest to address David's
comments?

I have a draft - I should be able to post an updated patch in the next days. Thanks for checking!

Thanks,

Lukas

Lukas Fittl

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

20 March, 11:47:56

On 06.07.2023 11:27, Lukas Fittl wrote:

On Thu, Jul 6, 2023 at 12:56 AM Daniel Gustafsson <daniel@yesql.se> wrote:
Lukas: do you have an updated patch for this commitfest to address David's
comments?

I have a draft - I should be able to post an updated patch in the next days. Thanks for checking!

Thanks,
Lukas

--
Lukas Fittl

Hi hackers,

While debugging a query execution plan involving Memoize, it'd be nice to determine how many unique keys would fit into the cache. The est_entries value provides some insight, but without knowing ndistinct, it is unclear whether the cache is large enough to hold all unique keys or if some will be evicted.

Given its potential usefulness, I would like to work for this. I attached v2 patch with changes.

Example from memoize.sql

EXPLAIN SELECT COUNT(*),AVG(t1.unique1) FROM tenk1 t1
INNER JOIN tenk1 t2 ON t1.unique1 = t2.thousand
WHERE t2.unique1 < 1200;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
Aggregate (cost=815.12..815.13 rows=1 width=40)
   -> Nested Loop (cost=0.30..809.12 rows=1200 width=4)
         -> Seq Scan on tenk1 t2 (cost=0.00..470.00 rows=1200 width=4)
               Filter: (unique1 < 1200)
         -> Memoize (cost=0.30..0.41 rows=1 width=4)
               Cache Key: t2.thousand
               Cache Mode: logical
               Cache Estimated Entries: 655
               Cache Estimated NDistinct: 721
               -> Index Only Scan using tenk1_unique1 on tenk1 t1 (cost=0.29..0.40 rows=1 width=4)
                     Index Cond: (unique1 = t2.thousand)
(11 rows)

Additionally, since this information would only be shown in EXPLAIN when costs are enabled, it should not cause any performance regression in normal execution. However, reviewers should be especially careful when verifying test outputs, as this change could affect plan details in regression tests.

Any thoughts?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachment

v2-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Memoize.patch

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

David Rowley

Date:

20 March, 13:37:35

On Thu, 20 Mar 2025 at 21:48, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:
>          ->  Memoize  (cost=0.30..0.41 rows=1 width=4)
>                Cache Key: t2.thousand
>                Cache Mode: logical
>                Cache Estimated Entries: 655
>                Cache Estimated NDistinct: 721
>                ->  Index Only Scan using tenk1_unique1 on tenk1 t1  (cost=0.29..0.40 rows=1 width=4)
>                      Index Cond: (unique1 = t2.thousand)
> (11 rows)
>
> Additionally, since this information would only be shown in EXPLAIN when costs are enabled, it should not cause any
performanceregression in normal execution. However, reviewers should be especially careful when verifying test outputs,
asthis change could affect plan details in regression tests. 
>
> Any thoughts?

> + ExplainIndentText(es);
> + appendStringInfo(es->str, "Cache Estimated Entries: %d\n", ((Memoize *) plan)->est_entries);
> + ExplainIndentText(es);
> + appendStringInfo(es->str, "Cache Estimated NDistinct: %0.0f\n", ((Memoize *) plan)->ndistinct);

est_entries is a uint32, so %u is the correct format character for that type.

I don't think you need to prefix all these properties with "Cache"
just because the other two properties have that prefix. I also don't
think the names you've chosen really reflect the meaning. How about
something like: "Estimated Distinct Lookup Keys: 721  Estimated
Capacity: 655", in that order. I think maybe having that as one line
for format=text is better than 2 lines. The EXPLAIN output is already
often taking up more lines in v18 than in v17, would be good to not
make that even worse unnecessarily.

I see the existing code there could use ExplainPropertyText rather
than have a special case for text and non-text formats. That's likely
my fault. If we're touching this code, then we should probably tidy
that up. Do you want to create a precursor fixup patch for that?

+ double ndistinct; /* Estimated number of distinct memoization keys,
+ * used for cache size evaluation. Kept for EXPLAIN */

Maybe this field in MemoizePath needs a better name. How about
"est_unique_keys"? and also do the rename in struct Memoize.

I'm also slightly concerned about making struct Memoize bigger. I had
issues with a performance regression [1] for 908a96861 when increasing
the WindowAgg struct size last year and the only way I found to make
it go away was to shuffle the fields around so that the struct size
didn't increase. I think we'll need to see a benchmark of a query that
hits Memoize quite hard with a small cache size to see if the
performance decreases as a result of adding the ndistinct field. It's
unfortunate that we'll not have the luxury of squeezing this double
into padding if we do see a slowdown.

David

[1] https://postgr.es/m/flat/CAHoyFK9n-QCXKTUWT_xxtXninSMEv%2BgbJN66-y6prM3f4WkEHw%40mail.gmail.com

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

20 March, 15:04:43

On 20.03.2025 13:37, David Rowley wrote:

est_entries is a uint32, so %u is the correct format character for that type.

I don't think you need to prefix all these properties with "Cache"
just because the other two properties have that prefix. I also don't
think the names you've chosen really reflect the meaning. How about
something like: "Estimated Distinct Lookup Keys: 721  Estimated
Capacity: 655", in that order. I think maybe having that as one line
for format=text is better than 2 lines. The EXPLAIN output is already
often taking up more lines in v18 than in v17, would be good to not
make that even worse unnecessarily.

LGTM

I see the existing code there could use ExplainPropertyText rather
than have a special case for text and non-text formats. That's likely
my fault. If we're touching this code, then we should probably tidy
that up. Do you want to create a precursor fixup patch for that?

I fully agree with this suggestion. Then I'll begin with this on another new thread.

+ double ndistinct; /* Estimated number of distinct memoization keys,
+ * used for cache size evaluation. Kept for EXPLAIN */

Maybe this field in MemoizePath needs a better name. How about
"est_unique_keys"? and also do the rename in struct Memoize.

LGTM

I'm also slightly concerned about making struct Memoize bigger. I had
issues with a performance regression [1] for 908a96861 when increasing
the WindowAgg struct size last year and the only way I found to make
it go away was to shuffle the fields around so that the struct size
didn't increase. I think we'll need to see a benchmark of a query that
hits Memoize quite hard with a small cache size to see if the
performance decreases as a result of adding the ndistinct field. It's
unfortunate that we'll not have the luxury of squeezing this double
into padding if we do see a slowdown.

I tried rearranging the fields in the structure with 'ndistinct' field, but unfortunately, sizeof(Memoize) did not remain the same. Therefore, benchmarking Memoize is definitely necessary. Thanks for the information!

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Andrei Lepikhov

Date:

20 March, 15:32:41

On 20/3/2025 11:37, David Rowley wrote:
> I'm also slightly concerned about making struct Memoize bigger. I had
> issues with a performance regression [1] for 908a96861 when increasing
> the WindowAgg struct size last year and the only way I found to make
> it go away was to shuffle the fields around so that the struct size
> didn't increase. I think we'll need to see a benchmark of a query that
> hits Memoize quite hard with a small cache size to see if the
> performance decreases as a result of adding the ndistinct field. It's
> unfortunate that we'll not have the luxury of squeezing this double
> into padding if we do see a slowdown.
I quite frequently need the number of distinct values (or groups) 
predicted during the Memoize node creation to understand why caching is 
sometimes employed or not.
But I had thought about an alternative way: having an extensible EXPLAIN 
(thanks to Robert), we may save optimisation-stage data (I have the same 
necessity in the case of IncrementalSort, for example) and put it into 
the Plan node on-demand. So, the way I want to go is a Plan::extlist 
node and create_plan hook, which may allow copying best_path data to the 
final plan. So, here, we will add a new parameter and avoid touching the 
core code.
But I would give +1 to current approach if it were done in a shorter time.

-- 
regards, Andrei Lepikhov

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

20 March, 17:03:25

On 20.03.2025 15:32, Andrei Lepikhov wrote:
> I quite frequently need the number of distinct values (or groups) 
> predicted during the Memoize node creation to understand why caching 
> is sometimes employed or not.
> But I had thought about an alternative way: having an extensible 
> EXPLAIN (thanks to Robert), we may save optimisation-stage data (I 
> have the same necessity in the case of IncrementalSort, for example) 
> and put it into the Plan node on-demand. So, the way I want to go is a 
> Plan::extlist node and create_plan hook, which may allow copying 
> best_path data to the final plan. So, here, we will add a new 
> parameter and avoid touching the core code.
> But I would give +1 to current approach if it were done in a shorter 
> time.

Then before proceeding further, I think we need to benchmark this change 
to ensure there are no performance regressions. If performance issues 
arise, then using extensible EXPLAIN might be the only viable approach.

I have addressed most of the review comments except for the 
ExplainPropertyText change. I am attaching v3 of the patch with these 
updates. If anyone notices any performance issues, please let me know. 
Issue with ExplainPropertyText for this thread I'll fix in the next 
patches if it will be necessary.

So far, I have tested it on small machines. There are no performance 
issues yet. I'll run benchmarks on larger ones soon.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachment

v3-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Memoize.patch

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Andrei Lepikhov

Date:

20 March, 21:54:07

On 20/3/2025 15:03, Ilia Evdokimov wrote:
> 
> On 20.03.2025 15:32, Andrei Lepikhov wrote:
>> I quite frequently need the number of distinct values (or groups) 
>> predicted during the Memoize node creation to understand why caching 
>> is sometimes employed or not.
>> But I had thought about an alternative way: having an extensible 
>> EXPLAIN (thanks to Robert), we may save optimisation-stage data (I 
>> have the same necessity in the case of IncrementalSort, for example) 
>> and put it into the Plan node on-demand. So, the way I want to go is a 
>> Plan::extlist node and create_plan hook, which may allow copying 
>> best_path data to the final plan. So, here, we will add a new 
>> parameter and avoid touching the core code.
>> But I would give +1 to current approach if it were done in a shorter 
>> time.
> 
> 
> Then before proceeding further, I think we need to benchmark this change 
> to ensure there are no performance regressions. If performance issues 
> arise, then using extensible EXPLAIN might be the only viable approach.
> 
> I have addressed most of the review comments except for the 
> ExplainPropertyText change. I am attaching v3 of the patch with these 
> updates. If anyone notices any performance issues, please let me know. 
> Issue with ExplainPropertyText for this thread I'll fix in the next 
> patches if it will be necessary.
> 
> So far, I have tested it on small machines. There are no performance 
> issues yet. I'll run benchmarks on larger ones soon.
I have some doubts here.
The number of distinct values says something only when it has taken 
together with the number of calls.
Frequently, one of the caching keys is from outer table A (10 tuples), 
and another is from outer table B (100 tuples). Calculating the number 
of successful cache fetches predicted by the planner may not be evident 
in the case of a composite cache key.

What I may propose here is:
1. Use fraction of calls, for example - 50% duplicated key values.
2. Show the calculated hit and eviction ratio.

The second option looks better to me. It is pretty understandable by a 
user and may be compared to the numbers obtained during the execution.

-- 
regards, Andrei Lepikhov

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

David Rowley

Date:

21 March, 05:50:11

On Fri, 21 Mar 2025 at 07:54, Andrei Lepikhov <lepihov@gmail.com> wrote:
> I have some doubts here.
> The number of distinct values says something only when it has taken
> together with the number of calls.

Couldn't the reader just look at the Nested Loop's outer side row
estimate for that?

> Frequently, one of the caching keys is from outer table A (10 tuples),
> and another is from outer table B (100 tuples). Calculating the number
> of successful cache fetches predicted by the planner may not be evident
> in the case of a composite cache key.
>
> What I may propose here is:
> 1. Use fraction of calls, for example - 50% duplicated key values.
> 2. Show the calculated hit and eviction ratio.

My concern with showing just the estimated hit and evict ratios it
that it might lead to more questions, like how those were calculated.
However, I can see the argument for having one or both of these in
addition to the expected unique keys and expected cache capacity.

I think the primary factors in how useful Memoize is are: 1) How many
items do we expect to be able to store in the cache concurrently, and;
2) How many unique lookups keys do we expect to be looked up, and; 3)
The total number of expected lookups.   #1 is quite difficult to
figure out (maybe by looking at row width and row estimates) and
there's just no information about #2. #3 is already shown, in the
Nested Loop's outer side.

The reason that the hit and evict ratios might also be useful are that
it you need a bit of inside knowledge on how the 3 input values are
used to calculate these. For example you might come up with a higher
estimated hit ratio if you assumed the keys arrived in order vs
unordered.  If they're unordered and you don't have room for all keys
in the cache at once then that increases the chances that Memoize had
to evict something that will be needed for a future lookup.

Also, I was just looking back at the concern I had with increasing the
size of struct Memoize. I suspect that might not be that much of a
concern. The WindowAgg problem I mentioned was with the executor
state, not the plan node. I see the only time we access the plan node
for Memoize during execution is when we call build_hash_table().

David

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Andrei Lepikhov

Date:

21 March, 12:02:36

On 21/3/2025 03:50, David Rowley wrote:
> On Fri, 21 Mar 2025 at 07:54, Andrei Lepikhov <lepihov@gmail.com> wrote:
>> I have some doubts here.
>> The number of distinct values says something only when it has taken
>> together with the number of calls.
> 
> Couldn't the reader just look at the Nested Loop's outer side row
> estimate for that?
In my cases, key sources are usually spread across dozens of joins, and 
it is visually hard to find out (especially when we have an EXPLAIN 
ANALYSE VERBOSE) the JOIN operator to extract the number of calls. The 
hit ratio, meanwhile, may be analysed locally in the Memoize node. For 
example, 80% (0.8) is evidently a good one, 40% is questionable, and 5% 
is too low and we should avoid Memoize here.
May it be beaten by just printing the "calls" number at the Memoize output?
> 
>> Frequently, one of the caching keys is from outer table A (10 tuples),
>> and another is from outer table B (100 tuples). Calculating the number
>> of successful cache fetches predicted by the planner may not be evident
>> in the case of a composite cache key.
>>
>> What I may propose here is:
>> 1. Use fraction of calls, for example - 50% duplicated key values.
>> 2. Show the calculated hit and eviction ratio.
> 
> I think the primary factors in how useful Memoize is are: 1) How many
> items do we expect to be able to store in the cache concurrently, and;
> 2) How many unique lookups keys do we expect to be looked up, and; 3)
> The total number of expected lookups.   #1 is quite difficult to
> figure out (maybe by looking at row width and row estimates) and
> there's just no information about #2. #3 is already shown, in the
> Nested Loop's outer side.
It depends on the task. If you are looking for the answer to how precise 
the group's estimation has been (to check statistics), I agree. In cases 
I have seen before, the main question is how effective was (or maybe) a 
Memoize node == how often the incoming key fits the cache. In that case, 
the hit ratio fraction is more understandable for a broad audience.
That's why according to my experience in case of a good cache 
reusability factor, users are usually okay with increasing the cache 
size to the necessary numbers and avoiding evictions at all costs. So, 
the predicted evict_ratio also tells us about incrementing work_mem to 
enhance the chances of Memoisation.
Having written the last sentence I came back to the point why work_mem 
is so universal and is used at each node as a criteria of memory 
allocation size? But it is a different story, I think.

-- 
regards, Andrei Lepikhov

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

21 March, 13:08:14

After considering the opinions expressed in this discussion, I tend to 
agree more with David. If we add info about estimated unique keys and 
estimated capacity, then any additional information - such as 
evict_ratio and hit_ratio - can also be calculated, as EXPLAIN ANALYZE 
provides all the necessary details to compute these values.

For now, I’m attaching a rebased v4 patch, which includes the fix for 
ExplainPropertyText.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachment

v4-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Memoize.patch

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

David Rowley

Date:

24 March, 00:16:18

On Fri, 21 Mar 2025 at 22:02, Andrei Lepikhov <lepihov@gmail.com> wrote:
> In cases
> I have seen before, the main question is how effective was (or maybe) a
> Memoize node == how often the incoming key fits the cache. In that case,
> the hit ratio fraction is more understandable for a broad audience.
> That's why according to my experience in case of a good cache
> reusability factor, users are usually okay with increasing the cache
> size to the necessary numbers and avoiding evictions at all costs. So,
> the predicted evict_ratio also tells us about incrementing work_mem to
> enhance the chances of Memoisation.

Can you explain why "Estimated Capacity" and "Estimated Distinct
Lookup Keys" don't answer that?  If there are more distinct lookup
keys than there is capacity to store them, then some will be evicted.

Once again, I'm not necessarily objecting to hit and evict ratios
being shown, I just want to know they're actually useful enough to
show and don't just bloat the EXPLAIN output needlessly. So far your
arguments aren't convincing me that they are.

> Having written the last sentence I came back to the point why work_mem
> is so universal and is used at each node as a criteria of memory
> allocation size? But it is a different story, I think.

We have to set the limit somehow.  We could have done this by having a
GUC per node type that uses memory, but it looks like something more
universal was decided, perhaps to save on GUCs. I don't know the exact
history, but once upon a time, sort_mem existed. Perhaps that
disappeared because we grew more node types that needed to allocate
large, otherwise unbounded amounts of memory.  We did more recently
grow a hash_mem_multiplier GUC, so it's not true to say that work_mem
solely controls the limits of each node's memory allocation sizes.

David

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Andrei Lepikhov

Date:

25 March, 00:23:18

On 3/23/25 22:16, David Rowley wrote:
> On Fri, 21 Mar 2025 at 22:02, Andrei Lepikhov <lepihov@gmail.com> wrote:
> Can you explain why "Estimated Capacity" and "Estimated Distinct
> Lookup Keys" don't answer that?  If there are more distinct lookup
> keys than there is capacity to store them, then some will be evicted.
I wouldn't say these parameters don't answer. I try to debate how usable 
they are. To be more practical, let me demonstrate the example:

EXPLAIN (COSTS OFF) SELECT * FROM t1,t2 WHERE t1.x=t2.x;

  Nested Loop  (cost=0.44..7312.65 rows=211330 width=33)
    ->  Seq Scan on t1  (cost=0.00..492.00 rows=30000 width=22)
    ->  Memoize  (cost=0.44..3.82 rows=7 width=11)
          Cache Key: t1.x
          Cache Mode: logical
          Estimated Capacity: 1001  Estimated Distinct Lookup Keys: 1001
          ->  Index Scan using t2_x_idx2 on t2  (cost=0.43..3.81 rows=7)
                Index Cond: (x = t1.x)

At first, I began to look for documentation because it was unclear what 
both new parameters specifically meant. Okay, there was no documentation 
but trivial code, and after a short discovery, I realised the meaning.
The first fact I see from this EXPLAIN is that Postgres estimates it has 
enough memory to fit all the entries. Okay, but what does it give me? I 
may just increase work_mem and provide the query with more memory if 
needed. My main concern is how frequently this cache is planned to be 
used. Doing some mental effort, I found the line "rows=30000." 
Calculating a bit more, I may suppose it means that we have a 95% chance 
to reuse the cache. Okay, I got it.
Now, see:
1. I needed to discover the meaning of the new parameters because they 
were different from the earlier "hit" and "miss."
2. I need to find a common JOIN for keys of this node. Imagine a typical 
200-row EXPLAIN with 2-3 Memoization keys from different tables.
3. I need to make calculations

On the opposite, the hit ratio, written instead, already known by 
analogy, already provides me with necessary cache efficacy data; no need 
to watch outside the node; it may be easily compared with the actual 
value. Am I wrong?

Both approaches provide the data, but each one is more usable?
I think we may ask more people, for example, Nikolay Samokhvalov, who, 
as I heard, works hard with explains.

> 
> Once again, I'm not necessarily objecting to hit and evict ratios
> being shown, I just want to know they're actually useful enough to
> show and don't just bloat the EXPLAIN output needlessly. So far your
> arguments aren't convincing me that they are.
I'm -1 for this redundancy.

-- 
regards, Andrei Lepikhov

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

David Rowley

Date:

25 March, 01:05:16

On Tue, 25 Mar 2025 at 10:23, Andrei Lepikhov <lepihov@gmail.com> wrote:
>
> On 3/23/25 22:16, David Rowley wrote:
> > Once again, I'm not necessarily objecting to hit and evict ratios
> > being shown, I just want to know they're actually useful enough to
> > show and don't just bloat the EXPLAIN output needlessly. So far your
> > arguments aren't convincing me that they are.

> I'm -1 for this redundancy.

I'm not following what the -1 is for. Is it for showing both hit and
evict ratios?  And your vote is only for adding hit ratio?

Just to make it clear, the evict ratio isn't redundant because we show
hit ratio. If you have 1000 calls and 1000 distinct values and enough
memory to store those, then that's a 0% hit ratio since the first
lookup is a miss. If you change the calls to 2000 then that's a 50%
hit ratio and still 0% evict.

David

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Tom Lane

Date:

25 March, 01:15:34

David Rowley <dgrowleyml@gmail.com> writes:
> I'm not following what the -1 is for. Is it for showing both hit and
> evict ratios?  And your vote is only for adding hit ratio?

> Just to make it clear, the evict ratio isn't redundant because we show
> hit ratio. If you have 1000 calls and 1000 distinct values and enough
> memory to store those, then that's a 0% hit ratio since the first
> lookup is a miss. If you change the calls to 2000 then that's a 50%
> hit ratio and still 0% evict.

FWIW, I share these doubts about whether these values are useful
enough to include in the default EXPLAIN output.  My main beef
with them though is that they are basically numbers derived along
the way to producing a cost estimate, and I don't think we break
out such intermediate results for other node types.

It's looking like Robert's "pg_overexplain" will hit the tree soon,
so maybe there could be a case for teaching that to emit additional
costing details such as these?  I'd kind of like to see a larger
scope than just Memoize for such an addition, though I'm not sure
exactly which other intermediate estimates are worth showing.

            regards, tom lane

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Lukas Fittl

Date:

25 March, 01:30:03

On Mon, Mar 24, 2025 at 3:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I share these doubts about whether these values are useful
enough to include in the default EXPLAIN output. My main beef
with them though is that they are basically numbers derived along
the way to producing a cost estimate, and I don't think we break
out such intermediate results for other node types.

The main argument I had initially when proposing this, is that Memoize is different from other plan nodes, in that it makes the child node costs "cheaper". Clearly seeing the expected cache hit/ratio (that drives that costing modification) helps interpret why the planner came up with a given plan.

Put differently, the reason this should be in the default EXPLAIN output (or at least the VERBOSE output), is because Memoize's impact on costing is counterintuitive (in my experience), and breaks the user's understanding of planner costs you can usually derive by looking at the per-node cost details, which typically flows up (i.e. gets larger as you step higher up in the tree).

It's looking like Robert's "pg_overexplain" will hit the tree soon,
so maybe there could be a case for teaching that to emit additional
costing details such as these?

I don't think that addresses the end-user usability sufficiently - from what I gathered "pg_overexplain" was meant for a hacker audience, not DBAs/etc interpreting Postgres plan choices.

Thanks,

Lukas

Lukas Fittl

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Tom Lane

Date:

25 March, 01:45:54

Lukas Fittl <lukas@fittl.com> writes:
> The main argument I had initially when proposing this, is that Memoize is
> different from other plan nodes, in that it makes the child node costs
> "cheaper". Clearly seeing the expected cache hit/ratio (that drives that
> costing modification) helps interpret why the planner came up with a given
> plan.

Memoize is hardly unique in that respect.  Merge Join sometimes
expects that it won't have to read the inner input to completion,
and reduces its cost estimate accordingly, and that confuses people.
LIMIT also reduces the cost estimate of its input, though perhaps
that doesn't surprise people.

As I said, I'm not necessarily averse to showing these numbers
somehow.  But I don't think they belong in the default output,
and I'm not even convinced that VERBOSE is the right place.
pg_overexplain seems like it could be an ideal home for this
sort of detail.

            regards, tom lane

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

David Rowley

Date:

25 March, 06:32:56

On Tue, 25 Mar 2025 at 11:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> FWIW, I share these doubts about whether these values are useful
> enough to include in the default EXPLAIN output.  My main beef
> with them though is that they are basically numbers derived along
> the way to producing a cost estimate, and I don't think we break
> out such intermediate results for other node types.

The only likeness I can think of is the "(originally N)" for the
buckets and batches in EXPLAIN ANALYZE for Hash Joins. I see that that
only shows when one of those isn't the same as the planner's
expectations

Maybe there's room to show additional details for Memoize in some
similar way only during EXPLAIN ANALYZE. EXPLAIN ANALYZE for Memoize
currently shows things like the number of hits, misses and evictions.
We could calculate what the planner expected those values to be and
show those. For this case, they're almost certain not to match what
the planner expected, but maybe that's ok.

David

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Andrei Lepikhov

Date:

25 March, 09:40:47

On 24/3/2025 23:05, David Rowley wrote:
> On Tue, 25 Mar 2025 at 10:23, Andrei Lepikhov <lepihov@gmail.com> wrote:
>>
>> On 3/23/25 22:16, David Rowley wrote:
>>> Once again, I'm not necessarily objecting to hit and evict ratios
>>> being shown, I just want to know they're actually useful enough to
>>> show and don't just bloat the EXPLAIN output needlessly. So far your
>>> arguments aren't convincing me that they are.
> 
>> I'm -1 for this redundancy.
> 
> I'm not following what the -1 is for. Is it for showing both hit and
> evict ratios?  And your vote is only for adding hit ratio?
Oh, pardon me. I wanted to say that IMO, the couple (capacity, distinct 
lookup keys) makes redundant (hit ratio, eviction ratio) and vice versa.

-- 
regards, Andrei Lepikhov

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Andrei Lepikhov

Date:

25 March, 12:32:52

On 3/24/25 23:45, Tom Lane wrote:
> Lukas Fittl <lukas@fittl.com> writes:
>> The main argument I had initially when proposing this, is that Memoize is
>> different from other plan nodes, in that it makes the child node costs
>> "cheaper". Clearly seeing the expected cache hit/ratio (that drives that
>> costing modification) helps interpret why the planner came up with a given
>> plan.
> As I said, I'm not necessarily averse to showing these numbers
> somehow.  But I don't think they belong in the default output,
> and I'm not even convinced that VERBOSE is the right place.
> pg_overexplain seems like it could be an ideal home for this
> sort of detail.
I prefer not to see these numbers in the default EXPLAIN output, not 
only because they can fluctuate but mainly because I like to view the 
basic query structure without additional details.
As I mentioned earlier, most of the information we typically need to 
explore query plans stays in path nodes and does not transfer to the 
Plan node. I believe this should stay as is, as we deal with numerous 
cases and a vast amount of data.
It would be beneficial to expose extra data in an external extension. By 
implementing a `create_plan` hook and an extensible list node in both 
Path and Plan structures, we could provide a machinery for writing an 
extension that can expose any planning-stage information in EXPLAIN on 
demand.
This could encourage developers to create a "pg_extended_explain" 
extension, which would address most user needs without requiring 
additional changes to the core system.

-- 
regards, Andrei Lepikhov

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Robert Haas

Date:

26 March, 22:37:34

On Mon, Mar 24, 2025 at 6:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> As I said, I'm not necessarily averse to showing these numbers
> somehow.  But I don't think they belong in the default output,
> and I'm not even convinced that VERBOSE is the right place.
> pg_overexplain seems like it could be an ideal home for this
> sort of detail.

I think we're going to be sad if we start shunting things that
non-developers need into pg_overexplain. On the one hand, PostgreSQL
consultants will be annoyed because they'll have to get a contrib
module installed in order to be able to do their jobs, and I don't
think that should be a requirement. On the other hand, PostgreSQL
developers will also be annoyed because once the consultants start
using it they'll complain when we change things, and I think we want
to have the freedom to change things in pg_overexplain. For that
reason, I think that if we choose to display anything here, it should
either be displayed all the time or gated by some in-core option such
as VERBOSE.

I do acknowledge the argument that we don't show details of how costs
are derived in other cases. While I think that point has some
validity, the flip side is that I spend a fairly significant amount of
time attempting to reverse-engineer what the planner did from the
EXPLAIN output, and I find that pretty unenjoyable. The recent change
to show two decimal places on row-count estimation is one that comes
up a heck of a lot, and several people have thanked me for getting
that patch committed because that problem affected them, too. But it's
surely not the only example of a case where it's hard to determine
what happened in the planner from what shows up in EXPLAIN output, and
I think that trying to find ways to improve on that situation is
worthwhile.

I also don't think that we should be too concerned about bloating the
EXPLAIN output in the context of a patch that only affects Memoize.
Memoize nodes are not incredibly common in the query plans that I see,
so even if we added another line or three to the output for each one,
I don't think that would create major problems. On the other hand,
maybe there's an argument that what this patch touches is only the tip
of the iceberg, and that we're eventually going to want the same kinds
of things for Nested Loop and Hash Joins and Merge Joins, and maybe
even more detail that can be displayed in say 3 lines. In that case,
there's a double concern. On the one hand, that really would make the
output a whole lot more verbose, and on the other hand, it might
generate a fair amount of work to maintain it across future planner
changes. I can see deciding to reject changes of that sort on the
grounds that we're not prepared to maintain it, or deciding to gate it
behind a new option on the grounds that it is so verbose that even
people who say EXPLAIN VERBOSE are going to be sad if they get all
that crap by default. I'm not saying that we SHOULD make those
decisions -- I think exposing more detail here could be pretty useful
to people trying to solve query plan problems, including me, so I hope
we don't just kick that idea straight to the curb without due thought
-- but I would understand them.

The part I'm least sure about with respect to the proposed patch is
the actual stuff being displayed. I don't have the experience to know
whether it's useful for tracking down issues. If it's not, then I
agree we shouldn't display it. If it is, then I'm tentatively in favor
of showing it in standard EXPLAIN, possibly only with VERBOSE, with
the caveats from the previous paragraph: if more-common node types are
also going to have a bunch of stuff like this, then we need to think
more carefully. If Memoize is exceptional in needing additional
information displayed, then I think it's fine.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

27 March, 13:12:38

On 26.03.2025 22:37, Robert Haas wrote:

I also don't think that we should be too concerned about bloating the
EXPLAIN output in the context of a patch that only affects Memoize.
Memoize nodes are not incredibly common in the query plans that I see,
so even if we added another line or three to the output for each one,
I don't think that would create major problems. On the other hand,
maybe there's an argument that what this patch touches is only the tip
of the iceberg, and that we're eventually going to want the same kinds
of things for Nested Loop and Hash Joins and Merge Joins, and maybe
even more detail that can be displayed in say 3 lines. In that case,
there's a double concern. On the one hand, that really would make the
output a whole lot more verbose, and on the other hand, it might
generate a fair amount of work to maintain it across future planner
changes. I can see deciding to reject changes of that sort on the
grounds that we're not prepared to maintain it, or deciding to gate it
behind a new option on the grounds that it is so verbose that even
people who say EXPLAIN VERBOSE are going to be sad if they get all
that crap by default. I'm not saying that we SHOULD make those
decisions -- I think exposing more detail here could be pretty useful
to people trying to solve query plan problems, including me, so I hope
we don't just kick that idea straight to the curb without due thought
-- but I would understand them.

The part I'm least sure about with respect to the proposed patch is
the actual stuff being displayed. I don't have the experience to know
whether it's useful for tracking down issues. If it's not, then I
agree we shouldn't display it. If it is, then I'm tentatively in favor
of showing it in standard EXPLAIN, possibly only with VERBOSE, with
the caveats from the previous paragraph: if more-common node types are
also going to have a bunch of stuff like this, then we need to think
more carefully. If Memoize is exceptional in needing additional
information displayed, then I think it's fine.

--
Robert Haas
EDB: http://www.enterprisedb.com

I understand the concerns raised about the risk of opening the door to more diagnostic detail across various plan nodes. However, in Hash Join, Merge Join, and Nested Loop, EXPLAIN typically reveals at least some of the planner’s expectations. For example, Hash Join shows the number of batches and originally expected buckets, giving insight into whether the hash table fit in memory. Merge Join shows unexpected Sort nodes when presorted inputs were assumed. Nested Loop reflects planner assumptions via loops and row estimates. In other words, these nodes expose at least some information about what the planner thought would happen.

Memoize is unique in that it shows runtime statistics (hits, misses, evictions), but reveals nothing about the planner’s expectations. We don’t see how many distinct keys were estimated or how many entries the planner thought would fit in memory. This makes it very difficult to understand whether Memoize was a good choice or not, or how to fix it when it performs poorly.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Robert Haas

Date:

27 March, 15:46:07

On Thu, Mar 27, 2025 at 6:12 AM Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:
> I understand the concerns raised about the risk of opening the door to more diagnostic detail across various plan
nodes.However, in Hash Join, Merge Join, and Nested Loop, EXPLAIN typically reveals at least some of the planner’s
expectations.For example, Hash Join shows the number of batches and originally expected buckets, giving insight into
whetherthe hash table fit in memory. Merge Join shows unexpected Sort nodes when presorted inputs were assumed. Nested
Loopreflects planner assumptions via loops and row estimates. In other words, these nodes expose at least some
informationabout what the planner thought would happen. 
>
> Memoize is unique in that it shows runtime statistics (hits, misses, evictions), but reveals nothing about the
planner’sexpectations. We don’t see how many distinct keys were estimated or how many entries the planner thought would
fitin memory. This makes it very difficult to understand whether Memoize was a good choice or not, or how to fix it
whenit performs poorly. 

Right. Without taking a strong position on this particular patch, I
think it's entirely reasonable, as a concept, to think of making
Memoize work similarly to what other nodes already do.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Andrei Lepikhov

Date:

27 March, 16:46:50

On 3/27/25 11:12, Ilia Evdokimov wrote:
> Memoize is unique in that it shows runtime statistics (hits, misses, 
> evictions), but reveals nothing about the planner’s expectations. We 
> don’t see how many distinct keys were estimated or how many entries the 
> planner thought would fit in memory.
It is acceptable for me to have Memoize estimations covered by the COSTS 
option.
This data may help both developers and the users. For us, it provides 
information on how good group estimations are and lets us identify gaps 
in the prediction code. A user may find insights about ways to optimise 
the query, detecting stale statistics or defining extended statistics.

  This makes it very difficult to
> understand whether Memoize was a good choice or not, or how to fix it 
> when it performs poorly.
I think user already has enough data to determine whether Memoize was 
the right choice - hits/misses/evictions show that.

-- 
regards, Andrei Lepikhov

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

28 March, 15:20:44

Then we need to decide clearly what exactly to display in EXPLAIN for 
the Memoize node: absolute values (estimated distinct keys and estimated 
cache capacity) or ratios (hit_ratio and evict_ratio). Ratios have the 
advantage of quickly reflecting the overall effectiveness of Memoize. 
However, absolute values have a significant advantage as they explicitly 
reveal the reason of Memoize's poor performance, making problem 
diagnosis simpler.

With absolute values, users can directly understand the underlying 
reason for poor performance. For example: insufficient memory (capacity 
< distinct keys), inaccurate planner statistics (distinct keys 
significantly different from actual values), poorly ordered keys 
(capacity ~ distinct keys, but frequent evictions as seen in the 
Evictions parameter), or Memoize simply not being beneficial (capacity ~ 
distinct keys ~ calls). Ratios, by contrast, only reflect the final 
outcome without clearly indicating the cause or the specific steps 
needed to resolve the issue.

Thus, absolute values do more than just inform users that a problem 
exists; they provide actionable details that enable users to directly 
address the problem (increase work_mem, refresh statistics, create 
extended statistics, or disable Memoize entirely). Additionally, no 
other plan nodes in PostgreSQL currently use a similar ratio-based 
approach - everywhere else absolute values are consistently shown (e.g., 
number of rows, buckets, batches, memory used, etc.). Using absolute 
values in Memoize maintains consistency with existing practice.

I've updated the patch to v5, since the new parameter est_unique_keys in 
make_memoize() is now placed near est_entries, which is more logical and 
readable than putting it at the end.

Any thoughts?

--
Best Regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachment

v5-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Memoize.patch

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

01 April, 10:52:52

On 28.03.2025 15:20, Ilia Evdokimov wrote:

> Then we need to decide clearly what exactly to display in EXPLAIN for 
> the Memoize node: absolute values (estimated distinct keys and 
> estimated cache capacity) or ratios (hit_ratio and evict_ratio). 
> Ratios have the advantage of quickly reflecting the overall 
> effectiveness of Memoize. However, absolute values have a significant 
> advantage as they explicitly reveal the reason of Memoize's poor 
> performance, making problem diagnosis simpler.
>
> With absolute values, users can directly understand the underlying 
> reason for poor performance. For example: insufficient memory 
> (capacity < distinct keys), inaccurate planner statistics (distinct 
> keys significantly different from actual values), poorly ordered keys 
> (capacity ~ distinct keys, but frequent evictions as seen in the 
> Evictions parameter), or Memoize simply not being beneficial (capacity 
> ~ distinct keys ~ calls). Ratios, by contrast, only reflect the final 
> outcome without clearly indicating the cause or the specific steps 
> needed to resolve the issue.
>
> Thus, absolute values do more than just inform users that a problem 
> exists; they provide actionable details that enable users to directly 
> address the problem (increase work_mem, refresh statistics, create 
> extended statistics, or disable Memoize entirely). Additionally, no 
> other plan nodes in PostgreSQL currently use a similar ratio-based 
> approach - everywhere else absolute values are consistently shown 
> (e.g., number of rows, buckets, batches, memory used, etc.). Using 
> absolute values in Memoize maintains consistency with existing practice.
>
> I've updated the patch to v5, since the new parameter est_unique_keys 
> in make_memoize() is now placed near est_entries, which is more 
> logical and readable than putting it at the end.
>
> Any thoughts?
>
> -- 
> Best Regards,
> Ilia Evdokimov,
> Tantor Labs LLC.


With the feature freeze coming up soon, I’d like to ask: do we plan to 
include this patch in v18?

Please let me know if there’s anything I can do to help move it forward.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

David Rowley

Date:

02 April, 00:06:59

On Tue, 1 Apr 2025 at 20:52, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:
> With the feature freeze coming up soon, I’d like to ask: do we plan to
> include this patch in v18?

I don't think there's a clear enough consensus yet on what EXPLAIN
should display. We'd need that beforehand. There are now less than 7
days to get that, so it might be unrealistic.

I tried to move things along to address Tom's concern about not
wanting to show this in EXPLAIN's standard output. I suggested in [1]
that we could use EXPLAIN ANALYZE, but nobody commented on that.
EXPLAIN ANALYZE is much more verbose than EXPLAIN already, and if we
put these in EXPLAIN ANALYZE then the viewer can more easily compare
planned vs actual. I had mentioned that Hash Join does something like
this for buckets.

David

[1] https://postgr.es/m/CAApHDvqPkWmz1Lq23c9CD+mAW=hgPrD289tC30f3H1f6Ng59+g@mail.gmail.com

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

02 April, 12:35:33

On 02.04.2025 00:06, David Rowley wrote:

I tried to move things along to address Tom's concern about not
wanting to show this in EXPLAIN's standard output. I suggested in [1]
that we could use EXPLAIN ANALYZE, but nobody commented on that.
EXPLAIN ANALYZE is much more verbose than EXPLAIN already, and if we
put these in EXPLAIN ANALYZE then the viewer can more easily compare
planned vs actual. I had mentioned that Hash Join does something like
this for buckets.

David

[1] https://postgr.es/m/CAApHDvqPkWmz1Lq23c9CD+mAW=hgPrD289tC30f3H1f6Ng59+g@mail.gmail.com

Apologies for not addressing your earlier suggestion properly. After reconsidering, I agree that showing ndistinct and est_entries in EXPLAIN ANALYZE when there are actual cache misses is the best approach. This is typically when users notice performance regressions and start investigating the cause, so surfacing planner expectations at that point can be the most useful. I attached v6 patch with changes.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachment

v6-0001-Display-estimated-distinct-keys-and-cache-capacity.patch

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

14 April, 12:02:41

As there seems to be general support for the idea behind this patch, and 
no one has raised strong objections recently, but also no clear 
consensus on the exact final shape of the output has been reached yet - 
and I understand other priorities might be taking precedence right now - 
I’ve registered the patch in the upcoming CommitFest 2025-07 to ensure 
it stays in the review process.

https://commitfest.postgresql.org/patch/5697/

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Robert Haas

Date:

14 April, 19:31:22

On Tue, Apr 1, 2025 at 5:07 PM David Rowley <dgrowleyml@gmail.com> wrote:
> I tried to move things along to address Tom's concern about not
> wanting to show this in EXPLAIN's standard output. I suggested in [1]
> that we could use EXPLAIN ANALYZE, but nobody commented on that.
> EXPLAIN ANALYZE is much more verbose than EXPLAIN already, and if we
> put these in EXPLAIN ANALYZE then the viewer can more easily compare
> planned vs actual. I had mentioned that Hash Join does something like
> this for buckets.

I don't think we should use ANALYZE for this because, IME, that should
just be about whether the query gets executed. Since this looks like
information that is available at plan time, I think it should be
displayed all the time or made contingent on VERBOSE. It would be sad
if you had to run the query to get information that was available
without needing to run the query. Personally, I think we can probably
just display it all the time. I mean, the typical plan is not going to
contain enough Memoize nodes that a little extra chatter for each one
impedes readability significantly.

As far as what to display, I have sympathy with Lukas's initial
complaint that it was hard to understand the costing of Memoize. I
haven't faced this exact problem, I think because Memoize isn't that
often used in plans I have seen, but I've faced a lot of similar
problems where you have to try to work backwards painfully to figure
out the chain of events that led to the value you're actually seeing,
and I'm +1 for efforts to make it more clear.

Having looked at v6, I think it would help, at least if the reader is
sufficiently knowledgeable. From the values displayed, it looks like
someone could reconstruct the evict_ratio value in
cost_memoize_rescan(), and if they know the loop count, also the
hit_ratio. But that seems hard: if you weren't reading the code, how
would you know how to do it? Even if you are reading the code, are you
sure you'd reconstruct it correctly? I wonder why we think it's better
to display this than a more "cooked" number like the estimated hit
ratio that was proposed in v1?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Andrei Lepikhov

Date:

14 April, 22:49:23

On 4/14/25 18:31, Robert Haas wrote:
> On Tue, Apr 1, 2025 at 5:07 PM David Rowley <dgrowleyml@gmail.com> wrote:
>> I tried to move things along to address Tom's concern about not
>> wanting to show this in EXPLAIN's standard output. I suggested in [1]
>> that we could use EXPLAIN ANALYZE, but nobody commented on that.
>> EXPLAIN ANALYZE is much more verbose than EXPLAIN already, and if we
>> put these in EXPLAIN ANALYZE then the viewer can more easily compare
>> planned vs actual. I had mentioned that Hash Join does something like
>> this for buckets.
> Having looked at v6, I think it would help, at least if the reader is
> sufficiently knowledgeable. From the values displayed, it looks like
> someone could reconstruct the evict_ratio value in
> cost_memoize_rescan(), and if they know the loop count, also the
> hit_ratio. But that seems hard: if you weren't reading the code, how
> would you know how to do it? Even if you are reading the code, are you
> sure you'd reconstruct it correctly? I wonder why we think it's better
> to display this than a more "cooked" number like the estimated hit
> ratio that was proposed in v1?
+1.
My reasons: a) it more "physical"; b) We don't need to look for the 
corresponding NestLoop JOIN node - my explans usually contain 150+ lines 
and it is hard to get into the analysis something out of current screen.

-- 
regards, Andrei Lepikhov

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

David Rowley

Date:

14 April, 23:53:05

On Tue, 15 Apr 2025 at 04:31, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think we should use ANALYZE for this because, IME, that should
> just be about whether the query gets executed. Since this looks like
> information that is available at plan time, I think it should be
> displayed all the time or made contingent on VERBOSE. It would be sad
> if you had to run the query to get information that was available
> without needing to run the query. Personally, I think we can probably
> just display it all the time. I mean, the typical plan is not going to
> contain enough Memoize nodes that a little extra chatter for each one
> impedes readability significantly.

You might be right there. I just offered that as a possible solution
to the concern of bloating the EXPLAIN output.

> Having looked at v6, I think it would help, at least if the reader is
> sufficiently knowledgeable. From the values displayed, it looks like
> someone could reconstruct the evict_ratio value in
> cost_memoize_rescan(), and if they know the loop count, also the
> hit_ratio. But that seems hard: if you weren't reading the code, how
> would you know how to do it? Even if you are reading the code, are you
> sure you'd reconstruct it correctly? I wonder why we think it's better
> to display this than a more "cooked" number like the estimated hit
> ratio that was proposed in v1?

I think the problem with only showing the estimated hit ratio is that
it's difficult to then know what you might change to improve that in
cases where you might have other join types disabled and are
experimenting to figure out why Nested Loop / Memoize isn't being
chosen.  If you see the estimated capacity isn't big enough to store
all the rows for all distinct keys then you know you can increase
work_mem to assist. If the Estimated Unique Keys is too high, then you
might be able to adjust some ndistinct estimates or increase stats
targets somewhere to improve that.  If we only showed the estimated
hit ratio, then you wouldn't know where to start to get the planner to
pick the Memoize plan.

As I mentioned in [1], I wouldn't object to having the estimated hit
ratio shown in addition to the components that are used to calculate
it. Per [2], Andrei didn't seem to like the idea due to the
duplication. FWIW, I do agree with you that the hit ratio isn't
exactly easy to calculate on the fly when looking at the input numbers
that are used to calculate it. I just wasn't 100% certain that it
would be needed in its calculated form. Maybe it depends on the reason
you're looking at EXPLAIN. Maybe the reason I mentioned above for
looking at EXPLAIN is just one of many and I'm not thinking hard
enough to consider the other ones.

If we can't get consensus for everything people want to add at once
then maybe the patch could be broken into two, with 0001 being pretty
much the v4 patch and then have 0002 add the Estimated Hit Ratio.
Having the physical patches and being able to try it out or view the
regression test changes might lower the bar on people chipping in with
their views.

David

[1] https://postgr.es/m/CAApHDvoE5S5FkvEq+N3-J9LfaVUpWLOnczOYAOEvBMCY20=pdg@mail.gmail.com
[2] https://postgr.es/m/e76f1635-a933-47bf-a826-045e20d5cfff@gmail.com

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

15 April, 02:14:37

On 14.04.2025 23:53, David Rowley wrote:

If we can't get consensus for everything people want to add at once
then maybe the patch could be broken into two, with 0001 being pretty
much the v4 patch and then have 0002 add the Estimated Hit Ratio.
Having the physical patches and being able to try it out or view the
regression test changes might lower the bar on people chipping in with
their views.

Agreed - I’ll split the patch into two: one adds Estimated Capacity and Estimated Distinct Keys, and the second adds Estimated Hit Ratio. I’ll also add regression test differences to make it easier to evaluate the usefulness of each part.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachment

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

David Rowley

Date:

15 April, 03:23:11

On Tue, 15 Apr 2025 at 11:14, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:
> On 14.04.2025 23:53, David Rowley wrote:
> If we can't get consensus for everything people want to add at once
> then maybe the patch could be broken into two, with 0001 being pretty
> much the v4 patch and then have 0002 add the Estimated Hit Ratio.
> Having the physical patches and being able to try it out or view the
> regression test changes might lower the bar on people chipping in with
> their views.
>
>
> Agreed - I’ll split the patch into two: one adds Estimated Capacity and Estimated Distinct Keys, and the second adds
EstimatedHit Ratio. I’ll also add regression test differences to make it easier to evaluate the usefulness of each
part.

Thanks for the patches.

I'm just looking for ways to allow us to group all three of these
together for the TEXT format type. I see the BUFFERS are displayed
quite compactly, e.g. "Buffers: shared hit=17 read=4".
show_wal_usage() does something similar and so does
show_modifytable_info() for MERGE.

Maybe we could compress the text output a bit with:

"Estimates: capacity=N distinct keys=N hit ratio=N.N%"

If we get it that compact, maybe lookups could fit in there too, i.e.:

"Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"

I'd also like to vote that you modify explain.c and multiply the
hit_ratio by 100 and use "%" as the unit in ExplainPropertyFloat().
Just looking at the raw number of "1.00" in the expected output, it
isn't obvious if the planner expects every lookup to be a cache hit or
just 1% of them.

Also, to get something commitable, you'll also need to modify
explain_memoize() at the top of memoize.sql and add handling to mask
out the value of these new properties. These are not going to be
anywhere near stable enough across platforms to have these shown in
the expected output files.

David

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

15 April, 12:44:13

On 15.04.2025 03:23, David Rowley wrote:

I'm just looking for ways to allow us to group all three of these
together for the TEXT format type. I see the BUFFERS are displayed
quite compactly, e.g. "Buffers: shared hit=17 read=4".
show_wal_usage() does something similar and so does
show_modifytable_info() for MERGE.

Maybe we could compress the text output a bit with:

"Estimates: capacity=N distinct keys=N hit ratio=N.N%"

If we get it that compact, maybe lookups could fit in there too, i.e.:

"Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"

I'd also like to vote that you modify explain.c and multiply the
hit_ratio by 100 and use "%" as the unit in ExplainPropertyFloat().
Just looking at the raw number of "1.00" in the expected output, it
isn't obvious if the planner expects every lookup to be a cache hit or
just 1% of them.

+1. I'll attach three patches with 'capacity=N distinct keys=N' output, 'lookups=N' and 'ratio=N.N%'


Also, to get something commitable, you'll also need to modify
explain_memoize() at the top of memoize.sql and add handling to mask
out the value of these new properties. These are not going to be
anywhere near stable enough across platforms to have these shown in
the expected output files.

David

There are other regression tests - such as create_index, subselect, join, and partition_join - that also use Memoize nodes. These tests currently don't have any masking in place, so the new output fields may cause instability across platforms.

Wrapping the line in costs or verbose would help with test stability, but doing that only to simplify test output doesn't look like the right reason. Maybe it makes more sense to mask these values in other tests that use Memoize too. What do you think?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Robert Haas

Date:

15 April, 19:24:57

On Mon, Apr 14, 2025 at 8:23 PM David Rowley <dgrowleyml@gmail.com> wrote:
> Maybe we could compress the text output a bit with:
>
> "Estimates: capacity=N distinct keys=N hit ratio=N.N%"

Sure, that works for me.

> If we get it that compact, maybe lookups could fit in there too, i.e.:
>
> "Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"

Is lookups=N here the estimated number of lookups i.e. what we think
nloops will end up being?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

David Rowley

Date:

15 April, 23:20:00

On Tue, 15 Apr 2025 at 21:44, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:
> Wrapping the line in costs or verbose would help with test stability, but doing that only to simplify test output
doesn'tlook like the right reason. Maybe it makes more sense to mask these values in other tests that use Memoize too.
Whatdo you think?
 

Disregard what I said about the explain_memoize() function. The new
output just shouldn't be shown with EXPLAIN (costs off).

David

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

David Rowley

Date:

15 April, 23:49:46

On Wed, 16 Apr 2025 at 04:25, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Apr 14, 2025 at 8:23 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > "Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"
>
> Is lookups=N here the estimated number of lookups i.e. what we think
> nloops will end up being?

Yes. The estimate is the "calls" variable in cost_memoize_rescan(),
which is fairly critical in the hit ratio estimate calculation.

Technically this is just the Nested Loop's outer_path->rows. There was
an argument earlier in the thread for putting this in along with the
other fields to make things easier to read. I did argue that it was
redundant due to the fact that the reader can look at the row estimate
for the outer side of the Nest Loop, but maybe it's small enough to go
in using the above format.

David

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

16 April, 01:09:20

I've prepared the updated patches as discussed, including the addition 
of estimated lookups in the EXPLAIN output for Memoize. Please let me 
know if you have any feedback or further suggestions.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachment

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Maciek Sakrejda

Date:

16 April, 19:09:41

On Tue, Apr 15, 2025 at 1:50 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Wed, 16 Apr 2025 at 04:25, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Mon, Apr 14, 2025 at 8:23 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > > "Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"
> >
> > Is lookups=N here the estimated number of lookups i.e. what we think
> > nloops will end up being?
>
> Yes. The estimate is the "calls" variable in cost_memoize_rescan(),
> which is fairly critical in the hit ratio estimate calculation.
>
> Technically this is just the Nested Loop's outer_path->rows. There was
> an argument earlier in the thread for putting this in along with the
> other fields to make things easier to read. I did argue that it was
> redundant due to the fact that the reader can look at the row estimate
> for the outer side of the Nest Loop, but maybe it's small enough to go
> in using the above format.

This kind of thing is nice to have in the text format, but it's really
nice to have when working with structured formats. Currently, to get a
loops estimate for the inner node, I need to find the parent, find the
outer node in the parent's children, and get that sibling's Plan Rows
(I think). It'd be nice to make things like that easier.

Thanks,
Maciek

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

David Rowley

Date:

18 April, 00:29:37

On Wed, 16 Apr 2025 at 10:09, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:
> I've prepared the updated patches as discussed, including the addition
> of estimated lookups in the EXPLAIN output for Memoize. Please let me
> know if you have any feedback or further suggestions.

While this is fresh, as I might forget before the July CF...

I know I did suggest that the hit_ratio should be a percent and it
should be multiplied by 100.0.  This seems fine for the text format as
you can have the % unit there. However, looking at
ExplainPropertyFloat(), we don't print the unit for non-text formats.
I wonder if printing a number between 0 and 100 there will be
confusing. I don't believe we have anything else in EXPLAIN that shows
a percentage. I don't quite know what to do about this. One thought I
had was to only * 100 for text format, but that might be more
confusing.

Aside from that, I'd prefer the new fields in struct Memoize to be
prefixed with "est_" the same as the existing "est_entries" field. I'm
unsure why MemoizePath.calls becomes Memoize.lookups. Seems
unnecessary and just means more brain space being used to maintain a
mental model.

David

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

27 April, 22:58:52

Sorry for the delayed reply.

On 18.04.2025 00:29, David Rowley wrote:

I know I did suggest that the hit_ratio should be a percent and it
should be multiplied by 100.0.  This seems fine for the text format as
you can have the % unit there. However, looking at
ExplainPropertyFloat(), we don't print the unit for non-text formats.
I wonder if printing a number between 0 and 100 there will be
confusing. I don't believe we have anything else in EXPLAIN that shows
a percentage. I don't quite know what to do about this. One thought I
had was to only * 100 for text format, but that might be more
confusing.

I agree that displaying the percentage instead of the fraction is a better choice, because it is much more convenient for analysis. If this value were intended to be passed into some API for further computations, it would make sense to keep it as a fraction to avoid unnecessary multiplications and divisions by 100. However, such use cases seem extremely unlikely.

Therefore, I think it is better to report percentages directly. Since non-text EXPLAIN formats do not display units, I propose to rename the field to "hit_percent" in all formats, including the text format. This way, the meaning of the value remains clear without needing additional context. What do you think about this approach?


Aside from that, I'd prefer the new fields in struct Memoize to be
prefixed with "est_" the same as the existing "est_entries" field. I'm
unsure why MemoizePath.calls becomes Memoize.lookups. Seems
unnecessary and just means more brain space being used to maintain a
mental model.

Agree.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

01 May, 15:22:34

On 27.04.2025 22:58, Ilia Evdokimov wrote:
> Therefore, I think it is better to report percentages directly. Since 
> non-text EXPLAIN formats do not display units, I propose to rename the 
> field to "hit_percent" in all formats, including the text format. This 
> way, the meaning of the value remains clear without needing additional 
> context. What do you think about this approach?


I attached updated v9 patch with the suggested changes. The updated line 
in the EXPLAIN looks like this:

Text format:
     Estimates: capacity=1 distinct keys=1 lookups=2 hit percent=50.00%

Non-text format:
     Estimated Capacity: 1
     Estimated Distinct Lookup Keys: 1
     Estimated Lookups: 2
     Estimated Hit Percent: 50.00

Any suggestions?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachment

v9-0001-Expose-cache-hit-statistics-in-Memoize-node-EXPLAIN.patch

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Andrei Lepikhov

Date:

01 May, 16:37:52

On 5/1/25 14:22, Ilia Evdokimov wrote:
>      Estimated Hit Percent: 50.00
> 
> Any suggestions?
I still think that two meaningful digits are enough for an EXPLAIN. We 
usually need to estimate the tuple set's size, not a precise number of 
tuples or groups. And definitely, not an arbitrarily chosen two digits 
after the point. So, IMO, good examples may look like the following:
50%, 5.0%, 0.00051%.

-- 
regards, Andrei Lepikhov

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From

Ilia Evdokimov

Date:

01 May, 22:10:56

On 01.05.2025 16:37, Andrei Lepikhov wrote:
> I still think that two meaningful digits are enough for an EXPLAIN. We 
> usually need to estimate the tuple set's size, not a precise number of 
> tuples or groups. And definitely, not an arbitrarily chosen two digits 
> after the point. So, IMO, good examples may look like the following:
> 50%, 5.0%, 0.00051%.
>

The idea is not bad. But I think it might be better to raise this in a 
new thread, because this is not the only place [0] where we might want 
to output numbers like that. We could also discuss there what else might 
benefit from such formatting.

[0]: 
https://www.postgresql.org/message-id/3365160.1739385615%40sss.pgh.pa.us

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.