Thread: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From
Lukas Fittl
Date:
Hi,

I was debugging a planner problem on Postgres 14.4 the other day - and the involved "bad" plan was including Memoize - though I don't necessarily think that Memoize is to blame (and this isn't any of the problems recently fixed in Memoize costing).

However, what I noticed whilst trying different ways to fix the plan, is that the Memoize output was a bit hard to reason about - especially since the plan involving Memoize was expensive to run, and so I was mostly running EXPLAIN without ANALYZE to look at the costing.

Here is an example of the output I was looking at:

     ->  Nested Loop  (cost=1.00..971672.56 rows=119623 width=0)
           ->  Index Only Scan using table1_idx on table1  (cost=0.43..372676.50 rows=23553966 width=8)
           ->  Memoize  (cost=0.57..0.61 rows=1 width=8)
                 Cache Key: table1.table2_id
                 Cache Mode: logical
                 ->  Index Scan using table2_idx on table2  (cost=0.56..0.60 rows=1 width=8)
                       Index Cond: (id = table1.table2_id)

The other plan I was comparing with (that I wanted the planner to choose instead), had a total cost of 1,451,807.35 -- and so I was trying to figure out why the Nested Loop was costed as 971,672.56.

Simple math makes me expect the Nested Loop should roughly have a total cost of14,740,595.76 here (372,676.50 + 23,553,966 * 0.61), ignoring a lot of the smaller costs. Thus, in this example, it appears Memoize made the plan cost significantly cheaper (roughly 6% of the regular cost).

Essentially this comes down to the "cost reduction" performed by Memoize only being implicitly visible in the Nested Loop's total cost - and with nothing useful on the Memoize node itself - since the rescan costs are not shown.

I think explicitly adding the estimated cache hit ratio for Memoize nodes might make this easier to reason about, like this:

->  Memoize  (cost=0.57..0.61 rows=1 width=8)
     Cache Key: table1.table2_id
     Cache Mode: logical
     Cache Hit Ratio Estimated: 0.94

Alternatively (or in addition) we could consider showing the "ndistinct" value that is calculated in cost_memoize_rescan - since that's the most significant contributor to the cache hit ratio (and you can influence that directly by improving the ndistinct statistics).

See attached a patch that implements showing the cache hit ratio as a discussion starter.

I'll park this in the July commitfest for now.

Thanks,
Lukas

--
Lukas Fittl
Attachment

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From
David Rowley
Date:
On Sun, 5 Mar 2023 at 13:21, Lukas Fittl <lukas@fittl.com> wrote:
> Alternatively (or in addition) we could consider showing the "ndistinct" value that is calculated in
cost_memoize_rescan- since that's the most significant contributor to the cache hit ratio (and you can influence that
directlyby improving the ndistinct statistics). 

I think the ndistinct estimate plus the est_entries together would be
useful. I think showing just the hit ratio number might often just
raise too many questions about how that's calculated. To calculate the
hit ratio we need to estimate the number of entries that can be kept
in the cache at once and also the number of input rows and the number
of distinct values.  We can see the input rows by looking at the outer
side of the join in EXPLAIN, but we've no idea about the ndistinct or
how many items the planner thought could be kept in the cache at once.

The plan node already has est_entries, so it should just be a matter
of storing the ndistinct estimate in the Path and putting it into the
Plan node so the executor has access to it during EXPLAIN.

David



Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From
Daniel Gustafsson
Date:
> On 7 Mar 2023, at 10:51, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sun, 5 Mar 2023 at 13:21, Lukas Fittl <lukas@fittl.com> wrote:
>> Alternatively (or in addition) we could consider showing the "ndistinct" value that is calculated in
cost_memoize_rescan- since that's the most significant contributor to the cache hit ratio (and you can influence that
directlyby improving the ndistinct statistics). 
>
> I think the ndistinct estimate plus the est_entries together would be
> useful. I think showing just the hit ratio number might often just
> raise too many questions about how that's calculated. To calculate the
> hit ratio we need to estimate the number of entries that can be kept
> in the cache at once and also the number of input rows and the number
> of distinct values.  We can see the input rows by looking at the outer
> side of the join in EXPLAIN, but we've no idea about the ndistinct or
> how many items the planner thought could be kept in the cache at once.
>
> The plan node already has est_entries, so it should just be a matter
> of storing the ndistinct estimate in the Path and putting it into the
> Plan node so the executor has access to it during EXPLAIN.

Lukas: do you have an updated patch for this commitfest to address David's
comments?

--
Daniel Gustafsson




On Thu, Jul 6, 2023 at 12:56 AM Daniel Gustafsson <daniel@yesql.se> wrote:
Lukas: do you have an updated patch for this commitfest to address David's
comments?

I have a draft - I should be able to post an updated patch in the next days. Thanks for checking!

Thanks,
Lukas

--
Lukas Fittl