Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment - Mailing list pgsql-hackers

From Lukas Fittl
Subject Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
Date
Msg-id CAP53Pky29GWAVVk3oBgKBDqhND0BRBN6yTPeguV_qSivFL5N_g@mail.gmail.com
Whole thread Raw
Responses Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
Hi,

I was debugging a planner problem on Postgres 14.4 the other day - and the involved "bad" plan was including Memoize - though I don't necessarily think that Memoize is to blame (and this isn't any of the problems recently fixed in Memoize costing).

However, what I noticed whilst trying different ways to fix the plan, is that the Memoize output was a bit hard to reason about - especially since the plan involving Memoize was expensive to run, and so I was mostly running EXPLAIN without ANALYZE to look at the costing.

Here is an example of the output I was looking at:

     ->  Nested Loop  (cost=1.00..971672.56 rows=119623 width=0)
           ->  Index Only Scan using table1_idx on table1  (cost=0.43..372676.50 rows=23553966 width=8)
           ->  Memoize  (cost=0.57..0.61 rows=1 width=8)
                 Cache Key: table1.table2_id
                 Cache Mode: logical
                 ->  Index Scan using table2_idx on table2  (cost=0.56..0.60 rows=1 width=8)
                       Index Cond: (id = table1.table2_id)

The other plan I was comparing with (that I wanted the planner to choose instead), had a total cost of 1,451,807.35 -- and so I was trying to figure out why the Nested Loop was costed as 971,672.56.

Simple math makes me expect the Nested Loop should roughly have a total cost of14,740,595.76 here (372,676.50 + 23,553,966 * 0.61), ignoring a lot of the smaller costs. Thus, in this example, it appears Memoize made the plan cost significantly cheaper (roughly 6% of the regular cost).

Essentially this comes down to the "cost reduction" performed by Memoize only being implicitly visible in the Nested Loop's total cost - and with nothing useful on the Memoize node itself - since the rescan costs are not shown.

I think explicitly adding the estimated cache hit ratio for Memoize nodes might make this easier to reason about, like this:

->  Memoize  (cost=0.57..0.61 rows=1 width=8)
     Cache Key: table1.table2_id
     Cache Mode: logical
     Cache Hit Ratio Estimated: 0.94

Alternatively (or in addition) we could consider showing the "ndistinct" value that is calculated in cost_memoize_rescan - since that's the most significant contributor to the cache hit ratio (and you can influence that directly by improving the ndistinct statistics).

See attached a patch that implements showing the cache hit ratio as a discussion starter.

I'll park this in the July commitfest for now.

Thanks,
Lukas

--
Lukas Fittl
Attachment

pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Request for comment on setting binary format output per session
Next
From: Dave Cramer
Date:
Subject: Re: Request for comment on setting binary format output per session