Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment - Mailing list pgsql-hackers

From Andrei Lepikhov
Subject Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
Date
Msg-id 11c709e1-9b17-4c00-9ead-93609a09b9db@gmail.com
Whole thread Raw
In response to Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
List pgsql-hackers
On 3/23/25 22:16, David Rowley wrote:
> On Fri, 21 Mar 2025 at 22:02, Andrei Lepikhov <lepihov@gmail.com> wrote:
> Can you explain why "Estimated Capacity" and "Estimated Distinct
> Lookup Keys" don't answer that?  If there are more distinct lookup
> keys than there is capacity to store them, then some will be evicted.
I wouldn't say these parameters don't answer. I try to debate how usable 
they are. To be more practical, let me demonstrate the example:

EXPLAIN (COSTS OFF) SELECT * FROM t1,t2 WHERE t1.x=t2.x;

  Nested Loop  (cost=0.44..7312.65 rows=211330 width=33)
    ->  Seq Scan on t1  (cost=0.00..492.00 rows=30000 width=22)
    ->  Memoize  (cost=0.44..3.82 rows=7 width=11)
          Cache Key: t1.x
          Cache Mode: logical
          Estimated Capacity: 1001  Estimated Distinct Lookup Keys: 1001
          ->  Index Scan using t2_x_idx2 on t2  (cost=0.43..3.81 rows=7)
                Index Cond: (x = t1.x)

At first, I began to look for documentation because it was unclear what 
both new parameters specifically meant. Okay, there was no documentation 
but trivial code, and after a short discovery, I realised the meaning.
The first fact I see from this EXPLAIN is that Postgres estimates it has 
enough memory to fit all the entries. Okay, but what does it give me? I 
may just increase work_mem and provide the query with more memory if 
needed. My main concern is how frequently this cache is planned to be 
used. Doing some mental effort, I found the line "rows=30000." 
Calculating a bit more, I may suppose it means that we have a 95% chance 
to reuse the cache. Okay, I got it.
Now, see:
1. I needed to discover the meaning of the new parameters because they 
were different from the earlier "hit" and "miss."
2. I need to find a common JOIN for keys of this node. Imagine a typical 
200-row EXPLAIN with 2-3 Memoization keys from different tables.
3. I need to make calculations

On the opposite, the hit ratio, written instead, already known by 
analogy, already provides me with necessary cache efficacy data; no need 
to watch outside the node; it may be easily compared with the actual 
value. Am I wrong?

Both approaches provide the data, but each one is more usable?
I think we may ask more people, for example, Nikolay Samokhvalov, who, 
as I heard, works hard with explains.

> 
> Once again, I'm not necessarily objecting to hit and evict ratios
> being shown, I just want to know they're actually useful enough to
> show and don't just bloat the EXPLAIN output needlessly. So far your
> arguments aren't convincing me that they are.
I'm -1 for this redundancy.

-- 
regards, Andrei Lepikhov



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: vacuum_truncate configuration parameter and isset_offset
Next
From: Tomas Vondra
Date:
Subject: Re: Snapshot related assert failure on skink