Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment - Mailing list pgsql-hackers

From Andrei Lepikhov
Subject Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
Date
Msg-id ff747c1d-1925-45ca-9c47-2763be45cba0@gmail.com
Whole thread Raw
In response to Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
List pgsql-hackers
On 21/3/2025 03:50, David Rowley wrote:
> On Fri, 21 Mar 2025 at 07:54, Andrei Lepikhov <lepihov@gmail.com> wrote:
>> I have some doubts here.
>> The number of distinct values says something only when it has taken
>> together with the number of calls.
> 
> Couldn't the reader just look at the Nested Loop's outer side row
> estimate for that?
In my cases, key sources are usually spread across dozens of joins, and 
it is visually hard to find out (especially when we have an EXPLAIN 
ANALYSE VERBOSE) the JOIN operator to extract the number of calls. The 
hit ratio, meanwhile, may be analysed locally in the Memoize node. For 
example, 80% (0.8) is evidently a good one, 40% is questionable, and 5% 
is too low and we should avoid Memoize here.
May it be beaten by just printing the "calls" number at the Memoize output?
> 
>> Frequently, one of the caching keys is from outer table A (10 tuples),
>> and another is from outer table B (100 tuples). Calculating the number
>> of successful cache fetches predicted by the planner may not be evident
>> in the case of a composite cache key.
>>
>> What I may propose here is:
>> 1. Use fraction of calls, for example - 50% duplicated key values.
>> 2. Show the calculated hit and eviction ratio.
> 
> I think the primary factors in how useful Memoize is are: 1) How many
> items do we expect to be able to store in the cache concurrently, and;
> 2) How many unique lookups keys do we expect to be looked up, and; 3)
> The total number of expected lookups.   #1 is quite difficult to
> figure out (maybe by looking at row width and row estimates) and
> there's just no information about #2. #3 is already shown, in the
> Nested Loop's outer side.
It depends on the task. If you are looking for the answer to how precise 
the group's estimation has been (to check statistics), I agree. In cases 
I have seen before, the main question is how effective was (or maybe) a 
Memoize node == how often the incoming key fits the cache. In that case, 
the hit ratio fraction is more understandable for a broad audience.
That's why according to my experience in case of a good cache 
reusability factor, users are usually okay with increasing the cache 
size to the necessary numbers and avoiding evictions at all costs. So, 
the predicted evict_ratio also tells us about incrementing work_mem to 
enhance the chances of Memoisation.
Having written the last sentence I came back to the point why work_mem 
is so universal and is used at each node as a criteria of memory 
allocation size? But it is a different story, I think.

-- 
regards, Andrei Lepikhov



pgsql-hackers by date:

Previous
From: Ni Ku
Date:
Subject: Re: Changing shared_buffers without restart
Next
From: Richard Guo
Date:
Subject: Reduce "Var IS [NOT] NULL" quals during constant folding