Re: adjust_limit_rows_costs algorithm - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: adjust_limit_rows_costs algorithm
Date
Msg-id 843c6b1e-7c75-4383-9b0d-694d93b69d16@iki.fi
Whole thread Raw
In response to adjust_limit_rows_costs algorithm  (wenhui qiu <qiuwenhuifx@gmail.com>)
List pgsql-hackers
On 24/12/2024 08:20, wenhui qiu wrote:
> sysbench=# explain analyze  select userid from dba_users where  username 
> like '%aaaaaaaaaaaa%' order by userid limit 1;
>                                                                   QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------
>   Limit  (cost=0.43..408.59 rows=1 width=4) (actual 
> time=1433.469..1433.470 rows=0 loops=1)
>     ->  Index Scan using dba_users_pkey on dba_users 
>   (cost=0.43..244892.51 rows=600 width=4) (actual 
> time=1433.468..1433.468 rows=0 loops=1)
>           Filter: ((username)::text ~~ '%aaaaaaaaaaaa%'::text)
>           Rows Removed by Filter: 6000000
>   Planning Time: 0.384 ms
>   Execution Time: 1433.489 ms
> (6 rows)

The problem seems to be that the planner estimates the LIKE qual to 
match 600 rows, but in reality it matches 0.

> I think the total_cost calculation method cannot be linear, because the 
> data distribution is not linear in reality. I try to change it to a 
> logarithmic

Why? I don't see anything wrong with the LIMIT estimation here.

One factor here is that the execution time of the Index Scan + Limit 
plan depends heavily on whether there are 0 matches, or more. With 0 
matches, the Index Scans needs to scan the whole table. With 1 rows, it 
needs to scan just half of the table on average before finding the 
match. With 10 rows, just 10 % of the table, and so forth. So the 
execution time is indeed highly non-linear depending on whether there 
are any matches or not. Is that what you meant?

It'd be nice to somehow take that non-linearity into account in the 
estimation. But I don't know how, and I don't think that change in LIMIT 
estimation is the right way.

> The reality may be more complicated, I mean there is no better way for 
> the community leaders to solve this problem, because the community will 
> never say no to the hint, and there are so many such problems

It's a difficult topic because "hints" mean different things to 
different people. Meanwhile, we have added things like "ALTER FUNCTION 
... ROWS <rows>" and support functions which can do more accurate 
estimation of functions. That's a kind of a hint. If someone comes up 
with more ways to help the planner with selectivity estimation, it's 
worth considering.

Or maybe you could improve the selectivity estimator of the LIKE 
operator to be more accurate to begin with.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: Vladlen Popolitov
Date:
Subject: Re: Windows UTF8 system locale
Next
From: Andrei Lepikhov
Date:
Subject: Re: ERROR: corrupt MVNDistinct entry