On 24/12/2024 08:20, wenhui qiu wrote:
> sysbench=# explain analyze select userid from dba_users where username
> like '%aaaaaaaaaaaa%' order by userid limit 1;
> QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=0.43..408.59 rows=1 width=4) (actual
> time=1433.469..1433.470 rows=0 loops=1)
> -> Index Scan using dba_users_pkey on dba_users
> (cost=0.43..244892.51 rows=600 width=4) (actual
> time=1433.468..1433.468 rows=0 loops=1)
> Filter: ((username)::text ~~ '%aaaaaaaaaaaa%'::text)
> Rows Removed by Filter: 6000000
> Planning Time: 0.384 ms
> Execution Time: 1433.489 ms
> (6 rows)
The problem seems to be that the planner estimates the LIKE qual to
match 600 rows, but in reality it matches 0.
> I think the total_cost calculation method cannot be linear, because the
> data distribution is not linear in reality. I try to change it to a
> logarithmic
Why? I don't see anything wrong with the LIMIT estimation here.
One factor here is that the execution time of the Index Scan + Limit
plan depends heavily on whether there are 0 matches, or more. With 0
matches, the Index Scans needs to scan the whole table. With 1 rows, it
needs to scan just half of the table on average before finding the
match. With 10 rows, just 10 % of the table, and so forth. So the
execution time is indeed highly non-linear depending on whether there
are any matches or not. Is that what you meant?
It'd be nice to somehow take that non-linearity into account in the
estimation. But I don't know how, and I don't think that change in LIMIT
estimation is the right way.
> The reality may be more complicated, I mean there is no better way for
> the community leaders to solve this problem, because the community will
> never say no to the hint, and there are so many such problems
It's a difficult topic because "hints" mean different things to
different people. Meanwhile, we have added things like "ALTER FUNCTION
... ROWS <rows>" and support functions which can do more accurate
estimation of functions. That's a kind of a hint. If someone comes up
with more ways to help the planner with selectivity estimation, it's
worth considering.
Or maybe you could improve the selectivity estimator of the LIKE
operator to be more accurate to begin with.
--
Heikki Linnakangas
Neon (https://neon.tech)