Re: Postgres picks suboptimal index after building of an extended statistics - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Postgres picks suboptimal index after building of an extended statistics |
Date | |
Msg-id | d3ce7422-978e-16a1-aa0e-7c749615797b@enterprisedb.com Whole thread Raw |
In response to | Re: Postgres picks suboptimal index after building of an extended statistics (Andrey Lepikhov <a.lepikhov@postgrespro.ru>) |
Responses |
Re: Postgres picks suboptimal index after building of an extended statistics
|
List | pgsql-hackers |
On 9/25/23 06:30, Andrey Lepikhov wrote: > ... > I can't stop thinking about this issue. It is bizarre when Postgres > chooses a non-unique index if a unique index gives us proof of minimum > scan. That's true, but no one implemented this heuristics. So the "proof of minimum scan" is merely hypothetical - the optimizer is unaware of it. > I don't see a reason to teach the clauselist_selectivity() routine to > estimate UNIQUE indexes. We add some cycles, but it will work with btree > indexes only. I'm not sure I understand what this is meant to say. Can you elaborate? We only allow UNIQUE for btree indexes anyway, so what exactly is the problem here? > Maybe to change compare_path_costs_fuzzily() and add some heuristic, for > example: > "If selectivity of both paths gives us no more than 1 row, prefer to use > a unique index or an index with least selectivity." > I don't understand how this would work. What do yo mean by "selectivity of a path"? AFAICS the problem here is that we estimate a path to return more rows (while we know there can only be 1, thanks to UNIQUE index). So how would you know the path does not give us more than 1 row? Surely you're not proposing compare_path_costs_fuzzily() to do something expensive like analyzing the paths, or something. Also, how would it work in upper levels? If you just change which path we keep, but leave the inaccurate row estimate in place, that may fix that level, but it's certainly going to confuse later planning steps. IMHO the problem here is that we produce wrong estimate, so we better fix that, instead of adding band-aids to other places. This happens because functional dependencies are very simple type of statistics - it has some expectations about the input data and also the queries executed on it. For example it assumes the data is reasonably homogeneous, so that we can calculate a global "degree". But the data in the example directly violates that - it has 1000 rows that are very random (so we'd find no dependencies), and 1000 rows with perfect dependencies. Hence we end with degree=0.5, which approximates the dependencies to all data. Not great, true, but that's the price for simplicity of this statistics kind. So the simplest solution is to disable dependencies on such data sets. It's a bit inconvenient/unfortunate we build dependencies by default, and people may not be aware of there assumptions. Perhaps we could/should make dependency_degree() more pessimistic when we find some "violations" of the rule (we intentionally are not strict about it, because data are imperfect). I don't have a good idea how to change the formulas, but I'm open to the idea in principle. The other thing we could do is checking for unique indexes in clauselist_selectivity, and if we find a match we can just skip the extended statistics altogether. Not sure how expensive this would be, but for typical cases (with modest number of indexes) perhaps OK. It wouldn't work without a unique index, but I don't have a better idea. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: