Re: Postgres picks suboptimal index after building of an extended statistics - Mailing list pgsql-hackers
From | Andrei Lepikhov |
---|---|
Subject | Re: Postgres picks suboptimal index after building of an extended statistics |
Date | |
Msg-id | 6fb43191-2df3-4791-b307-be754e648276@postgrespro.ru Whole thread Raw |
In response to | Re: Postgres picks suboptimal index after building of an extended statistics (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Postgres picks suboptimal index after building of an extended statistics
|
List | pgsql-hackers |
Thanks for detaied answer, On 3/11/2023 00:37, Tomas Vondra wrote: > On 9/25/23 06:30, Andrey Lepikhov wrote: >> ... >> I can't stop thinking about this issue. It is bizarre when Postgres >> chooses a non-unique index if a unique index gives us proof of minimum >> scan. > That's true, but no one implemented this heuristics. So the "proof of > minimum scan" is merely hypothetical - the optimizer is unaware of it. See the simple patch in the attachment. There, I have attempted to resolve situations of uncertainty to avoid making decisions based solely on the order of indexes in the list. >> I don't see a reason to teach the clauselist_selectivity() routine to >> estimate UNIQUE indexes. We add some cycles, but it will work with btree >> indexes only. > I'm not sure I understand what this is meant to say. Can you elaborate? > We only allow UNIQUE for btree indexes anyway, so what exactly is the > problem here? Partly, you already answered yourself below: we have unique index estimation in a few estimation calls, but go through the list of indexes each time. Also, for this sake, we would add some input parameters, usually NULL, because many estimations don't involve indexes at all. >> Maybe to change compare_path_costs_fuzzily() and add some heuristic, for >> example: >> "If selectivity of both paths gives us no more than 1 row, prefer to use >> a unique index or an index with least selectivity." > I don't understand how this would work. What do yo mean by "selectivity > of a path"? AFAICS the problem here is that we estimate a path to return > more rows (while we know there can only be 1, thanks to UNIQUE index). Oops, I meant cardinality. See the patch in the attachment. > So how would you know the path does not give us more than 1 row? Surely > you're not proposing compare_path_costs_fuzzily() to do something > expensive like analyzing the paths, or something. I solely propose to make optimizer more consecutive in its decisions: if we have one row for both path candidates, use uniqueness of the index or value of selectivity as one more parameter. > Also, how would it work in upper levels? If you just change which path > we keep, but leave the inaccurate row estimate in place, that may fix > that level, but it's certainly going to confuse later planning steps. It is designed for the only scan level. > IMHO the problem here is that we produce wrong estimate, so we better > fix that, instead of adding band-aids to other places. Agree. I am looking for a solution to help users somehow resolve such problems. As an alternative solution, I can propose a selectivity hook or (maybe even better) - use the pathlist approach and add indexes into the index list with some predefined order - at first positions, place unique indexes with more columns, etc. > This happens because functional dependencies are very simple type of > statistics - it has some expectations about the input data and also the > queries executed on it. For example it assumes the data is reasonably > homogeneous, so that we can calculate a global "degree". > > But the data in the example directly violates that - it has 1000 rows > that are very random (so we'd find no dependencies), and 1000 rows with > perfect dependencies. Hence we end with degree=0.5, which approximates > the dependencies to all data. Not great, true, but that's the price for > simplicity of this statistics kind. > > So the simplest solution is to disable dependencies on such data sets. > It's a bit inconvenient/unfortunate we build dependencies by default, > and people may not be aware of there assumptions. > > Perhaps we could/should make dependency_degree() more pessimistic when > we find some "violations" of the rule (we intentionally are not strict > about it, because data are imperfect). I don't have a good idea how to > change the formulas, but I'm open to the idea in principle. Thanks for the explanation! > The other thing we could do is checking for unique indexes in > clauselist_selectivity, and if we find a match we can just skip the > extended statistics altogether. Not sure how expensive this would be, > but for typical cases (with modest number of indexes) perhaps OK. It > wouldn't work without a unique index, but I don't have a better idea. It looks a bit expensive for me. But I am ready to try, if current solution doesn't look applicable. -- regards, Andrei Lepikhov Postgres Professional
Attachment
pgsql-hackers by date: