Thread: Patch: improve selectivity estimation for IN/NOT IN
Hello, the attached patch improves the array selectivity estimation for = ANY and <> ALL, hence for the IN/NOT IN operators, to avoid the shortcoming described in <http://archives.postgresql.org/pgsql-performance/2012-03/msg00006.php>. Cheers, -- Daniele
Attachment
Daniele Varrazzo <daniele.varrazzo@gmail.com> writes: > the attached patch improves the array selectivity estimation for = ANY > and <> ALL, hence for the IN/NOT IN operators, to avoid the > shortcoming described in > <http://archives.postgresql.org/pgsql-performance/2012-03/msg00006.php>. In connection with Alexander Korotkov's array-estimation patch, I just committed some code into scalararraysel() that checks whether the operator is equality or inequality of the array element type. It does that by consulting the default btree or hash opclass for the element type. I did that with the thought that it could be used to attack this issue too, but I see that you've done it another way, ie check to see whether the operator uses eqsel() or neqsel() as selectivity estimator. I'm not sure offhand which way is better. It could be argued that yours is more appropriate because if the operator isn't btree equality, but acts enough like it to use eqsel() as estimator, then it's still appropriate for scalararraysel() to treat it as equality. On the other side of the coin, an operator might be equality but have reason to use some operator-specific estimator rather than eqsel(). We have real examples of the former (such as the approximate-equality geometric operators) but I think the latter case is just hypothetical. Another thing that has to be thought about is that there are numerous cross-type operators that use eqsel, such as date-vs-timestamp, and it's far from clear whether it's appropriate for scalararraysel() to use the modified stats calculation when dealing with one of these. The btree-based logic is impervious to that since it won't match any cross-type operator. Thoughts? (BTW, in any case I don't trust function pointer comparison to be portable. It'd be a lot safer to look at the function OID.) regards, tom lane
On Sun, Mar 4, 2012 at 2:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Daniele Varrazzo <daniele.varrazzo@gmail.com> writes: >> the attached patch improves the array selectivity estimation for = ANY >> and <> ALL, hence for the IN/NOT IN operators, to avoid the >> shortcoming described in >> <http://archives.postgresql.org/pgsql-performance/2012-03/msg00006.php>. > > In connection with Alexander Korotkov's array-estimation patch, > I just committed some code into scalararraysel() that checks whether the > operator is equality or inequality of the array element type. It looks like you have grand plans for array estimation. My patch has a much more modest scope, and I'm hoping it could be applied to currently maintained PG versions, as I consider the currently produced estimations a bug. > It does > that by consulting the default btree or hash opclass for the element > type. I did that with the thought that it could be used to attack this > issue too, but I see that you've done it another way, ie check to see > whether the operator uses eqsel() or neqsel() as selectivity estimator. > > I'm not sure offhand which way is better. It could be argued that yours > is more appropriate because if the operator isn't btree equality, but acts > enough like it to use eqsel() as estimator, then it's still appropriate > for scalararraysel() to treat it as equality. On the other side of the > coin, an operator might be equality but have reason to use some > operator-specific estimator rather than eqsel(). We have real examples > of the former (such as the approximate-equality geometric operators) > but I think the latter case is just hypothetical. Another thing that > has to be thought about is that there are numerous cross-type operators > that use eqsel, such as date-vs-timestamp, and it's far from clear > whether it's appropriate for scalararraysel() to use the modified stats > calculation when dealing with one of these. The btree-based logic is > impervious to that since it won't match any cross-type operator. > > Thoughts? > > (BTW, in any case I don't trust function pointer comparison to be > portable. It'd be a lot safer to look at the function OID.) My original idea was to compare the comparison function with the catalog: I just don't know how to inspect the catalog/perform the check. Studying how to do it I've noticed the fn_addr referring to the "well known" eqsel/neqsel functions and thought about using it as indicators of the right operator semantics. If you are still interested in the patch, for sake of bugfixing, and somebody provides an example in the source about how to compare the functions oid, I can try to improve it. -- Daniele
On 04-03-2012 00:20, Daniele Varrazzo wrote: > It looks like you have grand plans for array estimation. My patch has > a much more modest scope, and I'm hoping it could be applied to > currently maintained PG versions, as I consider the currently produced > estimations a bug. > We don't normally add new features to stable branches unless it is a bug. In the optimizer case, planner regression is a bug (that this case is not). -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte24x7 e Treinamento
On Sun, Mar 4, 2012 at 1:44 PM, Euler Taveira de Oliveira <euler@timbira.com> wrote: > On 04-03-2012 00:20, Daniele Varrazzo wrote: >> It looks like you have grand plans for array estimation. My patch has >> a much more modest scope, and I'm hoping it could be applied to >> currently maintained PG versions, as I consider the currently produced >> estimations a bug. >> > We don't normally add new features to stable branches unless it is a bug. In > the optimizer case, planner regression is a bug (that this case is not). Please note that we are talking about planning errors leading to estimates of records in the millions instead of in the units, in queries where all the elements are known (most common elements, with the right stats, included in the query), even with a partial index whose size could never physically contain all the estimated rows screaming "something broken here!". So, it's not a regression as the computation has always been broken, but I think it can be hardly considered not a bug. OTOH, I expect the decision of leaving things as they are could be taken on the basis of the possibility that some program may stop working in reaction of an altered query plan: I'm not going to argue about this, although I feel it unlikely. -- Daniele
Daniele Varrazzo <daniele.varrazzo@gmail.com> writes: > On Sun, Mar 4, 2012 at 1:44 PM, Euler Taveira de Oliveira > <euler@timbira.com> wrote: >> On 04-03-2012 00:20, Daniele Varrazzo wrote: >>> It looks like you have grand plans for array estimation. My patch has >>> a much more modest scope, and I'm hoping it could be applied to >>> currently maintained PG versions, as I consider the currently produced >>> estimations a bug. >> We don't normally add new features to stable branches unless it is a bug. In >> the optimizer case, planner regression is a bug (that this case is not). > Please note that we are talking about planning errors leading to > estimates of records in the millions instead of in the units, We're also talking about a proposed patch that is not clearly a bug fix, but is a change in a heuristic, meaning it has the potential to make things worse in some cases. (Notably, for arrays that *don't* contain all-distinct values, the estimates are likely to get worse.) So I wasn't thinking of it as being back-patchable anyway. It's generally better not to destabilize planner behavior in minor releases, even if it's a 90%-of-the-time improvement. regards, tom lane
I wrote: > Daniele Varrazzo <daniele.varrazzo@gmail.com> writes: >> the attached patch improves the array selectivity estimation for = ANY >> and <> ALL, hence for the IN/NOT IN operators, to avoid the >> shortcoming described in >> <http://archives.postgresql.org/pgsql-performance/2012-03/msg00006.php>. I've committed a modified version of this patch. > I'm not sure offhand which way is better. It could be argued that yours > is more appropriate because if the operator isn't btree equality, but acts > enough like it to use eqsel() as estimator, then it's still appropriate > for scalararraysel() to treat it as equality. On the other side of the > coin, an operator might be equality but have reason to use some > operator-specific estimator rather than eqsel(). After some reflection I decided it was probably sane to use both methods, that is apply the disjoint-probabilities calculation if the operator appears to be equality/inequality according to either rule. However, I also put in the safety valve I suggested in the -performance thread, that the code fall back to the old calculation if the assumption of disjoint probabilities yields an impossible result. The patch as you had it would have just clamped such a result to 1 or 0, which didn't seem to me to be the best we could do. regards, tom lane