Thread: Minimum selectivity estimate for LIKE 'prefix%'
Peter complained awhile back about unrealistically small selectivity estimates for LIKE with a fixed-prefix pattern: http://archives.postgresql.org/pgsql-hackers/2007-12/msg00939.php I had suspected that this was related to the locale-specific problems we fixed a few months ago, but having finally had a chance to look at the data off-list, there doesn't seem to be any such component to the problem. What it boils down to is that when we generate a range constraint based on the prefix, such as col >= 'prefix' AND col < 'prefiy' if the prefix is more than a few characters long then the two endpoint values are indistinguishable as far as comparison to the histogram is concerned, and so we come out with a selectivity estimate that is zero to within roundoff error. This is unreasonably optimistic and can lead to bad plan choices. What I propose doing about this is a small variant on Peter's original suggestion: compute the estimated selectivity for col = 'prefix' and clamp the result of prefix_selectivity to be at least that. This is plausible on intuitive grounds since the range constraint must surely include at least these values. Furthermore, it eliminates what had been an entirely ad-hoc choice of a lower bound (the code was clamping to at least 1e-10, which is surely unreasonably optimistic). The end result of this, for the case Peter is interested in where there are no especially common values, is that the minimum selectivity estimate for LIKE 'prefix%' will be essentially 1/ndistinct where ndistinct is the estimated number of distinct values in the column, because that's what eqsel() does in such cases. I attach a proposed patch against HEAD for this. It's a bit long but most of the bulk is refactoring eqsel() to make it easy to use from prefix_selectivity(). The intellectual content is just in the last diff hunk. To help out Peter's client, who's running 8.1.x, we'd have to backpatch at least that far. We backpatched the last round of LIKE selectivity fixes to 8.1, so I don't have too much hesitation about doing the same here. Comments? regards, tom lane
Attachment
Am Donnerstag, 6. März 2008 schrieb Tom Lane: > To help out Peter's client, who's running 8.1.x, we'd have to backpatch > at least that far. We backpatched the last round of LIKE selectivity > fixes to 8.1, so I don't have too much hesitation about doing the same > here. Thanks for the offer. Personally, I'd be OK with not backpatching and dealing with the patch myself. We've patched that area before, but I'm not sure what the original cause of that was, but this here sounds more like a feature improvement.
Am Donnerstag, 6. März 2008 schrieb Tom Lane: > I attach a proposed patch against HEAD for this. It's a bit long > but most of the bulk is refactoring eqsel() to make it easy to use from > prefix_selectivity(). The intellectual content is just in the last > diff hunk. Looking at the patch as committed here <http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/selfuncs.c.diff?r1=1.243;r2=1.244>, I am confused by one hunk: *************** prefix_selectivity(VariableStatData *var *** 4452,4468 **** * "x < greaterstr". *------- */ - cmpopr = get_opfamily_member(opfamily, vartype, vartype, - BTLessStrategyNumber); - if (cmpopr == InvalidOid) - elog(ERROR, "no < operator for opfamily %u", opfamily); - fmgr_info(get_opcode(cmpopr), &opproc); - greaterstrcon = make_greater_string(prefixcon, &opproc); if (greaterstrcon) { Selectivity topsel; topsel = ineq_histogram_selectivity(vardata, &opproc, false, greaterstrcon->constvalue, greaterstrcon->consttype); --- 4503,4519 ---- * "x < greaterstr". *------- */ greaterstrcon = make_greater_string(prefixcon, &opproc); if (greaterstrcon) { Selectivity topsel; + cmpopr = get_opfamily_member(opfamily, vartype, vartype, + BTLessStrategyNumber); + if (cmpopr == InvalidOid) + elog(ERROR, "no < operator for opfamily %u", opfamily); + fmgr_info(get_opcode(cmpopr), &opproc); + topsel = ineq_histogram_selectivity(vardata, &opproc, false, greaterstrcon->constvalue, greaterstrcon->consttype); make_greater_string() is documented as * The caller must provide the appropriate "less than" comparison function * for testing the strings. but using the new code flow it appears that opproc would still refer to the >= operator looked up earlier. The 8.1 code calls this variable ltproc because it didn't need the >= operator. Perhaps using two variables would also be clearer here.
Peter Eisentraut <peter_e@gmx.net> writes: > Am Donnerstag, 6. M�rz 2008 schrieb Tom Lane: >> I attach a proposed patch against HEAD for this. �It's a bit long >> but most of the bulk is refactoring eqsel() to make it easy to use from >> prefix_selectivity(). �The intellectual content is just in the last >> diff hunk. > Looking at the patch as committed here > <http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/selfuncs.c.diff?r1=1.243;r2=1.244>, > I am confused by one hunk: Ack, that's a bug --- I was looking at the use of opproc by ineq_histogram_selectivity and failed to notice make_greater_string using it too. Will fix. regards, tom lane