Thread: Minimum selectivity estimate for LIKE 'prefix%'

Minimum selectivity estimate for LIKE 'prefix%'

From

Tom Lane

Date:

06 March 2008, 17:17:02

Peter complained awhile back about unrealistically small selectivity
estimates for LIKE with a fixed-prefix pattern:
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00939.php
I had suspected that this was related to the locale-specific problems
we fixed a few months ago, but having finally had a chance to look
at the data off-list, there doesn't seem to be any such component
to the problem.  What it boils down to is that when we generate a
range constraint based on the prefix, such as
    col >= 'prefix' AND col < 'prefiy'
if the prefix is more than a few characters long then the two endpoint
values are indistinguishable as far as comparison to the histogram
is concerned, and so we come out with a selectivity estimate that
is zero to within roundoff error.  This is unreasonably optimistic
and can lead to bad plan choices.

What I propose doing about this is a small variant on Peter's original
suggestion: compute the estimated selectivity for
    col = 'prefix'
and clamp the result of prefix_selectivity to be at least that.
This is plausible on intuitive grounds since the range constraint
must surely include at least these values.  Furthermore, it eliminates
what had been an entirely ad-hoc choice of a lower bound (the code
was clamping to at least 1e-10, which is surely unreasonably
optimistic).

The end result of this, for the case Peter is interested in where there
are no especially common values, is that the minimum selectivity
estimate for LIKE 'prefix%' will be essentially 1/ndistinct where
ndistinct is the estimated number of distinct values in the column,
because that's what eqsel() does in such cases.

I attach a proposed patch against HEAD for this.  It's a bit long
but most of the bulk is refactoring eqsel() to make it easy to use from
prefix_selectivity().  The intellectual content is just in the last
diff hunk.

To help out Peter's client, who's running 8.1.x, we'd have to backpatch
at least that far.  We backpatched the last round of LIKE selectivity
fixes to 8.1, so I don't have too much hesitation about doing the same
here.

Comments?

            regards, tom lane

Attachment

msg-22127-28399.bin

Re: Minimum selectivity estimate for LIKE 'prefix%'

From

Peter Eisentraut

Date:

07 March 2008, 14:29:19

Am Donnerstag, 6. März 2008 schrieb Tom Lane:
> To help out Peter's client, who's running 8.1.x, we'd have to backpatch
> at least that far.  We backpatched the last round of LIKE selectivity
> fixes to 8.1, so I don't have too much hesitation about doing the same
> here.

Thanks for the offer.  Personally, I'd be OK with not backpatching and dealing
with the patch myself.  We've patched that area before, but I'm not sure what
the original cause of that was, but this here sounds more like a feature
improvement.

Re: Minimum selectivity estimate for LIKE 'prefix%'

From

Peter Eisentraut

Date:

17 March 2008, 13:19:51

Am Donnerstag, 6. März 2008 schrieb Tom Lane:
> I attach a proposed patch against HEAD for this.  It's a bit long
> but most of the bulk is refactoring eqsel() to make it easy to use from
> prefix_selectivity().  The intellectual content is just in the last
> diff hunk.

Looking at the patch as committed here
<http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/selfuncs.c.diff?r1=1.243;r2=1.244>,
I am confused by one hunk:

*************** prefix_selectivity(VariableStatData *var
*** 4452,4468 ****
       *    "x < greaterstr".
       *-------
       */
-     cmpopr = get_opfamily_member(opfamily, vartype, vartype,
-                                  BTLessStrategyNumber);
-     if (cmpopr == InvalidOid)
-         elog(ERROR, "no < operator for opfamily %u", opfamily);
-     fmgr_info(get_opcode(cmpopr), &opproc);
-
      greaterstrcon = make_greater_string(prefixcon, &opproc);
      if (greaterstrcon)
      {
          Selectivity topsel;

          topsel = ineq_histogram_selectivity(vardata, &opproc, false,
                                              greaterstrcon->constvalue,
                                              greaterstrcon->consttype);
--- 4503,4519 ----
       *    "x < greaterstr".
       *-------
       */
      greaterstrcon = make_greater_string(prefixcon, &opproc);
      if (greaterstrcon)
      {
          Selectivity topsel;

+         cmpopr = get_opfamily_member(opfamily, vartype, vartype,
+                                      BTLessStrategyNumber);
+         if (cmpopr == InvalidOid)
+             elog(ERROR, "no < operator for opfamily %u", opfamily);
+         fmgr_info(get_opcode(cmpopr), &opproc);
+
          topsel = ineq_histogram_selectivity(vardata, &opproc, false,
                                              greaterstrcon->constvalue,
                                              greaterstrcon->consttype);

make_greater_string() is documented as

 * The caller must provide the appropriate "less than" comparison function
 * for testing the strings.

but using the new code flow it appears that opproc would still refer to
the >= operator looked up earlier.

The 8.1 code calls this variable ltproc because it didn't need the >= operator.
Perhaps using two variables would also be clearer here.

Re: Minimum selectivity estimate for LIKE 'prefix%'

From

Tom Lane

Date:

17 March 2008, 14:02:32

Peter Eisentraut <peter_e@gmx.net> writes:
> Am Donnerstag, 6. M�rz 2008 schrieb Tom Lane:
>> I attach a proposed patch against HEAD for this. �It's a bit long
>> but most of the bulk is refactoring eqsel() to make it easy to use from
>> prefix_selectivity(). �The intellectual content is just in the last
>> diff hunk.

> Looking at the patch as committed here
> <http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/selfuncs.c.diff?r1=1.243;r2=1.244>,
> I am confused by one hunk:

Ack, that's a bug --- I was looking at the use of opproc by
ineq_histogram_selectivity and failed to notice make_greater_string
using it too.  Will fix.

            regards, tom lane