Thread: LIKE indexing

LIKE indexing

From
Peter Eisentraut
Date:
Here's the patch for review.  It adds the non-locale operator classes for
all three character types and the analogous selectivity estimation
changes.  Basically, I'm confident this works, but as some people seem to
have doubts I show it here first.

--
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter

Attachment

Re: LIKE indexing

From
Tom Lane
Date:
I'm fairly close to committing a wholesale rearrangement of pg_opclass
and friends (per previous discussions, mostly with Oleg).  This is going
to create some conflicts with your patch :-(.  Who gets to go first?

            regards, tom lane

Re: LIKE indexing

From
Peter Eisentraut
Date:
Tom Lane writes:

> I'm fairly close to committing a wholesale rearrangement of pg_opclass
> and friends (per previous discussions, mostly with Oleg).  This is going
> to create some conflicts with your patch :-(.  Who gets to go first?

If your changes only conflict in the system catalog headers I can redo
those when you're done.  (Presuming there's documentation coming along.)

--
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter


Re: LIKE indexing

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> If your changes only conflict in the system catalog headers I can redo
> those when you're done.  (Presuming there's documentation coming along.)

Actually, all those uses of op_class() are going to conflict too...
op_class doesn't need an AM OID parameter anymore (and besides which,
I renamed it to op_in_opclass).

If you feel ready to commit, do so, and I'll do the merge.

            regards, tom lane

Re: LIKE indexing

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Here's the patch for review.

A few gripes:

+    The optimizer can also use a B-Tree index for queries involving the
+    pattern matching operators <literal>LIKE</>,
+    <literal>ILIKE</literal>, <literal>~</literal>, and
+    <literal>~*</literal>, <emphasis>if</emphasis> the pattern is
+    anchored to the beginning of the string, e.g., <literal>col LIKE
+    'foo%'</literal> or <literal>col ~ '^foo'</literal>, but not
+    <literal>col LIKE 'bar'</literal>.  However, if your server does

The "but not" part is wrong: col LIKE 'bar' works perfectly fine as
an indexable LIKE query.  Perhaps you meant "but not col LIKE '%foo'".

While it's okay to treat text and varchar alike, I object to treating
bpchar as equivalent to the other two.  Shouldn't the bpchar versions of
these functions strip trailing spaces before comparing?

Seems to me you should provide "$<>$" operators for completeness, even
though they're not essential for btree opclasses.  I think that these
operators may be useful for more than just this one purpose, so we
shouldn't set up artificial roadblocks.

I don't like the fact that you added expected-output rows to opr_sanity;
seems like tweaking the queries to allow $<$ etc as expected names would
be more appropriate.

            regards, tom lane

Re: LIKE indexing

From
Peter Eisentraut
Date:
Tom Lane writes:

> The "but not" part is wrong: col LIKE 'bar' works perfectly fine as
> an indexable LIKE query.  Perhaps you meant "but not col LIKE '%foo'".

Thanks.  That was a mixup with the POSIX regexp style.

> While it's okay to treat text and varchar alike, I object to treating
> bpchar as equivalent to the other two.  Shouldn't the bpchar versions of
> these functions strip trailing spaces before comparing?

I had thought a long time about this and I couldn't see a reason why.
The reason is that the LIKE operator for bpchar does take the blanks into
account, so it effectively doesn't care whether the blanks are the result
of padding or explicit input.  E.g.,

peter=# set enable_indexscan to off;
SET VARIABLE
peter=# create table test1 (a char(5));
CREATE
peter=# insert into test1 values ('four');
INSERT 16560 1
peter=# select * from test1 where a like 'four'::bpchar;
 a
---
(0 rows)

/*
 * If we had stripped spaces here we would have gotten a false positive.
 */

peter=# select * from test1 where a like 'fou_'::bpchar;
 a
---
(0 rows)

/*
 * Since the padding here is after the wildcard character and is thus
 * stripped in the analysis, the augmented expression still holds.
 */

peter=# select * from test1 where a like 'fou%'::bpchar;
   a
-------
 four
(1 row)

/* same here */

I would also argue that the notion of a direct binary comparision would
not benefit from space stripping.

> Seems to me you should provide "$<>$" operators for completeness, even
> though they're not essential for btree opclasses.

Will do.

> I don't like the fact that you added expected-output rows to opr_sanity;
> seems like tweaking the queries to allow $<$ etc as expected names would
> be more appropriate.

Ok.

--
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter


Re: LIKE indexing

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> peter=# create table test1 (a char(5));
> CREATE
> peter=# insert into test1 values ('four');
> INSERT 16560 1
> peter=# select * from test1 where a like 'four'::bpchar;
>  a
> ---
> (0 rows)

I think this is an erroneous result, actually, seeing as how

regression=# select 'four '::bpchar = 'four'::bpchar;
 ?column?
----------
 t
(1 row)

How can A = B not imply A LIKE B?  (This may be related to Hiroshi's
concerns.)

I dug into the spec to see what it has to say, and came up with this
rather opaque prose:

                 4) If the i-th substring specifier of PCV is neither an
                   arbitrary character specifier nor an arbitrary string
                   specifier, then the i-th substring of MCV is equal to
                   that substring specifier according to the collating
                   sequence of the <like predicate>, without the appending
                   of <space> characters to MCV, and has the same length as
                   that substring specifier.

The bit about "without the appending of <space> characters" *might*
mean that LIKE is always supposed to treat trailing blanks as
significant, but I'm not sure.  The text does seem to say that it's okay
to add trailing blanks to the pattern to produce a match, when the
collating sequence is PAD SPACE type (bpchar in our terms).

In any case, Hiroshi is dead right that LIKE is supposed to perform
collating-sequence-dependent comparison, and this probably means that
this whole approach is a dead end :-(

            regards, tom lane