Thread: LIKE indexing proposal

LIKE indexing proposal

From
Peter Eisentraut
Date:
I would like to re-table my proposal from many moons ago to allow
pattern-matching operations to use indexes under any locale.

The idea is that since pattern matching works on a character-by-character
basis, we also need to interact with the index using operators that do a
comparison strictly on a character-by-character basis.  So we define a
separate set of comparison operators and operator classes for each
character type (text, varchar, bpchar, name), teach the optimizer to use
it for suitably anchored pattern-matching operations, and tell users to
create indexes using that special operator class if they want pattern
matching to use indexes.

Here are a couple of details to discuss:

I named the operators #<#, #>=#, etc.  If someone can think of better
names, let me know.

Since character-by-character comparison is essentially binary comparison,
I named the operator classes, text_binary_ops, etc.  Another idea is to
name them text_like_ops or text_pattern_ops or whatever, so that if some
change in the pattern matching operations would later force us to alter
the behavior of the operators away from being strictly memcmp(), we would
be free to change them.

Should there be a special case for the C locale?  Without extra code, if
you use the C locale and would like to use indexes for both equality and
pattern comparisons, you need two indexes with different operator classes.
That might seem wasteful, but then all locales would really work the same.

I'm unclear on how the selectivity estimation should work.  The system
doesn't collect statistics based on the new #<#-style operators, so the
estimates are based on the normal comparison, which might have little to
do with reality.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: LIKE indexing proposal

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> I would like to re-table my proposal from many moons ago to allow
> pattern-matching operations to use indexes under any locale.

Do you have better answers to the objections that were raised the last
time?

In particular, I wonder whether it's worth putting effort into this at
all, rather than trying to make the contrib full-text-indexing support
mainstream-quality.


> Since character-by-character comparison is essentially binary comparison,
> I named the operator classes, text_binary_ops, etc.  Another idea is to
> name them text_like_ops or text_pattern_ops or whatever,

text_binary_ops seems a bit of a contradiction in terms :-(.  I'd prefer
either of the other choices.

> Should there be a special case for the C locale?

Yes, if only on backwards-compatibility grounds.  Database setups that
worked well before should not get arbitrarily broken.

> I'm unclear on how the selectivity estimation should work.  The system
> doesn't collect statistics based on the new #<#-style operators, so the
> estimates are based on the normal comparison, which might have little to
> do with reality.

Not sure that this really matters; the selectivity estimation for LIKE
is so crude that I doubt it would get significantly worse.  But one
could conceive of storing statistics computed both ways in pg_statistic.
        regards, tom lane



Re: LIKE indexing proposal

From
Peter Eisentraut
Date:
Tom Lane writes:

> Peter Eisentraut <peter_e@gmx.net> writes:
> > I would like to re-table my proposal from many moons ago to allow
> > pattern-matching operations to use indexes under any locale.
>
> Do you have better answers to the objections that were raised the last
> time?

The objection last time was that, according to the SQL standard, LIKE
patterns should use the locale-specific collation order for the fixed
parts (instead of character-by-character comparisons), and that under that
rule my proposed solution would fall down.

Time has shown that this objection is poor on four grounds:

1. LIKE does not work that way, and omitting optimizations now based on
possible future changes is a stupid approach.

2. If LIKE were to be changed that way, we can always take the
optimization out.

3. Even if LIKE were changed, certainly POSIX regexps would never be
changed, so you can still use the optimization there.

4. Nobody is seriously proposing to change LIKE that way.  And if someone
did, I would object, because the established collation rules make getting
consistent results from this approach impossible.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: LIKE indexing proposal

From
"Zeugswetter Andreas SB SD"
Date:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > I would like to re-table my proposal from many moons ago to allow
> > pattern-matching operations to use indexes under any locale.

Wouldn't existing b-trees be sufficient, if they could be 'scanned' starting
with the operator >= ? Thus a LIKE 'ABC%' could be done by stepping an (ascending)
index fom x >= 'ABC' up to the first key that does not have 'ABC' as first
characters ?

Seems this would be of more general use, than an extra index for LIKE, no ?
Some upper bound would still be needed for a selectivity estimate, but for
estimation purposes it would not really need to be watertight.

Andreas



Re: LIKE indexing proposal

From
"Zeugswetter Andreas SB SD"
Date:
> The objection last time was that, according to the SQL standard, LIKE
> patterns should use the locale-specific collation order for the fixed
> parts (instead of character-by-character comparisons),

Yes, I think that LIKE definitely needs to be character-by-character
and I can think of no example in the german language that would make
general sense (like ss and single byte sharp s (ß) would definitely be
nonsense if handled as equal for LIKE).

Andreas



Re: LIKE indexing proposal

From
Tom Lane
Date:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> Wouldn't existing b-trees be sufficient, if they could be 'scanned' starting 
> with the operator >= ? Thus a LIKE 'ABC%' could be done by stepping an (ascending) 
> index fom x >= 'ABC' up to the first key that does not have 'ABC' as first 
> characters ?

You've apparently forgotten all our previous history on that subject
:-(.  The above does not work in the presence of special sort rules for
digraphs, etc.  For example, that LIKE should certainly match ABCH ...
but there are locales in which "CH" sorts after "D" and would not be
found by an indexscan that runs from ABC to ABD.

Even with no digraphs, the optimization is broken by locales that treat
spaces as second-class citizens, prefer caseless to case-sensitive
comparisons, etc.  It doesn't work in en_US locale, for example.
        regards, tom lane



Re: LIKE indexing proposal

From
"Zeugswetter Andreas SB SD"
Date:
> You've apparently forgotten all our previous history on that subject
> :-(.  The above does not work in the presence of special sort rules for
> digraphs, etc.  For example, that LIKE should certainly match ABCH ...
> but there are locales in which "CH" sorts after "D" and would not be
> found by an indexscan that runs from ABC to ABD.

Sorry, so is the start condition of >= 'ABC' also not valid ?

A new index scan method with the pattern as input would need to know when it can
safely stop (which is the problem I understand), but would have the advantage of
beeing able to filter unwanted tuples before the heap access.

Andreas