Home > mailing lists

Yet another LIKE-indexing scheme - Mailing list pgsql-hackers

From	Tom Lane
Subject	Yet another LIKE-indexing scheme
Date	September 2, 2000 13:39:59
Msg-id	1429.967916387@sss.pgh.pa.us Whole thread Raw
Responses	Re: Yet another LIKE-indexing scheme Re: Yet another LIKE-indexing scheme
List	pgsql-hackers

Tree view

I had another thought about fixing our problems with deriving index
bounds for LIKE patterns in non-ASCII locales.  (If you don't remember
the gory details here, please re-read threadSigh, LIKE indexing is *still* broken in foreign locales
from pgsql-hackers archives of 7 to 10 June, 2000; there are also many
previous go-rounds about this long-standing issue.)

The problems that I've been told about seem to center around one- and
two-character patterns that have special sort rules in some locales.
Could we work around these problems by dropping one or perhaps two
characters from the end of the given LIKE prefix?  For example, givenWHERE name LIKE 'foobar%'
drop the last fixed character ('r') and generate index bounds from what
remains, using the same algorithm as in 7.0.  So the index bounds would
becomeWHERE name >= 'fooba' AND name < 'foobb'
(at least in ASCII locale --- to make the upper bound, we'd search for
a string considered greater than 'fooba' by the local strcmp()).

The truncation would need to be multibyte-aware, of course.

This would, for example, fix the example given by Erich Stamberger:

> Another interresting feature of Czech collation is:
> 
> H < "CH" < I
> 
> and:
> 
> B < C < C + CARON < D .. < H < "CH" < I
> 
> So what happens with "WHERE name like 'Czec%`" ?

Our existing code fails because it generates WHERE name >= 'Czec' AND
name < 'Czed'; it will therefore not find names beginning 'Czech'
because those are in another part of the index, between 'Czeh' and
'Czei'.  But WHERE name >= 'Cze' AND name < 'Czf' would work.

Are there examples where this still doesn't work?  (Funny sort rules
for trigraphs would break it, I'm sure, unless we drop two characters
instead of just one.)

Obviously we could still keep the last character in ASCII locale.
That would be a good thing since it'd reduce the number of tuples
scanned.  Is there a portable way to determine whether it's safe to
do so in other locales?  (Some inquiry function about whether the sort
ordering has any digraph or two-to-one rules might help, but I don't
know if there is one.)
        regards, tom lane

pgsql-hackers by date:

From: Alfred Perlstein
Date: 02 September 2000, 13:03:25
Subject: Really bad/weird stuff with views over tables in 7.0.2

From: Stephan Szabo
Date: 02 September 2000, 14:01:42
Subject: Re: Really bad/weird stuff with views over tables in 7.0.2

Yet another LIKE-indexing scheme - Mailing list pgsql-hackers

Previous

Next