Yet another LIKE-indexing scheme - Mailing list pgsql-hackers

From Tom Lane
Subject Yet another LIKE-indexing scheme
Date
Msg-id 1429.967916387@sss.pgh.pa.us
Whole thread Raw
Responses Re: Yet another LIKE-indexing scheme
Re: Yet another LIKE-indexing scheme
List pgsql-hackers
I had another thought about fixing our problems with deriving index
bounds for LIKE patterns in non-ASCII locales.  (If you don't remember
the gory details here, please re-read threadSigh, LIKE indexing is *still* broken in foreign locales
from pgsql-hackers archives of 7 to 10 June, 2000; there are also many
previous go-rounds about this long-standing issue.)

The problems that I've been told about seem to center around one- and
two-character patterns that have special sort rules in some locales.
Could we work around these problems by dropping one or perhaps two
characters from the end of the given LIKE prefix?  For example, givenWHERE name LIKE 'foobar%'
drop the last fixed character ('r') and generate index bounds from what
remains, using the same algorithm as in 7.0.  So the index bounds would
becomeWHERE name >= 'fooba' AND name < 'foobb'
(at least in ASCII locale --- to make the upper bound, we'd search for
a string considered greater than 'fooba' by the local strcmp()).

The truncation would need to be multibyte-aware, of course.

This would, for example, fix the example given by Erich Stamberger:

> Another interresting feature of Czech collation is:
> 
> H < "CH" < I
> 
> and:
> 
> B < C < C + CARON < D .. < H < "CH" < I
> 
> So what happens with "WHERE name like 'Czec%`" ?

Our existing code fails because it generates WHERE name >= 'Czec' AND
name < 'Czed'; it will therefore not find names beginning 'Czech'
because those are in another part of the index, between 'Czeh' and
'Czei'.  But WHERE name >= 'Cze' AND name < 'Czf' would work.

Are there examples where this still doesn't work?  (Funny sort rules
for trigraphs would break it, I'm sure, unless we drop two characters
instead of just one.)

Obviously we could still keep the last character in ASCII locale.
That would be a good thing since it'd reduce the number of tuples
scanned.  Is there a portable way to determine whether it's safe to
do so in other locales?  (Some inquiry function about whether the sort
ordering has any digraph or two-to-one rules might help, but I don't
know if there is one.)
        regards, tom lane


pgsql-hackers by date:

Previous
From: Alfred Perlstein
Date:
Subject: Really bad/weird stuff with views over tables in 7.0.2
Next
From: Stephan Szabo
Date:
Subject: Re: Really bad/weird stuff with views over tables in 7.0.2