Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem
Date
Msg-id 29189.928944532@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem  (Bruce Momjian <maillist@candle.pha.pa.us>)
Responses Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem  (Bruce Momjian <maillist@candle.pha.pa.us>)
List pgsql-hackers
Bruce Momjian <maillist@candle.pha.pa.us> writes:
> This certainly explains it.  With locale enabled, LIKE does not use
> indexes because we can't figure out how to do the indexing trick with
> non-ASCII character sets because we can't figure out the maximum
> character value for a particular encoding.

We don't actually need the *maximum* character value, what we need is
to be able to generate a *slightly larger* character value.

For example, what the parser is doing now:fld LIKE 'abc%' ==> fld <= 'abc\377'
is not even really right in ASCII locale, because it will reject a
data value like 'abc\377x'.

I think what we really want is to generate the "next value of the
same length" and use a < comparison.  In ASCII locale this meansfld LIKE 'abc%' ==> fld < 'abd'
which is reliable regardless of what comes after abc in the data.

The trick is to figure out a "next" value without assuming a lot
about the local character set and collation sequence.  I had
been thinking about a brute force method: generate a string and
check to see whether strcmp claims it's greater than the original
or not; if not, increment the last byte and try again.  You'd
also have to be able to back up and increment earlier bytes if
you maxed out the last one.  For example, in French locale,fld LIKE 'ab\376%'
you'd first produce 'ab\377' but discover that it's less than
'ab\376' (because \377 is y-dieresis which sorts like 'y').
Your next try must be 'ac\377' which will succeed.

But I am worried whether this trick will work in multibyte locales ---
incrementing the last byte might generate an invalid character sequence
and produce unpredictable results from strcmp.  So we need some help
from someone who knows a lot about collation orders and multibyte
character representations.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Oleg Bartunov
Date:
Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem
Next
From: The Hermit Hacker
Date:
Subject: Re: [HACKERS] Re: [PORTS] Anyone working on linux Alpha?