Re: [HACKERS] indexable and locale - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] indexable and locale
Date
Msg-id 5492.940095061@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] indexable and locale  (Tatsuo Ishii <t-ishii@sra.co.jp>)
Responses Re: [HACKERS] indexable and locale  (Tatsuo Ishii <t-ishii@sra.co.jp>)
Re: [HACKERS] indexable and locale  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> Attached is a patch to the old problem discussed feverly before 6.5.

> ... I think your pacthes break
> non-ascii multi-byte character sets data and should be surrounded by
> #ifdef LOCALE rather than replacing current codes surrounded by
> #ifndef LOCALE.

I am worried about this patch too.  Under MULTIBYTE could it
generate invalid characters?  Also, do all non-ASCII locales sort
codes 0-126 in the same order as ASCII?  I didn't think they do,
but I'm not an expert.

The approach I was considering for fixing the problem was to use a
loop that would repeatedly try to generate a string greater than the
prefix string.  The basic loop step would increment the rightmost
byte as Goran has done (or, if it's already up to the limit, chop
it off and increment the next character position).  Then test to
see whether the '<' operator actually believes the result is
greater than the given prefix, and repeat if not.  This avoids making
any strong assumptions about the sort order of different character
codes.  However, there are two significant issues that would have
to be surmounted to make it work reliably:

1. In MULTIBYTE mode incrementing the rightmost byte might yield
an illegal multibyte character.  Some way to prevent or detect this
would be needed, lest it confuse the comparison operator.  I think
we have some multibyte routines that could be used to check for
a valid result, but I haven't looked into it.

2. I think there are some locales out there that have context-
sensitive sorting rules, ie, a given character string may sort
differently than you'd expect from considering the characters in
isolation.  For example, in German isn't "ss" treated specially?
If "pqrss" does not sort between "pqrs" and "pqrt" then the entire
premise of *both* sides of the LIKE optimization falls apart,
because you can't be sure what will happen when comparing a prefix
string like "pqrs" against longer strings from the database.
I do not know if this is really a problem, nor what we could do
to avoid it if it is.
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Oliver Elphick"
Date:
Subject: Re: [HACKERS] to_char(), md5() (long)
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] sort on huge table