Re: Sorting Problem - Mailing list pgsql-general

From Gianni Mariani
Subject Re: Sorting Problem
Date
Msg-id 3F3A75C3.3060006@mariani.ws
Whole thread Raw
In response to Re: Sorting Problem  (Dennis Gearon <gearond@cvc.net>)
Responses Re: Sorting Problem  (Dennis Gearon <gearond@cvc.net>)
List pgsql-general
Dennis Gearon wrote:

> Got a link to that section of the standard, or better yet, to a
> 'interpreted' version of the standard? :-)
>
> Stephan Szabo wrote:
>
>> On Wed, 13 Aug 2003, Dennis Gearon wrote:
>>
>>
>>> Dennis Bj?rklund wrote:
>>>
>>>
>>>> In the future we need indexes that depend on the locale (and a lot
>>>> of other changes).
>>>>
>>>
>>> I agree. I've been looking at the web on this subject a lot lately. I
>>> am **NOT** a microslop fan, but SQL-SERVER even lets a user define a
>>> language(maybe encoding) down to the column level!
>>>
>>> I've been reading on GNU-C and on languages, encoding, and
>>> localization.
>>>
>>> http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
>>>
>>> http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html
>>>
>>>
>>>
>>> There are three basic approaches to doing different langauges in
>>> computerized text:
>>>
>>>    A/ various adaptations of the 8 bit character set, I.E. the
>>> ISO-8859-x series.
>>>    B/ wide characters
>>>    ********This should be how Postgress stores data internally.********
>>>    C/ Multibyte characters
>>>    ********This is how Postgress should default to sending data OUT
>>> of the application,
>>>            i.e. to the display or the web, or other system
>>> applications********
>>
>>
>>
>> SQL has a system for defining character set specifications,
>> collations and
>> such (per column/literal in some cases).  We should probably look at it
>> before making decisions on how to do things.
>

I thought UNIX (SCOTM) systems also had a way of being able to define
collation order.

see:
    ftp://dkuug.dk/i18n/WG15-collection/locales

for a collection of all ISO standardized locales (the WG15 ISO work
group's stuff).

Do a "man localedef" on most Linuxen or UNIXen.

As for wide characters vs multibyte, there is no clear winner.  The
right answer DEPENDS on the situation.

Wide characters on some platforms are 16 bit which means that when you
do Unicode you'll still have problems with surrogate pairs (meaning that
it's still multi (wide) char) so you still have all the problems of
multi-byte encodings.

You could decide to process everything in a PG specific 4 byte wide char
and do all text in Unicode but the overhead in processing 4 times the
data is quite significant.  The other option is to store all data in
utf-8 and have all text code become utf-8 aware.

I have found in practice that the utf-8 option is significantly easier
to implement, 100% Unicode compliant and the best performer (because of
reduced memory requirements).

The Posix API's for locales are not very good for modern day programs,
I'm not sure where the "mbr*" and the "wcr*" apis are in the
standardization process but if these are not well supported, you're on
your own and will need to implement similar functionality from scratch
and for that matter, the collation functions all operate on a "current"
locate which is really difficult to work with on multi-locale applications.







pgsql-general by date:

Previous
From: Dennis Gearon
Date:
Subject: Re: Sorting Problem
Next
From: Dennis Gearon
Date:
Subject: Re: Sorting Problem