Re: Sorting Problem - Mailing list pgsql-general

From Dennis Gearon
Subject Re: Sorting Problem
Date
Msg-id 3F3AB0EC.6040307@cvc.net
Whole thread Raw
In response to Re: Sorting Problem  (Kathy zhu <Kathy.zhu@Sun.COM>)
List pgsql-general
From what I've read, the db will ONLY sort on the language it was 'initdb'd on. I guess the 'C' locale should would
sortit sort of, in other languages. 

Kathy zhu wrote:

> Ok, thanks for all the discussion followed, vey educational :-))
> But nobody really followed up my question :-(
>
> For example, you have a table that is displayed in the browser. You want
> to let the user do sorting on one or multible columns, including those
> which contain localized strings.
>
> If the db supports sorting for multi-linguages, you can retreive all the
> rows and do "ORDER BY" on the columns directly.
>
> But if db doesn't support that, in Java, you will have to retrieve all
> the rows first, sort the objects on multiple columnes listed in the
> "ORDER BY" in java. It is much slower than the previous one.
>
> My problem is that if initdb is done with en_US, if a customer in Japan
> request the page with that sorted table, the text is not sorted in a
> Japanese one. Then I have a problem.
>
> Any suggestions on that ??
>
>
> Dennis Gearon wrote:
>
>> I agree, mostly. In the case of a database, I would bet that the
>> INTERNAL, IN-APPLICATION processing FAAAAAAAAAAARRRRRRRRRRR exceeds
>> that of sending and receiving it.
>>
>>     i.e. comparisons, sorts, triggers, indexes, views, functions,
>> logging to tables, ordering by,
>>         grouping, etc.
>>
>>     except backups, restores, logging to text files (these would be
>> good in UTF8)
>>
>> Bruce Momjian wrote:
>>
>>> I think the question is how often are you passing data around/storing it
>>> _in_ your application and how often are you processing it.
>>>
>>> ---------------------------------------------------------------------------
>>>
>>>
>>> Dennis Gearon wrote:
>>>
>>>> I agree with all of that except for one caveat:
>>>>
>>>>     all my reading, and just general off the cuff thinking, says
>>>> that processing variable width characters SIGNIFICANTLY slows an
>>>> application. It seems better to PROCESS fixed width characters
>>>> (1,2,4 byte), and TRANSMIT variable width characters (avoiding the
>>>> null problem.)
>>>>
>>>> Gianni Mariani wrote:
>>>>
>>>>
>>>>> Dennis Gearon wrote:
>>>>>
>>>>>
>>>>>> Got a link to that section of the standard, or better yet, to a
>>>>>> 'interpreted' version of the standard? :-)
>>>>>>
>>>>>> Stephan Szabo wrote:
>>>>>>
>>>>>>
>>>>>>> On Wed, 13 Aug 2003, Dennis Gearon wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Dennis Bj?rklund wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> In the future we need indexes that depend on the locale (and a
>>>>>>>>> lot of other changes).
>>>>>>>>>
>>>>>>>>
>>>>>>>> I agree. I've been looking at the web on this subject a lot
>>>>>>>> lately. I
>>>>>>>> am **NOT** a microslop fan, but SQL-SERVER even lets a user
>>>>>>>> define a
>>>>>>>> language(maybe encoding) down to the column level!
>>>>>>>>
>>>>>>>> I've been reading on GNU-C and on languages, encoding, and
>>>>>>>> localization.
>>>>>>>>
>>>>>>>> http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
>>>>>>>>
>>>>>>>> http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> There are three basic approaches to doing different langauges in
>>>>>>>> computerized text:
>>>>>>>>
>>>>>>>>   A/ various adaptations of the 8 bit character set, I.E. the
>>>>>>>> ISO-8859-x series.
>>>>>>>>   B/ wide characters
>>>>>>>>   ********This should be how Postgress stores data
>>>>>>>> internally.********
>>>>>>>>   C/ Multibyte characters
>>>>>>>>   ********This is how Postgress should default to sending data
>>>>>>>> OUT of the application,
>>>>>>>>           i.e. to the display or the web, or other system
>>>>>>>> applications********
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> SQL has a system for defining character set specifications,
>>>>>>> collations and
>>>>>>> such (per column/literal in some cases).  We should probably look
>>>>>>> at it
>>>>>>> before making decisions on how to do things.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> I thought UNIX (SCOTM) systems also had a way of being able to
>>>>> define collation order.
>>>>>
>>>>> see:
>>>>>   ftp://dkuug.dk/i18n/WG15-collection/locales
>>>>>
>>>>> for a collection of all ISO standardized locales (the WG15 ISO work
>>>>> group's stuff).
>>>>>
>>>>> Do a "man localedef" on most Linuxen or UNIXen.
>>>>>
>>>>> As for wide characters vs multibyte, there is no clear winner.  The
>>>>> right answer DEPENDS on the situation.
>>>>>
>>>>> Wide characters on some platforms are 16 bit which means that when
>>>>> you do Unicode you'll still have problems with surrogate pairs
>>>>> (meaning that it's still multi (wide) char) so you still have all
>>>>> the problems of multi-byte encodings.
>>>>>
>>>>> You could decide to process everything in a PG specific 4 byte wide
>>>>> char and do all text in Unicode but the overhead in processing 4
>>>>> times the data is quite significant.  The other option is to store
>>>>> all data in utf-8 and have all text code become utf-8 aware.
>>>>>
>>>>> I have found in practice that the utf-8 option is significantly
>>>>> easier to implement, 100% Unicode compliant and the best performer
>>>>> (because of reduced memory requirements).
>>>>> The Posix API's for locales are not very good for modern day
>>>>> programs, I'm not sure where the "mbr*" and the "wcr*" apis are in
>>>>> the standardization process but if these are not well supported,
>>>>> you're on your own and will need to implement similar functionality
>>>>> from scratch and for that matter, the collation functions all
>>>>> operate on a "current" locate which is really difficult to work
>>>>> with on multi-locale applications.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------(end of
>>>>> broadcast)---------------------------
>>>>> TIP 8: explain analyze is your friend
>>>>>
>>>>
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 7: don't forget to increase your free space map settings
>>>>
>>>
>>>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 9: the planner will ignore your desire to choose an index scan if
>> your
>>      joining column's datatypes do not match
>
>
>


pgsql-general by date:

Previous
From: Jason Godden
Date:
Subject: Re: importing db as text files
Next
From: Ron Johnson
Date:
Subject: Re: How to get the total number of rows with a query