Re: Sorting Problem - Mailing list pgsql-general

From Maksim Likharev
Subject Re: Sorting Problem
Date
Msg-id 56510AAEF435D240958D1CE8C6B1770A016D2DEE@mailc03.aurigin.com
Whole thread Raw
In response to Sorting Problem  ("Tim Edwards" <mor4321@hotmail.com>)
List pgsql-general
When I said use en_US as collate, I meant that you language set
fall into "Western European" group, another word "Latin1 General"
collate,
still as Dennis Björklund pointed that Swedish has different sorting
rules,
( That's why I guess they have separate Swedish collate )

Generally speaking if you language set is unlimited you effectively
stuck?
with sorting on client side by using particular language rules.
So then use 'C' collate for DB,
I do not know what kind of encoding will be good for you tho, Unicode??.

So you can sort on a client using xsl:sort and specifing language or any
other means that you
feel comfortable with.
Or just use binary sorting rules, that what 'C' collate does.
Choices, choices...






-----Original Message-----
From: Kathy zhu [mailto:Kathy.zhu@Sun.COM]
Sent: Wednesday, August 13, 2003 2:13 PM
To: gearond@cvc.net
Cc: Bruce Momjian; Gianni Mariani; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Sorting Problem


Ok, thanks for all the discussion followed, vey educational :-))
But nobody really followed up my question :-(

For example, you have a table that is displayed in the browser. You want
to let
the user do sorting on one or multible columns, including those which
contain
localized strings.

If the db supports sorting for multi-linguages, you can retreive all the
rows
and do "ORDER BY" on the columns directly.

But if db doesn't support that, in Java, you will have to retrieve all
the rows
first, sort the objects on multiple columnes listed in the "ORDER BY" in
java.
It is much slower than the previous one.

My problem is that if initdb is done with en_US, if a customer in Japan
request
the page with that sorted table, the text is not sorted in a Japanese
one. Then
I have a problem.

Any suggestions on that ??


Dennis Gearon wrote:
> I agree, mostly. In the case of a database, I would bet that the
> INTERNAL, IN-APPLICATION processing FAAAAAAAAAAARRRRRRRRRRR exceeds
that
> of sending and receiving it.
>
>     i.e. comparisons, sorts, triggers, indexes, views, functions,
> logging to tables, ordering by,
>         grouping, etc.
>
>     except backups, restores, logging to text files (these would be
good
> in UTF8)
>
> Bruce Momjian wrote:
>
>> I think the question is how often are you passing data around/storing
it
>> _in_ your application and how often are you processing it.
>>
>>
------------------------------------------------------------------------
---
>>
>>
>> Dennis Gearon wrote:
>>
>>> I agree with all of that except for one caveat:
>>>
>>>     all my reading, and just general off the cuff thinking, says
that
>>> processing variable width characters SIGNIFICANTLY slows an
>>> application. It seems better to PROCESS fixed width characters
(1,2,4
>>> byte), and TRANSMIT variable width characters (avoiding the null
>>> problem.)
>>>
>>> Gianni Mariani wrote:
>>>
>>>
>>>> Dennis Gearon wrote:
>>>>
>>>>
>>>>> Got a link to that section of the standard, or better yet, to a
>>>>> 'interpreted' version of the standard? :-)
>>>>>
>>>>> Stephan Szabo wrote:
>>>>>
>>>>>
>>>>>> On Wed, 13 Aug 2003, Dennis Gearon wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Dennis Bj?rklund wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> In the future we need indexes that depend on the locale (and a
>>>>>>>> lot of other changes).
>>>>>>>>
>>>>>>>
>>>>>>> I agree. I've been looking at the web on this subject a lot
>>>>>>> lately. I
>>>>>>> am **NOT** a microslop fan, but SQL-SERVER even lets a user
define a
>>>>>>> language(maybe encoding) down to the column level!
>>>>>>>
>>>>>>> I've been reading on GNU-C and on languages, encoding, and
>>>>>>> localization.
>>>>>>>
>>>>>>>
http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-
1.html
>>>>>>>
>>>>>>>
http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2
366,1222,00.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> There are three basic approaches to doing different langauges in

>>>>>>> computerized text:
>>>>>>>
>>>>>>>   A/ various adaptations of the 8 bit character set, I.E. the
>>>>>>> ISO-8859-x series.
>>>>>>>   B/ wide characters
>>>>>>>   ********This should be how Postgress stores data
>>>>>>> internally.********
>>>>>>>   C/ Multibyte characters
>>>>>>>   ********This is how Postgress should default to sending data
>>>>>>> OUT of the application,
>>>>>>>           i.e. to the display or the web, or other system
>>>>>>> applications********
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> SQL has a system for defining character set specifications,
>>>>>> collations and
>>>>>> such (per column/literal in some cases).  We should probably look

>>>>>> at it
>>>>>> before making decisions on how to do things.
>>>>>
>>>>>
>>>>>
>>>> I thought UNIX (SCOTM) systems also had a way of being able to
>>>> define collation order.
>>>>
>>>> see:
>>>>   ftp://dkuug.dk/i18n/WG15-collection/locales
>>>>
>>>> for a collection of all ISO standardized locales (the WG15 ISO work

>>>> group's stuff).
>>>>
>>>> Do a "man localedef" on most Linuxen or UNIXen.
>>>>
>>>> As for wide characters vs multibyte, there is no clear winner.  The

>>>> right answer DEPENDS on the situation.
>>>>
>>>> Wide characters on some platforms are 16 bit which means that when
>>>> you do Unicode you'll still have problems with surrogate pairs
>>>> (meaning that it's still multi (wide) char) so you still have all
>>>> the problems of multi-byte encodings.
>>>>
>>>> You could decide to process everything in a PG specific 4 byte wide

>>>> char and do all text in Unicode but the overhead in processing 4
>>>> times the data is quite significant.  The other option is to store
>>>> all data in utf-8 and have all text code become utf-8 aware.
>>>>
>>>> I have found in practice that the utf-8 option is significantly
>>>> easier to implement, 100% Unicode compliant and the best performer
>>>> (because of reduced memory requirements).
>>>> The Posix API's for locales are not very good for modern day
>>>> programs, I'm not sure where the "mbr*" and the "wcr*" apis are in
>>>> the standardization process but if these are not well supported,
>>>> you're on your own and will need to implement similar functionality

>>>> from scratch and for that matter, the collation functions all
>>>> operate on a "current" locate which is really difficult to work
with
>>>> on multi-locale applications.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 8: explain analyze is your friend
>>>>
>>>
>>>
>>> ---------------------------(end of
broadcast)---------------------------
>>> TIP 7: don't forget to increase your free space map settings
>>>
>>
>>
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if
your
>      joining column's datatypes do not match


--



---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

pgsql-general by date:

Previous
From: "Gregory S. Williamson"
Date:
Subject: Re: importing db as text files
Next
From: Ron Johnson
Date:
Subject: Support contracts (was Re: Commercial support?)