Thread: COLLATE

COLLATE

From

"Filip Rembiałkowski"

Date:

05 June 2006, 12:04:36

Hi all,

let's assume that we keep Unicode text  data in the column.
sometimes we want to sort it according to specific collation order.
how can we force collation when running a query?
ideal solution would be having SQL92 standard COLLATE clauses in
SELECT statements. I heard it's work in progress, what's the status?

are there any "partial" solutions to the problem?

regards,

Filip Rembiałkowski

Re: COLLATE

From

Martijn van Oosterhout

Date:

05 June 2006, 13:03:51

On Mon, Jun 05, 2006 at 05:04:25PM +0200, Filip Rembia??kowski wrote:
> Hi all,
>
> let's assume that we keep Unicode text  data in the column.
> sometimes we want to sort it according to specific collation order.
> how can we force collation when running a query?
> ideal solution would be having SQL92 standard COLLATE clauses in
> SELECT statements. I heard it's work in progress, what's the status?

Yeah, I was working on it but got stuck on the planner/optimiser
changes. In the mean time the tree drifted and lack of interest, which
gets us where we are now...

> are there any "partial" solutions to the problem?

Not that I know of.

Have a ncie day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

signature.asc

Re: COLLATE

From

Greg Stark

Date:

05 June 2006, 13:54:20

"Filip Rembiaâkowski" <plk.zuber@gmail.com> writes:

> let's assume that we keep Unicode text  data in the column.
> sometimes we want to sort it according to specific collation order.
> how can we force collation when running a query?
> ideal solution would be having SQL92 standard COLLATE clauses in
> SELECT statements. I heard it's work in progress, what's the status?
>
> are there any "partial" solutions to the problem?

I don't know the status of the full COLLATE support.

But the closest partial solution suggested so far is the pg_xfrm function that
has been implemented and posted at least three times by three different
posters to the postgres mailing lists. In the interest of avoiding a fourth
independent implementation I'll attach the one I use below, it's not big.

--
greg

Attachment

Re: COLLATE

From

"Nikolay Samokhvalov"

Date:

06 June 2006, 17:40:56

On 6/5/06, Martijn van Oosterhout <kleptog@svana.org> wrote:
>
> Yeah, I was working on it but got stuck on the planner/optimiser
> changes. In the mean time the tree drifted and lack of interest, which
> gets us where we are now...

Very bad news :-(
I were looking forward to this feature... So many troubles in my
projects would be vanished...

What is 'lack of interest'? Interest from community, or major
developers, or your personal one?

Re: COLLATE

From

"Filip Rembiałkowski"

Date:

07 June 2006, 10:09:25

(group dupe)

05 Jun 2006 12:53:57 -0400, Greg Stark <gsstark@mit.edu>:
>
> But the closest partial solution suggested so far is the pg_xfrm function that
> has been implemented and posted at least three times by three different
> posters to the postgres mailing lists. In the interest of avoiding a fourth
> independent implementation I'll attach the one I use below, it's not big.
>

Thanks! It may be the only solution available at the moment.

But I have no idea how to compile/install it.
I imagine there should be separate /contrib/pg_strxfrm directory, with
the makefile etc.
is it correct? if so, how should the makefile look like?

I tried compiling but it failed:

pgdba@sulaco:~/src/postgresql-8.1.3/src$ gcc -I ./include -fPIC -c pg_strxfrm.c
pg_strxfrm.c: In function 'pg_strxfrm':
pg_strxfrm.c:98: error: 'Warn_restart' undeclared (first use in this function)
pg_strxfrm.c:98: error: (Each undeclared identifier is reported only once
pg_strxfrm.c:98: error: for each function it appears in.)

Re: COLLATE

From

Martijn van Oosterhout

Date:

07 June 2006, 18:11:43

On Wed, Jun 07, 2006 at 12:40:49AM +0400, Nikolay Samokhvalov wrote:
> I were looking forward to this feature... So many troubles in my
> projects would be vanished...
>
> What is 'lack of interest'? Interest from community, or major
> developers, or your personal one?

Kind of all three, feeding off eachother. There's just not enough
interest from any front to really get it moving. It's a fairly invasive
change and without significant support and interest from somewhere,
chances of completion let alone acceptance are pretty slim...

Have a nice day
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

signature.asc

Re: COLLATE

From

"Tomi NA"

Date:

07 June 2006, 18:37:55

On 6/7/06, Martijn van Oosterhout <kleptog@svana.org> wrote:

> Kind of all three, feeding off eachother. There's just not enough
> interest from any front to really get it moving. It's a fairly invasive
> change and without significant support and interest from somewhere,
> chances of completion let alone acceptance are pretty slim...

I don't get it. Maybe it's me or my environment.
About half of the work we do in my company is building webs on the
(excellent) eZ publish CMS. We routinely use MySQL even though the CMS
(at least declaratively) wolks on top of pgsql. We use MySQL on
inertia: it's the default db under eZ publish and it's easy to get
hosting.
On the other hand, I'd rather bite my hand off than use MySQL in some
other Internet apps (e-shop with a card payment interface and others).
Where am I going with all this?
Well, if you don't live in an english-speaking country, there's no
such thing as a single language web. Because of collation limitations,
however, postgresql would be the first to be crossed out on my list no
matter how good it is in all other respects.
I understand that the needed change is uncomfortably invasive, but not
beeing able to collate correctly is a show-stopping problem on a
professional site. The user opens a drop-down with several dozens of
cities, scrolls down a bit to where his city should be, dooesn't see
it because the collator places the weird letter all the way down at
the end of the list - and the user walks away. The other user can't
get info about the bus lines to the city or he doesn't buy a product
he want's because he doesn't see it where it's supposed to be. Another
user notices the error and dissregards the site as amateurish.
I understand I'm talking about a specific area of use, but isn't that
where the biggest growth in both the number of new applications as
well as the number of users is?
As it is, I'm happy to have a great RDBMS to build intranet, rich
client apps on - but that's about it.
I've used MySQL and MSSQL (unfortunately) and they both support
collation much better than pgsql. Not perfect, but much much better.
Maybe postgresql should try to set less ambitious goals and instead of
going for the holy grail of collation management (which is usualy
praiseworthy) try to provide at least db-level collation definitions
if table/row/cell level collation settings should proove too
challenging at the moment.

t.n.a.

Re: COLLATE

From

Tom Lane

Date:

07 June 2006, 18:42:32

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Wed, Jun 07, 2006 at 12:40:49AM +0400, Nikolay Samokhvalov wrote:
>> What is 'lack of interest'? Interest from community, or major
>> developers, or your personal one?

> Kind of all three, feeding off eachother. There's just not enough
> interest from any front to really get it moving. It's a fairly invasive
> change and without significant support and interest from somewhere,
> chances of completion let alone acceptance are pretty slim...

I think there was also considerable concern about introducing a
dependency on a very large chunk of outside software (viz, ICU).
Loss of control, licensing questions, etc.

Of course, doing it *without* ICU is even more daunting :-(
but I'd like to think we'll get there someday.

            regards, tom lane

Re: COLLATE

From

Jan Behrens

Date:

14 June 2006, 10:12:44

Filip Rembiałkowski wrote:
> let's assume that we keep Unicode text  data in the column.
> sometimes we want to sort it according to specific collation order.
> how can we force collation when running a query?

Hi Filip,

I had the same problem you have. As a solution I implemented a wrapper
function named collkey(), which transforms a Unicode string into a
sortable collation key using IBM's ICU library. Of course it's not as
nice as having a standard SQL command for it, but i think it's a nice
solution, until a good collation support is provided by PostgreSQL itself.

You can find the source here:
http://www.flexiguided.de/publications.pgcollkey.en.html

I can understand that people don't want PostgreSQL being dependent of
other libraries. The best solution for future would be implementing
independent collation functions inside of PostgreSQL, which also support
standard SQL syntax, but that's a lot of work, i think.

Jan Behrens

Re: COLLATE

From

Kevin Murphy

Date:

16 August 2006, 17:32:53

Filip Rembiałkowski wrote:
> Greg Stark wrote:
>
>     But the closest partial solution suggested so far is the pg_xfrm (sic) function that
>     has been implemented and posted at least three times by three different
>     posters to the postgres mailing lists. In the interest of avoiding a fourth
>     independent implementation I'll attach the one I use below, it's not big.
>
>
> But I have no idea how to compile/install it.
>
> I tried compiling but it failed:
>
> pgdba ( at ) sulaco:~/src/postgresql-8.1.3/src$ gcc -I ./include -fPIC -c pg_strxfrm.c
> pg_strxfrm.c: In function 'pg_strxfrm':
> pg_strxfrm.c:98: error: 'Warn_restart' undeclared (first use in this function)
> pg_strxfrm.c:98: error: (Each undeclared identifier is reported only once
> pg_strxfrm.c:98: error: for each function it appears in.)

What is the answer to Filip's question?  I didn't see an answer in the list archives.  I've seen several copies of Joe
Conway'spg_strxfrm.c code on the web, and it always refers to the Warn_restart variable, which doesn't seem to exist in
the8.1.4 code that I'm using. 

I am working with a database in UTF-8 encoding using "C" collation - but I'd occasionally like to ORDER BY columns
containingreal UTF-8 data.  

Would the pg_strxfrm() function get used in a new operator class function?   I'll read up on operator classes in
chapter32.14 of the docs, but if someone has a simple example, it might help other searchers of the archives. 

Thanks,
Kevin Murphy

Re: COLLATE

From

Tom Lane

Date:

16 August 2006, 18:02:36

Kevin Murphy <murphy@genome.chop.edu> writes:
> What is the answer to Filip's question?  I didn't see an answer in the list archives.  I've seen several copies of
JoeConway's pg_strxfrm.c code on the web, and it always refers to the Warn_restart variable, which doesn't seem to
existin the 8.1.4 code that I'm using. 

Warn_restart hasn't existed since PG 7.4.  I would imagine that the code
needs to be tweaked to use a PG_TRY construct instead of direct setjmp
hacking.

            regards, tom lane

Re: COLLATE

From

Kevin Murphy

Date:

17 August 2006, 12:01:14

Tom Lane wrote:
> Kevin Murphy <murphy@genome.chop.edu> writes:
>> What is the answer to Filip's question?  I didn't see an answer in the list archives.  I've seen several copies of
JoeConway's pg_strxfrm.c code on the web, and it always refers to the Warn_restart variable, which doesn't seem to
existin the 8.1.4 code that I'm using. 
>
> Warn_restart hasn't existed since PG 7.4.  I would imagine that the code
> needs to be tweaked to use a PG_TRY construct instead of direct setjmp
> hacking.

Yes, I'm a user, not a hacker.  I was hoping that someone had done this
already.  Anyway, I gave PG_TRY a try, and the code superficially works.
I have no idea what I'm doing; you can see what I did below.

Confirm that instead of:

        memcpy(&save_restart, &Warn_restart, sizeof(save_restart));
        if (sigsetjmp(Warn_restart, 1) != 0)
        {
                memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));
                newlocale = setlocale(LC_COLLATE, oldlocale);
                if (!newlocale)
                        elog(PANIC, "setlocale failed to reset locale:
%s", localestr);
                siglongjmp(Warn_restart, 1);
        }
        ...
        code here
        ...
        memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));


it should be:

        PG_TRY();
        {
             ...
             code here
             ...
        }
        PG_CATCH();
        {
             newlocale = setlocale(LC_COLLATE, oldlocale);
             if (!newlocale)
                     elog(PANIC, "setlocale failed to reset locale: %s",
localestr);

        }
        PG_END_TRY();

Thanks,
Kevin Murphy