Thread: Locale agnostic unicode text

Locale agnostic unicode text

From
Dawid Kuroczko
Date:
Hello!

One of least liked by me features of PostgreSQL is a need to specify
LC_CTYPE an LC_COLLATE at initdb time.  Especially if you intend to
put into DB texts in
different languages (say, Polish, French, German and Russian) and use
functions like lower() or ORDER BY these texts. :)

I guess the need to select these two locales at initdb time is to
avoid problems with corrupted indexes (column first indexed
with lower() function when setlocale('C'), then accessed when
setlocale('ru_RU')... oops.  etc.).  Probably there are more of
those. :)

To solve this I thought about creating function lower(text, locale), say:
   lower ('Name', 'pl_PL.utf8');
Simple enough, I used plperl (plperlu actually) to make it happen and
while doing so I've noticed that perl does unicode-lowercasing/uppercasing on
its own accord, doesn't need locales to it, and does it pretty well.
So the resulting function is:

CREATE DOMAIN unitext text;
CREATE OR REPLACE FUNCTION lower(unitext) RETURNS unitext AS $$
        utf8::decode($_[0]);
        return lc($_[0]);
$$ LANGUAGE plperlu IMMUTABLE;

And it seems to work fine regardless of locale set.


So... I thoght, why not use this unitext to sort texts?
So I've created functions, operators and operator class,
This time setlocale() was needed to get the behaviour
I needed (database initdb'ed to 'C', my order set to 'pl_PL',
or whatever locale I need at given moment).

I've attached a 'draft' of unitext,sql, which should create
unitext datatype capable of sorting according Polish locale.
It does not work as it should and I don't know how to make
it work.  For example:

SELECT * FROM uni_tab ORDER BY uni_column;

...sorts according to 'C' (initdb's) locale.

I can force my way by issuing:

SELECT * FROM uni_tab ORDER BY uni_column USING <;

...but I would like to force ORDER BY using operators
provided by me without this 'USING <' clause.

Any hints how to do it?

  Regards,
     Dawid

PS: I like perl's lc() and uc() behaviour in utf8 mode.  I'm thinking about
trying to "port" it from perl source as a C-language extension for PostgreSQL.
What do you think about it?

Attachment

Re: Locale agnostic unicode text

From
Tom Lane
Date:
Dawid Kuroczko <qnex42@gmail.com> writes:
> So... I thoght, why not use this unitext to sort texts?
> So I've created functions, operators and operator class,
> This time setlocale() was needed to get the behaviour
> I needed (database initdb'ed to 'C', my order set to 'pl_PL',
> or whatever locale I need at given moment).

I would imagine that the performance is spectacularly awful :-(.
Have you benchmarked it?  A large sort on a unitext column,
for instance, would be revealing.

> ...but I would like to force ORDER BY using operators
> provided by me without this 'USING <' clause.

Hmm, the existence of the default btree operator class should be
sufficient.

> CREATE OR REPLACE FUNCTION lower(unitext) RETURNS unitext AS $$
>     utf8::decode($_[0]);
>     return lc($_[0]);
> $$ LANGUAGE plperlu IMMUTABLE;

AFAIK upper/lower cannot be considered to be locale-independent
(see Turkish I/i business for a counterexample).
        regards, tom lane


Re: Locale agnostic unicode text

From
Dawid Kuroczko
Date:
On Sat, 22 Jan 2005 17:09:42 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > This time setlocale() was needed to get the behaviour
> > I needed (database initdb'ed to 'C', my order set to 'pl_PL',
> > or whatever locale I need at given moment).
> I would imagine that the performance is spectacularly awful :-(.
> Have you benchmarked it?  A large sort on a unitext column,
> for instance, would be revealing.

True.  Yet it would be still better than nothing ("C").  Actually
I was thinking that maybe functional indexes could be
used to boost the speed (at least for ordering).

> > ...but I would like to force ORDER BY using operators
> > provided by me without this 'USING <' clause.
> Hmm, the existence of the default btree operator class should be
> sufficient.

If You (or anyone) could try that SQL file and try to find
missing clause... :)

I guess that the case is that DOMAIN unitext is not quite
another type, so text's default operators sometimes take
precedence over unitext's own. :)

> > CREATE OR REPLACE FUNCTION lower(unitext) RETURNS unitext AS $$
> >       utf8::decode($_[0]);
> >       return lc($_[0]);
> > $$ LANGUAGE plperlu IMMUTABLE;
> 
> AFAIK upper/lower cannot be considered to be locale-independent
> (see Turkish I/i business for a counterexample).

I imagine it is not possible to make 'one size fits all' lower(),
yet perl's uc()/lc() in my opinion for some cases is still
better than choosing one locale or using "C" locale.
  Regards,     Dawid


Re: Locale agnostic unicode text

From
Greg Stark
Date:
> On Sat, 22 Jan 2005 17:09:42 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > This time setlocale() was needed to get the behaviour
> > > I needed (database initdb'ed to 'C', my order set to 'pl_PL',
> > > or whatever locale I need at given moment).
> > I would imagine that the performance is spectacularly awful :-(.
> > Have you benchmarked it?  A large sort on a unitext column,
> > for instance, would be revealing.

Why do you persist in believing this? I sent timing results of doing a
setlocale for every record here about a year ago. Sorting on the pg_strxfrm I
posted (and Conway rewrote) was about twice as slow as sorting without using
it. So it's slow but not spectacularly awful.

This depends on having a good setlocale implementation, but glibc at least
seems to be satisfactory.

-- 
greg



Re: Locale agnostic unicode text

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> On Sat, 22 Jan 2005 17:09:42 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I would imagine that the performance is spectacularly awful :-(.
>> Have you benchmarked it?  A large sort on a unitext column,
>> for instance, would be revealing.

> Why do you persist in believing this? I sent timing results of doing a
> setlocale for every record here about a year ago. Sorting on the pg_strxfrm I
> posted (and Conway rewrote) was about twice as slow as sorting without using
> it. So it's slow but not spectacularly awful.

glibc is not the world.  I tried Dawid's functions on Mac OS X, being a
random non-glibc platform that I happen to use.  Using some text data
I had handy (44500 lines, 1.9MB) I made a single-column text table and
timedexplain analyze select * from foo order by f1;
The results wereIn C locale, SQL_ASCII encoding:    820 msIn C locale, UNICODE encoding:        825 msUsing Dawid's
functions:       62010 msStripped-down functions:        21010 ms
 

The "stripped down" functions were the same functions without the 
locale overhead, eg

CREATE OR REPLACE FUNCTION unitext_le(unitext,unitext) RETURNS boolean AS $$       my $ret = ($_[0] le $_[1]) ? 't' :
'f';      return $ret;
 
$$ LANGUAGE plperlu STABLE;

so we may conclude that about one-third of the overhead is plperl's
fault and the other two-thirds is setlocale's fault.  But it's still
a factor of 50 slowdown to do it this way (actually worse, since not
all of the EXPLAIN ANALYZE total runtime went into sorting).

I'm not sure what your threshold of "spectacularly awful" is, but that
meets mine.
        regards, tom lane


Re: Locale agnostic unicode text

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> >
> > So it's slow but not spectacularly awful.
> 
> glibc is not the world.  

Sorry, I should have said "It's not *necessarily* spectacularly awful"

> I tried Dawid's functions on Mac OS X, being a
> random non-glibc platform that I happen to use.  Using some text data
> I had handy (44500 lines, 1.9MB) I made a single-column text table and
> timed
>     explain analyze select * from foo order by f1;
> The results were
>     In C locale, SQL_ASCII encoding:    820 ms
>     In C locale, UNICODE encoding:        825 ms
>     Using Dawid's functions:        62010 ms
>     Stripped-down functions:        21010 ms

I don't think these are fair comparisons though. The C locale probably
short-circuits much of the work that strxfrm/strcoll have to do for other
locales. I think the fair comparison is to compare a database initdb'd in a
non-C locale like en_US using strcoll with no setlocale calls against one
calling setlocale twice for every record.

In any case it's true, some platforms have bad implementations of things.

But if you have to do this (and I have to do this too) it doesn't really
matter that some platforms don't handle it well. This just means those
platforms aren't feasible and I'm forced to use glibc-based platforms. It
doesn't mean I should dismiss Postgres for the project.

Incidentally Dawid, if you are on a platform like OSX with a performance
problem with this there is a possible optimization you can use. If you store
and update the data rarely but sort it frequently you can store the output of
strxfrm in a bytea column. Then you can sort on that column without having to
call setlocale repeatedly.

If you have few queries that can be optimized to always use indexes you can
even store this information in a functional index instead of denormalizing the
table.

-- 
greg



Re: Locale agnostic unicode text

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> The results were
>> In C locale, SQL_ASCII encoding:    820 ms
>> In C locale, UNICODE encoding:        825 ms
>> Using Dawid's functions:        62010 ms
>> Stripped-down functions:        21010 ms

> I don't think these are fair comparisons though. The C locale probably
> short-circuits much of the work that strxfrm/strcoll have to do for other
> locales. I think the fair comparison is to compare a database initdb'd in a
> non-C locale like en_US using strcoll with no setlocale calls against one
> calling setlocale twice for every record.

OK, that's fair:In en_US locale:        1900 ms

which makes it only a factor of 20 slowdown to use setlocale every time...
        regards, tom lane