Thread: Per-column collation

Per-column collation

From
Peter Eisentraut
Date:
Here is the next patch in this epic series. [0]

I have addressed most of the issues pointed out in previous reviews and
removed all major outstanding problems that were marked in the code.  So
it might just almost really work.

The documentation now also covers everything that's interesting, so
newcomers can start with that.


For those who have previously reviewed this, two major changes:

* The locales to be loaded are now computed by initdb, no longer during
the build process.

* The regression test file has been removed from the main test set.  To
run it, use

make check MULTIBYTE=UTF8 EXTRA_TESTS=collate


Stuff that still cannot be expected to work:

* no CREATE COLLATION yet, maybe later

* no support for regular expression searches

* not text search support

These would not be release blockers, I think.


[0] http://archives.postgresql.org/message-id/1284583568.4696.20.camel@vanquo.pezone.net

Attachment

Re: Per-column collation

From
Pavel Stehule
Date:
Hello

I am checking a patch. I found a problem with initdb

[postgres@pavel-stehule postgresql]$ /usr/local/pgsql/bin/initdb -D
/usr/local/pgsql/data/
could not change directory to "/home/pavel/src/postgresql"
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale en_US.utf8.
The default database encoding has accordingly been set to UTF8.
The default text search configuration will be set to "english".

fixing permissions on existing directory /usr/local/pgsql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 24MB
creating configuration files ... ok
creating template1 database in /usr/local/pgsql/data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ...initdb: locale name has non-ASCII characters,
skipped: bokm�linitdb: locale name has non-ASCII characters, skipped:
fran�aiscould not determine encoding for locale "hy_AM.armscii8":
codeset is "ARMSCII-8"
could not determine encoding for locale "ka_GE": codeset is "GEORGIAN-PS"
could not determine encoding for locale "ka_GE.georgianps": codeset is
"GEORGIAN-PS"
could not determine encoding for locale "kk_KZ": codeset is "PT154"
could not determine encoding for locale "kk_KZ.pt154": codeset is "PT154"
could not determine encoding for locale "tg_TJ": codeset is "KOI8-T"
could not determine encoding for locale "tg_TJ.koi8t": codeset is "KOI8-T"
could not determine encoding for locale "thai": codeset is "TIS-620"
could not determine encoding for locale "th_TH": codeset is "TIS-620"
could not determine encoding for locale "th_TH.tis620": codeset is "TIS-620"
could not determine encoding for locale "vi_VN.tcvn": codeset is "TCVN5712-1"
FATAL:  invalid byte sequence for encoding "UTF8": 0xe56c27
child process exited with exit code 1
initdb: removing contents of data directory "/usr/local/pgsql/data

tested on fedora 13

[postgres@pavel-stehule local]$ locale -a| wc -l
731

Regards

Pavel Stehule



2010/11/15 Peter Eisentraut <peter_e@gmx.net>:
> Here is the next patch in this epic series. [0]
>
> I have addressed most of the issues pointed out in previous reviews and
> removed all major outstanding problems that were marked in the code.  So
> it might just almost really work.
>
> The documentation now also covers everything that's interesting, so
> newcomers can start with that.
>
>
> For those who have previously reviewed this, two major changes:
>
> * The locales to be loaded are now computed by initdb, no longer during
> the build process.
>
> * The regression test file has been removed from the main test set.  To
> run it, use
>
> make check MULTIBYTE=UTF8 EXTRA_TESTS=collate
>
>
> Stuff that still cannot be expected to work:
>
> * no CREATE COLLATION yet, maybe later
>
> * no support for regular expression searches
>
> * not text search support
>
> These would not be release blockers, I think.
>
>
> [0] http://archives.postgresql.org/message-id/1284583568.4696.20.camel@vanquo.pezone.net
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


Re: Per-column collation

From
Peter Eisentraut
Date:
On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:
> I am checking a patch. I found a problem with initdb

Ah, late night brain farts, it appears.  Here is a corrected version.


Attachment

Re: Per-column collation

From
Pavel Stehule
Date:
Hello

2010/11/15 Peter Eisentraut <peter_e@gmx.net>:
> On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:
>> I am checking a patch. I found a problem with initdb
>
> Ah, late night brain farts, it appears.  Here is a corrected version.
>
>

yes, it's ok now.

I see still a few issues:

a) default encoding for collate isn't same as default encoding of database

it's minimally not friendly - mostly used encoding is UTF8, but in
most cases users should to write locale.utf8.

b) there is bug - default collate (database collate is ignored)


postgres=# show lc_collate;lc_collate
────────────cs_CZ.UTF8
(1 row)

Time: 0.518 ms
postgres=# select * from jmena order by v;    v
───────────ChromečkaCrhaDrobnýČečetka
(4 rows)

postgres=# select * from jmena order by v collate "cs_CZ.utf8";    v
───────────CrhaČečetkaDrobnýChromečka
(4 rows)

both result should be same.

isn't there problem in case sensitive collate name? When I use a
lc_collate value, I got a error message

postgres=# select * from jmena order by v collate "cs_CZ.UTF8";
ERROR:  collation "cs_CZ.UTF8" for current database encoding "UTF8"
does not exist
LINE 1: select * from jmena order by v collate "cs_CZ.UTF8";

problem is when table is created without explicit collate.

Regards

Pavel Stehule


Re: Per-column collation

From
Peter Eisentraut
Date:
On mån, 2010-11-15 at 23:13 +0100, Pavel Stehule wrote:
> a) default encoding for collate isn't same as default encoding of database
> 
> it's minimally not friendly - mostly used encoding is UTF8, but in
> most cases users should to write locale.utf8.

I don't understand what you are trying to say.  Please provide more
detail.

> b) there is bug - default collate (database collate is ignored)
> 
> 
> postgres=# show lc_collate;
>  lc_collate
> ────────────
>  cs_CZ.UTF8
> (1 row)
> 
> Time: 0.518 ms
> postgres=# select * from jmena order by v;
>      v
> ───────────
>  Chromečka
>  Crha
>  Drobný
>  Čečetka
> (4 rows)
> 
> postgres=# select * from jmena order by v collate "cs_CZ.utf8";
>      v
> ───────────
>  Crha
>  Čečetka
>  Drobný
>  Chromečka
> (4 rows)
> 
> both result should be same.

I tried to reproduce this here but got the expected results.  Could you
try to isolate a complete test script?

> isn't there problem in case sensitive collate name? When I use a
> lc_collate value, I got a error message
> 
> postgres=# select * from jmena order by v collate "cs_CZ.UTF8";
> ERROR:  collation "cs_CZ.UTF8" for current database encoding "UTF8"
> does not exist
> LINE 1: select * from jmena order by v collate "cs_CZ.UTF8";
> 
> problem is when table is created without explicit collate.

Well, I agree it's not totally nice, but we have to do something, and I
think it's logical to use the locale names as collation names by
default, and collation names are SQL identifiers.  Do you have any ideas
for improving this?




Re: Per-column collation

From
Pavel Stehule
Date:
Hello

2010/11/16 Peter Eisentraut <peter_e@gmx.net>:
> On mån, 2010-11-15 at 23:13 +0100, Pavel Stehule wrote:
>> a) default encoding for collate isn't same as default encoding of database
>>
>> it's minimally not friendly - mostly used encoding is UTF8, but in
>> most cases users should to write locale.utf8.
>
> I don't understand what you are trying to say.  Please provide more
> detail.

go down.

>
>> b) there is bug - default collate (database collate is ignored)
>>
>>
>> postgres=# show lc_collate;
>>  lc_collate
>> ────────────
>>  cs_CZ.UTF8
>> (1 row)
>>
>> Time: 0.518 ms
>> postgres=# select * from jmena order by v;
>>      v
>> ───────────
>>  Chromečka
>>  Crha
>>  Drobný
>>  Čečetka
>> (4 rows)
>>
>> postgres=# select * from jmena order by v collate "cs_CZ.utf8";
>>      v
>> ───────────
>>  Crha
>>  Čečetka
>>  Drobný
>>  Chromečka
>> (4 rows)
>>
>> both result should be same.
>
> I tried to reproduce this here but got the expected results.  Could you
> try to isolate a complete test script?
>

I can't to reproduce now too. On different system and comp. Maybe I
did some wrong. Sorry.


>> isn't there problem in case sensitive collate name? When I use a
>> lc_collate value, I got a error message
>>
>> postgres=# select * from jmena order by v collate "cs_CZ.UTF8";
>> ERROR:  collation "cs_CZ.UTF8" for current database encoding "UTF8"
>> does not exist
>> LINE 1: select * from jmena order by v collate "cs_CZ.UTF8";
>>
>> problem is when table is created without explicit collate.
>
> Well, I agree it's not totally nice, but we have to do something, and I
> think it's logical to use the locale names as collation names by
> default, and collation names are SQL identifiers.  Do you have any ideas
> for improving this?

yes - my first question is: Why we need to specify encoding, when only
one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
use a UTF8 - btw there is wrong message:

yyy=# select * from jmena order by jmeno collate "cs_CZ.iso88592";
ERROR:  collation "cs_CZ.iso88592" for current database encoding
"UTF8" does not exist
LINE 1: select * from jmena order by jmeno collate "cs_CZ.iso88592";                                          ^

I don't know why, but preferred encoding for czech is iso88592 now -
but I can't to use it - so I can't to use a names "czech", "cs_CZ". I
always have to use a full name "cs_CZ.utf8". It's wrong. More - from
this moment, my application depends on firstly used encoding - I can't
to change encoding without refactoring of SQL statements - because
encoding is hardly there (in collation clause).

So I don't understand, why you fill a table pg_collation with thousand
collated that are not possible to use? If I use a utf8, then there
should be just utf8 based collates. And if you need to work with wide
collates, then I am for a preferring utf8 - minimally for central
europe region. if somebody would to use a collates here, then he will
use a combination cs, de, en - so it must to use a latin2 and latin1
or utf8. I think so encoding should not be a part of collation when it
is possible.

Regards

Pavel





>
>
>


Re: Per-column collation

From
Peter Eisentraut
Date:
On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:
> yes - my first question is: Why we need to specify encoding, when only
> one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
> use a UTF8 - btw there is wrong message:
> 
> yyy=# select * from jmena order by jmeno collate "cs_CZ.iso88592";
> ERROR:  collation "cs_CZ.iso88592" for current database encoding
> "UTF8" does not exist
> LINE 1: select * from jmena order by jmeno collate "cs_CZ.iso88592";
>                                            ^

Sorry, is there some mistake in that message?

> I don't know why, but preferred encoding for czech is iso88592 now -
> but I can't to use it - so I can't to use a names "czech", "cs_CZ". I
> always have to use a full name "cs_CZ.utf8". It's wrong. More - from
> this moment, my application depends on firstly used encoding - I can't
> to change encoding without refactoring of SQL statements - because
> encoding is hardly there (in collation clause).

I can only look at the locales that the operating system provides.  We
could conceivably make some simplifications like stripping off the
".utf8", but then how far do we go and where do we stop?  Locale names
on Windows look different too.  But in general, how do you suppose we
should map an operating system locale name to an "acceptable" SQL
identifier?  You might hope, for example, that we could look through the
list of operating system locale names and map, say,

cs_CZ        -> "czech"
cs_CZ.iso88592  -> "czech"
cs_CZ.utf8      -> "czech"
czech           -> "czech"

but we have no way to actually know that these are semantically similar,
so this illustrated mapping is AI complete.  We need to take the locale
names as is, and that may or may not carry encoding information.

> So I don't understand, why you fill a table pg_collation with thousand
> collated that are not possible to use? If I use a utf8, then there
> should be just utf8 based collates. And if you need to work with wide
> collates, then I am for a preferring utf8 - minimally for central
> europe region. if somebody would to use a collates here, then he will
> use a combination cs, de, en - so it must to use a latin2 and latin1
> or utf8. I think so encoding should not be a part of collation when it
> is possible.

Different databases can have different encodings, but the pg_collation
catalog is copied from the template database in any case.  We can't do
any changes in system catalogs as we create new databases, so the
"useless" collations have to be there.  There are only a few hundred,
actually, so it's not really a lot of wasted space.




Re: Per-column collation

From
Pavel Stehule
Date:
2010/11/16 Peter Eisentraut <peter_e@gmx.net>:
> On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:
>> yes - my first question is: Why we need to specify encoding, when only
>> one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
>> use a UTF8 - btw there is wrong message:
>>
>> yyy=# select * from jmena order by jmeno collate "cs_CZ.iso88592";
>> ERROR:  collation "cs_CZ.iso88592" for current database encoding
>> "UTF8" does not exist
>> LINE 1: select * from jmena order by jmeno collate "cs_CZ.iso88592";
>>                                            ^
>
> Sorry, is there some mistake in that message?
>

it is unclean - I expect some like "cannot to use collation
cs_CZ.iso88502, because your database use a utf8 encoding".

>> I don't know why, but preferred encoding for czech is iso88592 now -
>> but I can't to use it - so I can't to use a names "czech", "cs_CZ". I
>> always have to use a full name "cs_CZ.utf8". It's wrong. More - from
>> this moment, my application depends on firstly used encoding - I can't
>> to change encoding without refactoring of SQL statements - because
>> encoding is hardly there (in collation clause).
>
> I can only look at the locales that the operating system provides.  We
> could conceivably make some simplifications like stripping off the
> ".utf8", but then how far do we go and where do we stop?  Locale names
> on Windows look different too.  But in general, how do you suppose we
> should map an operating system locale name to an "acceptable" SQL
> identifier?  You might hope, for example, that we could look through the
> list of operating system locale names and map, say,
>
> cs_CZ           -> "czech"
> cs_CZ.iso88592  -> "czech"
> cs_CZ.utf8      -> "czech"
> czech           -> "czech"
>
> but we have no way to actually know that these are semantically similar,
> so this illustrated mapping is AI complete.  We need to take the locale
> names as is, and that may or may not carry encoding information.
>
>> So I don't understand, why you fill a table pg_collation with thousand
>> collated that are not possible to use? If I use a utf8, then there
>> should be just utf8 based collates. And if you need to work with wide
>> collates, then I am for a preferring utf8 - minimally for central
>> europe region. if somebody would to use a collates here, then he will
>> use a combination cs, de, en - so it must to use a latin2 and latin1
>> or utf8. I think so encoding should not be a part of collation when it
>> is possible.
>
> Different databases can have different encodings, but the pg_collation
> catalog is copied from the template database in any case.  We can't do
> any changes in system catalogs as we create new databases, so the
> "useless" collations have to be there.  There are only a few hundred,
> actually, so it's not really a lot of wasted space.
>

I have not a problem with size. Just I think, current behave isn't
practical. When database encoding is utf8, then I except, so "cs_CZ"
or "czech" will be for utf8. I understand, so template0 must have a
all locales, and I understand why current behave is, but it is very
user unfriendly. Actually, only old application in CR uses latin2,
almost all uses a utf, but now latin2 is preferred. This is bad and
should be solved.

Regards

Pavel

>
>


Re: Per-column collation

From
marcin mank
Date:
> I can only look at the locales that the operating system provides.  We
> could conceivably make some simplifications like stripping off the
> ".utf8", but then how far do we go and where do we stop?  Locale names
> on Windows look different too.  But in general, how do you suppose we
> should map an operating system locale name to an "acceptable" SQL
> identifier?  You might hope, for example, that we could look through the

It would be nice if we could have some mapping of locale names bult
in, so one doesn`t have to write alternative sql depending on DB
server OS:
select * from tab order by foo collate "Polish, Poland"
select * from tab order by foo collate "pl_PL.UTF-8"

(that`s how it works now, correct?)

Greetings
Marcin Mańk


Re: Per-column collation

From
Pavel Stehule
Date:
2010/11/16 marcin mank <marcin.mank@gmail.com>:
>> I can only look at the locales that the operating system provides.  We
>> could conceivably make some simplifications like stripping off the
>> ".utf8", but then how far do we go and where do we stop?  Locale names
>> on Windows look different too.  But in general, how do you suppose we
>> should map an operating system locale name to an "acceptable" SQL
>> identifier?  You might hope, for example, that we could look through the
>
> It would be nice if we could have some mapping of locale names bult
> in, so one doesn`t have to write alternative sql depending on DB
> server OS:

+1

Pavel

> select * from tab order by foo collate "Polish, Poland"
> select * from tab order by foo collate "pl_PL.UTF-8"
>
> (that`s how it works now, correct?)
>
> Greetings
> Marcin Mańk
>


Re: Per-column collation

From
Peter Eisentraut
Date:
On tis, 2010-11-16 at 20:59 +0100, Pavel Stehule wrote:
> 2010/11/16 Peter Eisentraut <peter_e@gmx.net>:
> > On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:
> >> yes - my first question is: Why we need to specify encoding, when only
> >> one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
> >> use a UTF8 - btw there is wrong message:
> >>
> >> yyy=# select * from jmena order by jmeno collate "cs_CZ.iso88592";
> >> ERROR:  collation "cs_CZ.iso88592" for current database encoding
> >> "UTF8" does not exist
> >> LINE 1: select * from jmena order by jmeno collate "cs_CZ.iso88592";
> >>                                            ^
> >
> > Sorry, is there some mistake in that message?
> >
> 
> it is unclean - I expect some like "cannot to use collation
> cs_CZ.iso88502, because your database use a utf8 encoding".

No, the namespace for collations is per encoding.  (This is per SQL
standard.)  So you *could* have a collation called "cs_CZ.iso88502" for
the UTF8 encoding.

> I have not a problem with size. Just I think, current behave isn't
> practical. When database encoding is utf8, then I except, so "cs_CZ"
> or "czech" will be for utf8. I understand, so template0 must have a
> all locales, and I understand why current behave is, but it is very
> user unfriendly. Actually, only old application in CR uses latin2,
> almost all uses a utf, but now latin2 is preferred. This is bad and
> should be solved.

Again, we can only look at the locale names that the operating system
gives us.  Mapping that to the names you expect is an AI problem.  If
you have a solution, please share it.




Re: Per-column collation

From
Peter Eisentraut
Date:
On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote:
> It would be nice if we could have some mapping of locale names bult
> in, so one doesn`t have to write alternative sql depending on DB
> server OS:
> select * from tab order by foo collate "Polish, Poland"
> select * from tab order by foo collate "pl_PL.UTF-8"

Sure that would be nice, but how do you hope to do that?



Re: Per-column collation

From
Pavel Stehule
Date:
2010/11/16 Peter Eisentraut <peter_e@gmx.net>:
> On tis, 2010-11-16 at 20:59 +0100, Pavel Stehule wrote:
>> 2010/11/16 Peter Eisentraut <peter_e@gmx.net>:
>> > On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:
>> >> yes - my first question is: Why we need to specify encoding, when only
>> >> one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
>> >> use a UTF8 - btw there is wrong message:
>> >>
>> >> yyy=# select * from jmena order by jmeno collate "cs_CZ.iso88592";
>> >> ERROR:  collation "cs_CZ.iso88592" for current database encoding
>> >> "UTF8" does not exist
>> >> LINE 1: select * from jmena order by jmeno collate "cs_CZ.iso88592";
>> >>                                            ^
>> >
>> > Sorry, is there some mistake in that message?
>> >
>>
>> it is unclean - I expect some like "cannot to use collation
>> cs_CZ.iso88502, because your database use a utf8 encoding".
>
> No, the namespace for collations is per encoding.  (This is per SQL
> standard.)  So you *could* have a collation called "cs_CZ.iso88502" for
> the UTF8 encoding.
>
>> I have not a problem with size. Just I think, current behave isn't
>> practical. When database encoding is utf8, then I except, so "cs_CZ"
>> or "czech" will be for utf8. I understand, so template0 must have a
>> all locales, and I understand why current behave is, but it is very
>> user unfriendly. Actually, only old application in CR uses latin2,
>> almost all uses a utf, but now latin2 is preferred. This is bad and
>> should be solved.
>
> Again, we can only look at the locale names that the operating system
> gives us.  Mapping that to the names you expect is an AI problem.  If
> you have a solution, please share it.
>

ok, then we should to define this alias manually

some like - CREATE COLLATE "czech" FOR LOCALE "cs_CZ.UTF8"

or some similar. Without this, the application or stored procedures
can be non portable between UNIX and WIN.

Peter, now initdb check relation between encoding and locale - and
this check is portable. Can we use this code?

Pavel

>
>


Re: Per-column collation

From
Peter Eisentraut
Date:
On tis, 2010-11-16 at 21:40 +0100, Pavel Stehule wrote:
> ok, then we should to define this alias manually
> 
> some like - CREATE COLLATE "czech" FOR LOCALE "cs_CZ.UTF8"
> 
> or some similar. Without this, the application or stored procedures
> can be non portable between UNIX and WIN.

Yes, such a command will be provided.  You can already do it manually.

> Peter, now initdb check relation between encoding and locale - and
> this check is portable. Can we use this code?

Hmm, not really, but something similar, I suppose.  Only that the
mapping list would be much longer and more volatile.




Re: Per-column collation

From
Martijn van Oosterhout
Date:
On Tue, Nov 16, 2010 at 10:32:01PM +0200, Peter Eisentraut wrote:
> On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote:
> > It would be nice if we could have some mapping of locale names bult
> > in, so one doesn`t have to write alternative sql depending on DB
> > server OS:
> > select * from tab order by foo collate "Polish, Poland"
> > select * from tab order by foo collate "pl_PL.UTF-8"
>
> Sure that would be nice, but how do you hope to do that?

Given that each operating system comes with a different set of
collations, it seems unlikely you could even find two collations on
different OSes that even correspond. There's not a lot of
standardisation here (well, except for the unicode collation
algorithm, but that doesn't help with language variations).

I don't think this is a big deal for now, perhaps after per-column
collation is implemented we can work on the portability issues. Make it
work, then make it better.

</me ducks>

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Per-column collation

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Tue, Nov 16, 2010 at 10:32:01PM +0200, Peter Eisentraut wrote:
>> On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote:
>>> It would be nice if we could have some mapping of locale names bult
>>> in, so one doesn`t have to write alternative sql depending on DB
>>> server OS:

>> Sure that would be nice, but how do you hope to do that?

> Given that each operating system comes with a different set of
> collations, it seems unlikely you could even find two collations on
> different OSes that even correspond.

Yeah, the *real* portability problem here is that the locale behavior is
likely to be different, not just the name.  I don't think we'd be doing
people many favors by masking behavioral differences between a forced
common name.
        regards, tom lane


Re: Per-column collation

From
Pavel Stehule
Date:
2010/11/16 Tom Lane <tgl@sss.pgh.pa.us>:
> Martijn van Oosterhout <kleptog@svana.org> writes:
>> On Tue, Nov 16, 2010 at 10:32:01PM +0200, Peter Eisentraut wrote:
>>> On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote:
>>>> It would be nice if we could have some mapping of locale names bult
>>>> in, so one doesn`t have to write alternative sql depending on DB
>>>> server OS:
>
>>> Sure that would be nice, but how do you hope to do that?
>
>> Given that each operating system comes with a different set of
>> collations, it seems unlikely you could even find two collations on
>> different OSes that even correspond.
>
> Yeah, the *real* portability problem here is that the locale behavior is
> likely to be different, not just the name.  I don't think we'd be doing
> people many favors by masking behavioral differences between a forced
> common name.
>

no, minimally there is same behave of cs_CZ.utf8 and cs_CZ.iso88592.
But without any "alias" user should to modify source code, when he
change a encoding.

Pavel

>                        regards, tom lane
>


Re: Per-column collation

From
Heikki Linnakangas
Date:
On 15.11.2010 21:42, Peter Eisentraut wrote:
> On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:
>> I am checking a patch. I found a problem with initdb
>
> Ah, late night brain farts, it appears.  Here is a corrected version.

Some random comments:

In syntax.sgml:

> +    The <literal>COLLATE</literal> clause overrides the collation of
> +    an expression.  It is appended to the expression at applies to:

That last sentence doesn't parse.


Would it be possible to eliminate the ExecEvalCollateClause function 
somehow? It just calls through the argument. How about directly 
returning the argument ExprState in ExecInitExpr?

get_collation_name() returns the plain name without schema, so it's not 
good enough for use in ruleutils.c. pg_dump is also ignoring collation's 
schema.

Have you done any performance testing? Functions like text_cmp can be a 
hotspot in sorting, so any extra overhead there might be show up in tests.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Per-column collation

From
Itagaki Takahiro
Date:
On Tue, Nov 16, 2010 at 04:42, Peter Eisentraut <peter_e@gmx.net> wrote:
> On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:
>> I am checking a patch. I found a problem with initdb
> Ah, late night brain farts, it appears.  Here is a corrected version.

This version cannot be applied cleanly any more. Please update it.
(I think you don't have to include changes for catversion.h)
./src/backend/optimizer/util/plancat.c.rej
./src/backend/optimizer/plan/createplan.c.rej
./src/backend/optimizer/path/indxpath.c.rej
./src/include/catalog/catversion.h.rej

I didn't compile nor run the patched server, but I found a couple of
issues in the design and source code:

* COLLATE information must be explicitly passed by caller in the patch,
but we might forgot the handover when we write new codes. Is it possible
to pass it automatically, say using a global variable? If we could do so,
existing extensions might work with collation without rewritten.

* Did you check the regression test on Windows? We probably cannot use
en_US.utf8 on Windows. Also, some output of the test includes non-ASCII
characters. How will we test COLLATE feature on non-UTF8 databases?

[src/test/regress/sql/collate.sql]
+CREATE TABLE collate_test1 (
+    a int,
+    b text COLLATE "en_US.utf8" not null
+);

* Did you see any performance regression by collation?
I found a bug in lc_collate_is_c(); result >= 0 should be
checked before any other checks. SearchSysCache1() here
would be a performance regression.

[src/backend/utils/adt/pg_locale.c]
-lc_collate_is_c(void)
+lc_collate_is_c(Oid collation){
...
+        tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(collation));
...
HERE =>    if (result >= 0)        return (bool) result;

--
Itagaki Takahiro


Re: Per-column collation

From
Peter Eisentraut
Date:
On tor, 2010-11-18 at 21:37 +0200, Heikki Linnakangas wrote:
> Have you done any performance testing? Functions like text_cmp can be
> a hotspot in sorting, so any extra overhead there might be show up in
> tests.

Without having optimized it very much yet, the performance for a 1GB
ORDER BY is

* without COLLATE clause: about the same as without the patch

* with COLLATE clause: about 30%-50% slower

I can imagine that there is some optimization potential in the latter
case.  But in any case, it's not awfully slow.




Re: Per-column collation

From
Peter Eisentraut
Date:
On mån, 2010-11-22 at 11:58 +0900, Itagaki Takahiro wrote:
> * COLLATE information must be explicitly passed by caller in the patch,
> but we might forgot the handover when we write new codes. Is it possible
> to pass it automatically, say using a global variable? If we could do so,
> existing extensions might work with collation without rewritten.

I don't see how that is supposed to work.  I understand the concern, but
the system is fairly robust against this becoming a problem.

> * Did you check the regression test on Windows? We probably cannot use
> en_US.utf8 on Windows. Also, some output of the test includes non-ASCII
> characters. How will we test COLLATE feature on non-UTF8 databases?

I attempted to discuss this here:

http://archives.postgresql.org/pgsql-hackers/2010-11/msg00464.php

For a lack of a solution, the approach now is to run the regression test
file manually, and we provide separate test files for as many platforms
and encodings or whatever we want to support.




Re: Per-column collation

From
Peter Eisentraut
Date:
On mån, 2010-11-22 at 11:58 +0900, Itagaki Takahiro wrote:
> * Did you see any performance regression by collation?
> I found a bug in lc_collate_is_c(); result >= 0 should be
> checked before any other checks. SearchSysCache1() here
> would be a performance regression.

That code turned out to be buggy anyway, because it was using the
"result" cache variable independent of the collation parameter.

I did some profiling with this now.  The problem is that this function
lc_collate_is_c() would need to cache the "C-ness" property for any
number of collations.  Depending on what call pattern you expect or want
to optimize for, you might end up caching most of the pg_collation
catalog, which is actually the mandate of SearchSysCache, but the
profile shows that SearchSysCache takes a large chunk of the additional
run time.

If I remove that branch altogether, that is, don't treat the C locale
specially at all in the nondefault collation case, then using non-C
locales as nondefault collation is almost as fast as using non-C locales
as default location.  However, using the C locale as a nondefault
collation would then be quite slow (still faster that non-C locales).

The solution would perhaps be a custom, lightweight caching system, but
I haven't thought of one yet.



Re: Per-column collation

From
Peter Eisentraut
Date:
On tor, 2010-11-18 at 21:37 +0200, Heikki Linnakangas wrote:
> On 15.11.2010 21:42, Peter Eisentraut wrote:
> > On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:
> >> I am checking a patch. I found a problem with initdb
> >
> > Ah, late night brain farts, it appears.  Here is a corrected version.
> 
> Some random comments:
> 
> In syntax.sgml:
> 
> > +    The <literal>COLLATE</literal> clause overrides the collation of
> > +    an expression.  It is appended to the expression at applies to:
> 
> That last sentence doesn't parse.

Fixed in the patch I'm about to send out.

> Would it be possible to eliminate the ExecEvalCollateClause function 
> somehow? It just calls through the argument. How about directly 
> returning the argument ExprState in ExecInitExpr?

This currently mirrors the handling of RelabelType.  Perhaps both could
be improved, but that is an independent issue.

> get_collation_name() returns the plain name without schema, so it's not 
> good enough for use in ruleutils.c. pg_dump is also ignoring collation's 
> schema.

Fixed schema handling in ruleutils and pg_dump in new patch.




Re: Per-column collation

From
Peter Eisentraut
Date:
On ons, 2010-11-24 at 22:22 +0200, Peter Eisentraut wrote:
> On mån, 2010-11-22 at 11:58 +0900, Itagaki Takahiro wrote:
> > * Did you see any performance regression by collation?
> > I found a bug in lc_collate_is_c(); result >= 0 should be
> > checked before any other checks. SearchSysCache1() here
> > would be a performance regression.
> 
> That code turned out to be buggy anyway, because it was using the
> "result" cache variable independent of the collation parameter.

Since I don't have a short-term solution for this, I have ripped out the
caching of C-ness for nondefault locales.

> I did some profiling with this now.  The problem is that this function
> lc_collate_is_c() would need to cache the "C-ness" property for any
> number of collations.  Depending on what call pattern you expect or want
> to optimize for, you might end up caching most of the pg_collation
> catalog, which is actually the mandate of SearchSysCache, but the
> profile shows that SearchSysCache takes a large chunk of the additional
> run time.
> 
> If I remove that branch altogether, that is, don't treat the C locale
> specially at all in the nondefault collation case, then using non-C
> locales as nondefault collation is almost as fast as using non-C locales
> as default location.  However, using the C locale as a nondefault
> collation would then be quite slow (still faster that non-C locales).
> 
> The solution would perhaps be a custom, lightweight caching system, but
> I haven't thought of one yet.




Re: Per-column collation

From
Peter Eisentraut
Date:
Here is an updated patch to address the issues discussed during this
commitfest.


Attachment

Re: Per-column collation

From
Itagaki Takahiro
Date:
On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut <peter_e@gmx.net> wrote:
> Here is an updated patch to address the issues discussed during this
> commitfest.

Here are comments and questions after I tested the latest patch:

==== Issues ====
* initdb itself seems to be succeeded, but it says "could not determine
encoding for locale" messages for any combination of encoding=utf8/eucjp
and locale=ja_JP.utf8/ja_JP.eucjp/C. Is it an expected behavior?
----
creating collations ...initdb: locale name has non-ASCII characters,
skipped: bokm虱
initdb: locale name has non-ASCII characters, skipped: fran軋is
could not determine encoding for locale "hy_AM.armscii8": codeset is "ARMSCII-8"
... (a dozen of lines) ...
could not determine encoding for locale "vi_VN.tcvn": codeset is "TCVN5712-1"
ok
----

* contrib/citext raises an encoding error when COLLATE is specified
even if it is the collation as same as the database default.
We might need some special treatment for C locale.
=# SHOW lc_collate;  ==> C
=# SELECT ('A'::citext) = ('a'::citext);  ==> false
=# SELECT ('A'::citext) = ('a'::citext) COLLATE "C";
ERROR:  invalid multibyte character for locale
HINT:  The server's LC_CTYPE locale is probably incompatible with the
database encoding.

* pg_dump would generate unportable files for different platforms
because collation names

==== Source codes ====
* PG_GETARG_COLLATION might be a better name rather than PG_GET_COLLATION.

* What is the different between InvalidOid and DEFAULT_COLLATION_OID
for collation oids? The patch replaces DirectFunctionCall to
DirectFunctionCallC in some places, but we could shrink the diff size
if we can use InvalidOid instead of DEFAULT_COLLATION_OID,

* I still think an explicit passing collations from-function-to-function
is horrible because we might forget it in some places, and almost existing
third party module won't work.  Is it possible to make it a global variable,
and push/pop the state when changed? Sorry I'm missing something, but
I think we could treat the collation setting as like as GUC settings.

--
Itagaki Takahiro


Re: Per-column collation

From
"David E. Wheeler"
Date:
On Dec 6, 2010, at 4:06 AM, Itagaki Takahiro wrote:

> * contrib/citext raises an encoding error when COLLATE is specified
> even if it is the collation as same as the database default.
> We might need some special treatment for C locale.

I've been wondering if this patch will support case-insensitve collations. If so, then citext should probably be
revisedto use one. 

Best,

David



Re: Per-column collation

From
Pavel Stehule
Date:
2010/12/6 David E. Wheeler <david@kineticode.com>:
> On Dec 6, 2010, at 4:06 AM, Itagaki Takahiro wrote:
>
>> * contrib/citext raises an encoding error when COLLATE is specified
>> even if it is the collation as same as the database default.
>> We might need some special treatment for C locale.
>
> I've been wondering if this patch will support case-insensitve collations. If so, then citext should probably be
revisedto use one.
 

what I know - no. It's support only system based collations

Pavel

>
> Best,
>
> David
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: Per-column collation

From
Peter Eisentraut
Date:
On mån, 2010-12-06 at 21:06 +0900, Itagaki Takahiro wrote:
> On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut <peter_e@gmx.net> wrote:
> > Here is an updated patch to address the issues discussed during this
> > commitfest.
> 
> Here are comments and questions after I tested the latest patch:
> 
> ==== Issues ====
> * initdb itself seems to be succeeded, but it says "could not determine
> encoding for locale" messages for any combination of encoding=utf8/eucjp
> and locale=ja_JP.utf8/ja_JP.eucjp/C. Is it an expected behavior?
> ----
> creating collations ...initdb: locale name has non-ASCII characters,
> skipped: bokm虱
> initdb: locale name has non-ASCII characters, skipped: fran軋is
> could not determine encoding for locale "hy_AM.armscii8": codeset is "ARMSCII-8"
> ... (a dozen of lines) ...
> could not determine encoding for locale "vi_VN.tcvn": codeset is "TCVN5712-1"
> ok
> ----

What this does it take the output of locale -a and populate the
pg_collation catalog with the locales it finds.  When it finds an
operating system locale that uses an encoding that is not recognized,
you will see this warning.

I understand that that would probably annoy users.  We could hide the
warning and silently skip those locales.  But then could that hide
genuine configuration problems?

> * contrib/citext raises an encoding error when COLLATE is specified
> even if it is the collation as same as the database default.
> We might need some special treatment for C locale.
> =# SHOW lc_collate;  ==> C
> =# SELECT ('A'::citext) = ('a'::citext);  ==> false
> =# SELECT ('A'::citext) = ('a'::citext) COLLATE "C";
> ERROR:  invalid multibyte character for locale
> HINT:  The server's LC_CTYPE locale is probably incompatible with the
> database encoding.

OK, I can reproduce that.  That's fallout from the lc_ctype_is_c()
optimization that I removed, as explained in another email.  I'll have
to think about that again.

> * pg_dump would generate unportable files for different platforms
> because collation names

pg_dump can already produce unportable files for a number of other
reasons, including per-database locale, tablespaces, OS-dependent
configuration settings.

The way I imagine this working is that someone who wants to design a
genuinely portable application using this feature would create their own
collation based on the existing, OS-specific collation (using a
to-be-added CREATE COLLATION command).  As mentioned earlier, however,
we can't actually solve the problem that the OS locales may not behave
the same across systems.

> ==== Source codes ====
> * PG_GETARG_COLLATION might be a better name rather than PG_GET_COLLATION.

It's not the collation of a function argument, it's the collation of a
function call.  (You could conceivably also fetch the collation of a
function argument, but that isn't used in any way.)

> * What is the different between InvalidOid and DEFAULT_COLLATION_OID
> for collation oids? The patch replaces DirectFunctionCall to
> DirectFunctionCallC in some places, but we could shrink the diff size
> if we can use InvalidOid instead of DEFAULT_COLLATION_OID,

Think of DEFAULT_COLLATION_OID as analogous to UKNOWNOID.  A long time
ago we used InvalidOid for all kinds of types, including unknown,
pseudotypes, cstring, and no type at all.  The reason we changed this
was that this masked errors and made processing of the unknown type
difficult/impossible.  I know this makes the code bigger, but it's
necessary.  I originally coded the patch using InvalidOid for
everything, but that wasn't very robust.

This also ties into the next question ...

> * I still think an explicit passing collations from-function-to-function
> is horrible because we might forget it in some places, and almost existing
> third party module won't work.  Is it possible to make it a global variable,
> and push/pop the state when changed? Sorry I'm missing something, but
> I think we could treat the collation setting as like as GUC settings.

A collation is a property of a datum or an expression.  You might as
well argue that we don't keep track of types of expressions and instead
store it globally.  Doesn't make sense.

Extensions are not required to support collations.  Those that might
want to will usually end up calling one of the locale-enabled functions
such as varstr_cmp(), and there the function prototype will ensure that
specifying a collation cannot be missed.

Additionally, the distinction of InvalidOid and DEFAULT_COLLATION_OID
does a great deal to ensure that in case a collation is unspecified or
missing in some new code, you will get a proper error message instead of
unspecified behavior.




Re: Per-column collation

From
Peter Eisentraut
Date:
On mån, 2010-12-06 at 10:01 -0800, David E. Wheeler wrote:
> I've been wondering if this patch will support case-insensitve
> collations. If so, then citext should probably be revised to use one.

This has been touch upon several times during the discussions on past
patches.

Essentially, the current patch only arranges that you can specify a sort
order for data.  The system always breaks ties using a binary
comparison.  This could conceivably be changed, but it's a separate
problem.  Some of the necessary investigation work has presumably
already been done in the context of citext.




Re: Per-column collation

From
"David E. Wheeler"
Date:
On Dec 6, 2010, at 11:29 AM, Peter Eisentraut wrote:

> This has been touch upon several times during the discussions on past
> patches.
> 
> Essentially, the current patch only arranges that you can specify a sort
> order for data.  The system always breaks ties using a binary
> comparison.  This could conceivably be changed, but it's a separate
> problem.  Some of the necessary investigation work has presumably
> already been done in the context of citext.

Okay, thanks, good to know.

Best,

David



Re: Per-column collation

From
Alexandre Riveira
Date:
Please

It would be very important to us that the Brazilian LIKE collate worked 
with, and possible case-insensitive and accent-insensitive

Tank's

Alexandre Riveira
Brazil

Peter Eisentraut escreveu:
> On mån, 2010-12-06 at 10:01 -0800, David E. Wheeler wrote:
>   
>> I've been wondering if this patch will support case-insensitve
>> collations. If so, then citext should probably be revised to use one.
>>     
>
> This has been touch upon several times during the discussions on past
> patches.
>
> Essentially, the current patch only arranges that you can specify a sort
> order for data.  The system always breaks ties using a binary
> comparison.  This could conceivably be changed, but it's a separate
> problem.  Some of the necessary investigation work has presumably
> already been done in the context of citext.
>
>
>
>   



Re: Per-column collation

From
Itagaki Takahiro
Date:
On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut <peter_e@gmx.net> wrote:
> Here is an updated patch to address the issues discussed during this
> commitfest.

I found another issue in the patch; ILIKE in WHERE clause doesn't work.
It was surprising because LIKE in WHERE clause and ILIKE in SELECT list
works expectedly.- SELECT * FROM pg_class WHERE relname LIKE 'pg%'- SELECT relname ILIKE 'pg%' FROM pg_class;

----
postgres=# SELECT name, setting FROM pg_settingsWHERE name IN ('lc_ctype', 'lc_collate', 'server_encoding');     name
   | setting
 
-----------------+---------lc_collate      | Clc_ctype        | Cserver_encoding | UTF8
(3 rows)

postgres=# SELECT * FROM pg_class WHERE relname ILIKE 'pg%';
ERROR:  no collation was derived
----

-- 
Itagaki Takahiro


Re: Per-column collation

From
Peter Eisentraut
Date:
On tis, 2010-12-07 at 11:46 +0900, Itagaki Takahiro wrote:
> On Sun, Dec 5, 2010 at 01:04, Peter Eisentraut <peter_e@gmx.net> wrote:
> > Here is an updated patch to address the issues discussed during this
> > commitfest.
> 
> I found another issue in the patch; ILIKE in WHERE clause doesn't work.
> It was surprising because LIKE in WHERE clause and ILIKE in SELECT list
> works expectedly.
>  - SELECT * FROM pg_class WHERE relname LIKE 'pg%'
>  - SELECT relname ILIKE 'pg%' FROM pg_class;
> 
> ----
> postgres=# SELECT name, setting FROM pg_settings
>  WHERE name IN ('lc_ctype', 'lc_collate', 'server_encoding');
>       name       | setting
> -----------------+---------
>  lc_collate      | C
>  lc_ctype        | C
>  server_encoding | UTF8
> (3 rows)
> 
> postgres=# SELECT * FROM pg_class WHERE relname ILIKE 'pg%';
> ERROR:  no collation was derived

This is fixed in the 20101213 patch I'm about to send out.



Re: Per-column collation

From
Peter Eisentraut
Date:
On mån, 2010-12-06 at 21:26 +0200, Peter Eisentraut wrote:
> 
> > * contrib/citext raises an encoding error when COLLATE is specified
> > even if it is the collation as same as the database default.
> > We might need some special treatment for C locale.
> > =# SHOW lc_collate;  ==> C
> > =# SELECT ('A'::citext) = ('a'::citext);  ==> false
> > =# SELECT ('A'::citext) = ('a'::citext) COLLATE "C";
> > ERROR:  invalid multibyte character for locale
> > HINT:  The server's LC_CTYPE locale is probably incompatible with
> the
> > database encoding.
> 
> OK, I can reproduce that.  That's fallout from the lc_ctype_is_c()
> optimization that I removed, as explained in another email.  I'll have
> to think about that again.

This is fixed in the 20101213 patch I'm about to send out.



Re: Per-column collation

From
Peter Eisentraut
Date:
On lör, 2010-12-04 at 18:04 +0200, Peter Eisentraut wrote:
> Here is an updated patch to address the issues discussed during this
> commitfest.

And another one, that fixes the problems pointed out since.


Attachment

Re: Per-column collation

From
Robert Haas
Date:
On Sun, Dec 12, 2010 at 5:15 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On lör, 2010-12-04 at 18:04 +0200, Peter Eisentraut wrote:
>> Here is an updated patch to address the issues discussed during this
>> commitfest.
>
> And another one, that fixes the problems pointed out since.

I don't really have a position on whether or not this patch is ready
to commit... but I do think that this is the sort of patch that is
very likely to have some bugs almost no matter when we commit it.
Therefore, I would argue that we ought not to let the process of
getting it committed drag out too long.  Even if we find 3 more bugs,
we might have found 10 more bugs if it had been in the tree for the
same time period.  Of course, any conceptual issues must be worked out
before commit.  But there are likely to be some loose ends however we
do it.  So when we think it's pretty close, we should move forward.
All IMHO, of course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Per-column collation

From
Greg Smith
Date:
Robert Haas wrote:
> I don't really have a position on whether or not this patch is ready
> to commit... but I do think that this is the sort of patch that is
> very likely to have some bugs almost no matter when we commit it

I just updated the CF app to track Peter's latest update, which remains 
untested by anyone else for whether it fixes all the issues brought up.  
It would be nice to get a re-review to confirm things are still working 
in advance of CF 2011-01.  Given the reviewer here is also a committer, 
that means it's possible this can go into the tree after that if 
everything checks out even outside of the regular CF schedule.  In the 
interest of closing out this CF, I'm updating this one as returned for 
now though.  That doesn't prevent it from going in anyway once it's 
confirmed ready, and I agree the sooner the better to help find 
breakage.  But I don't think that's so important that it should block 
the critical path for the next alpha.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us




Re: Per-column collation

From
Itagaki Takahiro
Date:
On Thu, Dec 16, 2010 at 19:37, Greg Smith <greg@2ndquadrant.com> wrote:
> I just updated the CF app to track Peter's latest update, which remains
> untested by anyone else for whether it fixes all the issues brought up.  It
> would be nice to get a re-review to confirm things are still working in
> advance of CF 2011-01.

Sorry, I don't have enough time to review it. Also, multiple reviewers
using different kinds of settings would be preferred for such feature.

# We might need "previous reviewers" and "active reviewers" in commit-fest
# app. Or, should non-active reviewers delete their names?

I tested it in program-level, but the code-level review is still needed.
I'm worried about whether "no holes" checks are required in COLLATE
hand-off from function to function. If so, the review will be a harder
work than the case where the framework guarantee the hand-off.

--
Itagaki Takahiro


Re: Per-column collation

From
Greg Smith
Date:
Itagaki Takahiro wrote:
> We might need "previous reviewers" and "active reviewers" in commit-fest
> app. Or, should non-active reviewers delete their names?
>   

This is only really an issue with patches that get moved from one CF to 
the next, which doesn't happen that often.  Patches that are marked 
"Returned With Feedback" instead get a new entry in the next CF instead, 
which avoids this problem.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books