Thread: Open issues for collations

Open issues for collations

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I think some discussion of which of the things on the open
> item lists need to be done before beta might be helpful, and we ought
> to add any items that are not there but are blockers.

Here's a quick enumeration of some things I think need discussion about
the collations patch:

* Are we happy yet with the collation assignment behavior (see
parse_collate.c)?  A couple of specific subtopics:

** Selecting a field from a record-returning function's output.
Currently, we'll use the field's declared collation; except that
if the field has default collation, we'll replace that with the common
collation of the function's inputs, if any.  Is either part of that
sane?  Do we need to make this work for functions invoked with other
syntax than a plain function call, eg operator or cast syntax?

** What to do with domains whose declaration includes a COLLATE clause?
Currently, we'll impute that collation to the result of a cast to the
domain type --- even if the cast's input expression includes an
explicit COLLATE clause.  It's not clear that that's per spec.  If it
is correct, should we behave similarly for functions that are declared
to return a domain type?  Should it matter if the cast-to-domain is
explicit or implicit?  Perhaps it'd be best if domain collations only
mattered for columns declared with that domain type.  Then we'd have
a general rule that collations only come into play in an expression
as a result of (a) the declared type of a column reference or (b)
an explicit COLLATE clause.


* In plpgsql, is it OK for declared local variables to inherit the
function's input collation?  Should we provide a COLLATE option in
variable declarations to let that be overridden?  If Oracle understands
COLLATE, probably we should look at what they do in PL/SQL.

* RI triggers should insert COLLATE clauses in generated queries to
satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the
referenced column's collation.  Right now you may get either table's
collation depending on which query type is involved.  I think an obvious
failure may not be possible so long as equality means the same thing in
all collations, but it's definitely possible that the planner might
decide it can't use the referenced column's unique index, which would
suck for performance.  (Note: this rule seems to prove that the
committee assumes equality can mean different things in different
collations, else they'd not have felt the need to specify.)

* It'd sure be nice if we had some nontrivial test cases that work in
encodings besides UTF8.  I'm still bothered that the committed patch
failed to cover single-byte-encoding cases in upper/lower/initcap.

* Remove initdb's warning about useless locales?  Seems like pointless
noise, or at least something that can be relegated to debug mode.

* Is it worth adding a cares-about-collation flag to pg_proc?  Probably
too late to be worrying about such refinements for 9.1.

There are a bunch of other minor issues that I'm still working through,
but these are the ones that seem to merit discussion.
        regards, tom lane


Re: Open issues for collations

From
Simon Riggs
Date:
On Sat, Mar 26, 2011 at 4:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think some discussion of which of the things on the open
>> item lists need to be done before beta might be helpful, and we ought
>> to add any items that are not there but are blockers.
>
> Here's a quick enumeration of some things I think need discussion about
> the collations patch:
>
> * Are we happy yet with the collation assignment behavior (see
> parse_collate.c)?  A couple of specific subtopics:
>
> ** Selecting a field from a record-returning function's output.
> Currently, we'll use the field's declared collation; except that
> if the field has default collation, we'll replace that with the common
> collation of the function's inputs, if any.  Is either part of that
> sane?  Do we need to make this work for functions invoked with other
> syntax than a plain function call, eg operator or cast syntax?
>
> ** What to do with domains whose declaration includes a COLLATE clause?
> Currently, we'll impute that collation to the result of a cast to the
> domain type --- even if the cast's input expression includes an
> explicit COLLATE clause.  It's not clear that that's per spec.  If it
> is correct, should we behave similarly for functions that are declared
> to return a domain type?  Should it matter if the cast-to-domain is
> explicit or implicit?  Perhaps it'd be best if domain collations only
> mattered for columns declared with that domain type.  Then we'd have
> a general rule that collations only come into play in an expression
> as a result of (a) the declared type of a column reference or (b)
> an explicit COLLATE clause.
>
>
> * In plpgsql, is it OK for declared local variables to inherit the
> function's input collation?  Should we provide a COLLATE option in
> variable declarations to let that be overridden?  If Oracle understands
> COLLATE, probably we should look at what they do in PL/SQL.
>
> * RI triggers should insert COLLATE clauses in generated queries to
> satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the
> referenced column's collation.  Right now you may get either table's
> collation depending on which query type is involved.  I think an obvious
> failure may not be possible so long as equality means the same thing in
> all collations, but it's definitely possible that the planner might
> decide it can't use the referenced column's unique index, which would
> suck for performance.  (Note: this rule seems to prove that the
> committee assumes equality can mean different things in different
> collations, else they'd not have felt the need to specify.)
>
> * It'd sure be nice if we had some nontrivial test cases that work in
> encodings besides UTF8.  I'm still bothered that the committed patch
> failed to cover single-byte-encoding cases in upper/lower/initcap.
>
> * Remove initdb's warning about useless locales?  Seems like pointless
> noise, or at least something that can be relegated to debug mode.
>
> * Is it worth adding a cares-about-collation flag to pg_proc?  Probably
> too late to be worrying about such refinements for 9.1.
>
> There are a bunch of other minor issues that I'm still working through,
> but these are the ones that seem to merit discussion.

That's a long list and I think it's clear that we won't resolve all of
those issues to everybody's satisfaction in a single release, let
alone in next week or so. We need a way forwards.

What I think we should do is add detailed documentation on how it
works now. There are many people that would love to help, but not
everybody can visualise exactly the points you are making above, I
would confess that I can't. Having docs that clearly explain a neat
new capability and the various possible gotcha/caveats will help
others come up with test cases and ideas.

It seems to me likely that in real usage many of those gotchas will
drop away because they represent unlikely or perverse use cases.

I don't see anything bad in releasing software that has unresolved
questions, as long as those items are clearly flagged up and we
specifically ask for feedback on them. That looks to me to be a
many-eyeballs approach to the problem.
http://en.wikipedia.org/wiki/Linus%27_Law

Tucking our shoelaces into our shoes doesn't mean the loose ends are
resolved fully.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Open issues for collations

From
Greg Stark
Date:
On Sat, Mar 26, 2011 at 4:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> ** Selecting a field from a record-returning function's output.
> Currently, we'll use the field's declared collation; except that
> if the field has default collation, we'll replace that with the common
> collation of the function's inputs, if any.  Is either part of that
> sane?  Do we need to make this work for functions invoked with other
> syntax than a plain function call, eg operator or cast syntax?

Either of those sounds reasonable but the combination seems weird.
Some example functions might help:

list_words('foo bar bar') -> ('foo'), ('bar'), ('baz')
fetch_users_by_lastname('Smith') -> (1,'John','Smith','Great
Britain','GB'), (2,'Jürgen','Smith','DE')
fetch_users_by_countrycode('DE') -> (2,'Jürgen','Smith','DE')

The first looks like it should definitely inherit. The second the
result set is heterogeneous and inheriting might be the best
compromise but it would produce very strange results for columns like
the country-code which should just use their defined collation of C.
The third case inheriting the country code's collation of C would be
very strange and definitely wrong.

It occurs to me that if we have any inherited cases people might come
to depend on that behaviour and there would be no out for us. Whereas
if we say record return values always use the record type's field's
collations then we could always later add a collation of type
_inherited or _anycollation or some such that indicated that that
column should inherit the arguments' collation and it wouldn't affect
any existing code.



--
greg


Re: Open issues for collations

From
Robert Haas
Date:
On Mar 26, 2011, at 12:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> ** Selecting a field from a record-returning function's output.
> Currently, we'll use the field's declared collation; except that
> if the field has default collation, we'll replace that with the common
> collation of the function's inputs, if any.  Is either part of that
> sane?  Do we need to make this work for functions invoked with other
> syntax than a plain function call, eg operator or cast syntax?

I am not an expert on this topic in any way. That having been said, the first part of that rule seems quite sane. The
secondpart seems less clear, but probably also sane. 

> ** What to do with domains whose declaration includes a COLLATE clause?
> Currently, we'll impute that collation to the result of a cast to the
> domain type --- even if the cast's input expression includes an
> explicit COLLATE clause.

I would have thought that an explicit COLLATE clause would trump any action at a distance.

> * In plpgsql, is it OK for declared local variables to inherit the
> function's input collation?  Should we provide a COLLATE option in
> variable declarations to let that be overridden?  If Oracle understands
> COLLATE, probably we should look at what they do in PL/SQL.

I don't know what Oracle does, but a collate option in variable declarations seems like a very good idea.  Inheriting
theinput collation if not specified seems good too. I also suspect we might need something like COLLATE FROM $1, but
maybethat's a 9.2 feature. 

> * RI triggers should insert COLLATE clauses in generated queries to
> satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the
> referenced column's collation.  Right now you may get either table's
> collation depending on which query type is involved.  I think an obvious
> failure may not be possible so long as equality means the same thing in
> all collations, but it's definitely possible that the planner might
> decide it can't use the referenced column's unique index, which would
> suck for performance.  (Note: this rule seems to prove that the
> committee assumes equality can mean different things in different
> collations, else they'd not have felt the need to specify.)

No idea what to do about this.

> * It'd sure be nice if we had some nontrivial test cases that work in
> encodings besides UTF8.  I'm still bothered that the committed patch
> failed to cover single-byte-encoding cases in upper/lower/initcap.

Or this.

> * Remove initdb's warning about useless locales?  Seems like pointless
> noise, or at least something that can be relegated to debug mode.

+1.

> * Is it worth adding a cares-about-collation flag to pg_proc?  Probably
> too late to be worrying about such refinements for 9.1.

Depends how much knock-on work it'll create.

...Robert

Re: Open issues for collations

From
Greg Stark
Date:
On Sat, Mar 26, 2011 at 3:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> ** What to do with domains whose declaration includes a COLLATE clause?
>> Currently, we'll impute that collation to the result of a cast to the
>> domain type --- even if the cast's input expression includes an
>> explicit COLLATE clause.
>
> I would have thought that an explicit COLLATE clause would trump any action at a distance.
>

I think an explicit COLLATE *outside* the cast would. But inside the
cast? The question comes down to whether a domain with a collate
clause is explicitly providing a collation or implicitly.

So again, examples:

CREATE DOMAIN name AS text COLLATE english;
CREATE DOMAIN countrycode AS char(2) COLLATE C;


1) SELECT * from users where country = 'DE' order by first_name COLLATE german;

2) SELECT * from users where country = 'DE' order by
(tolower(first_name) COLLATE german)::name;

3) SELECT * from users order by substr(address,1,2)::countrycode COLLATE english

4) SELECT * from users order by (substr(address,1,2) COLLATE
english)::countrycode

The ones with the collation expressions inside the casts seem very
strange and the behaviour following the domain don't seem
unreasonable. The behaviour with the collate clauses outside the cast
should definitely be follow the explicit collate clause.

-- 
greg


Re: Open issues for collations

From
Robert Haas
Date:
On Mar 26, 2011, at 12:34 PM, Greg Stark <gsstark@mit.edu> wrote:
> The ones with the collation expressions inside the casts seem very
> strange and the behaviour following the domain don't seem
> unreasonable. The behaviour with the collate clauses outside the cast
> should definitely be follow the explicit collate clause.

+1.  That's exactly what I was trying to say; thanks for saying it better.

...Robert


Re: Open issues for collations

From
Martijn van Oosterhout
Date:
On Sat, Mar 26, 2011 at 12:36:43AM -0400, Tom Lane wrote:
> ** Selecting a field from a record-returning function's output.
> Currently, we'll use the field's declared collation; except that
> if the field has default collation, we'll replace that with the common
> collation of the function's inputs, if any.  Is either part of that
> sane?  Do we need to make this work for functions invoked with other
> syntax than a plain function call, eg operator or cast syntax?

That seems all a bit weird. I spent some time reading through the SQL
spec to see if I could came up with a few ideas about what they thought
relevent. I think the gist of it is that I think the result row should
have for each column its declared collation in all cases.

Firstly, the SQL doesn't go into the general case where the collate
result of a function is based in it inputs. But in any case, if the
function returns a record, the whole record would have that collation,
which is absurd. I think letting it go to the contained columns is just
weird.

Secondly, I think the derivation algorithm is for determing the
collation of expressions which have no otherwise declared collation.
Anything returning a predefined record type has a predefined collation
and it should be used. If you're in a query referring to rowvar.field
and rowvar has a type, that's what should be used.

(No doubt you have some corner cases in mind?)

As for operators, they should behave like functions wherever possible,
otherwise it's just introducing unnecessary differences.

The cast-case is related to below.

> ** What to do with domains whose declaration includes a COLLATE clause?
> Currently, we'll impute that collation to the result of a cast to the
> domain type --- even if the cast's input expression includes an
> explicit COLLATE clause.  It's not clear that that's per spec.  If it
> is correct, should we behave similarly for functions that are declared
> to return a domain type?  Should it matter if the cast-to-domain is
> explicit or implicit?  Perhaps it'd be best if domain collations only
> mattered for columns declared with that domain type.  Then we'd have
> a general rule that collations only come into play in an expression
> as a result of (a) the declared type of a column reference or (b)
> an explicit COLLATE clause.

The SQL spec considers the collation to be part of the datatype, so if
you're casting to a domain (or type) you get the collation associated
with that domain (or type). As per the spec:

"The collation derivation of a declared type with a declared type
collation that is explicitly or implicitly specified by a <data type>
is implicit."

So the result of a cast would be the collation of the specified
type/domain with state implicit.

Also, apparently the COLLATE clause as allowed anywhere where a
datatype is permitted. So you can say:

CAST( foo AS TEXT COLLATE "en_US" )

Not sure if that works now. The result would be implicit state, as
opposed to if the COLLATE clause appears elsewhere.

Incidently, a function returning a domain seems weird to me. What does
it mean: (1) the function returns this type, Postgres assumes this is
true, or (2) function returns something, Postgres does an implicit
cast?

In any case, I'd suggest it is treated as being included in the
resolving of the return collation with the arguments so if the result
is a domain and you apply the normal rules you get:

(1) explicit states in the arguments will override it
(2) if arguments are implicit state and conflict with domain, the
result is no-collation, otherwise implicitly whatever the domain was
(3) no arguments have collation, which means you get the domain
default.

Which all seems eminently reasonable.

So I'd agree with your rules, but add a case (c) result of a cast.

> * In plpgsql, is it OK for declared local variables to inherit the
> function's input collation?  Should we provide a COLLATE option in
> variable declarations to let that be overridden?  If Oracle understands
> COLLATE, probably we should look at what they do in PL/SQL.

If COLLATE is allowed anywhere where the datatype is allowed, then the
COLLATE clause should be permitted there. Otherwise they become the
specified type with whatever the default is for that type. In
expressions the coercible-default state will get overridden
by the IMPLICIT state from the arguments as appropriate.

I note I'm using the term coercible default here, because that's what
Transact-SQL calls the state for any variable or value that's not a column
reference. I'm just checking and don't see any support for it in the
SQL standard. While it seemed to me to be extremely useful, since it
allows column references to override literals.

> * RI triggers should insert COLLATE clauses in generated queries to
> satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the
> referenced column's collation.  Right now you may get either table's
> collation depending on which query type is involved.  I think an obvious
> failure may not be possible so long as equality means the same thing in
> all collations, but it's definitely possible that the planner might
> decide it can't use the referenced column's unique index, which would
> suck for performance.  (Note: this rule seems to prove that the
> committee assumes equality can mean different things in different
> collations, else they'd not have felt the need to specify.)

Using the referenced collation makes it clear that you don't want to
throw an error if the collations don't match. So yes, the generated
queries need to have the correct collation clauses.

Collations which redefine equality are on the horizon, case and accent
insensetivity being the obvious cases. Better get it right now.

> * It'd sure be nice if we had some nontrivial test cases that work in
> encodings besides UTF8.  I'm still bothered that the committed patch
> failed to cover single-byte-encoding cases in upper/lower/initcap.

That'd be nice. Perhaps one the existing cases can be converted to some
latin-X encoding. It does rely on the collation existing on the test
machine.

> * Is it worth adding a cares-about-collation flag to pg_proc?  Probably
> too late to be worrying about such refinements for 9.1.

Long term it will be needed. It's been skipped now, but it will let you
produce better errors messages in the future.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Open issues for collations

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Sat, Mar 26, 2011 at 12:36:43AM -0400, Tom Lane wrote:
>> ** Selecting a field from a record-returning function's output.
>> Currently, we'll use the field's declared collation; except that
>> if the field has default collation, we'll replace that with the common
>> collation of the function's inputs, if any.  Is either part of that
>> sane?  Do we need to make this work for functions invoked with other
>> syntax than a plain function call, eg operator or cast syntax?

> That seems all a bit weird. I spent some time reading through the SQL
> spec to see if I could came up with a few ideas about what they thought
> relevent. I think the gist of it is that I think the result row should
> have for each column its declared collation in all cases.

That interpretation would be fine with me.  It would let us get rid of
the special-case code at lines 307-324 of parse_collate.c, which I put
in only because there are cases in the collate.linux.utf8.sql regression
test that fail without it.  But I'm perfectly happy to conclude that
those test cases are mistaken.

>> ** What to do with domains whose declaration includes a COLLATE clause?

> The SQL spec considers the collation to be part of the datatype, so if
> you're casting to a domain (or type) you get the collation associated
> with that domain (or type). As per the spec:

> "The collation derivation of a declared type with a declared type
> collation that is explicitly or implicitly specified by a <data type>
> is implicit."

> So the result of a cast would be the collation of the specified
> type/domain with state implicit.

Hm.  That makes sense for explicit CAST syntax, but what about a
function returning a collatable type?  In particular, applying this
rule to the || operator would have us conclude that
x || y COLLATE "foo"

doesn't have an overall collation of "foo", which seems clearly wrong.
But if you claim that it's all based on the result type of the function
then it's hard to avoid that outcome.

> Also, apparently the COLLATE clause as allowed anywhere where a
> datatype is permitted. So you can say:

> CAST( foo AS TEXT COLLATE "en_US" )

No, you can't; this case is specifically disallowed by SQL:2008 6.12
<cast specification> syntax rule 1b: <data type> shall not contain a
<collate clause>.  Which is a pretty weird thing for them to say if
you believe that collations are nothing more nor less than a data type
property.  It kinda leads me to think that that is not how the committee
sees collations ... although the spec is pretty unhelpful at suggesting
exactly how you should think of them instead.

> Incidently, a function returning a domain seems weird to me. What does
> it mean: (1) the function returns this type, Postgres assumes this is
> true, or (2) function returns something, Postgres does an implicit
> cast?

> In any case, I'd suggest it is treated as being included in the
> resolving of the return collation with the arguments so if the result
> is a domain and you apply the normal rules you get:

> (1) explicit states in the arguments will override it
> (2) if arguments are implicit state and conflict with domain, the
> result is no-collation, otherwise implicitly whatever the domain was
> (3) no arguments have collation, which means you get the domain
> default.

> Which all seems eminently reasonable.

It's reasonable by itself, but it's also rather different from what you
just argued the behavior should be for field selection.  I don't see
the rationale for treating those cases differently.  In fact, given that
Postgres has always treated f(x) and x.f as equivalent notations,
I think there's a pretty strong argument for wanting their effects on
collation choice to be equivalent too.

>> * In plpgsql, is it OK for declared local variables to inherit the
>> function's input collation?  Should we provide a COLLATE option in
>> variable declarations to let that be overridden?  If Oracle understands
>> COLLATE, probably we should look at what they do in PL/SQL.

> If COLLATE is allowed anywhere where the datatype is allowed, then the
> COLLATE clause should be permitted there. Otherwise they become the
> specified type with whatever the default is for that type. In
> expressions the coercible-default state will get overridden
> by the IMPLICIT state from the arguments as appropriate.

No, I don't care for that, because then it matters whether an expression
contains any direct references to the input arguments versus containing
just local variables.  In particular, assigning a parameter to a local
variable and then using the local variable instead of the parameter
could change the function's results.  That seems mighty surprising and
bug-prone to me.  So I think it's important that local variables default
to the function's input collation.  I have no problem with adding
explicit COLLATE to the declaration syntax for plpgsql variables,
though.
        regards, tom lane


Re: Open issues for collations

From
Martijn van Oosterhout
Date:
On Sun, Mar 27, 2011 at 03:14:37PM -0400, Tom Lane wrote:
> > So the result of a cast would be the collation of the specified
> > type/domain with state implicit.
>
> Hm.  That makes sense for explicit CAST syntax, but what about a
> function returning a collatable type?  In particular, applying this
> rule to the || operator would have us conclude that
>
>     x || y COLLATE "foo"
>
> doesn't have an overall collation of "foo", which seems clearly wrong.
> But if you claim that it's all based on the result type of the function
> then it's hard to avoid that outcome.

I don't see this. There's no explicit cast in the above expression. The
return type says something, but that gets back to the question of
whether the return type indicates a cast to that type or not. In any
case, the SQL specifically states what should happen in the above case.

The question started with what happens when the function is declared to
return a domain type. I'm really not sure what to do there. I'd suggest
similar to how the typmod is propegated but I'm afraid that the answer
there is "we don't". What I suggested with combining it with the
collations of the input types is about the only concrete suggestion I
have.

> > Also, apparently the COLLATE clause as allowed anywhere where a
> > datatype is permitted. So you can say:
>
> > CAST( foo AS TEXT COLLATE "en_US" )
>
> No, you can't; this case is specifically disallowed by SQL:2008 6.12

Maybe they just thought it confusing that COLLATE would mean different
things in different contexts and forbade it.

> > (1) explicit states in the arguments will override it
> > (2) if arguments are implicit state and conflict with domain, the
> > result is no-collation, otherwise implicitly whatever the domain was
> > (3) no arguments have collation, which means you get the domain
> > default.
>
> > Which all seems eminently reasonable.
>
> It's reasonable by itself, but it's also rather different from what you
> just argued the behavior should be for field selection.  I don't see
> the rationale for treating those cases differently.  In fact, given that
> Postgres has always treated f(x) and x.f as equivalent notations,
> I think there's a pretty strong argument for wanting their effects on
> collation choice to be equivalent too.

Well, that's a good argument. But I'd suggest that here x.f is not a
column reference, so there's no reason it should be treated like one.
The SQL standard says column references determine collations but for
various strings operations it's the inputs that count. This makes sense
if you consider the return type declaration to be merely an assertion
and not a cast.

<snip about default collation on declared variables>

> No, I don't care for that, because then it matters whether an expression
> contains any direct references to the input arguments versus containing
> just local variables.  In particular, assigning a parameter to a local
> variable and then using the local variable instead of the parameter
> could change the function's results.  That seems mighty surprising and
> bug-prone to me.  So I think it's important that local variables default
> to the function's input collation.  I have no problem with adding
> explicit COLLATE to the declaration syntax for plpgsql variables,
> though.

Ok, you've convinced me here. You effectively set the default collation
for the period of the function, which I think is perfectly defensible.
We might allow people in the future to override it (a la search_path)
but this is a good start.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Open issues for collations

From
Peter Eisentraut
Date:
On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote:
> * RI triggers should insert COLLATE clauses in generated queries to
> satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the
> referenced column's collation.  Right now you may get either table's
> collation depending on which query type is involved.  I think an obvious
> failure may not be possible so long as equality means the same thing in
> all collations, but it's definitely possible that the planner might
> decide it can't use the referenced column's unique index, which would
> suck for performance.  (Note: this rule seems to prove that the
> committee assumes equality can mean different things in different
> collations, else they'd not have felt the need to specify.)

Right, but we don't support that yet, so I don't consider that that has
to be addressed right now.  Rather it could go on a "list of things to
fix when supporting collations which redefine equality".  The index
mismatch issue is also not urgent.  It's not a regression and it's more
like don't-do-that-then or do-it-differently-then.

> * It'd sure be nice if we had some nontrivial test cases that work in
> encodings besides UTF8.  I'm still bothered that the committed patch
> failed to cover single-byte-encoding cases in upper/lower/initcap.

Well, how do we want to maintain these test cases without doing too much
duplication?  It would be easy to run a small sed script over
collate.linux.utf8.sql to create, say, a latin1 version out of it.
Since it's Linux only, it might be valid to do it that way without
having to make it super-portable in C.

> * Remove initdb's warning about useless locales?  Seems like pointless
> noise, or at least something that can be relegated to debug mode.

Fine with me.

> * Is it worth adding a cares-about-collation flag to pg_proc?  Probably
> too late to be worrying about such refinements for 9.1.

Probably.  It would open up a bunch of new cases to change and
fine-tune.




Re: Open issues for collations

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote:
>> * It'd sure be nice if we had some nontrivial test cases that work in
>> encodings besides UTF8.  I'm still bothered that the committed patch
>> failed to cover single-byte-encoding cases in upper/lower/initcap.

> Well, how do we want to maintain these test cases without doing too much
> duplication?  It would be easy to run a small sed script over
> collate.linux.utf8.sql to create, say, a latin1 version out of it.

I tried.  The upper/lower test cases require Turkish characters that
aren't in Latin1.  I'm not sure if we can readily produce test cases
that cover both sorting changes and case-folding changes in just one
single-byte encoding --- anybody?

One thing I noticed but didn't push to committing is that the test case
has a largely-unnecessary assumption about how the local system's locale
names spell "utf8".  We could eliminate that by having it use the
trimmed locale names created by initdb.  I would've made more of a push
for that if it resulted in a test case that passed on OS X, but it turns
out that once you get past the locale name spelling, you find out that
Macs still can't sort UTF8 strings correctly :-(
        regards, tom lane


Re: Open issues for collations

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of lun mar 28 21:02:40 -0300 2011:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote:
> >> * It'd sure be nice if we had some nontrivial test cases that work in
> >> encodings besides UTF8.  I'm still bothered that the committed patch
> >> failed to cover single-byte-encoding cases in upper/lower/initcap.
> 
> > Well, how do we want to maintain these test cases without doing too much
> > duplication?  It would be easy to run a small sed script over
> > collate.linux.utf8.sql to create, say, a latin1 version out of it.
> 
> I tried.  The upper/lower test cases require Turkish characters that
> aren't in Latin1.  I'm not sure if we can readily produce test cases
> that cover both sorting changes and case-folding changes in just one
> single-byte encoding --- anybody?

ISO-8859-9?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Open issues for collations

From
Euler Taveira de Oliveira
Date:
Em 28-03-2011 22:27, Alvaro Herrera escreveu:
> Excerpts from Tom Lane's message of lun mar 28 21:02:40 -0300 2011:
>> I tried.  The upper/lower test cases require Turkish characters that
>> aren't in Latin1.  I'm not sure if we can readily produce test cases
>> that cover both sorting changes and case-folding changes in just one
>> single-byte encoding --- anybody?
>
> ISO-8859-9?
>
I'm afraid we have to map lang to single byte character set. Not all languages 
prefer ISO-8859.


--   Euler Taveira de Oliveira  http://www.timbira.com/


Re: Open issues for collations

From
Martijn van Oosterhout
Date:
On Mon, Mar 28, 2011 at 08:02:40PM -0400, Tom Lane wrote:
> One thing I noticed but didn't push to committing is that the test case
> has a largely-unnecessary assumption about how the local system's locale
> names spell "utf8".  We could eliminate that by having it use the
> trimmed locale names created by initdb.  I would've made more of a push
> for that if it resulted in a test case that passed on OS X, but it turns
> out that once you get past the locale name spelling, you find out that
> Macs still can't sort UTF8 strings correctly :-(

Yeah, and I don't think it's likely they're ever going to fix it
either. :( On their own website they explain how to do locale based
sorting, and they provide their own interfaces for that which,
unsurprisingly, uses UTF-16 and ICU underneath.


http://developer.apple.com/library/mac/#documentation/CoreFoundation/Conceptual/CFLocales/Articles/CFLocaleConcepts.html

I think the rule that we sort the same as the command-line sort program
is still true however, so I don't think it's surprising as such. Just
unfortunate.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: Open issues for collations

From
Robert Haas
Date:
Reading through this thread...

On Sat, Mar 26, 2011 at 12:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> ** Selecting a field from a record-returning function's output.
> Currently, we'll use the field's declared collation; except that
> if the field has default collation, we'll replace that with the common
> collation of the function's inputs, if any.  Is either part of that
> sane?  Do we need to make this work for functions invoked with other
> syntax than a plain function call, eg operator or cast syntax?

There were a couple of different ideas about which way we ought to go
with this, but I'm happy to defer to what Tom and Martijn hashed out:

MO> That seems all a bit weird. I spent some time reading through the SQL
MO> spec to see if I could came up with a few ideas about what they thought
MO> relevent. I think the gist of it is that I think the result row should
MO> have for each column its declared collation in all cases.

TL> That interpretation would be fine with me.  It would let us get rid of
TL> the special-case code at lines 307-324 of parse_collate.c, which I put
TL> in only because there are cases in the collate.linux.utf8.sql regression
TL> test that fail without it.  But I'm perfectly happy to conclude that
TL> those test cases are mistaken.

I'm not sure whether that's been done, though, or whether we're even
going to do it.

> ** What to do with domains whose declaration includes a COLLATE clause?
> Currently, we'll impute that collation to the result of a cast to the
> domain type --- even if the cast's input expression includes an
> explicit COLLATE clause.  It's not clear that that's per spec.  If it
> is correct, should we behave similarly for functions that are declared
> to return a domain type?  Should it matter if the cast-to-domain is
> explicit or implicit?  Perhaps it'd be best if domain collations only
> mattered for columns declared with that domain type.  Then we'd have
> a general rule that collations only come into play in an expression
> as a result of (a) the declared type of a column reference or (b)
> an explicit COLLATE clause.

I think we had agreement than a cast to a domain type with a collation
should stomp on any existing collation on the contained expression.

> * In plpgsql, is it OK for declared local variables to inherit the
> function's input collation?  Should we provide a COLLATE option in
> variable declarations to let that be overridden?

I think everyone who responded said "yes" to both questions.

> * RI triggers should insert COLLATE clauses in generated queries to
> satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the
> referenced column's collation.  Right now you may get either table's
> collation depending on which query type is involved.  I think an obvious
> failure may not be possible so long as equality means the same thing in
> all collations, but it's definitely possible that the planner might
> decide it can't use the referenced column's unique index, which would
> suck for performance.  (Note: this rule seems to prove that the
> committee assumes equality can mean different things in different
> collations, else they'd not have felt the need to specify.)

I tested this and, indeed, if the collations don't match, the index
can't be used.

CREATE TABLE me (x character varying COLLATE "en_US");
CREATE TABLE me2 (x character varying COLLATE "es_ES");
CREATE TABLE you (x character varying COLLATE "es_ES" NOT NULL,
PRIMARY KEY (x));
ALTER TABLE me ADD FOREIGN KEY (x) REFERENCES you(x);
ALTER TABLE me2 ADD FOREIGN KEY (x) REFERENCES you(x);
SET enable_seqscan=false;
SET enable_hashjoin=false;

With that setup, this still does a seqscan-and-sort:

EXPLAIN select * from me, you where me.x = you.x;

But this uses the index:

EXPLAIN select * from me2, you where me2.x = you.x;

I found another problem, too:

rhaas=# insert into you values ('1');
INSERT 0 1
rhaas=# insert into me values ('1');
INSERT 0 1
rhaas=# alter table me alter column x set data type varchar collate "en_GB";
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.
CONTEXT:  SQL statement "SELECT fk."x" FROM ONLY "public"."me" fk LEFT
OUTER JOIN ONLY "public"."you" pk ON ( pk."x"::pg_catalog.text
OPERATOR(pg_catalog.=) fk."x"::pg_catalog.text) WHERE pk."x" IS NULL
AND (fk."x" IS NOT NULL)"

> * It'd sure be nice if we had some nontrivial test cases that work in
> encodings besides UTF8.  I'm still bothered that the committed patch
> failed to cover single-byte-encoding cases in upper/lower/initcap.

Seems like no one knows how to do this.

> * Remove initdb's warning about useless locales?  Seems like pointless
> noise, or at least something that can be relegated to debug mode.

Everyone was in favor of this.

> * Is it worth adding a cares-about-collation flag to pg_proc?  Probably
> too late to be worrying about such refinements for 9.1.

Probably too late for this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Open issues for collations

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Reading through this thread...
> On Sat, Mar 26, 2011 at 12:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ** Selecting a field from a record-returning function's output.
>> Currently, we'll use the field's declared collation; except that
>> if the field has default collation, we'll replace that with the common
>> collation of the function's inputs, if any. �Is either part of that
>> sane? �Do we need to make this work for functions invoked with other
>> syntax than a plain function call, eg operator or cast syntax?

> There were a couple of different ideas about which way we ought to go
> with this, but I'm happy to defer to what Tom and Martijn hashed out:

> MO> That seems all a bit weird. I spent some time reading through the SQL
> MO> spec to see if I could came up with a few ideas about what they thought
> MO> relevent. I think the gist of it is that I think the result row should
> MO> have for each column its declared collation in all cases.

> TL> That interpretation would be fine with me.  It would let us get rid of
> TL> the special-case code at lines 307-324 of parse_collate.c, which I put
> TL> in only because there are cases in the collate.linux.utf8.sql regression
> TL> test that fail without it.  But I'm perfectly happy to conclude that
> TL> those test cases are mistaken.

> I'm not sure whether that's been done, though, or whether we're even
> going to do it.

I looked a bit more closely at this, and I think I finally get the point
of what those regression test cases involving the dup() function are
about.  Consider a trivial polymorphic function such as
create function dummy(anyelement) returns anyelement as'select $1' language sql;

When applied to a textual argument, this is a function taking and
returning string, and so collation does (and should, I think) propagate
through it.  Thus in
select dummy(x) from tbl order by 1;

you will get ordering by the declared collation of tbl.x, whatever that
is.  But now consider
create function dup(in anyelement, a out anyelement, b out anyelement)as 'select $1, $2' language sql;
select dup(x).a from tbl order by 1;

It's not unreasonable to think that this should also order by tbl.x's
collation --- if collation propagates through dummy(), why not through
dup()?  And in fact those regression test cases are expecting that it
does propagate in such a case.

Now the discussion that we had earlier in this thread was implicitly
assuming that we were talking about FieldSelect from a known composite
type.  If dup() were declared to return a named composite type, then
using the collation that is declared for that type's "a" column seems
reasonable.  But when you're dealing with an anonymous record type,
which is what dup() actually returns here, there is no such declaration;
and what's more, the fact that there's a record type at all is just an
implementation detail to most users.

If we take out the kluge in parse_collate.c's handling of FieldSelects,
then what we will get in this example is ordering by the database
default collation.  We can justify that on a narrow language-lawyering
basis by saying "dup() returns a composite type, which has no collation,
therefore collation does not propagate through from its arguments to
any column you might select from its result".  But it's going to feel
a bit surprising to anyone who thinks of this in terms of OUT arguments
rather than an anonymous composite type.

I'm inclined to think that we should take out the kluge and rest on the
language-lawyering viewpoint, because otherwise there are going to be
umpteen other corner cases where people are going to expect collation to
propagate and it's not going to work without very major kluging.

Comments?
        regards, tom lane


Re: Open issues for collations

From
Peter Eisentraut
Date:
On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote:
> * Remove initdb's warning about useless locales?  Seems like pointless
> noise, or at least something that can be relegated to debug mode.

Quick question on this:  Should we at least warn if zero suitable
locales were found or some other problem scenario?  Or should we just
wait around and see what actual problems, if any, will be reported?



Re: Open issues for collations

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote:
>> * Remove initdb's warning about useless locales?  Seems like pointless
>> noise, or at least something that can be relegated to debug mode.

> Quick question on this:  Should we at least warn if zero suitable
> locales were found or some other problem scenario?  Or should we just
> wait around and see what actual problems, if any, will be reported?

Well, my opinion is that normal users never see the output of initdb
at all, so I don't think there's that much value in complaining there.
But I don't have a serious objection to complaining if we couldn't find
any usable locales at all, either.
        regards, tom lane


Re: Open issues for collations

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of vie abr 08 15:27:15 -0300 2011:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote:
> >> * Remove initdb's warning about useless locales?  Seems like pointless
> >> noise, or at least something that can be relegated to debug mode.
> 
> > Quick question on this:  Should we at least warn if zero suitable
> > locales were found or some other problem scenario?  Or should we just
> > wait around and see what actual problems, if any, will be reported?
> 
> Well, my opinion is that normal users never see the output of initdb
> at all, so I don't think there's that much value in complaining there.

Those users are not going to have those problems anyway.  The
problematic users are going to be those running on unusual platforms.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Open issues for collations

From
Peter Eisentraut
Date:
On fre, 2011-04-08 at 16:14 -0300, Alvaro Herrera wrote:
> Excerpts from Tom Lane's message of vie abr 08 15:27:15 -0300 2011:
> > Peter Eisentraut <peter_e@gmx.net> writes:
> > > On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote:
> > >> * Remove initdb's warning about useless locales?  Seems like pointless
> > >> noise, or at least something that can be relegated to debug mode.
> > 
> > > Quick question on this:  Should we at least warn if zero suitable
> > > locales were found or some other problem scenario?  Or should we just
> > > wait around and see what actual problems, if any, will be reported?
> > 
> > Well, my opinion is that normal users never see the output of initdb
> > at all, so I don't think there's that much value in complaining there.
> 
> Those users are not going to have those problems anyway.  The
> problematic users are going to be those running on unusual platforms.

So what is your opinion on the original question?




Re: Open issues for collations

From
Alvaro Herrera
Date:
Excerpts from Peter Eisentraut's message of sáb abr 09 01:32:28 -0300 2011:
> On fre, 2011-04-08 at 16:14 -0300, Alvaro Herrera wrote:
> > Excerpts from Tom Lane's message of vie abr 08 15:27:15 -0300 2011:
> > > Peter Eisentraut <peter_e@gmx.net> writes:
> > > > On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote:
> > > >> * Remove initdb's warning about useless locales?  Seems like pointless
> > > >> noise, or at least something that can be relegated to debug mode.
> > > 
> > > > Quick question on this:  Should we at least warn if zero suitable
> > > > locales were found or some other problem scenario?  Or should we just
> > > > wait around and see what actual problems, if any, will be reported?
> > > 
> > > Well, my opinion is that normal users never see the output of initdb
> > > at all, so I don't think there's that much value in complaining there.
> > 
> > Those users are not going to have those problems anyway.  The
> > problematic users are going to be those running on unusual platforms.
> 
> So what is your opinion on the original question?

I feel throwing a warning is appropriate here, because people for which
problems are most likely are going to see them.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Open issues for collations

From
Peter Eisentraut
Date:
On mån, 2011-03-28 at 20:02 -0400, Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote:
> >> * It'd sure be nice if we had some nontrivial test cases that work
> >> in encodings besides UTF8.  I'm still bothered that the committed
> >> patch failed to cover single-byte-encoding cases in >>
> upper/lower/initcap.
> 
> > Well, how do we want to maintain these test cases without doing too
> > much duplication?  It would be easy to run a small sed script over
> > collate.linux.utf8.sql to create, say, a latin1 version out of it.
> 
> I tried.  The upper/lower test cases require Turkish characters that
> aren't in Latin1.  I'm not sure if we can readily produce test cases
> that cover both sorting changes and case-folding changes in just one
> single-byte encoding --- anybody?
> 
> One thing I noticed but didn't push to committing is that the test
> case has a largely-unnecessary assumption about how the local system's
> locale names spell "utf8".  We could eliminate that by having it use
> the trimmed locale names created by initdb.

I see you went for the latter option.  That works pretty well already.
I've also been playing around with separating out the "Turkish" tests
into a separate file.  That would then probably get the remaining
"latin1" file passing, if we also dropped the encoding mention from this
error message:

ERROR:  collation "foo" for encoding "UTF8" does not exist

I had thought hard about this in the past and didn't want to do it, but
since we are now making every effort to effectively hide collations with
the wrong encoding, this would possibly be acceptable.

I'm also seeing promising signs that we might get this test (minus
Turkish, perhaps) passing on Windows.




Re: Open issues for collations

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> On mån, 2011-03-28 at 20:02 -0400, Tom Lane wrote:
>> One thing I noticed but didn't push to committing is that the test
>> case has a largely-unnecessary assumption about how the local system's
>> locale names spell "utf8".  We could eliminate that by having it use
>> the trimmed locale names created by initdb.

> I see you went for the latter option.  That works pretty well already.
> I've also been playing around with separating out the "Turkish" tests
> into a separate file.  That would then probably get the remaining
> "latin1" file passing, if we also dropped the encoding mention from this
> error message:

> ERROR:  collation "foo" for encoding "UTF8" does not exist

> I had thought hard about this in the past and didn't want to do it, but
> since we are now making every effort to effectively hide collations with
> the wrong encoding, this would possibly be acceptable.

Not sure.  If we had the test refactored to the point where that was the
only diff you got with a different server encoding, maybe it'd be worth
changing; but right now we're still a long way from there.  I was seeing
this change as mainly targeted towards making the test useful on more
platforms, and since that encoding name is ours and not
platform-specific, it doesn't create any portability issues to show it.
        regards, tom lane


Re: Open issues for collations

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> [ assorted comments on original issues ]

I believe that all the collation issues I complained about on 26 March
are now resolved, except for the question of getting some more test
cases, and the question of adding a cares-about-collation flag to
pg_proc.  I've added the latter to the TODO list since everybody agreed
it was too late to consider it for 9.1.

However, I've come across a new issue that maybe requires discussion:
what collation should be associated with a multi-row VALUES in FROM?
For instance, in
SELECT ... FROM  (VALUES (1, 'foo'), (2, 'bar' COLLATE "C")) v(a,b),  ...

what collation should be imputed to references to v.b?

The way the code currently works is that the first row of the VALUES
list is inspected to determine what collation to report --- so in this
example, you'd get default collation, ignoring the COLLATE clause in the
second row.  There are several problems with this:

1. I think it violates the SQL spec.  SQL:2008 7.3 <table value
constructor> says that the column types of a VALUES construct are
resolved per the rules of section 9.3, and 9.3 is the one with the
standard verbiage about resolving a common collation, so it's hard
to see how "use the first row and ignore the rest" satisfies the spec.

2. It doesn't seem to satisfy the POLA --- in the above example,
ignoring the explicit COLLATE clause is rather surprising.  We could
document that that's what it does, perhaps, but if you can attach
COLLATE to any input of an operator or function and get the same
results, it's hard to explain why the same isn't true of a VALUES
column.  Especially when VALUES resolves data types in a much more
symmetrical fashion, eg this works:
  (VALUES (1, '1'), (2, '2'::real)) v(a,b)

3. It's not hard to imagine people thinking they can get row-by-row-
varying collation behavior from something like
SELECT ... FROM  (VALUES (1, 'foo' COLLATE "en_US"), (2, 'bar' COLLATE "C")) v(a,b),  ...

So not throwing an error, but rather silently doing something other than
what this SQL seems to say, seems pretty unfriendly to me.

The reason I'm expending so much verbiage on this is that fixing it
seems to require an addition to struct RangeTblEntry, ie, a catversion
bump and forced initdb, so that we can store a list of the resolved
column collations for an RTE_VALUES RTE.  I don't really think there's
much choice though.

Comments?
        regards, tom lane


Re: Open issues for collations

From
Peter Eisentraut
Date:
On Mon, 2011-04-18 at 10:41 -0400, Tom Lane wrote:
> However, I've come across a new issue that maybe requires discussion:
> what collation should be associated with a multi-row VALUES in FROM?
> For instance, in
> 
>         SELECT ... FROM
>           (VALUES (1, 'foo'), (2, 'bar' COLLATE "C")) v(a,b),
>           ...
> 
> what collation should be imputed to references to v.b?
> 
> The way the code currently works is that the first row of the VALUES
> list is inspected to determine what collation to report

Hmm, I do see this on my list of things to address, so yes, it should be
fixed.  I likely got stuck because it's pretty complicated.