Thread: Open issues for collations
Robert Haas <robertmhaas@gmail.com> writes: > I think some discussion of which of the things on the open > item lists need to be done before beta might be helpful, and we ought > to add any items that are not there but are blockers. Here's a quick enumeration of some things I think need discussion about the collations patch: * Are we happy yet with the collation assignment behavior (see parse_collate.c)? A couple of specific subtopics: ** Selecting a field from a record-returning function's output. Currently, we'll use the field's declared collation; except that if the field has default collation, we'll replace that with the common collation of the function's inputs, if any. Is either part of that sane? Do we need to make this work for functions invoked with other syntax than a plain function call, eg operator or cast syntax? ** What to do with domains whose declaration includes a COLLATE clause? Currently, we'll impute that collation to the result of a cast to the domain type --- even if the cast's input expression includes an explicit COLLATE clause. It's not clear that that's per spec. If it is correct, should we behave similarly for functions that are declared to return a domain type? Should it matter if the cast-to-domain is explicit or implicit? Perhaps it'd be best if domain collations only mattered for columns declared with that domain type. Then we'd have a general rule that collations only come into play in an expression as a result of (a) the declared type of a column reference or (b) an explicit COLLATE clause. * In plpgsql, is it OK for declared local variables to inherit the function's input collation? Should we provide a COLLATE option in variable declarations to let that be overridden? If Oracle understands COLLATE, probably we should look at what they do in PL/SQL. * RI triggers should insert COLLATE clauses in generated queries to satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the referenced column's collation. Right now you may get either table's collation depending on which query type is involved. I think an obvious failure may not be possible so long as equality means the same thing in all collations, but it's definitely possible that the planner might decide it can't use the referenced column's unique index, which would suck for performance. (Note: this rule seems to prove that the committee assumes equality can mean different things in different collations, else they'd not have felt the need to specify.) * It'd sure be nice if we had some nontrivial test cases that work in encodings besides UTF8. I'm still bothered that the committed patch failed to cover single-byte-encoding cases in upper/lower/initcap. * Remove initdb's warning about useless locales? Seems like pointless noise, or at least something that can be relegated to debug mode. * Is it worth adding a cares-about-collation flag to pg_proc? Probably too late to be worrying about such refinements for 9.1. There are a bunch of other minor issues that I'm still working through, but these are the ones that seem to merit discussion. regards, tom lane
On Sat, Mar 26, 2011 at 4:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I think some discussion of which of the things on the open >> item lists need to be done before beta might be helpful, and we ought >> to add any items that are not there but are blockers. > > Here's a quick enumeration of some things I think need discussion about > the collations patch: > > * Are we happy yet with the collation assignment behavior (see > parse_collate.c)? A couple of specific subtopics: > > ** Selecting a field from a record-returning function's output. > Currently, we'll use the field's declared collation; except that > if the field has default collation, we'll replace that with the common > collation of the function's inputs, if any. Is either part of that > sane? Do we need to make this work for functions invoked with other > syntax than a plain function call, eg operator or cast syntax? > > ** What to do with domains whose declaration includes a COLLATE clause? > Currently, we'll impute that collation to the result of a cast to the > domain type --- even if the cast's input expression includes an > explicit COLLATE clause. It's not clear that that's per spec. If it > is correct, should we behave similarly for functions that are declared > to return a domain type? Should it matter if the cast-to-domain is > explicit or implicit? Perhaps it'd be best if domain collations only > mattered for columns declared with that domain type. Then we'd have > a general rule that collations only come into play in an expression > as a result of (a) the declared type of a column reference or (b) > an explicit COLLATE clause. > > > * In plpgsql, is it OK for declared local variables to inherit the > function's input collation? Should we provide a COLLATE option in > variable declarations to let that be overridden? If Oracle understands > COLLATE, probably we should look at what they do in PL/SQL. > > * RI triggers should insert COLLATE clauses in generated queries to > satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the > referenced column's collation. Right now you may get either table's > collation depending on which query type is involved. I think an obvious > failure may not be possible so long as equality means the same thing in > all collations, but it's definitely possible that the planner might > decide it can't use the referenced column's unique index, which would > suck for performance. (Note: this rule seems to prove that the > committee assumes equality can mean different things in different > collations, else they'd not have felt the need to specify.) > > * It'd sure be nice if we had some nontrivial test cases that work in > encodings besides UTF8. I'm still bothered that the committed patch > failed to cover single-byte-encoding cases in upper/lower/initcap. > > * Remove initdb's warning about useless locales? Seems like pointless > noise, or at least something that can be relegated to debug mode. > > * Is it worth adding a cares-about-collation flag to pg_proc? Probably > too late to be worrying about such refinements for 9.1. > > There are a bunch of other minor issues that I'm still working through, > but these are the ones that seem to merit discussion. That's a long list and I think it's clear that we won't resolve all of those issues to everybody's satisfaction in a single release, let alone in next week or so. We need a way forwards. What I think we should do is add detailed documentation on how it works now. There are many people that would love to help, but not everybody can visualise exactly the points you are making above, I would confess that I can't. Having docs that clearly explain a neat new capability and the various possible gotcha/caveats will help others come up with test cases and ideas. It seems to me likely that in real usage many of those gotchas will drop away because they represent unlikely or perverse use cases. I don't see anything bad in releasing software that has unresolved questions, as long as those items are clearly flagged up and we specifically ask for feedback on them. That looks to me to be a many-eyeballs approach to the problem. http://en.wikipedia.org/wiki/Linus%27_Law Tucking our shoelaces into our shoes doesn't mean the loose ends are resolved fully. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Mar 26, 2011 at 4:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > ** Selecting a field from a record-returning function's output. > Currently, we'll use the field's declared collation; except that > if the field has default collation, we'll replace that with the common > collation of the function's inputs, if any. Is either part of that > sane? Do we need to make this work for functions invoked with other > syntax than a plain function call, eg operator or cast syntax? Either of those sounds reasonable but the combination seems weird. Some example functions might help: list_words('foo bar bar') -> ('foo'), ('bar'), ('baz') fetch_users_by_lastname('Smith') -> (1,'John','Smith','Great Britain','GB'), (2,'Jürgen','Smith','DE') fetch_users_by_countrycode('DE') -> (2,'Jürgen','Smith','DE') The first looks like it should definitely inherit. The second the result set is heterogeneous and inheriting might be the best compromise but it would produce very strange results for columns like the country-code which should just use their defined collation of C. The third case inheriting the country code's collation of C would be very strange and definitely wrong. It occurs to me that if we have any inherited cases people might come to depend on that behaviour and there would be no out for us. Whereas if we say record return values always use the record type's field's collations then we could always later add a collation of type _inherited or _anycollation or some such that indicated that that column should inherit the arguments' collation and it wouldn't affect any existing code. -- greg
On Mar 26, 2011, at 12:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > ** Selecting a field from a record-returning function's output. > Currently, we'll use the field's declared collation; except that > if the field has default collation, we'll replace that with the common > collation of the function's inputs, if any. Is either part of that > sane? Do we need to make this work for functions invoked with other > syntax than a plain function call, eg operator or cast syntax? I am not an expert on this topic in any way. That having been said, the first part of that rule seems quite sane. The secondpart seems less clear, but probably also sane. > ** What to do with domains whose declaration includes a COLLATE clause? > Currently, we'll impute that collation to the result of a cast to the > domain type --- even if the cast's input expression includes an > explicit COLLATE clause. I would have thought that an explicit COLLATE clause would trump any action at a distance. > * In plpgsql, is it OK for declared local variables to inherit the > function's input collation? Should we provide a COLLATE option in > variable declarations to let that be overridden? If Oracle understands > COLLATE, probably we should look at what they do in PL/SQL. I don't know what Oracle does, but a collate option in variable declarations seems like a very good idea. Inheriting theinput collation if not specified seems good too. I also suspect we might need something like COLLATE FROM $1, but maybethat's a 9.2 feature. > * RI triggers should insert COLLATE clauses in generated queries to > satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the > referenced column's collation. Right now you may get either table's > collation depending on which query type is involved. I think an obvious > failure may not be possible so long as equality means the same thing in > all collations, but it's definitely possible that the planner might > decide it can't use the referenced column's unique index, which would > suck for performance. (Note: this rule seems to prove that the > committee assumes equality can mean different things in different > collations, else they'd not have felt the need to specify.) No idea what to do about this. > * It'd sure be nice if we had some nontrivial test cases that work in > encodings besides UTF8. I'm still bothered that the committed patch > failed to cover single-byte-encoding cases in upper/lower/initcap. Or this. > * Remove initdb's warning about useless locales? Seems like pointless > noise, or at least something that can be relegated to debug mode. +1. > * Is it worth adding a cares-about-collation flag to pg_proc? Probably > too late to be worrying about such refinements for 9.1. Depends how much knock-on work it'll create. ...Robert
On Sat, Mar 26, 2011 at 3:16 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> ** What to do with domains whose declaration includes a COLLATE clause? >> Currently, we'll impute that collation to the result of a cast to the >> domain type --- even if the cast's input expression includes an >> explicit COLLATE clause. > > I would have thought that an explicit COLLATE clause would trump any action at a distance. > I think an explicit COLLATE *outside* the cast would. But inside the cast? The question comes down to whether a domain with a collate clause is explicitly providing a collation or implicitly. So again, examples: CREATE DOMAIN name AS text COLLATE english; CREATE DOMAIN countrycode AS char(2) COLLATE C; 1) SELECT * from users where country = 'DE' order by first_name COLLATE german; 2) SELECT * from users where country = 'DE' order by (tolower(first_name) COLLATE german)::name; 3) SELECT * from users order by substr(address,1,2)::countrycode COLLATE english 4) SELECT * from users order by (substr(address,1,2) COLLATE english)::countrycode The ones with the collation expressions inside the casts seem very strange and the behaviour following the domain don't seem unreasonable. The behaviour with the collate clauses outside the cast should definitely be follow the explicit collate clause. -- greg
On Mar 26, 2011, at 12:34 PM, Greg Stark <gsstark@mit.edu> wrote: > The ones with the collation expressions inside the casts seem very > strange and the behaviour following the domain don't seem > unreasonable. The behaviour with the collate clauses outside the cast > should definitely be follow the explicit collate clause. +1. That's exactly what I was trying to say; thanks for saying it better. ...Robert
On Sat, Mar 26, 2011 at 12:36:43AM -0400, Tom Lane wrote: > ** Selecting a field from a record-returning function's output. > Currently, we'll use the field's declared collation; except that > if the field has default collation, we'll replace that with the common > collation of the function's inputs, if any. Is either part of that > sane? Do we need to make this work for functions invoked with other > syntax than a plain function call, eg operator or cast syntax? That seems all a bit weird. I spent some time reading through the SQL spec to see if I could came up with a few ideas about what they thought relevent. I think the gist of it is that I think the result row should have for each column its declared collation in all cases. Firstly, the SQL doesn't go into the general case where the collate result of a function is based in it inputs. But in any case, if the function returns a record, the whole record would have that collation, which is absurd. I think letting it go to the contained columns is just weird. Secondly, I think the derivation algorithm is for determing the collation of expressions which have no otherwise declared collation. Anything returning a predefined record type has a predefined collation and it should be used. If you're in a query referring to rowvar.field and rowvar has a type, that's what should be used. (No doubt you have some corner cases in mind?) As for operators, they should behave like functions wherever possible, otherwise it's just introducing unnecessary differences. The cast-case is related to below. > ** What to do with domains whose declaration includes a COLLATE clause? > Currently, we'll impute that collation to the result of a cast to the > domain type --- even if the cast's input expression includes an > explicit COLLATE clause. It's not clear that that's per spec. If it > is correct, should we behave similarly for functions that are declared > to return a domain type? Should it matter if the cast-to-domain is > explicit or implicit? Perhaps it'd be best if domain collations only > mattered for columns declared with that domain type. Then we'd have > a general rule that collations only come into play in an expression > as a result of (a) the declared type of a column reference or (b) > an explicit COLLATE clause. The SQL spec considers the collation to be part of the datatype, so if you're casting to a domain (or type) you get the collation associated with that domain (or type). As per the spec: "The collation derivation of a declared type with a declared type collation that is explicitly or implicitly specified by a <data type> is implicit." So the result of a cast would be the collation of the specified type/domain with state implicit. Also, apparently the COLLATE clause as allowed anywhere where a datatype is permitted. So you can say: CAST( foo AS TEXT COLLATE "en_US" ) Not sure if that works now. The result would be implicit state, as opposed to if the COLLATE clause appears elsewhere. Incidently, a function returning a domain seems weird to me. What does it mean: (1) the function returns this type, Postgres assumes this is true, or (2) function returns something, Postgres does an implicit cast? In any case, I'd suggest it is treated as being included in the resolving of the return collation with the arguments so if the result is a domain and you apply the normal rules you get: (1) explicit states in the arguments will override it (2) if arguments are implicit state and conflict with domain, the result is no-collation, otherwise implicitly whatever the domain was (3) no arguments have collation, which means you get the domain default. Which all seems eminently reasonable. So I'd agree with your rules, but add a case (c) result of a cast. > * In plpgsql, is it OK for declared local variables to inherit the > function's input collation? Should we provide a COLLATE option in > variable declarations to let that be overridden? If Oracle understands > COLLATE, probably we should look at what they do in PL/SQL. If COLLATE is allowed anywhere where the datatype is allowed, then the COLLATE clause should be permitted there. Otherwise they become the specified type with whatever the default is for that type. In expressions the coercible-default state will get overridden by the IMPLICIT state from the arguments as appropriate. I note I'm using the term coercible default here, because that's what Transact-SQL calls the state for any variable or value that's not a column reference. I'm just checking and don't see any support for it in the SQL standard. While it seemed to me to be extremely useful, since it allows column references to override literals. > * RI triggers should insert COLLATE clauses in generated queries to > satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the > referenced column's collation. Right now you may get either table's > collation depending on which query type is involved. I think an obvious > failure may not be possible so long as equality means the same thing in > all collations, but it's definitely possible that the planner might > decide it can't use the referenced column's unique index, which would > suck for performance. (Note: this rule seems to prove that the > committee assumes equality can mean different things in different > collations, else they'd not have felt the need to specify.) Using the referenced collation makes it clear that you don't want to throw an error if the collations don't match. So yes, the generated queries need to have the correct collation clauses. Collations which redefine equality are on the horizon, case and accent insensetivity being the obvious cases. Better get it right now. > * It'd sure be nice if we had some nontrivial test cases that work in > encodings besides UTF8. I'm still bothered that the committed patch > failed to cover single-byte-encoding cases in upper/lower/initcap. That'd be nice. Perhaps one the existing cases can be converted to some latin-X encoding. It does rely on the collation existing on the test machine. > * Is it worth adding a cares-about-collation flag to pg_proc? Probably > too late to be worrying about such refinements for 9.1. Long term it will be needed. It's been skipped now, but it will let you produce better errors messages in the future. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
Martijn van Oosterhout <kleptog@svana.org> writes: > On Sat, Mar 26, 2011 at 12:36:43AM -0400, Tom Lane wrote: >> ** Selecting a field from a record-returning function's output. >> Currently, we'll use the field's declared collation; except that >> if the field has default collation, we'll replace that with the common >> collation of the function's inputs, if any. Is either part of that >> sane? Do we need to make this work for functions invoked with other >> syntax than a plain function call, eg operator or cast syntax? > That seems all a bit weird. I spent some time reading through the SQL > spec to see if I could came up with a few ideas about what they thought > relevent. I think the gist of it is that I think the result row should > have for each column its declared collation in all cases. That interpretation would be fine with me. It would let us get rid of the special-case code at lines 307-324 of parse_collate.c, which I put in only because there are cases in the collate.linux.utf8.sql regression test that fail without it. But I'm perfectly happy to conclude that those test cases are mistaken. >> ** What to do with domains whose declaration includes a COLLATE clause? > The SQL spec considers the collation to be part of the datatype, so if > you're casting to a domain (or type) you get the collation associated > with that domain (or type). As per the spec: > "The collation derivation of a declared type with a declared type > collation that is explicitly or implicitly specified by a <data type> > is implicit." > So the result of a cast would be the collation of the specified > type/domain with state implicit. Hm. That makes sense for explicit CAST syntax, but what about a function returning a collatable type? In particular, applying this rule to the || operator would have us conclude that x || y COLLATE "foo" doesn't have an overall collation of "foo", which seems clearly wrong. But if you claim that it's all based on the result type of the function then it's hard to avoid that outcome. > Also, apparently the COLLATE clause as allowed anywhere where a > datatype is permitted. So you can say: > CAST( foo AS TEXT COLLATE "en_US" ) No, you can't; this case is specifically disallowed by SQL:2008 6.12 <cast specification> syntax rule 1b: <data type> shall not contain a <collate clause>. Which is a pretty weird thing for them to say if you believe that collations are nothing more nor less than a data type property. It kinda leads me to think that that is not how the committee sees collations ... although the spec is pretty unhelpful at suggesting exactly how you should think of them instead. > Incidently, a function returning a domain seems weird to me. What does > it mean: (1) the function returns this type, Postgres assumes this is > true, or (2) function returns something, Postgres does an implicit > cast? > In any case, I'd suggest it is treated as being included in the > resolving of the return collation with the arguments so if the result > is a domain and you apply the normal rules you get: > (1) explicit states in the arguments will override it > (2) if arguments are implicit state and conflict with domain, the > result is no-collation, otherwise implicitly whatever the domain was > (3) no arguments have collation, which means you get the domain > default. > Which all seems eminently reasonable. It's reasonable by itself, but it's also rather different from what you just argued the behavior should be for field selection. I don't see the rationale for treating those cases differently. In fact, given that Postgres has always treated f(x) and x.f as equivalent notations, I think there's a pretty strong argument for wanting their effects on collation choice to be equivalent too. >> * In plpgsql, is it OK for declared local variables to inherit the >> function's input collation? Should we provide a COLLATE option in >> variable declarations to let that be overridden? If Oracle understands >> COLLATE, probably we should look at what they do in PL/SQL. > If COLLATE is allowed anywhere where the datatype is allowed, then the > COLLATE clause should be permitted there. Otherwise they become the > specified type with whatever the default is for that type. In > expressions the coercible-default state will get overridden > by the IMPLICIT state from the arguments as appropriate. No, I don't care for that, because then it matters whether an expression contains any direct references to the input arguments versus containing just local variables. In particular, assigning a parameter to a local variable and then using the local variable instead of the parameter could change the function's results. That seems mighty surprising and bug-prone to me. So I think it's important that local variables default to the function's input collation. I have no problem with adding explicit COLLATE to the declaration syntax for plpgsql variables, though. regards, tom lane
On Sun, Mar 27, 2011 at 03:14:37PM -0400, Tom Lane wrote: > > So the result of a cast would be the collation of the specified > > type/domain with state implicit. > > Hm. That makes sense for explicit CAST syntax, but what about a > function returning a collatable type? In particular, applying this > rule to the || operator would have us conclude that > > x || y COLLATE "foo" > > doesn't have an overall collation of "foo", which seems clearly wrong. > But if you claim that it's all based on the result type of the function > then it's hard to avoid that outcome. I don't see this. There's no explicit cast in the above expression. The return type says something, but that gets back to the question of whether the return type indicates a cast to that type or not. In any case, the SQL specifically states what should happen in the above case. The question started with what happens when the function is declared to return a domain type. I'm really not sure what to do there. I'd suggest similar to how the typmod is propegated but I'm afraid that the answer there is "we don't". What I suggested with combining it with the collations of the input types is about the only concrete suggestion I have. > > Also, apparently the COLLATE clause as allowed anywhere where a > > datatype is permitted. So you can say: > > > CAST( foo AS TEXT COLLATE "en_US" ) > > No, you can't; this case is specifically disallowed by SQL:2008 6.12 Maybe they just thought it confusing that COLLATE would mean different things in different contexts and forbade it. > > (1) explicit states in the arguments will override it > > (2) if arguments are implicit state and conflict with domain, the > > result is no-collation, otherwise implicitly whatever the domain was > > (3) no arguments have collation, which means you get the domain > > default. > > > Which all seems eminently reasonable. > > It's reasonable by itself, but it's also rather different from what you > just argued the behavior should be for field selection. I don't see > the rationale for treating those cases differently. In fact, given that > Postgres has always treated f(x) and x.f as equivalent notations, > I think there's a pretty strong argument for wanting their effects on > collation choice to be equivalent too. Well, that's a good argument. But I'd suggest that here x.f is not a column reference, so there's no reason it should be treated like one. The SQL standard says column references determine collations but for various strings operations it's the inputs that count. This makes sense if you consider the return type declaration to be merely an assertion and not a cast. <snip about default collation on declared variables> > No, I don't care for that, because then it matters whether an expression > contains any direct references to the input arguments versus containing > just local variables. In particular, assigning a parameter to a local > variable and then using the local variable instead of the parameter > could change the function's results. That seems mighty surprising and > bug-prone to me. So I think it's important that local variables default > to the function's input collation. I have no problem with adding > explicit COLLATE to the declaration syntax for plpgsql variables, > though. Ok, you've convinced me here. You effectively set the default collation for the period of the function, which I think is perfectly defensible. We might allow people in the future to override it (a la search_path) but this is a good start. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote: > * RI triggers should insert COLLATE clauses in generated queries to > satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the > referenced column's collation. Right now you may get either table's > collation depending on which query type is involved. I think an obvious > failure may not be possible so long as equality means the same thing in > all collations, but it's definitely possible that the planner might > decide it can't use the referenced column's unique index, which would > suck for performance. (Note: this rule seems to prove that the > committee assumes equality can mean different things in different > collations, else they'd not have felt the need to specify.) Right, but we don't support that yet, so I don't consider that that has to be addressed right now. Rather it could go on a "list of things to fix when supporting collations which redefine equality". The index mismatch issue is also not urgent. It's not a regression and it's more like don't-do-that-then or do-it-differently-then. > * It'd sure be nice if we had some nontrivial test cases that work in > encodings besides UTF8. I'm still bothered that the committed patch > failed to cover single-byte-encoding cases in upper/lower/initcap. Well, how do we want to maintain these test cases without doing too much duplication? It would be easy to run a small sed script over collate.linux.utf8.sql to create, say, a latin1 version out of it. Since it's Linux only, it might be valid to do it that way without having to make it super-portable in C. > * Remove initdb's warning about useless locales? Seems like pointless > noise, or at least something that can be relegated to debug mode. Fine with me. > * Is it worth adding a cares-about-collation flag to pg_proc? Probably > too late to be worrying about such refinements for 9.1. Probably. It would open up a bunch of new cases to change and fine-tune.
Peter Eisentraut <peter_e@gmx.net> writes: > On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote: >> * It'd sure be nice if we had some nontrivial test cases that work in >> encodings besides UTF8. I'm still bothered that the committed patch >> failed to cover single-byte-encoding cases in upper/lower/initcap. > Well, how do we want to maintain these test cases without doing too much > duplication? It would be easy to run a small sed script over > collate.linux.utf8.sql to create, say, a latin1 version out of it. I tried. The upper/lower test cases require Turkish characters that aren't in Latin1. I'm not sure if we can readily produce test cases that cover both sorting changes and case-folding changes in just one single-byte encoding --- anybody? One thing I noticed but didn't push to committing is that the test case has a largely-unnecessary assumption about how the local system's locale names spell "utf8". We could eliminate that by having it use the trimmed locale names created by initdb. I would've made more of a push for that if it resulted in a test case that passed on OS X, but it turns out that once you get past the locale name spelling, you find out that Macs still can't sort UTF8 strings correctly :-( regards, tom lane
Excerpts from Tom Lane's message of lun mar 28 21:02:40 -0300 2011: > Peter Eisentraut <peter_e@gmx.net> writes: > > On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote: > >> * It'd sure be nice if we had some nontrivial test cases that work in > >> encodings besides UTF8. I'm still bothered that the committed patch > >> failed to cover single-byte-encoding cases in upper/lower/initcap. > > > Well, how do we want to maintain these test cases without doing too much > > duplication? It would be easy to run a small sed script over > > collate.linux.utf8.sql to create, say, a latin1 version out of it. > > I tried. The upper/lower test cases require Turkish characters that > aren't in Latin1. I'm not sure if we can readily produce test cases > that cover both sorting changes and case-folding changes in just one > single-byte encoding --- anybody? ISO-8859-9? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Em 28-03-2011 22:27, Alvaro Herrera escreveu: > Excerpts from Tom Lane's message of lun mar 28 21:02:40 -0300 2011: >> I tried. The upper/lower test cases require Turkish characters that >> aren't in Latin1. I'm not sure if we can readily produce test cases >> that cover both sorting changes and case-folding changes in just one >> single-byte encoding --- anybody? > > ISO-8859-9? > I'm afraid we have to map lang to single byte character set. Not all languages prefer ISO-8859. -- Euler Taveira de Oliveira http://www.timbira.com/
On Mon, Mar 28, 2011 at 08:02:40PM -0400, Tom Lane wrote: > One thing I noticed but didn't push to committing is that the test case > has a largely-unnecessary assumption about how the local system's locale > names spell "utf8". We could eliminate that by having it use the > trimmed locale names created by initdb. I would've made more of a push > for that if it resulted in a test case that passed on OS X, but it turns > out that once you get past the locale name spelling, you find out that > Macs still can't sort UTF8 strings correctly :-( Yeah, and I don't think it's likely they're ever going to fix it either. :( On their own website they explain how to do locale based sorting, and they provide their own interfaces for that which, unsurprisingly, uses UTF-16 and ICU underneath. http://developer.apple.com/library/mac/#documentation/CoreFoundation/Conceptual/CFLocales/Articles/CFLocaleConcepts.html I think the rule that we sort the same as the command-line sort program is still true however, so I don't think it's surprising as such. Just unfortunate. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
Reading through this thread... On Sat, Mar 26, 2011 at 12:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > ** Selecting a field from a record-returning function's output. > Currently, we'll use the field's declared collation; except that > if the field has default collation, we'll replace that with the common > collation of the function's inputs, if any. Is either part of that > sane? Do we need to make this work for functions invoked with other > syntax than a plain function call, eg operator or cast syntax? There were a couple of different ideas about which way we ought to go with this, but I'm happy to defer to what Tom and Martijn hashed out: MO> That seems all a bit weird. I spent some time reading through the SQL MO> spec to see if I could came up with a few ideas about what they thought MO> relevent. I think the gist of it is that I think the result row should MO> have for each column its declared collation in all cases. TL> That interpretation would be fine with me. It would let us get rid of TL> the special-case code at lines 307-324 of parse_collate.c, which I put TL> in only because there are cases in the collate.linux.utf8.sql regression TL> test that fail without it. But I'm perfectly happy to conclude that TL> those test cases are mistaken. I'm not sure whether that's been done, though, or whether we're even going to do it. > ** What to do with domains whose declaration includes a COLLATE clause? > Currently, we'll impute that collation to the result of a cast to the > domain type --- even if the cast's input expression includes an > explicit COLLATE clause. It's not clear that that's per spec. If it > is correct, should we behave similarly for functions that are declared > to return a domain type? Should it matter if the cast-to-domain is > explicit or implicit? Perhaps it'd be best if domain collations only > mattered for columns declared with that domain type. Then we'd have > a general rule that collations only come into play in an expression > as a result of (a) the declared type of a column reference or (b) > an explicit COLLATE clause. I think we had agreement than a cast to a domain type with a collation should stomp on any existing collation on the contained expression. > * In plpgsql, is it OK for declared local variables to inherit the > function's input collation? Should we provide a COLLATE option in > variable declarations to let that be overridden? I think everyone who responded said "yes" to both questions. > * RI triggers should insert COLLATE clauses in generated queries to > satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the > referenced column's collation. Right now you may get either table's > collation depending on which query type is involved. I think an obvious > failure may not be possible so long as equality means the same thing in > all collations, but it's definitely possible that the planner might > decide it can't use the referenced column's unique index, which would > suck for performance. (Note: this rule seems to prove that the > committee assumes equality can mean different things in different > collations, else they'd not have felt the need to specify.) I tested this and, indeed, if the collations don't match, the index can't be used. CREATE TABLE me (x character varying COLLATE "en_US"); CREATE TABLE me2 (x character varying COLLATE "es_ES"); CREATE TABLE you (x character varying COLLATE "es_ES" NOT NULL, PRIMARY KEY (x)); ALTER TABLE me ADD FOREIGN KEY (x) REFERENCES you(x); ALTER TABLE me2 ADD FOREIGN KEY (x) REFERENCES you(x); SET enable_seqscan=false; SET enable_hashjoin=false; With that setup, this still does a seqscan-and-sort: EXPLAIN select * from me, you where me.x = you.x; But this uses the index: EXPLAIN select * from me2, you where me2.x = you.x; I found another problem, too: rhaas=# insert into you values ('1'); INSERT 0 1 rhaas=# insert into me values ('1'); INSERT 0 1 rhaas=# alter table me alter column x set data type varchar collate "en_GB"; ERROR: could not determine which collation to use for string comparison HINT: Use the COLLATE clause to set the collation explicitly. CONTEXT: SQL statement "SELECT fk."x" FROM ONLY "public"."me" fk LEFT OUTER JOIN ONLY "public"."you" pk ON ( pk."x"::pg_catalog.text OPERATOR(pg_catalog.=) fk."x"::pg_catalog.text) WHERE pk."x" IS NULL AND (fk."x" IS NOT NULL)" > * It'd sure be nice if we had some nontrivial test cases that work in > encodings besides UTF8. I'm still bothered that the committed patch > failed to cover single-byte-encoding cases in upper/lower/initcap. Seems like no one knows how to do this. > * Remove initdb's warning about useless locales? Seems like pointless > noise, or at least something that can be relegated to debug mode. Everyone was in favor of this. > * Is it worth adding a cares-about-collation flag to pg_proc? Probably > too late to be worrying about such refinements for 9.1. Probably too late for this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Reading through this thread... > On Sat, Mar 26, 2011 at 12:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ** Selecting a field from a record-returning function's output. >> Currently, we'll use the field's declared collation; except that >> if the field has default collation, we'll replace that with the common >> collation of the function's inputs, if any. �Is either part of that >> sane? �Do we need to make this work for functions invoked with other >> syntax than a plain function call, eg operator or cast syntax? > There were a couple of different ideas about which way we ought to go > with this, but I'm happy to defer to what Tom and Martijn hashed out: > MO> That seems all a bit weird. I spent some time reading through the SQL > MO> spec to see if I could came up with a few ideas about what they thought > MO> relevent. I think the gist of it is that I think the result row should > MO> have for each column its declared collation in all cases. > TL> That interpretation would be fine with me. It would let us get rid of > TL> the special-case code at lines 307-324 of parse_collate.c, which I put > TL> in only because there are cases in the collate.linux.utf8.sql regression > TL> test that fail without it. But I'm perfectly happy to conclude that > TL> those test cases are mistaken. > I'm not sure whether that's been done, though, or whether we're even > going to do it. I looked a bit more closely at this, and I think I finally get the point of what those regression test cases involving the dup() function are about. Consider a trivial polymorphic function such as create function dummy(anyelement) returns anyelement as'select $1' language sql; When applied to a textual argument, this is a function taking and returning string, and so collation does (and should, I think) propagate through it. Thus in select dummy(x) from tbl order by 1; you will get ordering by the declared collation of tbl.x, whatever that is. But now consider create function dup(in anyelement, a out anyelement, b out anyelement)as 'select $1, $2' language sql; select dup(x).a from tbl order by 1; It's not unreasonable to think that this should also order by tbl.x's collation --- if collation propagates through dummy(), why not through dup()? And in fact those regression test cases are expecting that it does propagate in such a case. Now the discussion that we had earlier in this thread was implicitly assuming that we were talking about FieldSelect from a known composite type. If dup() were declared to return a named composite type, then using the collation that is declared for that type's "a" column seems reasonable. But when you're dealing with an anonymous record type, which is what dup() actually returns here, there is no such declaration; and what's more, the fact that there's a record type at all is just an implementation detail to most users. If we take out the kluge in parse_collate.c's handling of FieldSelects, then what we will get in this example is ordering by the database default collation. We can justify that on a narrow language-lawyering basis by saying "dup() returns a composite type, which has no collation, therefore collation does not propagate through from its arguments to any column you might select from its result". But it's going to feel a bit surprising to anyone who thinks of this in terms of OUT arguments rather than an anonymous composite type. I'm inclined to think that we should take out the kluge and rest on the language-lawyering viewpoint, because otherwise there are going to be umpteen other corner cases where people are going to expect collation to propagate and it's not going to work without very major kluging. Comments? regards, tom lane
On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote: > * Remove initdb's warning about useless locales? Seems like pointless > noise, or at least something that can be relegated to debug mode. Quick question on this: Should we at least warn if zero suitable locales were found or some other problem scenario? Or should we just wait around and see what actual problems, if any, will be reported?
Peter Eisentraut <peter_e@gmx.net> writes: > On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote: >> * Remove initdb's warning about useless locales? Seems like pointless >> noise, or at least something that can be relegated to debug mode. > Quick question on this: Should we at least warn if zero suitable > locales were found or some other problem scenario? Or should we just > wait around and see what actual problems, if any, will be reported? Well, my opinion is that normal users never see the output of initdb at all, so I don't think there's that much value in complaining there. But I don't have a serious objection to complaining if we couldn't find any usable locales at all, either. regards, tom lane
Excerpts from Tom Lane's message of vie abr 08 15:27:15 -0300 2011: > Peter Eisentraut <peter_e@gmx.net> writes: > > On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote: > >> * Remove initdb's warning about useless locales? Seems like pointless > >> noise, or at least something that can be relegated to debug mode. > > > Quick question on this: Should we at least warn if zero suitable > > locales were found or some other problem scenario? Or should we just > > wait around and see what actual problems, if any, will be reported? > > Well, my opinion is that normal users never see the output of initdb > at all, so I don't think there's that much value in complaining there. Those users are not going to have those problems anyway. The problematic users are going to be those running on unusual platforms. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On fre, 2011-04-08 at 16:14 -0300, Alvaro Herrera wrote: > Excerpts from Tom Lane's message of vie abr 08 15:27:15 -0300 2011: > > Peter Eisentraut <peter_e@gmx.net> writes: > > > On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote: > > >> * Remove initdb's warning about useless locales? Seems like pointless > > >> noise, or at least something that can be relegated to debug mode. > > > > > Quick question on this: Should we at least warn if zero suitable > > > locales were found or some other problem scenario? Or should we just > > > wait around and see what actual problems, if any, will be reported? > > > > Well, my opinion is that normal users never see the output of initdb > > at all, so I don't think there's that much value in complaining there. > > Those users are not going to have those problems anyway. The > problematic users are going to be those running on unusual platforms. So what is your opinion on the original question?
Excerpts from Peter Eisentraut's message of sáb abr 09 01:32:28 -0300 2011: > On fre, 2011-04-08 at 16:14 -0300, Alvaro Herrera wrote: > > Excerpts from Tom Lane's message of vie abr 08 15:27:15 -0300 2011: > > > Peter Eisentraut <peter_e@gmx.net> writes: > > > > On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote: > > > >> * Remove initdb's warning about useless locales? Seems like pointless > > > >> noise, or at least something that can be relegated to debug mode. > > > > > > > Quick question on this: Should we at least warn if zero suitable > > > > locales were found or some other problem scenario? Or should we just > > > > wait around and see what actual problems, if any, will be reported? > > > > > > Well, my opinion is that normal users never see the output of initdb > > > at all, so I don't think there's that much value in complaining there. > > > > Those users are not going to have those problems anyway. The > > problematic users are going to be those running on unusual platforms. > > So what is your opinion on the original question? I feel throwing a warning is appropriate here, because people for which problems are most likely are going to see them. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On mån, 2011-03-28 at 20:02 -0400, Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > On lör, 2011-03-26 at 00:36 -0400, Tom Lane wrote: > >> * It'd sure be nice if we had some nontrivial test cases that work > >> in encodings besides UTF8. I'm still bothered that the committed > >> patch failed to cover single-byte-encoding cases in >> > upper/lower/initcap. > > > Well, how do we want to maintain these test cases without doing too > > much duplication? It would be easy to run a small sed script over > > collate.linux.utf8.sql to create, say, a latin1 version out of it. > > I tried. The upper/lower test cases require Turkish characters that > aren't in Latin1. I'm not sure if we can readily produce test cases > that cover both sorting changes and case-folding changes in just one > single-byte encoding --- anybody? > > One thing I noticed but didn't push to committing is that the test > case has a largely-unnecessary assumption about how the local system's > locale names spell "utf8". We could eliminate that by having it use > the trimmed locale names created by initdb. I see you went for the latter option. That works pretty well already. I've also been playing around with separating out the "Turkish" tests into a separate file. That would then probably get the remaining "latin1" file passing, if we also dropped the encoding mention from this error message: ERROR: collation "foo" for encoding "UTF8" does not exist I had thought hard about this in the past and didn't want to do it, but since we are now making every effort to effectively hide collations with the wrong encoding, this would possibly be acceptable. I'm also seeing promising signs that we might get this test (minus Turkish, perhaps) passing on Windows.
Peter Eisentraut <peter_e@gmx.net> writes: > On mån, 2011-03-28 at 20:02 -0400, Tom Lane wrote: >> One thing I noticed but didn't push to committing is that the test >> case has a largely-unnecessary assumption about how the local system's >> locale names spell "utf8". We could eliminate that by having it use >> the trimmed locale names created by initdb. > I see you went for the latter option. That works pretty well already. > I've also been playing around with separating out the "Turkish" tests > into a separate file. That would then probably get the remaining > "latin1" file passing, if we also dropped the encoding mention from this > error message: > ERROR: collation "foo" for encoding "UTF8" does not exist > I had thought hard about this in the past and didn't want to do it, but > since we are now making every effort to effectively hide collations with > the wrong encoding, this would possibly be acceptable. Not sure. If we had the test refactored to the point where that was the only diff you got with a different server encoding, maybe it'd be worth changing; but right now we're still a long way from there. I was seeing this change as mainly targeted towards making the test useful on more platforms, and since that encoding name is ours and not platform-specific, it doesn't create any portability issues to show it. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > [ assorted comments on original issues ] I believe that all the collation issues I complained about on 26 March are now resolved, except for the question of getting some more test cases, and the question of adding a cares-about-collation flag to pg_proc. I've added the latter to the TODO list since everybody agreed it was too late to consider it for 9.1. However, I've come across a new issue that maybe requires discussion: what collation should be associated with a multi-row VALUES in FROM? For instance, in SELECT ... FROM (VALUES (1, 'foo'), (2, 'bar' COLLATE "C")) v(a,b), ... what collation should be imputed to references to v.b? The way the code currently works is that the first row of the VALUES list is inspected to determine what collation to report --- so in this example, you'd get default collation, ignoring the COLLATE clause in the second row. There are several problems with this: 1. I think it violates the SQL spec. SQL:2008 7.3 <table value constructor> says that the column types of a VALUES construct are resolved per the rules of section 9.3, and 9.3 is the one with the standard verbiage about resolving a common collation, so it's hard to see how "use the first row and ignore the rest" satisfies the spec. 2. It doesn't seem to satisfy the POLA --- in the above example, ignoring the explicit COLLATE clause is rather surprising. We could document that that's what it does, perhaps, but if you can attach COLLATE to any input of an operator or function and get the same results, it's hard to explain why the same isn't true of a VALUES column. Especially when VALUES resolves data types in a much more symmetrical fashion, eg this works: (VALUES (1, '1'), (2, '2'::real)) v(a,b) 3. It's not hard to imagine people thinking they can get row-by-row- varying collation behavior from something like SELECT ... FROM (VALUES (1, 'foo' COLLATE "en_US"), (2, 'bar' COLLATE "C")) v(a,b), ... So not throwing an error, but rather silently doing something other than what this SQL seems to say, seems pretty unfriendly to me. The reason I'm expending so much verbiage on this is that fixing it seems to require an addition to struct RangeTblEntry, ie, a catversion bump and forced initdb, so that we can store a list of the resolved column collations for an RTE_VALUES RTE. I don't really think there's much choice though. Comments? regards, tom lane
On Mon, 2011-04-18 at 10:41 -0400, Tom Lane wrote: > However, I've come across a new issue that maybe requires discussion: > what collation should be associated with a multi-row VALUES in FROM? > For instance, in > > SELECT ... FROM > (VALUES (1, 'foo'), (2, 'bar' COLLATE "C")) v(a,b), > ... > > what collation should be imputed to references to v.b? > > The way the code currently works is that the first row of the VALUES > list is inspected to determine what collation to report Hmm, I do see this on my list of things to address, so yes, it should be fixed. I likely got stuck because it's pretty complicated.