Thread: How to pass around collation information
I have been thinking about this collation support business a bit. Ignoring for the moment where we would get the actual collation routines from, I wonder how we are going to pass this information around in the system. Someone declares a collation on a column in a table, and somehow this information needs to arrive in bttextcmp() and friends. Also, functions that take in a string and return one (e.g., substring), need to take in this information and return it back out. How should this work? Option 1, make it part of the datum. That way it will pass through the system just fine, but it would waste a lot of storage and break just about everything that operates on string types now, as well as pg_upgrade. So that's probably out. Option 2, invent some new mechanism that accompanies a datum or a type whereever it goes. Kind of like typmod, but not really. Then the collation information would presumably be made available to functions through the fmgr interface. The binary representation of data values stays the same. Option 2a, while we are at it, are there any other things of this nature that would be worth supporting at the same time? I could imagine that having the option to pass around the ctype locale or the text search dictionary in a similar way could be useful. Is this something that could be combined with typmod or other dormant data type metadata requirements (PostGIS?, XML?)? Ideas?
Excerpts from Peter Eisentraut's message of vie may 28 12:27:52 -0400 2010: > Option 2, invent some new mechanism that accompanies a datum or a type > whereever it goes. Kind of like typmod, but not really. Then the > collation information would presumably be made available to functions > through the fmgr interface. The binary representation of data values > stays the same. Is the collation a property of the datum, or one of the comparison? If the latter, should it be really be made a sidecar of a datum, or would it make more sense to attach it to the operation being performed? I wonder if instead of trying to pass it down multiple layers till bttextcmp and further down, it would make more sense to set a global variable somewhere in the high levels, and only have it checked in varstr_cmp. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 28/05/10 19:27, Peter Eisentraut wrote: > I have been thinking about this collation support business a bit. > Ignoring for the moment where we would get the actual collation routines > from, I wonder how we are going to pass this information around in the > system. Someone declares a collation on a column in a table, and > somehow this information needs to arrive in bttextcmp() and friends. Yes. Comparison operators need it, as do functions like isalpha(). > Also, functions that take in a string and return one (e.g., substring), > need to take in this information and return it back out. How should > this work? Hmm, I don't see what substring would need collation for. And it certainly shouldn't be returning it. Collation is a property of the comparison operators (and isalpha etc.), and the planner needs to deduce the right collation for each such operation in the query. That involves looking at the tables and columns involved, as well as per-user information and any explicit COLLATE clauses in the query, but all that happens at plan-time. > Option 1, make it part of the datum. That way it will pass through the > system just fine, but it would waste a lot of storage and break just > about everything that operates on string types now, as well as > pg_upgrade. So that's probably out. It's also fundamentally wrong, collation is not a property of a datum but of the operation. > Option 2, invent some new mechanism that accompanies a datum or a type > whereever it goes. Kind of like typmod, but not really. Then the > collation information would presumably be made available to functions > through the fmgr interface. The binary representation of data values > stays the same. Something like that. I'm thinking that bttextcmp() and friends will simply take an extra argument indicating the collation, and we'll teach the operator / operator class infrastructure about that too. One way to approach this is to realize that it's already possible to use multiple collations in a database. You just have to define separate < = > operators and operator classes for every collation,and change all your queries to use the right operator depending on the desired collation everywhere where you use < = > (including ORDER BYs, with the USING <operator> syntax). The behavior is exactly what we want, it's just completely inpractical, so we need something to do the same in a less cumbersome way. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On fre, 2010-05-28 at 20:22 +0300, Heikki Linnakangas wrote: > It's also fundamentally wrong, collation is not a property of a datum > but of the operation. > One way to approach this is to realize that it's already possible to > use > multiple collations in a database. You just have to define separate < > = > > operators and operator classes for every collation, and change all > your queries to use the right operator depending on the desired > collation everywhere where you use < = > (including ORDER BYs, with > the > USING <operator> syntax). The behavior is exactly what we want, it's > just completely inpractical, so we need something to do the same in a > less cumbersome way. Well, maybe we should step back a little and work out what sort of feature we actually want, if any. The feature I'm thinking of is what people might call "per-column locale", and the SQL standard defines that. It would look like this: CREATE TABLE test ( a varchar COLLATE de, b varchar COLLATE fr ); SELECT * FROM test WHERE a > 'baz' ORDER BY b; So while it's true that the collation is used by the operations (> and ORDER BY), the information which collation to use comes with the data values. It's basically saying, a is in language "de", so sort it like that unless told otherwise. There is also an override syntax available, like this: SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv; But here again the collation is attached to a data value, and only from there it is passed to the operator. What is actually happening is SELECT * FROM test WHERE (a COLLATE en) > 'baz' ORDER BY (b COLLATE sv); What you appear to be describing is a "per-operation locale", which also sounds valid, but it would be a different thing. It might be thought of as this: SELECT * FROM test WHERE a (> COLLATE en) 'baz' ORDER BY COLLATE sv b; with some suitable global default. So which one of these should it be?
Peter Eisentraut <peter_e@gmx.net> writes: > So while it's true that the collation is used by the operations (> and > ORDER BY), the information which collation to use comes with the data > values. It's basically saying, a is in language "de", so sort it like > that unless told otherwise. There is also an override syntax available, > like this: > SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv; That seems fairly bizarre. What does this mean: WHERE a COLLATE en > b COLLATE de ? If it's an error, why is this not an error WHERE a COLLATE en > b if b is marked as COLLATE de in its table? I guess the more general question is whether the spec expects that collation settings can be derived statically (like type information) or whether they might sometimes only be known at runtime. We also need to think about whether we're okay with only applying collation to built-in types (text, varchar, char) or whether we need the feature to work for add-on types as well. In particular, is citext still a meaningful feature if we have this, or is it superseded by COLLATE? In the abstract I'd prefer to let it work for user-defined types, but if we can have a much simpler implementation by not doing so, it might be better to give that up. Is COLLATE a property that can be attached to a domain over text? regards, tom lane
On Fri, May 28, 2010 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Eisentraut <peter_e@gmx.net> writes: >> So while it's true that the collation is used by the operations (> and >> ORDER BY), the information which collation to use comes with the data >> values. It's basically saying, a is in language "de", so sort it like >> that unless told otherwise. There is also an override syntax available, >> like this: > >> SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv; > > That seems fairly bizarre. What does this mean: > > WHERE a COLLATE en > b COLLATE de > > ? If it's an error, why is this not an error > > WHERE a COLLATE en > b > > if b is marked as COLLATE de in its table? I think we need to think of the comparison operators as ternary, and the COLLATE syntax applied to columns or present in queries as various ways of setting defaults or explicit overrides for what the third argument will end up being. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On fre, 2010-05-28 at 15:03 -0400, Robert Haas wrote: > I think we need to think of the comparison operators as ternary, and > the COLLATE syntax applied to columns or present in queries as various > ways of setting defaults or explicit overrides for what the third > argument will end up being. How could this extend to things like isalpha() or upper() that would need access to ctype information?
On fre, 2010-05-28 at 14:48 -0400, Tom Lane wrote: > > SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv; > > That seems fairly bizarre. What does this mean: > > WHERE a COLLATE en > b COLLATE de > > ? If it's an error, why is this not an error > > WHERE a COLLATE en > b > > if b is marked as COLLATE de in its table? The way I understand it, a collation "derivation" can be explicit or implicit. Explicit derivations override implicit derivations. If in the argument set of an operation, explicit collation derivations exist, they must all be the same. > I guess the more general question is whether the spec expects that > collation settings can be derived statically (like type information) > or whether they might sometimes only be known at runtime. It looks like it is treated like type information. The derivation and validation rules are part of the Syntax Rules. > We also need to think about whether we're okay with only applying > collation to built-in types (text, varchar, char) or whether we need > the feature to work for add-on types as well. In particular, is citext > still a meaningful feature if we have this, or is it superseded by > COLLATE? In the abstract I'd prefer to let it work for user-defined > types, but if we can have a much simpler implementation by not doing > so, it might be better to give that up. I think if we get this done using the strcoll_l() API to do the work, which looks like the path of least resistance at the moment, citext would still be useful because all the standard locales would still be case sensitive. > Is COLLATE a property that can be attached to a domain over text? According to the spec, yes.
Hi, Peter Eisentraut <peter_e@gmx.net> writes: > On fre, 2010-05-28 at 20:22 +0300, Heikki Linnakangas wrote: >> USING <operator> syntax). The behavior is exactly what we want, it's >> just completely inpractical, so we need something to do the same in a >> less cumbersome way. For an example, here is something I did to better understand the system a while ago. Of course I never got to use it for real: http://pgsql.tapoueh.org/btree_fr_ops/ > Well, maybe we should step back a little and work out what sort of > feature we actually want, if any. The feature I'm thinking of is what > people might call "per-column locale", and the SQL standard defines > that. It would look like this: > > CREATE TABLE test ( > a varchar COLLATE de, > b varchar COLLATE fr > ); > > SELECT * FROM test WHERE a > 'baz' ORDER BY b; > > So while it's true that the collation is used by the operations (> and > ORDER BY), the information which collation to use comes with the data > values. It's basically saying, a is in language "de", so sort it like > that unless told otherwise. There is also an override syntax available, > like this: > > SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv; > > But here again the collation is attached to a data value, and only from > there it is passed to the operator. What is actually happening is > > SELECT * FROM test WHERE (a COLLATE en) > 'baz' ORDER BY (b COLLATE sv); > > > What you appear to be describing is a "per-operation locale", which also > sounds valid, but it would be a different thing. It might be thought of > as this: > > SELECT * FROM test WHERE a (> COLLATE en) 'baz' ORDER BY COLLATE sv b; > > with some suitable global default. > > > So which one of these should it be? My understanding is that what we do is per-operation locale. The locale information bears no semantic when not attached to some operation on strings, like sorting or comparing. So what you're showing here I think is how to attach a collation label to every string in the system, at the catalog level or dynamically at the query level. Now this collation label will only be used whenever you want to use a collation aware function or operator. Those functions need to get the labels for their implementation to have the expected meaning. So we need both to attach collations to all known strings (defaulting to the current database collation I guess), as you showed at the SQL level, and to pass this information down to the functions operating on those strings. A confusing example on this grounds would be the following, which I hope the standard disallow: SELECT * FROM test WHERE a COLLATE en > b COLLATE sv; Regards, -- dim
On Fri, May 28, 2010 at 3:20 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On fre, 2010-05-28 at 15:03 -0400, Robert Haas wrote: >> I think we need to think of the comparison operators as ternary, and >> the COLLATE syntax applied to columns or present in queries as various >> ways of setting defaults or explicit overrides for what the third >> argument will end up being. > > How could this extend to things like isalpha() or upper() that would > need access to ctype information? Good question. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
2010/5/28 alvherre <alvherre@commandprompt.com>: > Excerpts from Peter Eisentraut's message of vie may 28 12:27:52 -0400 2010: > >> Option 2, invent some new mechanism that accompanies a datum or a type >> whereever it goes. Kind of like typmod, but not really. Then the >> collation information would presumably be made available to functions >> through the fmgr interface. The binary representation of data values >> stays the same. > > Is the collation a property of the datum, or one of the comparison? > If the latter, should it be really be made a sidecar of a datum, or > would it make more sense to attach it to the operation being performed? > > I wonder if instead of trying to pass it down multiple layers till > bttextcmp and further down, it would make more sense to set a global > variable somewhere in the high levels, and only have it checked in > varstr_cmp. > Maybe collation is property of some operation: func call, sort, ... I prefer to put collation info to FunctionCallInfo structure. Usually you cannot change collation per row - collation is attached to expression. Regards Pavel > -- > Álvaro Herrera <alvherre@commandprompt.com> > The PostgreSQL Company - Command Prompt, Inc. > PostgreSQL Replication, Consulting, Custom Development, 24x7 support > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
On 28/05/10 23:15, Robert Haas wrote: > On Fri, May 28, 2010 at 3:20 PM, Peter Eisentraut<peter_e@gmx.net> wrote: >> On fre, 2010-05-28 at 15:03 -0400, Robert Haas wrote: >>> I think we need to think of the comparison operators as ternary, and >>> the COLLATE syntax applied to columns or present in queries as various >>> ways of setting defaults or explicit overrides for what the third >>> argument will end up being. >> >> How could this extend to things like isalpha() or upper() that would >> need access to ctype information? > > Good question. :-( Strictly speaking, collation and ctype are two different things. Which is a convenient way to evade the question :-). But you could ask, how would we handle more fine-grained ctype in upper() then? Perhaps by adding a second argument for ctype. Similarly to to_tsvector([config, ] string), you could explicitly pass the ctype as an argument, or leave it out in which case a default is used. It wouldn't give you per-column ctype, though. What does the spec have to say about the ctype used for upper() et al BTW? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, May 28, 2010 at 10:32:34PM +0300, Peter Eisentraut wrote: > On fre, 2010-05-28 at 14:48 -0400, Tom Lane wrote: > > > SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv; > > > > That seems fairly bizarre. What does this mean: > > > > WHERE a COLLATE en > b COLLATE de > > > > ? If it's an error, why is this not an error > > > > WHERE a COLLATE en > b > > > > if b is marked as COLLATE de in its table? > > The way I understand it, a collation "derivation" can be explicit or > implicit. Explicit derivations override implicit derivations. If in > the argument set of an operation, explicit collation derivations exist, > they must all be the same. The SQL standard has an explicit set of rules for determining the collations of any particular operation (they apply to operators/functions not to the datums). The basic idea is that tables/columns/data types define an implicit collation, which can be overidden by explicit collations. If there is ambiguity you throw an error. I implemented this all several years ago, it's not all that complicated really. IIRC I added a field to the Node type and each level determined it's collection from the sublevels. I solved the problem for the OP by providing an extra function to user defined functions which would return the collation for that particular call. The more interesting question I found was that the standard only defined collation for strings, whereas it can be applied much more broadly. I described a possible solution several years back, it should in the archives somewhere. It worked pretty well as I recall. IIRC The idea was to let each type has its own set of collations and when using an operator/function you let the collection be determined by the argument that had the same type as the return type. It would be nice if COLLATE could finally be implemented, it'd be quite useful. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
On lör, 2010-05-29 at 00:18 +0300, Heikki Linnakangas wrote: > What does the spec have to say about the ctype used for upper() et al > BTW? It doesn't make any provisions for locale dependencies for that.
On fre, 2010-05-28 at 20:59 +0300, Peter Eisentraut wrote: > The feature I'm thinking of is what > people might call "per-column locale", and the SQL standard defines > that. It would look like this: > > CREATE TABLE test ( > a varchar COLLATE de, > b varchar COLLATE fr > ); > > SELECT * FROM test WHERE a > 'baz' ORDER BY b; Perhaps it's also worth pointing out there could be use cases other than supporting multiple natural languages. For example, it is frequently requested to be able to sort in ways that doesn't ignore special characters, binary sort, or perhaps special file name sort that treats '/' special in some way. So it could be quite useful to be able to say CREATE TABLE something ( description text COLLATE en, code char(6) COLLATE binary, file text COLLATE filename_sort ); or even something like CREATE DOMAIN filename AS text COLLATE filename_sort; These are examples where having the collation attached to the column would appear to make more sense then having it attached only to operations.
On Wed, Jun 2, 2010 at 3:46 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On fre, 2010-05-28 at 20:59 +0300, Peter Eisentraut wrote: >> The feature I'm thinking of is what >> people might call "per-column locale", and the SQL standard defines >> that. It would look like this: >> >> CREATE TABLE test ( >> a varchar COLLATE de, >> b varchar COLLATE fr >> ); >> >> SELECT * FROM test WHERE a > 'baz' ORDER BY b; > > Perhaps it's also worth pointing out there could be use cases other than > supporting multiple natural languages. For example, it is frequently > requested to be able to sort in ways that doesn't ignore special > characters, binary sort, or perhaps special file name sort that treats > '/' special in some way. So it could be quite useful to be able to say > > CREATE TABLE something ( > description text COLLATE en, > code char(6) COLLATE binary, > file text COLLATE filename_sort > ); > > or even something like > > CREATE DOMAIN filename AS text COLLATE filename_sort; > > These are examples where having the collation attached to the column > would appear to make more sense then having it attached only to > operations. But in the end the only purpose of setting it on a column is to set which one will be used for operations on that column. And the user might still override it for a particular query. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On ons, 2010-06-02 at 16:56 -0400, Robert Haas wrote: > But in the end the only purpose of setting it on a column is to set > which one will be used for operations on that column. And the user > might still override it for a particular query. Of course. I'm just saying that it *can* be useful to attach a collation to a column definition, rather than only allowing it to be specified along with the sort operation.
On Jun 3, 2010, at 2:43 AM, Peter Eisentraut wrote: > On ons, 2010-06-02 at 16:56 -0400, Robert Haas wrote: >> But in the end the only purpose of setting it on a column is to set >> which one will be used for operations on that column. And the user >> might still override it for a particular query. > > Of course. I'm just saying that it *can* be useful to attach a > collation to a column definition, rather than only allowing it to be > specified along with the sort operation. How does collation relate to per-table/column encodings? For that matter, are per-table/column encodings spec, and/or somethingthat we're looking to implement down the line? Regards, David -- David Christensen End Point Corporation david@endpoint.com
On tor, 2010-06-03 at 07:30 -0500, David Christensen wrote: > How does collation relate to per-table/column encodings? There is some connection between collations and character sets or encodings, because a collation is tied to one of those, just as a necessity of implementation (depending on implementation details). You could have per-column (or per-some-other-small-unit) collation with a global encoding, but a per-column encoding with a global collation wouldn't work. > For that matter, are per-table/column encodings spec, Yes. > and/or something that we're looking to implement down the line? I don't think anyone is seriously planning that. But per-column collations would have to come first, anyway. Of course there is always the possibility that someone comes up with an alternative plan that invalidates the above, but the above represents the facts from the SQL standard and other implementations.
On Thu, Jun 3, 2010 at 3:43 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On ons, 2010-06-02 at 16:56 -0400, Robert Haas wrote: >> But in the end the only purpose of setting it on a column is to set >> which one will be used for operations on that column. And the user >> might still override it for a particular query. > > Of course. I'm just saying that it *can* be useful to attach a > collation to a column definition, rather than only allowing it to be > specified along with the sort operation. Oh, I agree with that, for sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company