Thread: Re: [19] Proposal: function markers to indicate collation/ctype sensitivity
Re: [19] Proposal: function markers to indicate collation/ctype sensitivity
From
Peter Eisentraut
Date:
On 04.06.25 05:22, Jeff Davis wrote: > This proposal would add that dependency information, and importantly, > would be careful about which dependency entries are required for > particular expressions and which are not. > Introduce three new options when creating or altering a function, > operator or index AM: COLLATE, CTYPE, or EQUALITY, representing the > operations that the object is sensitive to. Yes, this has been on my todo list since the day collations were added, but for a different reason: We should be able to detect a failed collation derivation at parse time. This is according to the SQL standard, and also because it's arguably a better user experience. But we don't do that, so we have to check it at run time, which is what all these errmsg("could not determine which collation to use for string comparison") checks are for. The reason we don't do it at parse time is that we don't have the information which functions care about collations, which is exactly what you are proposing here to add. In my mind, I had this project listed under "procollate", but feel free to use a different name. But I would consider making this one setting with multiple values instead of multiple boolean settings. I don't mean to say that you should implement the parse-time collation derivation check as well, but we should design the catalog metadata so that that is possible.
On Thu, 2025-06-05 at 10:12 +0200, Peter Eisentraut wrote: > The reason we don't do it at parse time is that we don't have the > information which functions care about collations, which is exactly > what > you are proposing here to add. Currently, we have: create table c(x text collate "C", y text collate "en_US"); insert into c values ('x', 'y'); select x < y from c; -- fails (runtime check) select x || y from c; -- succeeds Surely, "<" would be marked as ordering-sensitive, and we could move the error to parse-time. But what about UDFs? If we assume that all UDFs are ordering-sensitive unless marked otherwise, then a user-defined version of "||" that previously worked would now start failing, until they add the ordering- insensitive mark. We'd need some kind of migration path where we could retain the runtime checks and disable the parse time checks until people have a chance to add the right marks to their UDFs. Migration paths like that are not great because they take several releases to work out, and we're never quite sure when to finally remove the deprecated behavior. If we make the opposite assumption, that none are ordering-sensitive unless we mark them so, that would allow properly-marked functions to fail at parse time, and the rest to fail at runtime. But this assumption doesn't work as well for recording dependencies, because we'd miss the dependencies for UDFs that aren't properly marked. Thoughts? Regards, Jeff Davis
On Thu, 2025-06-05 at 10:12 +0200, Peter Eisentraut wrote: > But I would consider making this one setting > with multiple values instead of multiple boolean settings. While we're at it, CTYPE is not very descriptive for a user-facing name. And COLLATE has become overloaded (expression clause, pg_collation object, ordering, or the superset of behaviors that includes CTYPE). Let's consider more user-friendly naming for the markers: CASE: lower/upper/initcap/fold behavior CLASS: char classifications such as [[:punct:]] ORDER: comparisons Internally, at least for the foreseeable future, CASE and CLASS would be the same. They'd just be different markers to record the user's intent. Also, we could use keywords in the DDL syntax, or we could use a new options syntax, or a comma-separated list as a string literal to specify the markers. I don't have a strong opinion on which route to take, but I chose the above names from existing keywords so we wouldn't have to add any. Regards, Jeff Davis