Thread: Collation version tracking for macOS
During development, I have been using the attached patch to simulate libc collation versions on macOS. It just uses the internal major OS version number. I don't know to what the extend the libc locales on macOS are maintained or updated at all, so I don't know what practical effect this would have. Again, it's mainly for development. If there is interest from others, I think we could add this, maybe disabled by default, or we just keep it in the mailing list archives for interested parties.
Attachment
On Mon, Feb 14, 2022 at 10:00 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > During development, I have been using the attached patch to simulate > libc collation versions on macOS. It just uses the internal major OS > version number. I don't know to what the extend the libc locales on > macOS are maintained or updated at all, so I don't know what practical > effect this would have. Again, it's mainly for development. If there > is interest from others, I think we could add this, maybe disabled by > default, or we just keep it in the mailing list archives for interested > parties. Last time I looked into this it seemed like macOS's strcoll() gave sensible answers in the traditional single-byte encodings, but didn't understand UTF-8 at all so you get C/strcmp() order. In other words there was effectively nothing to version. I remember that other old Unixes used to be like that, and I suspect that they might be using old pre-UTF-8 FreeBSD code for locales based on a quick peek at [1] (though FreeBSD itself has since learned to do CLDR-based UTF-8 sorting with a completely new implementation shared with other OSes). This makes me wonder if Apple is hiding another collation implementation somewhere up its sleeve -- surely that libc support is not good enough for the world's shiny globalised macOS/iOS apps? Maybe UCCompareText() and friends (UnicodeUtilitiesCoreLib) and the various Obj-C NSString comparison stuff, all of which probably predates Unixoid macOS (google tells me that UnicodeUtilities.h was present in macOS 9). It wouldn't be surprising if it shares nothing with the modern OS's C runtime stuff that came via NeXT. Just mentioning this as a curiosity, because I was trying to figure out how that could be left non-working without anyone complaining... [1] https://github.com/apple-open-source-mirror/Libc/tree/master/locale
On 07.05.22 02:31, Thomas Munro wrote: >> During development, I have been using the attached patch to simulate >> libc collation versions on macOS. It just uses the internal major OS >> version number. I don't know to what the extend the libc locales on >> macOS are maintained or updated at all, so I don't know what practical >> effect this would have. Again, it's mainly for development. If there >> is interest from others, I think we could add this, maybe disabled by >> default, or we just keep it in the mailing list archives for interested >> parties. > Last time I looked into this it seemed like macOS's strcoll() gave > sensible answers in the traditional single-byte encodings, but didn't > understand UTF-8 at all so you get C/strcmp() order. In other words > there was effectively nothing to version. Someone recently told me that collations in macOS have actually changed recently and that this is a live problem. See explanation here: https://github.com/PostgresApp/PostgresApp/blob/master/docs/documentation/reindex-warning.md?plain=1#L66 So I think we should reconsider this patch, even for PG15.
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > On 07.05.22 02:31, Thomas Munro wrote: >> Last time I looked into this it seemed like macOS's strcoll() gave >> sensible answers in the traditional single-byte encodings, but didn't >> understand UTF-8 at all so you get C/strcmp() order. In other words >> there was effectively nothing to version. > Someone recently told me that collations in macOS have actually changed > recently and that this is a live problem. See explanation here: > https://github.com/PostgresApp/PostgresApp/blob/master/docs/documentation/reindex-warning.md?plain=1#L66 > So I think we should reconsider this patch, even for PG15. According to that document, they changed it in macOS 11, which came out a year and a half ago. Given the lack of complaints, it doesn't seem like this is urgent enough to mandate a post-beta change that would have lots of downside (namely, false-positive warnings for every other macOS update). regards, tom lane
On 6/3/22 9:21 AM, Tom Lane wrote: > > According to that document, they changed it in macOS 11, which came out > a year and a half ago. Given the lack of complaints, it doesn't seem > like this is urgent enough to mandate a post-beta change that would > have lots of downside (namely, false-positive warnings for every other > macOS update). Sorry, I'm going to rant for a minute... it is my very strong opinion that using language like "false positive" here is misguided and dangerous. If new version of sort order is released, for example when they recently updated backwards-secondary sorting in french [CLDR-2905] or matching of v and w in swedish and finnish [CLDR-7088], it is very dangerous to use language like “false positive” to describe a database where there just didn't happen to be any rows with accented french characters at the point in time where PostgreSQL magically changed which version of sort order it was using from the 2010 french version to the 2020 french version. No other piece of software that calls itself a database would do what PostgreSQL is doing: just give users a "warning" after suddenly changing the sort order algorithm (most users won't even read warnings in their logs). Oracle, DB2, SQL Server and even MySQL carefully version collation data, hardcode a pseudo-linguistic collation into the DB (like PG does for timezones), and if they provide updates to linguistic sort order (from Unicode CLDR) then they allow the user to explicitly specify which version of french or german ICU sorting they are want to use. Different versions are treated as different sort orders; they are not conflated. I have personally seen PostgreSQL databases where an update to an old version of glibc was applied (I'm not even talking 2.28 here) and it resulted in data loss b/c crash recovery couldn't replay WAL records and the user had to do a PITR. That's aside from the more common issues of segfaults or duplicate records that violate unique constraints or wrong query results like missing data. And it's not just updates - people can set up a hot standby on a different version and see many of these problems too. Collation versioning absolutely must be first class and directly controlled by users, and it's very dangerous to allow users - at all - to take an index and then use a different version than what the index was built with. Not to mention all the other places in the DB where collation is used... partitioning, constraints, and any other place where persisted data can make an assumption about any sort of string comparison. It feels to me like we're still not really thinking clearly about this within the PG community, and that the seriousness of this issue is not fully understood. -Jeremy Schneider -- http://about.me/jeremy_schneider
On Sat, Jun 4, 2022 at 12:17 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > On 07.05.22 02:31, Thomas Munro wrote: > > Last time I looked into this it seemed like macOS's strcoll() gave > > sensible answers in the traditional single-byte encodings, but didn't > > understand UTF-8 at all so you get C/strcmp() order. In other words > > there was effectively nothing to version. > > Someone recently told me that collations in macOS have actually changed > recently and that this is a live problem. See explanation here: > > https://github.com/PostgresApp/PostgresApp/blob/master/docs/documentation/reindex-warning.md?plain=1#L66 How can I see evidence of this? I'm comparing Debian, FreeBSD and macOS 12.4 and when I run "LC_COLLATE=en_US.UTF-8 sort /usr/share/dict/words" I get upper and lower case mixed together on the other OSes, but on the Mac the upper case comes first, which is my usual smoke test for "am I looking at binary sort order?"
On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider <schneider@ardentperf.com> wrote: > No other piece of software that calls itself a database would do what > PostgreSQL is doing: just give users a "warning" after suddenly changing > the sort order algorithm (most users won't even read warnings in their > logs). Oracle, DB2, SQL Server and even MySQL carefully version > collation data, hardcode a pseudo-linguistic collation into the DB (like > PG does for timezones), and if they provide updates to linguistic sort > order (from Unicode CLDR) then they allow the user to explicitly specify > which version of french or german ICU sorting they are want to use. > Different versions are treated as different sort orders; they are not > conflated. I guess you know this but for the record, there have been discussions before about supporting multiple versions of ICU concurrently, like DB2. For example, one idea was that each ICU version could be a separate "provider" in PostgreSQL, so you can concurrently use multiple versions. One of several places this came up: https://www.postgresql.org/message-id/CADE5jYJTnYaTNXMFKOK-0p44%2BDm5LMcRcJ5kVi1MVHomb2QTkQ%40mail.gmail.com > It feels to me like we're still not really thinking clearly about this > within the PG community, and that the seriousness of this issue is not > fully understood. FWIW A couple of us tried quite hard to make smarter warnings, and that thread and others discussed a lot of those topics, like the relevance to constraints and so forth.
Thomas Munro <thomas.munro@gmail.com> writes: > On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider > <schneider@ardentperf.com> wrote: >> It feels to me like we're still not really thinking clearly about this >> within the PG community, and that the seriousness of this issue is not >> fully understood. > FWIW A couple of us tried quite hard to make smarter warnings, and > that thread and others discussed a lot of those topics, like the > relevance to constraints and so forth. I think the real problem here is that the underlying software mostly doesn't take this issue seriously. Unfortunately, that leads one to the conclusion that we need to maintain our own collation code and data (e.g., our own fork of ICU), and that isn't happening. Unlike say Oracle, we do not have the manpower; nor do we want to bloat our code base that much. Short of maintaining our own fork, ranting about the imperfections of the situation is a waste of time. regards, tom lane
On 6/3/22 3:58 PM, Tom Lane wrote > Thomas Munro <thomas.munro@gmail.com> writes: >> On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider >> <schneider@ardentperf.com> wrote: >>> It feels to me like we're still not really thinking clearly about this >>> within the PG community, and that the seriousness of this issue is not >>> fully understood. >> FWIW A couple of us tried quite hard to make smarter warnings, and >> that thread and others discussed a lot of those topics, like the >> relevance to constraints and so forth. > I think the real problem here is that the underlying software mostly > doesn't take this issue seriously. Unfortunately, that leads one to > the conclusion that we need to maintain our own collation code and > data (e.g., our own fork of ICU), and that isn't happening. Unlike > say Oracle, we do not have the manpower; nor do we want to bloat our > code base that much. > > Short of maintaining our own fork, ranting about the imperfections > of the situation is a waste of time. The first step to a solution is admitting that the problem exists. Ignoring broken backups, segfaults and data corruption as a "rant" implies that we simply throw in the towel and tell users to suck it up or switch engines. There are other ways to address this short of the community doing all the work itself. One simple example would be to refuse to start if the collation provider has changed since initdb (which we'd need to allow users to override). A more sophisticated option would be to provide the machinery for supporting multiple collation libraries. Both of those at least ensure that users are aware any time there's a problem, which IMO is *enormously* better than letting core functionality silently stop working.
Jim Nasby <nasbyj@amazon.com> writes: >> I think the real problem here is that the underlying software mostly >> doesn't take this issue seriously. > The first step to a solution is admitting that the problem exists. > Ignoring broken backups, segfaults and data corruption as a "rant" > implies that we simply throw in the towel and tell users to suck it up > or switch engines. There are other ways to address this short of the > community doing all the work itself. One simple example would be to > refuse to start if the collation provider has changed since initdb > (which we'd need to allow users to override). You're conveniently skipping over the hard part, which is to tell whether the collation provider has changed behavior (which we'd better do with pretty darn high accuracy, if we're going to refuse to start on the basis of thinking it has). Unfortunately, giving a reliable indication of collation behavioral changes is *exactly* the thing that the providers aren't taking seriously. regards, tom lane
> On Jun 6, 2022, at 17:10, Jim Nasby <nasbyj@amazon.com> wrote: > Ignoring broken backups, segfaults and data corruption as a "rant" implies that we simply throw in the towel and tellusers to suck it up or switch engines. Well now, let’s be clear, I was the one who called my email a “rant”. 🙂 And I do apologize for that - it was grumpy and impulsive and Tom isn’t wrong that rants don’t usually help move things forward. Thomas - thanks for the link back to one of the threads. I spent some time reading through that and it’s a lot of material;I haven’t read the whole thread yet. If you have some others that would also be particularly good background, letme know. I’m doing a chunk of this in my spare time at the moment, but I do want to keep getting more up to speed. I waspulled into a bunch of various things related to PostgreSQL and ICU and collation and OS’s over the past couple years,so I learned a lot from on-the-ground experience and I am interested in trying to get a little more involved in theconversation here. Personally, I really do think there should at least be an *option* to tell the DB to fully error rather than just warn onversion mismatch. Correctness matters to many users, and being able to *trust* string comparisons are correct is prettydamn fundamental all throughout a database. It really doesn’t get any more basic and the potential for bad things tohappen is pretty astronomical, if you can’t trust those. I understand the consternation about dealing with upgrades oflarge & busy databases, but I’m still surprised that the community consensus arrived at the present behavior, and I havea lot of reading to do, to really understand how that happened and where the dialogue is today. Multiple versions of ICU sounds nice for users who need real linguistic collation (like what Oracle and DB2 offer), but Istill feel like there needs to be a super simple basic “pseudo-linguistic” collation baked in, that’s “good enough” for99% of users and that is guaranteed to be the same everywhere on every platform and just won’t ever change. I think glibcneeds to be phased out somehow. At a minimum, not the default for new users… to stop the bleeding. If MySQL wasn’t GPLthen I’d say to just copy their collations. I’d be reluctant to spend too much time on a POC now though, it feels likemy idea is the outlier and the general PG hacker consensus would be to reject this idea. (But maybe I’m wrong?) Anyway, again, apologies for my pants-on-fire email last week. I hope I can enjoy a few beers someday - or coffee for thenon-drinkers - with a few other PG collation nerds (which I never set out to be, but it may have befallen me <g>). -Jeremy Sent from my TI-83
On Tue, Jun 7, 2022 at 12:10 PM Jim Nasby <nasbyj@amazon.com> wrote: > On 6/3/22 3:58 PM, Tom Lane wrote > > Thomas Munro <thomas.munro@gmail.com> writes: > >> On Sat, Jun 4, 2022 at 7:13 AM Jeremy Schneider > >> <schneider@ardentperf.com> wrote: > >>> It feels to me like we're still not really thinking clearly about this > >>> within the PG community, and that the seriousness of this issue is not > >>> fully understood. > >> FWIW A couple of us tried quite hard to make smarter warnings, and > >> that thread and others discussed a lot of those topics, like the > >> relevance to constraints and so forth. > > I think the real problem here is that the underlying software mostly > > doesn't take this issue seriously. Unfortunately, that leads one to > > the conclusion that we need to maintain our own collation code and > > data (e.g., our own fork of ICU), and that isn't happening. Unlike > > say Oracle, we do not have the manpower; nor do we want to bloat our > > code base that much. > > > > Short of maintaining our own fork, ranting about the imperfections > > of the situation is a waste of time. > The first step to a solution is admitting that the problem exists. We've been discussing this topic for years and I don't think anyone thinks the case is closed... > Ignoring broken backups, segfaults and data corruption as a "rant" > implies that we simply throw in the towel and tell users to suck it up > or switch engines. There are other ways to address this short of the > community doing all the work itself. One simple example would be to > refuse to start if the collation provider has changed since initdb > (which we'd need to allow users to override). Yeah, it's been discussed, but never proposed. The problem is that you need to start up to fix the problem. Another option is not to use affected indexes, but that doesn't help with other forms of the problem (partition constraints, etc). > A more sophisticated > option would be to provide the machinery for supporting multiple > collation libraries. Earlier I mentioned distinct "providers" but I take that back, that's too complicated. Reprising an old idea that comes up each time we talk about this, this time with some more straw-man detail: what about teaching our ICU support to understand "libicu18n.so.71:en" to mean that it should dlopen() that library and use its functions? Or some cleverer, shorter notation. Then it's the user's problem to make sure the right libraries are installed, and it'll fail if they're not. For example, on Debian bookworm right now you can install libicu63, libicu67, libicu71, though only the "current" -dev package, but which I'm sure we can cope with. You're at the mercy of the distro or add-on package repos to keep a lot of versions around, but that seems OK. Maintaining our own fork(s) of ICU would seem like massive overkill and I don't think anyone has suggested that; the question on my mind is whether we could rely on existing packages. Then you'd be exposed only to changes that happen within (say) the ICU 63 package's lifetime... I recall looking into whether that can happen but ... I don't recall the answer.
On Mon, Jun 6, 2022 at 8:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jim Nasby <nasbyj@amazon.com> writes:
>> I think the real problem here is that the underlying software mostly
>> doesn't take this issue seriously.
> The first step to a solution is admitting that the problem exists.
> Ignoring broken backups, segfaults and data corruption as a "rant"
> implies that we simply throw in the towel and tell users to suck it up
> or switch engines. There are other ways to address this short of the
> community doing all the work itself. One simple example would be to
> refuse to start if the collation provider has changed since initdb
> (which we'd need to allow users to override).
You're conveniently skipping over the hard part, which is to tell
whether the collation provider has changed behavior (which we'd better
do with pretty darn high accuracy, if we're going to refuse to start
on the basis of thinking it has). Unfortunately, giving a reliable
indication of collation behavioral changes is *exactly* the thing
that the providers aren't taking seriously.
Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then running crc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few databases use more than a couple different collations.
--
Rod Taylor
On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor <rbt@rbt.ca> wrote: > Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then runningcrc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few databasesuse more than a couple different collations. Collation rules have multiple levels and all kinds of quirks, so that won't work.
On Fri, Jun 3, 2022 at 4:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think the real problem here is that the underlying software mostly > doesn't take this issue seriously. Unfortunately, that leads one to > the conclusion that we need to maintain our own collation code and > data (e.g., our own fork of ICU), and that isn't happening. Unlike > say Oracle, we do not have the manpower; nor do we want to bloat our > code base that much. You don't, but that opinion isn't universally held, or at least not with the same vigor that you hold it. See e.g. https://www.postgresql.org/message-id/a4019c5e570d4dbb5e3f816c080fb57c76ab604a.camel%40cybertec.at and subsequent discussion, for example. In fact, I'd go so far as to argue that you're basically sticking your head in the sand here. You wrote: "Given the lack of complaints, it doesn't seem like this is urgent enough to mandate a post-beta change that would have lots of downside (namely, false-positive warnings for every other macOS update)." But you wrote that to Peter, who was essentially complaining that we hadn't done anything, and linked to another source, which was also complaining about the problem, and then Jeremy Schneider replied to your email and complained some more. Complaining about "false positives" doesn't really make sense to me. It's true that we don't have any false positives right now, but we also have no true positives. Even a stopped clock is right twice a day, but not in a useful way. People want to be notified when a problem might exist, even if sometimes it doesn't actually. The alternative is having no idea at all that things might be broken, which is not better. -- Robert Haas EDB: http://www.enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor <rbt@rbt.ca> wrote: >> Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then runningcrc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few databasesuse more than a couple different collations. > Collation rules have multiple levels and all kinds of quirks, so that > won't work. Yeah, and it's exactly at the level of quirks that things are likely to change. Nobody's going to suddenly start sorting B before A. They might, say, change their minds about where the digram "cz" sorts relative to single letters, in languages where special rules for that are a thing. The idea of fingerprinting a collation's behavior is interesting, but I've got doubts about whether we can make a sufficiently thorough fingerprint. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > In fact, I'd go so far as to argue that you're basically sticking your > head in the sand here. You wrote: No, I quite agree that we have a problem. What I don't agree is that issuing a lot of false-positive warnings is a solution. That will just condition people to ignore the warnings, and then when their platform really does change behavior, they're still screwed. If we could *accurately* report collation behavioral changes, I'd be all for that. Rod's idea upthread is certainly way too simplistic, but could we build a set of test cases that do detect known changes in collation behaviors? We'd be shooting at a moving target; but even if we're late in noticing that platform X changed the behavior of collation Y, we could help users who run in the problem afterwards. regards, tom lane
On Tue, Jun 7, 2022 at 12:37 PM Robert Haas <robertmhaas@gmail.com> wrote: > It's true that we don't have any false positives right now, but we > also have no true positives. Even a stopped clock is right twice a > day, but not in a useful way. People want to be notified when a > problem might exist, even if sometimes it doesn't actually. Collations by their very nature are unlikely to change all that much. Obviously they can and do change, but the details are presumably pretty insignificant to a native speaker. Stands to reason that the issue (which is fundamentally a problem for natural language experts) would have been resolved far sooner if there really was a significant controversy about something that tends to come up often. It's pretty clear that glibc as a project doesn't take the issue very seriously, because they see it as a problem of the GUI sorting a table in a way that seems slightly suboptimal to scholars of a natural language. Clearly that isn't actually a big deal. But the latent possibility of wrong answers to queries is a very big deal. Both are true. It's just a matter of priorities in each case. I agree that "false positive" is not a valid way of describing a breaking change in a Postgres collation that happens to not affect one index in particular, due to the current phase of the moon. It's probably very likely that most individual indexes that we warn about will be so-called false positives. I bet Postgres that there are many near-misses that we never get to hear about already. That's rather beside the point. The index must be assumed to be corrupt. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > I agree that "false positive" is not a valid way of describing a > breaking change in a Postgres collation that happens to not affect one > index in particular, due to the current phase of the moon. It's > probably very likely that most individual indexes that we warn about > will be so-called false positives. This is not the concern that I have. I agree that if we tell a user that collation X changed behavior and he'd better reindex his indexes that use collation X, but none of them actually contain any cases that changed behavior, that's not a "false positive" --- that's "it's cheaper to reindex than to try to identify whether there's a problem". What I mean by "false positive" is telling every macOS user that they'd better reindex everything every year, when in point of fact Apple changes those collations almost never. We will soon lose those users' attention --- see fable about boy crying wolf --- and then when Apple actually does change something, we've got a problem. So if we give collation-change warnings, they'd better have some measurable connection to reality. regards, tom lane
On Mon, Jun 6, 2022 at 5:45 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Earlier I mentioned distinct "providers" but I take that back, that's > too complicated. Reprising an old idea that comes up each time we > talk about this, this time with some more straw-man detail: what about > teaching our ICU support to understand "libicu18n.so.71:en" to mean > that it should dlopen() that library and use its functions? Or some > cleverer, shorter notation. Then it's the user's problem to make sure > the right libraries are installed, and it'll fail if they're not. For > example, on Debian bookworm right now you can install libicu63, > libicu67, libicu71, though only the "current" -dev package, but which > I'm sure we can cope with. You're at the mercy of the distro or > add-on package repos to keep a lot of versions around, but that seems > OK. Right. Postgres could link to multiple versions of ICU at the same time. Right now it doesn't, and right now the ICU C symbol names that we use are actually versioned (this isn't immediately apparent because the C preprocessor makes it appear that ICU symbol names are generic). We could perhaps invent a new indirection that knows about multiple ICU versions, each of which is an independent collation provider, or maybe a related collation provider that gets used by default on REINDEX. ICU is designed for this kind of thing. That approach more or less puts packagers on the hook for managing collation stability. But now long term collation stability is at least feasible -- we at least have a coherent strategy. In the worst case the community .deb and .rpm repos might continue to support an older ICU version, or lobby for its continued support by the distro (while actively discouraging its use in new databases). This isn't the same thing as forking ICU. It's a compromise between that extreme, and the current situation. -- Peter Geoghegan
On Wed, Jun 8, 2022 at 7:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The idea of fingerprinting a collation's behavior is interesting, > but I've got doubts about whether we can make a sufficiently thorough > fingerprint. On one of the many threads about this I recall posting a thought experiment patch that added system_collation_version_command or some such, so you could train your computer to compute a hash for /usr/share/locale/XXX/LC_COLLATE (or whatever it's called on your system), but it all seemed a bit gross to me on various levels. Most people don't know or care about collations so they won't set it up, so to make it useful it'd have to have useful defaults, and it seems like a bad idea to teach PostgreSQL where all these systems keep their collation rules.
On 6/7/22 12:53 PM, Peter Geoghegan wrote: > > Collations by their very nature are unlikely to change all that much. > Obviously they can and do change, but the details are presumably > pretty insignificant to a native speaker. This idea does seem to persist. It's not as frequent as timezones, but collation rules reflect local dialects and customs, and there are changes quite regularly for a variety of reasons. A brief perusal of CLDR changelogs and CLDR jiras can give some insight here: https://github.com/unicode-org/cldr https://unicode-org.atlassian.net/jira/software/c/projects/CLDR/issues/?jql=project%20%3D%20%22CLDR%22%20AND%20text%20~%20%22collation%22%20ORDER%20BY%20created%20DESC The difference between the unicode consortium and the GNU C Library is that unicode is maintained by people who are specifically interested in working with language and internationalization challenges. I've spoken to a glibc maintainer who directly told me that they dislike working with the collation code, and try to avoid it. It's not even ISO 14651 anymore with so many custom glibc-specific changes layered on top. I looked at the first few commits in the glibc source that were responsible for the big 2.28 changes - there were a serious of quite a few commits and some were so large they wouldn't even load in the github API. Here's one such commit: https://github.com/bminor/glibc/commit/9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 It's reasonable to expect that Red Hat and Debian will keep things stable on one particular major, and to expect that every new major OS version will update to the latest collation algorithms and locale data for glibc. Another misunderstanding that seems to persist is that this only relates to exotic locales or that it's only the 2.28 version. My github repo is out-of-date (I know of more cases that I still need to publish) but the old data already demonstrates changes to the root/DUCET collation rules (evident in en_US without any tailoring) for glibc versions 2.13, 2.21 and 2.26 https://github.com/ardentperf/glibc-unicode-sorting/ If a PosgreSQL user is unlucky enough to have one of those unicode characters stored in a table, they can get broken indexes even if they only use the default US english locale, and without touching glibc 2.28 - and all you need is an index on a field where end users can type any string input. > It's pretty clear that glibc as a project doesn't take the issue very > seriously, because they see it as a problem of the GUI sorting a table > in a way that seems slightly suboptimal to scholars of a natural > language. I disagree that glibc maintainers are doing anything wrong. While the quality of glibc collations aren't great when compared with CLDR, I think the glibc maintainers have done versioning exactly right: they are clear about which patches are allowed to contain collation updates, and the OS distributions are able to ensure stability on major OS release. I haven't yet found a Red Hat minor release that changed glibc collation. -Jeremy -- http://about.me/jeremy_schneider
On Tue, Jun 7, 2022 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > This is not the concern that I have. I agree that if we tell a user > that collation X changed behavior and he'd better reindex his indexes > that use collation X, but none of them actually contain any cases that > changed behavior, that's not a "false positive" --- that's "it's cheaper > to reindex than to try to identify whether there's a problem". What > I mean by "false positive" is telling every macOS user that they'd better > reindex everything every year, when in point of fact Apple changes those > collations almost never. That does seem like a meaningful distinction. I'm sorry if I misrepresented your position on this. We're talking about macOS here, which is hardly a paragon of lean software. I think that it's worth revisiting the assumption that the C standard library collations are the most useful set of collations, and we shouldn't presume to know better than the operating system. Couldn't individual packagers establish their own system for managing collations across multiple ICU versions, as I outlined up-thread? I think that it's okay (maybe unavoidable) that we keep "lib C collations are authoritative" as a generic assumption when Postgres is built from source. We can still have defacto standards that apply on all mainstream platforms when users install standard packages for production databases -- I don't see why we can't do both. Maybe the best place to solve this problem is at the level of each individual package ecosystem. There can be some outsourcing to package managers this way, without relying on the underlying OS, or lib C collations, or ICU in general. This scheme wouldn't technically be under our direct control, but would still be something that we could influence. We could have a back and forth conversation about what's not working in the field. -- Peter Geoghegan
On Tue, Jun 7, 2022 at 1:24 PM Jeremy Schneider <schneider@ardentperf.com> wrote: > This idea does seem to persist. It's not as frequent as timezones, but > collation rules reflect local dialects and customs, and there are > changes quite regularly for a variety of reasons. A brief perusal of > CLDR changelogs and CLDR jiras can give some insight here: > Another misunderstanding that seems to persist is that this only relates > to exotic locales or that it's only the 2.28 version. I'm not defending the status quo, and I think that I'm better informed than most about the problems in this area. My point was that it hardly matters that we don't necessarily see outright corruption. This was based in part on a misunderstanding of Tom's point, though. > While the quality of glibc collations aren't great when compared with > CLDR, I think the glibc maintainers have done versioning exactly right: > they are clear about which patches are allowed to contain collation > updates, and the OS distributions are able to ensure stability on major > OS release. I haven't yet found a Red Hat minor release that changed > glibc collation. That might be true, but my impression from interacting with Carlos O'Donnell is that they pretty much don't take the concern about stability all that seriously. Which I think is reasonable, given his position! The fact that we are this sensitive to glibc collation versioning might be a wholly unique situation (unlike with ICU, which was built with that in mind). It might be that every other user of glibc collations sees this as fairly inconsequential, because they don't have to deal with persistent state that directly relies on the rules in various ways that are critically important. Even if glibc theoretically does a perfect job of versioning, I still think that their priorities are very much unlike our priorities, and that that should be a relevant consideration for us. -- Peter Geoghegan
On Tue, Jun 7, 2022 at 4:24 PM Jeremy Schneider <schneider@ardentperf.com> wrote: > I haven't yet found a Red Hat minor release that changed > glibc collation. I feel like this is a thing that happens regularly enough that it's known to be a gotcha by many of my colleagues here at EDB. Perhaps that's all pure fiction, but I doubt it. People don't go around making up stories about things being broken so they can say bad things about Red Hat. They got told by customers that things are broken and then go try to figure out how that happened. -- Robert Haas EDB: http://www.enterprisedb.com
On 6/7/22 1:51 PM, Peter Geoghegan wrote: > On Tue, Jun 7, 2022 at 1:24 PM Jeremy Schneider > <schneider@ardentperf.com> wrote: >> This idea does seem to persist. It's not as frequent as timezones, but >> collation rules reflect local dialects and customs, and there are >> changes quite regularly for a variety of reasons. A brief perusal of >> CLDR changelogs and CLDR jiras can give some insight here: > >> Another misunderstanding that seems to persist is that this only relates >> to exotic locales or that it's only the 2.28 version. > > I'm not defending the status quo, and I think that I'm better informed > than most about the problems in this area. My point was that it hardly > matters that we don't necessarily see outright corruption. This was > based in part on a misunderstanding of Tom's point, though. I think I was guilty of the same misunderstanding - apologies Tom! Thanks Peter for calling that out explicitly. For my for my part, gut feeling is that MacOS major releases will be similar to any other OS major release, which may contain updates to collation algorithms and locales. ISTM like the same thing PG is looking for on other OS's to trigger the warning. But it might be good to get an official reference on MacOS, if someone knows where to find one? (I don't.) -Jeremy -- http://about.me/jeremy_schneider
On Tue, Jun 7, 2022 at 03:43:32PM -0400, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Wed, Jun 8, 2022 at 3:58 AM Rod Taylor <rbt@rbt.ca> wrote: > >> Is this more involved than creating a list of all valid Unicode characters (~144 thousand), sorting them, then runningcrc32 over the sorted order to create the "version" for the library/collation pair? Far from free but few databasesuse more than a couple different collations. > > > Collation rules have multiple levels and all kinds of quirks, so that > > won't work. > > Yeah, and it's exactly at the level of quirks that things are likely > to change. Nobody's going to suddenly start sorting B before A. > They might, say, change their minds about where the digram "cz" > sorts relative to single letters, in languages where special rules > for that are a thing. > > The idea of fingerprinting a collation's behavior is interesting, > but I've got doubts about whether we can make a sufficiently thorough > fingerprint. Rather than trying to figure out if the collations changed, have we ever considered checking if index additions and lookups don't match the OS collation and reporting these errors somehow? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
On Tue, Jun 7, 2022 at 2:13 PM Jeremy Schneider <schneider@ardentperf.com> wrote: > For my for my part, gut feeling is that MacOS major releases will be > similar to any other OS major release, which may contain updates to > collation algorithms and locales. ISTM like the same thing PG is looking > for on other OS's to trigger the warning. But it might be good to get an > official reference on MacOS, if someone knows where to find one? (I don't.) I just don't think that we should be relying on a huge entity like Apple or even glibc for this -- they don't share our priorities, and there is no reason for this to change. The advantage of ICU versioning is that it is just one library, that can coexist with others, including other versions of ICU. Imagine a world in which we support multiple ICU versions (for Debian packages, say), some of which are getting quite old. Maybe we can lobby for the platform to continue to support that old version of the library -- there ought to be options. Lobbying Debian to stick with an older version of glibc is another matter entirely. That has precisely zero chance of ever succeeding, for reasons that are quite understandable. Half the problem here is to detect breaking changes, but the other half is to not break anything in the first place. Or to give the user plenty of opportunity to transition incrementally, without needing to reindex everything at the same time. Obviously the only way that's possible is by supporting multiple versions of ICU at the same time, in the same database. This requires indirection that distinguishes between "physical and logical" collation versions, where the same nominal collation can have different implementations across multiple ICU versions. The rules for standards like BCP47 (the system that defines the name of an ICU/CLDR locale) are deliberately very tolerant of what they accept in order to ensure forwards and backwards compatibility in environments where there isn't just one ICU/CLDR version [1] (most environments in the world of distributed or web applications). So you can expect the BCP47 name of a collation to more or less work on any ICU version, perhaps with some loss of functionality (this is unavoidable when you downgrade ICU to a version that doesn't have whatever CLDR customization you might have relied on). It's very intentionally a "best effort" approach, because throwing a "locale not found" error message usually isn't helpful from the point of view of the end user. Note that this is a broader standard than ICU or CLDR or even Unicode. [1] https://www.ietf.org/rfc/rfc6067.txt -- Peter Geoghegan
On Wed, Jun 8, 2022 at 8:16 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jun 6, 2022 at 5:45 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Earlier I mentioned distinct "providers" but I take that back, that's > > too complicated. Reprising an old idea that comes up each time we > > talk about this, this time with some more straw-man detail: what about > > teaching our ICU support to understand "libicu18n.so.71:en" to mean > > that it should dlopen() that library and use its functions? Or some > > cleverer, shorter notation. Then it's the user's problem to make sure > > the right libraries are installed, and it'll fail if they're not. For > > example, on Debian bookworm right now you can install libicu63, > > libicu67, libicu71, though only the "current" -dev package, but which > > I'm sure we can cope with. You're at the mercy of the distro or > > add-on package repos to keep a lot of versions around, but that seems > > OK. > > Right. Postgres could link to multiple versions of ICU at the same > time. Right now it doesn't, and right now the ICU C symbol names that > we use are actually versioned (this isn't immediately apparent because > the C preprocessor makes it appear that ICU symbol names are generic). Yeah, it's possible to link against multiple versions in theory and that might be a way to do it if we were shipping our own N copies of ICU like DB2 does, but that's hard in practice for shared libraries on common distros (and vendoring or static linking of such libraries was said to be against many distros' rules, since it would be a nightmare if everyone did that, though I don't have a citation for that). I suspect it's better to use dlopen() to load them, because (1) I believe that the major distros only have -dev/-devel packages for the "current" version, even though they let you install the packages containing the .so files for multiple versions at the same time so that binaries linked against older versions keep working and (2) I think it'd be cool if users were free to find more ICU versions in add-on package repos and be able to use them to get a version that the packager of PostgreSQL didn't anticipate. > We could perhaps invent a new indirection that knows about > multiple ICU versions, each of which is an independent collation > provider, or maybe a related collation provider that gets used by > default on REINDEX. ICU is designed for this kind of thing. That > approach more or less puts packagers on the hook for managing > collation stability. But now long term collation stability is at least > feasible -- we at least have a coherent strategy. In the worst case > the community .deb and .rpm repos might continue to support an older > ICU version, or lobby for its continued support by the distro (while > actively discouraging its use in new databases). This isn't the same > thing as forking ICU. It's a compromise between that extreme, and > the current situation. Yeah, I've flip-flopped a couple of times on the question of whether ICU63 and ICU67 should be different collation providers, or individual collations should somehow specify the library they want to use (admittedly what I showed above with a raw library name is pretty ugly and some indirection scheme might be nice). It would be good to drill into the pros and cons of those two choices. As for getting sane defaults, I don't know if this is a good idea, but it's an idea: perhaps schemas and search paths could be used, you avoid having to include ugly version strings in the collation identifiers, and the search path effectively controls default when you don't want to be explicit (= most users)?
On Tue, Jun 7, 2022 at 3:27 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Yeah, it's possible to link against multiple versions in theory and > that might be a way to do it if we were shipping our own N copies of > ICU like DB2 does, but that's hard in practice for shared libraries on > common distros (and vendoring or static linking of such libraries was > said to be against many distros' rules, since it would be a nightmare > if everyone did that, though I don't have a citation for that). I'm not saying that it's going to be easy, but I can't see why it should be impossible. I use Debian unstable for most of my work. It supports multiple versions of LLVM/clang, not just one (though there is a virtual package with a default version, I believe). What's the difference, really? Packaging standards certainly matter, but they're not immutable laws of the universe. It seems reasonable to suppose that the people that define these standards would be willing to hear us out -- this is hardly a trifling matter, or something that only affects a small minority of *their* users. We don't need to support a huge number of versions on each OS -- just enough to make it feasible for everybody to avoid the need to ever reindex every index on a collatable type (maybe ICU versions that were the default for the last several major versions of the OS are available through special packages). We don't necessarily have to have a hard dependency on every supported version from the point of view of the package manager. And all of this would ultimately be the responsibility of each individual packager; they'd need to figure out how to make it work within the context of the platform that they're targeting. We'd facilitate that important work, but would defer to them on the final details. There could be a hands-off approach to the whole thing, so it wouldn't be a total departure from what we do today. > Yeah, I've flip-flopped a couple of times on the question of whether > ICU63 and ICU67 should be different collation providers, or > individual collations should somehow specify the library they want to > use (admittedly what I showed above with a raw library name is pretty > ugly and some indirection scheme might be nice). It would be good to > drill into the pros and cons of those two choices. I think that there are pretty good technical reasons why each ICU version is tied to a particular version of CLDR. Implementing CLDR correctly and efficiently is a rather difficult process, even if we ignore figuring out what natural language rules make sense. And so linking to multiple different ICU versions doesn't really seem like overkill to me. Or if it is then I can easily think of far better examples of software bloat. Defining "stable behavior for collations" as "uses exactly the same software artifact over time" is defensive (compared to always linking to one ICU version that does it all), but we have plenty that we need to defend against here. -- Peter Geoghegan
On Wed, Jun 8, 2022 at 10:59 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jun 7, 2022 at 3:27 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Yeah, it's possible to link against multiple versions in theory and > > that might be a way to do it if we were shipping our own N copies of > > ICU like DB2 does, but that's hard in practice for shared libraries on > > common distros (and vendoring or static linking of such libraries was > > said to be against many distros' rules, since it would be a nightmare > > if everyone did that, though I don't have a citation for that). > > I'm not saying that it's going to be easy, but I can't see why it > should be impossible. I use Debian unstable for most of my work. It > supports multiple versions of LLVM/clang, not just one (though there > is a virtual package with a default version, I believe). What's the > difference, really? The difference is that Debian has libllvm-{11,12,13,14}-dev packages, but it does *not* have multiple -dev packages for libicu, just a single libicu-dev which can be used to compile and link against their chosen current library version. They do have multiple packages for the actual .so and allow them to be installed concurrently. Therefore, you could install N .sos and dlopen() them, but you *can't* write a program that compiles and links against N versions at the same time using their packages (despite IBM's work to make that possible, perhaps for use in their own databases). > Packaging standards certainly matter, but they're not immutable laws > of the universe. It seems reasonable to suppose that the people that > define these standards would be willing to hear us out -- this is > hardly a trifling matter, or something that only affects a small > minority of *their* users. OK, yeah, I'm thinking within the confines of things we can do easily right now on existing systems as they are currently packaging software only by changing our code, not "tell Debian to change their packaging so we can compile and link against N versions". Supposing Debian maintainers (and all the others) agreed, there'd still something else in favour of dlopen(): wouldn't it be nice if the users were not limited by the versions that the packager of PostgreSQL decided to link against? What if someone has a good reason to want to use ICU versions that are older than Debian currently ships, that are easily available in add-on repos? > > Yeah, I've flip-flopped a couple of times on the question of whether > > ICU63 and ICU67 should be different collation providers, or > > individual collations should somehow specify the library they want to > > use (admittedly what I showed above with a raw library name is pretty > > ugly and some indirection scheme might be nice). It would be good to > > drill into the pros and cons of those two choices. > > I think that there are pretty good technical reasons why each ICU > version is tied to a particular version of CLDR. Implementing CLDR > correctly and efficiently is a rather difficult process, even if we > ignore figuring out what natural language rules make sense. And so > linking to multiple different ICU versions doesn't really seem like > overkill to me. Or if it is then I can easily think of far better > examples of software bloat. Defining "stable behavior for collations" > as "uses exactly the same software artifact over time" is defensive > (compared to always linking to one ICU version that does it all), but > we have plenty that we need to defend against here. I think we're not understanding each other here: I was talking about the technical choice of whether we'd model the multiple library versions in our catalogues as different "collprovider" values, or somehow encode them into the "collcollate" string, or something else. I'm with you, I'm already sold on the mult-library concept (and have been in several previous cycles of this recurring discussion), which is why I'm trying to move to discussing nuts and bolts and packaging and linking realities that apparently stopped any prototype from appearing last time around.
On Tue, Jun 7, 2022 at 4:29 PM Thomas Munro <thomas.munro@gmail.com> wrote: > The difference is that Debian has libllvm-{11,12,13,14}-dev packages, > but it does *not* have multiple -dev packages for libicu, just a > single libicu-dev which can be used to compile and link against their > chosen current library version. They do have multiple packages for > the actual .so and allow them to be installed concurrently. > Therefore, you could install N .sos and dlopen() them, but you *can't* > write a program that compiles and links against N versions at the same > time using their packages (despite IBM's work to make that possible, > perhaps for use in their own databases). I know that glibc has various facilities for versioning dynamic libraries, which includes ways to control symbol visibility. It's possible that IBM's work on ICU versioning didn't just build on a generic facility like that because that approach wasn't sufficiently portable, particularly with platforms like AIX. It's also possible that we won't have any of these same requirements, and can feasibly link against multiple ICU versions some other way, and ultimately achieve the same result -- multiple versions of ICU that can be used by Postgres at the same time, with long term stable collations across major OS and Postgres versions. I now understand that you agree with me on this basic and important point. Must have been a miscommunication. > > Packaging standards certainly matter, but they're not immutable laws > > of the universe. It seems reasonable to suppose that the people that > > define these standards would be willing to hear us out -- this is > > hardly a trifling matter, or something that only affects a small > > minority of *their* users. > > OK, yeah, I'm thinking within the confines of things we can do easily > right now on existing systems as they are currently packaging software > only by changing our code, not "tell Debian to change their packaging > so we can compile and link against N versions". There are lots of specifics here, and I'm certainly not an expert on packaging. IMV our approach doesn't necessarily need to use the same original canonical package, though. It just needs to provide a reasonably smooth experience for users that actually need to keep their old collations working on upgrade. Either way, the process needs to be something where all parties understand the concerns of each other. Of course Debian doesn't support linking against multiple versions of ICU right now; why would they? Is there any reason to think that even one person ever asked about it? Our interest in doing that will probably be totally unique from their point of view. Can we just ask somebody about it that has a deep understanding of these things? > Supposing Debian > maintainers (and all the others) agreed, there'd still something else > in favour of dlopen(): wouldn't it be nice if the users were not > limited by the versions that the packager of PostgreSQL decided to > link against? What if someone has a good reason to want to use ICU > versions that are older than Debian currently ships, that are easily > available in add-on repos? I don't consider the ability to support many versions of ICU for the sake of ICU features to be much of an advantage. I mostly just care about the simple, common case where a user upgrades and doesn't want to REINDEX immediately. You may well be right about dlopen(); I just don't know right now. > I think we're not understanding each other here: I was talking about > the technical choice of whether we'd model the multiple library > versions in our catalogues as different "collprovider" values, or > somehow encode them into the "collcollate" string, or something else. ISTM that there are two mostly-distinct questions here: 1. How do we link to multiple versions of ICU at the same time, in a way that is going to work smoothly on mainstream platforms? 2. What semantics around collations do we want for Postgres once we gain the ability to use multiple versions of ICU at the same time? For example, do we want to generalize the definition of a collation, so that it's associated with one particular ICU version and collation for the purposes of on-disk compatibility, but isn't necessarily tied to the same ICU version in other contexts, such as on a dump and restore? -- Peter Geoghegan
On Wed, Jun 8, 2022 at 12:23 PM Peter Geoghegan <pg@bowt.ie> wrote: > ISTM that there are two mostly-distinct questions here: > > 1. How do we link to multiple versions of ICU at the same time, in a > way that is going to work smoothly on mainstream platforms? > > 2. What semantics around collations do we want for Postgres once we > gain the ability to use multiple versions of ICU at the same time? For > example, do we want to generalize the definition of a collation, so > that it's associated with one particular ICU version and collation for > the purposes of on-disk compatibility, but isn't necessarily tied to > the same ICU version in other contexts, such as on a dump and restore? Yeah. Well I couldn't resist doing some (very!) experimental hacking. See attached. The idea of putting a raw library name in there is just a straw-man, and I already found a major problem with it: I also need to get my hands on u_strToLower and friends for formatting.c, but those functions are in a different library that needs to be dlopen'd separately, so we need *two* names. That's not done in the attached patch, but at least this demonstrates some of the mechanics of a dlopen() based solution that can do the collating part... of course there are all kinds of problems apparent (security of loading arbitrary libraries, API stability, interaction with the "default" ICU that our binary is linked against, creation of initial set of collations in initdb, naming, upgrades, ...). Demo: $ sudo apt-get install libicu63 libicu67 postgres=# create schema icu63; CREATE SCHEMA postgres=# create schema icu67; CREATE SCHEMA postgres=# create collation icu63."en-US-x-icu" (provider = icu, locale = 'libicui18n.so.63:en-US'); CREATE COLLATION postgres=# create collation icu67."en-US-x-icu" (provider = icu, locale = 'libicui18n.so.67:en-US'); CREATE COLLATION postgres=# select collname, collnamespace::regnamespace, colliculocale, collversion from pg_collation where collname = 'en-US-x-icu'; collname | collnamespace | colliculocale | collversion -------------+---------------+------------------------+------------- en-US-x-icu | pg_catalog | en-US | 153.14 en-US-x-icu | icu63 | libicui18n.so.63:en-US | 153.88 en-US-x-icu | icu67 | libicui18n.so.67:en-US | 153.14 (3 rows) postgres=# select relname from pg_class order by relname collate icu63."en-US-x-icu" limit 2; relname --------------------------- _pg_foreign_data_wrappers _pg_foreign_servers (2 rows)
Attachment
On Tue, Jun 7, 2022 at 3:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > No, I quite agree that we have a problem. What I don't agree is that > issuing a lot of false-positive warnings is a solution. That will > just condition people to ignore the warnings, and then when their > platform really does change behavior, they're still screwed. If we > could *accurately* report collation behavioral changes, I'd be all > for that. I mean, how many false-positive warnings do you think we'll get? I would argue that if we put out something that's wrong half the time -- it tells you about all the real problems and an equal number of imaginary ones -- we'd be way ahead of where we are right now. If on the other hand we put out something that's wrong 99% of the time -- it tells you about all the real problems and ninety-nine times as many imaginary ones -- that's worse than useless. There can be some weasel wording in the language e.g. "WARNING: glibc has been updated, collation definitions may have changed". It's worth keeping in mind that the user doesn't necessarily have another source of information that is more accurate than what we're providing. If they REINDEX somewhat more often than is really necessary, that may be painful, but it can still be a lot better than having queries return wrong answers. If it's not, nobody's forcing them to issue that REINDEX command. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jun 7, 2022 at 4:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I mean by "false positive" is telling every macOS user that they'd better > reindex everything every year, when in point of fact Apple changes those > collations almost never. Do we actually know that to be true? Given how fast things seem to be getting added to Unicode, it wouldn't surprise me at all if they're updating their Unicode tables for new characters with some regularity, if nothing else, and that's a breaking change for us. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jun 7, 2022 at 3:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> No, I quite agree that we have a problem. What I don't agree is that >> issuing a lot of false-positive warnings is a solution. > I mean, how many false-positive warnings do you think we'll get? The proposed patch would result in a warning about every collation- sensitive index during every macOS major version upgrade, ie about once a year for most people. Seeing that Apple only actually touch their POSIX collations once a decade or so, that's way too far over on the crying-wolf end of the scale for me. We need something that has at least *some* connection to actual changes. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jun 7, 2022 at 4:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I mean by "false positive" is telling every macOS user that they'd better >> reindex everything every year, when in point of fact Apple changes those >> collations almost never. > Do we actually know that to be true? Given how fast things seem to be > getting added to Unicode, it wouldn't surprise me at all if they're > updating their Unicode tables for new characters with some regularity, > if nothing else, and that's a breaking change for us. Their POSIX collations seem to be legacy code that's entirely unrelated to any modern collation support; in particular the "UTF8" ones are that in name only. I'm sure that Apple are indeed updating the UTF8 data behind their proprietary i18n APIs, but the libc APIs are mostly getting benign neglect. Maybe the report that started this thread indicates that this is changing, but I'll believe that when I see it. regards, tom lane
Tom Lane wrote: > Yeah, and it's exactly at the level of quirks that things are likely > to change. Nobody's going to suddenly start sorting B before A. > They might, say, change their minds about where the digram "cz" > sorts relative to single letters, in languages where special rules > for that are a thing. Independently of these rules, all Unicode collations change frequently because each release of Unicode adds new characters. Any string that contains a code point that was previously unassigned is going to be sorted differently by all collations when that code point gets assigned to a character. Therefore the versions of all collations need to be bumped at every Unicode release. This is what ICU does. If the libc in macOS doesn't follow Unicode, that's not relevant to macOS, but let's assume an OS that tries to be up-to-date. If major OS upgrades happen every year or less frequently, each OS upgrade is likely to imply an upgrade of all the collations, since the interval between Unicode releases tends to be a year or less: https://www.unicode.org/history/publicationdates.html Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Wed, Jun 8, 2022 at 10:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Their POSIX collations seem to be legacy code that's entirely unrelated to > any modern collation support; in particular the "UTF8" ones are that in > name only. I'm sure that Apple are indeed updating the UTF8 data behind > their proprietary i18n APIs, but the libc APIs are mostly getting benign > neglect. I find that easy to believe. It's consistent with the overall picture of Apple not caring about the POSIX collations beyond the basic requirement for compatibility. ISTM that their totally inefficient approach to implementing strxfrm() is another example of the same thing. (The Apple strxfrm() produces huge low entropy binary strings, unlike the glibc version, which is pretty well optimized.) -- Peter Geoghegan
"Daniel Verite" <daniel@manitou-mail.org> writes: > Independently of these rules, all Unicode collations change frequently > because each release of Unicode adds new characters. Any string > that contains a code point that was previously unassigned is going > to be sorted differently by all collations when that code point gets > assigned to a character. > Therefore the versions of all collations need to be bumped at every > Unicode release. This is what ICU does. I'm very skeptical of this process as being a reason to push users to reindex everything in sight. If U+NNNN was not a thing last year, there's no reason to expect that it appears in anyone's existing data, and therefore the fact that it sorts differently this year is a poor excuse for sounding time-to-reindex alarm bells. I'm quite concerned that we are going to be training users to ignore collation-change warnings. They have got to be a lot better targeted than this, or we're just wasting everyone's time, including ours. regards, tom lane
On Wed, Jun 8, 2022 at 4:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm very skeptical of this process as being a reason to push users > to reindex everything in sight. If U+NNNN was not a thing last year, > there's no reason to expect that it appears in anyone's existing data, > and therefore the fact that it sorts differently this year is a poor > excuse for sounding time-to-reindex alarm bells. That seems completely wrong to me. It's not like a new character shows up and people wait to start using it until it makes its way into everyone's collation data. That is emphatically not what happens, I would say. What happens is that people upgrade their libc packages at one times and their postgres packages at another time, and it's unlikely that they have any idea which order they do or did those things. Meanwhile, people start using all the latest emojis. The idea that the average PostgreSQL user has any idea whether a certain emoji shows up in the data set for the first time before or after they install the libc version that knows about it seems absurd. We don't even know how to figure out which emojis the installed libc supports -- if we did, we could reject data that we don't know how to sort properly instead of ending up with corrupted indexes later. The user has no more ability to figure it out than we do, and even if they did, they probably wouldn't want to compare their stream of input data to their collate definitions using some process external to the database. -- Robert Haas EDB: http://www.enterprisedb.com
> On Jun 7, 2022, at 1:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > This is not the concern that I have. I agree that if we tell a user > that collation X changed behavior and he'd better reindex his indexes > that use collation X, but none of them actually contain any cases that > changed behavior, that's not a "false positive" --- that's "it's cheaper > to reindex than to try to identify whether there's a problem". I don't see this problem as limited to indexes, though I do understand why that might be the most common place for the problemto manifest itself. As a simple example, text[] constructed using array_agg over sorted data can be corrupted by a collation change, and reindexwon't fix it. If we extend the table-AM interface to allow query quals to be pushed down to the table-AM, we might develop table-AMs thatcare about sort order, too. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 9, 2022 at 5:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm sure that Apple are indeed updating the UTF8 data behind > their proprietary i18n APIs, but the libc APIs are mostly getting benign > neglect. As for how exactly they might be doing that, I don't know, but a bit of light googling tells me that a private, headerless, please-don't-call-me-directly copy of ICU arrived back in macOS 10.3[1]. I don't see it on my 12.4 system, but I also know that 12.x started hiding system libraries completely (the linker is magic and pulls libraries from some parallel dimension, there is no /usr/lib/libSystem.B.dylib file on disk, and yet otool -L <your_favourite_executable> references it). It's a lovely client machine, but I don't know if anyone really runs meaningful database server stuff on macOS. I think if I did I'd be very keen to use ICU for everything directly, rather than trying to unpick any of that and talk to Apple's API... I think the how-to-support-multiple-ICUs subrant/subthread is a much more interesting topic. I have no idea if the dlopen() concept I mentioned is the right way forward, but FWIW the experimental patch I posted seems to work just fine on a Mac, using multiple ICU libraries installed by MacPorts, which might be useful to developers contemplating that stuff. [1] https://lists.apple.com/archives/xcode-users/2005/Jun/msg00633.html
New emoji are getting added with some frequency, it’s a thing lately… New Unicode chars use existing but unassigned code points. All code points are able to be encoded, claimed or unclaimed. Someone on old glibc or ICU can still store the new characters. As long as there’s an input field. You wouldn’t believe somestuff I’ve seen people enter in the “name” field for web apps… 🙄 It’ll get some undefined or default sort behavior forunrecognized or unassigned code points. When the libs are updated, those new chars begin to sort correctly, which is a change and breaks indexes (and potentiallyother stuff). -Jeremy Sent from my TI-83 > On Jun 8, 2022, at 16:34, Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Jun 9, 2022 at 5:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm sure that Apple are indeed updating the UTF8 data behind >> their proprietary i18n APIs, but the libc APIs are mostly getting benign >> neglect. > > As for how exactly they might be doing that, I don't know, but a bit > of light googling tells me that a private, headerless, > please-don't-call-me-directly copy of ICU arrived back in macOS > 10.3[1]. I don't see it on my 12.4 system, but I also know that 12.x > started hiding system libraries completely (the linker is magic and > pulls libraries from some parallel dimension, there is no > /usr/lib/libSystem.B.dylib file on disk, and yet otool -L > <your_favourite_executable> references it). > > It's a lovely client machine, but I don't know if anyone really runs > meaningful database server stuff on macOS. I think if I did I'd be > very keen to use ICU for everything directly, rather than trying to > unpick any of that and talk to Apple's API... I think the > how-to-support-multiple-ICUs subrant/subthread is a much more > interesting topic. I have no idea if the dlopen() concept I mentioned > is the right way forward, but FWIW the experimental patch I posted > seems to work just fine on a Mac, using multiple ICU libraries > installed by MacPorts, which might be useful to developers > contemplating that stuff. > > [1] https://lists.apple.com/archives/xcode-users/2005/Jun/msg00633.html
> On Jun 8, 2022, at 03:19, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Jun 8, 2022 at 12:23 PM Peter Geoghegan <pg@bowt.ie> wrote: >> ISTM that there are two mostly-distinct questions here: >> >> 1. How do we link to multiple versions of ICU at the same time, in a >> way that is going to work smoothly on mainstream platforms? >> > Yeah. Well I couldn't resist doing some (very!) experimental hacking. > See attached. Even if PG supports two versions of ICU, how does someone actually go about removing every dependency on the old versionand replacing it with the new? Can it be done without downtime? Can it be done without modifying a running application?Avoiding “collate” clauses on SQL statements requires working behind the scenes with defaults and indexes andpartitions and constraints and everything else. I’m having a hard time coming up with a way this would be possible inpractice, with all the places collations can show up. Is the idea of “alter database” to change the default collation even realistic? I’m having a bit of trouble picturing what the end game is here -Jeremy Sent from my TI-83
On Wed, Jun 8, 2022 at 10:24 PM Jeremy Schneider <schneider@ardentperf.com> wrote: > Even if PG supports two versions of ICU, how does someone actually go about removing every dependency on the old versionand replacing it with the new? They simply REINDEX, without changing anything. The details are still fuzzy, but at least that's what I was thinking of. This should be possible by generalizing the definition of a collation to recognize that different ICU versions can support the same collation. Of course we'd also have to remember which actual ICU version and specific "physical collation" was currently in use by each index. We'd also probably have to have some policy about which ICU version was the latest (or some suitably generalized version of that that applies to collation providers more generally). > Can it be done without downtime? Can it be done without modifying a running application? Clearly the only way that we can ever transition to a new "physical collation" is by reindexing using a newer ICU version. And clearly there is going to be a need to fully deprecate any legacy version of ICU on a long enough timeline. There is just no getting around that. The advantage of an approach along the lines that I've laid out is that everything can be done incrementally, possibly some time after an initial OS or Posgres upgrade, once everything has settled. Much much later, even. If the same new ICU version isn't available in your original/old environment (which is likely), you can avoid reindexing, and so reserve the option of backing out of a complex upgrade until very late in the process. You're going to have to do it eventually, but it can probably just be an afterthought. -- Peter Geoghegan
On Wed, Jun 8, 2022 at 10:39 PM Peter Geoghegan <pg@bowt.ie> wrote: > They simply REINDEX, without changing anything. The details are still > fuzzy, but at least that's what I was thinking of. As I said before, BCP47 format tags are incredibly forgiving by design. So it should be reasonable to assume that anything that has worked in an earlier version of ICU will continue to work in a way that's at least as useful in a future version. See: https://www.postgresql.org/message-id/CAH2-Wz=ZrA5Yf55pKtdJb2pYCVN=2dh__VGR9arQqOHMqWgQPg@mail.gmail.com That's not strictly guaranteed, because sometimes countries cease to exist, and their ISO country codes eventually go away too. But that still tends to fail gracefully. It's mostly only relevant for things that are part of a locale, which is a broader concept than just collation. An application that did this and relied on ICU for localization might then find that the currency sign changed, but I'm not aware of any impact on locales. You can ask for total nonsense and mostly get reasonable behaviors, like Japanese as spoken in Iceland. Even some totally made up (or misspelled) country is accepted without complaint. -- Peter Geoghegan
> On Jun 8, 2022, at 22:40, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jun 8, 2022 at 10:24 PM Jeremy Schneider > <schneider@ardentperf.com> wrote: >> Even if PG supports two versions of ICU, how does someone actually go about removing every dependency on the old versionand replacing it with the new? > > They simply REINDEX, without changing anything. The details are still > fuzzy, but at least that's what I was thinking of. > >> Can it be done without downtime? Can it be done without modifying a running application? > > Clearly the only way that we can ever transition to a new "physical > collation" is by reindexing using a newer ICU version. And clearly > there is going to be a need to fully deprecate any legacy version of > ICU on a long enough timeline. There is just no getting around that. I’m probably just going to end up rehashing the old threads I haven’t read yet… One challenge with this approach is you have things like sort-merge joins that require the same collation across multipleobjects. So I think you’d need to keep all the old indexes around until you have new indexes available for all objectsin a database, and somehow the planner would need to be smart enough to dynamically figure out old vs new versionson a query-by-query basis. May need an atomic database-wide cutover; running a DB with internally mixed collationversions doesn’t seem like a small challenge. It would require enough disk space for two copies of all indexes,and queries would change which indexes they use in a way that wouldn’t be immediately obvious to users or app dev.Suddenly switching to or from a differently-bloated index could result in confusing and sudden performance changes. Also there would still need to be a plan to address all the other non-index objects where collation is used, as has beenmentioned before. And given the current architecture, that final “alter database update default collation” command still seems awful risky,bug-prone and difficult to get correct. At least it seems that way to me. At a minimum, this is a very big project and it seems to me like it may be wise to get more end-to-end fleshing out of theplans before committing incremental pieces in core (which could end up being misguided if the plan doesn’t work as wellas assumed). Definitely doesn’t seem to me like anything that will happen in a year or two. And my opinion is that the problems caused by depending on OS libraries for collation need to be addressed on a shorter timelinethan what’s realistic for inventing a new way for a relational database to offer transparent or online upgrades oflinguistic collation versions. Also I still think folks are overcomplicating this by focusing on linguistic collation as the solution. Like 1% of usersactually need or care about having the latest technically correct local-language-based sorting, at a database level.MySQL did the right thing here by doing what every other RDBMS did, and just making a simple “good-enough” collationhardcoded in the DB, same across all platforms, that never changes. The 1% of users who need true linguistic collation can probably deal with the trade-off of dump-and-load upgrades for theirICU indexes and databases for a few more years. -Jeremy Sent from my TI-83
On Thu, Jun 9, 2022 at 10:54 AM Jeremy Schneider <schneider@ardentperf.com> wrote: > I’m probably just going to end up rehashing the old threads I haven’t read yet… > > One challenge with this approach is you have things like sort-merge joins that require the same collation across multipleobjects. So I think you’d need to keep all the old indexes around until you have new indexes available for all objectsin a database, and somehow the planner would need to be smart enough to dynamically figure out old vs new versionson a query-by-query basis. I don't think that it would be fundamentally difficult to have the planner deal with collations at the level required to avoid incorrect query plans. I'm not suggesting that this is an easy project, or that the end result would be totally free of caveats, such as the issue with merge joins. I am only suggesting that something like this seems doable. There aren't that many distinct high level approaches that could possibly decouple upgrading Postgres/the OS from reindexing. This is one. > And my opinion is that the problems caused by depending on OS libraries for collation need to be addressed on a shortertimeline than what’s realistic for inventing a new way for a relational database to offer transparent or online upgradesof linguistic collation versions. But what does that really mean? You can use ICU collations as the default for the entire cluster now. Where do we still fall short? Do you mean that there is still a question of actively encouraging using ICU collations? I don't understand what you're arguing for. Literally everybody agrees that the current status quo is not good. That much seems settled to me. > Also I still think folks are overcomplicating this by focusing on linguistic collation as the solution. I don't think that's true; I think that everybody understands that being on the latest linguistic collation is only very rarely a compelling feature. The whole way that BCP47 tags are so forgiving is entirely consistent with that view of things. But what difference does it make? As long as you accept that any collation *might* need to be updated, or the default ICU version might change on OS upgrade, then you have to have some strategy for dealing with the transition. Not being on a very old obsolete version of ICU will eventually become a "compelling feature" in its own right. I believe that EDB adopted ICU many years ago, and stuck with one vendored version for quite a few years. And eventually being on a very old version of ICU became a real problem. -- Peter Geoghegan
On Thu, Jun 9, 2022 at 10:54 AM Jeremy Schneider <schneider@ardentperf.com> wrote: > MySQL did the right thing here by doing what every other RDBMS did, and just making a simple “good-enough” collation hardcodedin the DB, same across all platforms, that never changes. That's not true. Both SQL Server and DB2 have some notion of collations that are versioned. Oracle may not, but then Oracle also handles collations by indexing strxfrm() blobs, with all of the obvious downsides that that entails (far larger indexes, issues with index-only scans). That seems like an excellent example of what not to do. -- Peter Geoghegan
Specifying the library name before the language-country code with a new separator (":") as you suggested below has somebenefits. Did you consider making the collation version just another collation attribute, such as colStrength, colCaseLevel,etc.? For example, an alternate syntax might be: create collation icu63."en-US-x-icu" (provider = icu, locale = 'en-US@colVersion=63'); Was the concern that ICU might redefine a new collation property with the same name in a different and incompatible way (wemight work with the ICU developers to agree on what it should be), or that a version is just not the same kind of collationproperty as the other collation properties? (in the example above, I'm assuming that for provider = icu, we could translate '63' into 'libicui18n.so.63' automatically.) On 6/8/22, 6:22 AM, "Thomas Munro" <thomas.munro@gmail.com> wrote: <snip> postgres=# create collation icu63."en-US-x-icu" (provider = icu, locale = 'libicui18n.so.63:en-US'); CREATE COLLATION <snip>
On Thu, Jun 9, 2022 at 2:20 PM Finnerty, Jim <jfinnert@amazon.com> wrote: > Specifying the library name before the language-country code with a new separator (":") as you suggested below has somebenefits. Did you consider making the collation version just another collation attribute, such as colStrength, colCaseLevel,etc.? > For example, an alternate syntax might be: > > create collation icu63."en-US-x-icu" (provider = icu, locale = 'en-US@colVersion=63'); Why would a user want to specify an ICU version in DDL? Wouldn't that break in the event of a dump and reload of the database, for example? It also strikes me as being inconsistent with the general philosophy for ICU and the broader BCP45 IETF standard, which is "interpret the locale string to the best of our ability, never throw an error". Your proposed syntax already "works" today! You just need to create a schema called icu63 -- then the command executes successfully (for certain values of successfully). I'm not arguing against the need for something like this. I'm just pointing out that there are good reasons to imagine that it would largely be an implementation detail, perhaps only used to unambiguously identify which specific ICU version and locale string relate to which on-disk relfilenode structure currently. -- Peter Geoghegan
On Fri, Jun 10, 2022 at 9:20 AM Finnerty, Jim <jfinnert@amazon.com> wrote: > Specifying the library name before the language-country code with a new separator (":") as you suggested below has somebenefits. One of the reasons for putting some representation of desired library into the colliculocale column (rather than, say, adding a new column pg_collation) is that I think we'd also want to be able to put that into daticulocale (for the database default collation, when using ICU). But really I just did that because it was easy... perhaps, both pg_collation and pg_database could gain a new column, and that would be a little more pleasing from a schema design point of view (1NF atomicity, and it's a sort of foreign key, or at least it would be if there were another catalog to list library versions...)? > Did you consider making the collation version just another collation attribute, such as colStrength, colCaseLevel, etc.? > For example, an alternate syntax might be: > > create collation icu63."en-US-x-icu" (provider = icu, locale = 'en-US@colVersion=63'); Hmm, I hadn't considered that. (I wouldn't call it "col" version BTW, it's a library version, and we don't want to overload our terminology for collation version. We'd still be on the look out for collversion changes coming from a single library's minor version changing, for example an apt-get upgrade can replace the .63 files, which on most systems are symlinks to .63.1, .63.2 etc. ☠️) > Was the concern that ICU might redefine a new collation property with the same name in a different and incompatible way(we might work with the ICU developers to agree on what it should be), or that a version is just not the same kind ofcollation property as the other collation properties? Well my first impression is that we don't really own that namespace, and since we're using this to decide which library to route calls to, it seems nicer to put it at a "higher level" than those properties. So I'd prefer something like "63:en-US", or 63 in a new column. > (in the example above, I'm assuming that for provider = icu, we could translate '63' into 'libicui18n.so.63' automatically.) Yeah. My patch that jams a library name in there was just the fastest way I could think of to get something off the ground to test whether I could route calls to different libraries (yes!), though at one moment I thought it wasn't terrible. But aside from any aesthetic complaints about that way of doing it, it turns out not to be enough: we need to dlopen() two different libraries, because we also need some ctype-ish functions from this guy: $ nm -D -C /usr/lib/x86_64-linux-gnu/libicuuc.so.63.1 | grep u_strToUpper 00000000000d22c0 T u_strToUpper_63 I guess we probably want to just put "63" somewhere in pg_collation, as you say. But then, teaching PostgreSQL how to expand that to a name that is platform/packaging dependent seems bad. The variations would probably be minor; on a Mac it's .dylib, on AIX it may be .a, and the .63 convention may not be universal, I dunno, but some systems might need absolute paths (depending on ld.so.conf etc), but that's all stuff that I think an administrator should care about, not us. Perhaps there could be a new catalog table just for that. So far I have imagined there would still be one special ICU library linked at build time, which doesn't need to be dlopen'd, and works automatically without administrators having to declare it. So a system that has one linked-in library version 67, and then has two extras that have been added by an administrator running some new DDL commands might have: postgres=# select * from pg_icu_library order by version; version | libicuuc | libicui18n ---------+----------------+------------------ 58 | libicuuc.so.58 | libicui18n.so.58 63 | libicuuc.so.63 | libicui18n.so.63 67 | | (3 rows) Suppose you pg_upgrade to something that is linked against 71. Perhaps you'd need to tell it how to dlopen 67 before you can open any collations with that library, but once you've done that your collation-dependent partition constraints etc should all hold. I dunno, lots of problems to figure out here, including quite broad ones about various migration problems. I haven't understood what Peter G is suggesting about how upgrades might work, so I'll go and try to do that...
On Thu, Jun 9, 2022 at 4:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Suppose you pg_upgrade to something that is linked against 71. > Perhaps you'd need to tell it how to dlopen 67 before you can open any > collations with that library, but once you've done that your > collation-dependent partition constraints etc should all hold. I > dunno, lots of problems to figure out here, including quite broad ones > about various migration problems. I haven't understood what Peter G > is suggesting about how upgrades might work, so I'll go and try to do > that... I'm mostly just arguing for the idea that we should treat ICU versions as essentially interchangeable in terms of their high-level capabilities around collations and languages/scripts/whatever provided for by the underlying CLDR version -- tools like pg_dump shouldn't need to care about ICU versions per se. *ICU itself* should be versioned, rather than having multiple independent ICU collation providers. This should work as well as anything like this can ever be expected to work -- because internationalization is just hard. These remarks need to be interpreted in the context of how internationalization is *supposed* to work under standards like BCP47 (again, this is a broad RFC about internationalization, not really an ICU thing). Natural languages are inherently squishy, messy things. The "default ICU collations" that initdb puts in pg_collation are not really special to ICU -- we generate them through a quasi-arbitrary process that iterates through top-level locales, which results in a list that is a bit like what you get with libc collations. If you pg_upgrade, you might have leftover "default ICU collations" that wouldn't have been the default on a new initdb. It's inherently pretty chaotic (because humans aren't as predictable as computers), which is why BCP47 itself is so forgiving -- it literally has to be. Plus there really isn't much downside to being so lax; as Jeremy pretty much said already, the important thing is generally to have roughly the right idea -- which this fuzzy approach mostly manages to do. Let's not fight that. Let's leave the natural language stuff to the experts, by versioning a single collation provider (like ICU), and generalizing the definition of a collation along the same lines -- something that can be implemented using any available version of ICU (with a preference for the latest on REINDEX, perhaps). It might turn out that an older version does a slightly better job than a newer version (regressions cannot be ruled out), but ultimately that's not our problem. It can't be -- we're not the unicode consortium. It's theoretically up to the user to make sure they're happy with any behavioral changes under this scheme, perhaps by testing. They won't actually test very often, of course, but that shouldn't matter in practice. This is already what we advise for users that use advanced tailorings of custom ICU collations, such as a custom collation for "natural sorting", often used for things like alphanumeric invoice numbers. That might break if you downgrade ICU version, and maybe even if you upgrade ICU version. -- Peter Geoghegan
On Fri, Jun 10, 2022 at 10:29 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jun 9, 2022 at 2:20 PM Finnerty, Jim <jfinnert@amazon.com> wrote: > > For example, an alternate syntax might be: > > > > create collation icu63."en-US-x-icu" (provider = icu, locale = 'en-US@colVersion=63'); > > Why would a user want to specify an ICU version in DDL? Wouldn't that > break in the event of a dump and reload of the database, for example? > It also strikes me as being inconsistent with the general philosophy > for ICU and the broader BCP45 IETF standard, which is "interpret the > locale string to the best of our ability, never throw an error". > > Your proposed syntax already "works" today! You just need to create a > schema called icu63 -- then the command executes successfully (for > certain values of successfully). Jim was proposing the @colVersion=63 part, but the schema part came from my example upthread. That was from a real transcript, and I included that because the way I've been thinking of this so far has distinct collation OIDs for the "same" collation from different ICU libraries, and yet I want them to have the same collname. That is, I don't want (say) "en-US-x-icu63" and "en-US-x-icu71"... I thought it'd be nice to keep using "en-US-x-icu" as we do today, so if there are two of them they'd *have* to be in different schemas. That has the nice property that you can use the search_path to avoid mentioning it. But I'm not at all wedded to that idea, or any other ideas in this thread, just trying stuff out... However, since you mentioned that a simple REINDEX would get you from one library version to another, I think we're making some completely different assumptions somewhere along the line, and I don't get your idea yet. It sounds like you don't want two different collation OIDs in that case? The (vastly too) simplistic way I was thinking of it, if you have a column with an ICU 63 collation, to switch to ICU 67 you first do some DDL to add ICU 67 to your system and import 67's collations (creating new collation OIDs), and then eg ALTER TABLE foo ALTER COLUMN bar TYPE text COLLATE icu67."en-US-x-icu", which will rebuild your indexes. That's a big job, and doesn't address how you switch the database default collation. None of that is very satisfying, much more thought needed, but it falls out of the decision to have distinct icu63."en-US-x-icu" and icu67."en-US-x-icu". You seem to have some other idea in mind where the system only knows about one "en-US-x-icu", but somehow, somewhere else (where?), keeps track of which indexes were built with ICU 63 and which with ICU 67, which I don't yet grok. Or did I misunderstand?
On Thu, Jun 9, 2022 at 5:18 PM Thomas Munro <thomas.munro@gmail.com> wrote: > However, since you mentioned that a simple REINDEX would get you from > one library version to another, I think we're making some completely > different assumptions somewhere along the line, and I don't get your > idea yet. It sounds like you don't want two different collation OIDs > in that case? Not completely sure about the REINDEX behavior, but it's at least an example of the kind of thing that could be enabled. I'm proposing that pg_collation-wise collations have the most abstract possible definitions -- "logical collations", which are decoupled from "physical collations" that actually describe a particular ICU collator associated with a particular ICU version (all the information that keeps how the on-disk structure is organized for a given relfilenode straight). In other words, the definition of a collation is the user's own definition. To the user, it's pretty close to (maybe even exactly) a BCP47 string, now and forever. You can make arguments against the REINDEX behavior. And maybe those arguments will turn out to be good arguments. Assuming that they are, then the solution may just be to have a special option that will make the REINDEX use the most recent library. The important point is to make the abstraction as high level as possible from the point of view of users. > You seem to have some > other idea in mind where the system only knows about one > "en-US-x-icu", but somehow, somewhere else (where?), keeps track of > which indexes were built with ICU 63 and which with ICU 67, which I > don't yet grok. Or did I misunderstand? That's what I meant, yes -- you got it right. Another way to put it would be to go as far as we can in the direction of decoupling the concerns that we have as database people from the concerns of natural language experts. Let's not step on their toes, and let's avoid having our toes trampled on. -- Peter Geoghegan
Thanks for picking this up! > How can I see evidence of this? I'm comparing Debian, FreeBSD and > macOS 12.4 and when I run "LC_COLLATE=en_US.UTF-8 sort > /usr/share/dict/words" I get upper and lower case mixed together on > the other OSes, but on the Mac the upper case comes first, which is my > usual smoke test for "am I looking at binary sort order?" Perhaps I can shed some light on this matter: Apple's libc collations have always been a bit special in that concern, even for the non-UTF8 ones. Rooted in ancient FreeBSDthey "try to keep collating table backward compatible with ASCII" thus upper and lower cases characters are separated(There are exceptions like 'cs_CZ.ISO8859-2'). The latest public sources I can find are in adv_cmds-119 [1] whichbelongs to OSX 10.5 [2] - these correspond to the ones used in FreeBSD till v10 [3], whereby the timestamps rather pointits origin around FreeBSD 5. Further, there are only very few locales actually present on macOS (36 - none of it supportingUnicode) and these have not changed for a very long time (I verified that from OS X 10.6.8 till macOS 12.4 [4],exception is a 'de_DE-A.ISO8859-1' present only in macOS 10.15). What they do instead is symlinking [5] missing collations to similar ones even across encodings, often resulting in la_LN.US-ASCII('la_LN' seem to stand for a Latin meta language) being used which is exactly byte order [6]. These symlinkshave not changed [7] from OS X 10.6.8 till macOS 10.15.7. But in macOS 11 many of these symlinks changed their target.So did the popular 'en_US.UTF-8' from 'la_LN.US-ASCII' to 'la_LN.ISO8859-1' or 'de_DE.UTF-8' from 'la_LN.US-ASCII'to 'de_DE.ISO8859-1'. In effect, about half of the UTF-8 collations change from no collation to partial/brokencollation support. macOS 12 again shows no changes - tests for macOS 13 are outstanding. # tl:dr; With your smoke test "sort /usr/share/dict/words" on a modern macOS you won't see a difference between "C" and "en_US.UTF-8"but with "( echo '5£'; echo '£5' ) | LC_COLLATE=en_US.UTF-8 sort" you can produce a difference against "( echo'5£'; echo '£5' ) | LC_COLLATE=C sort". Or test with "diff -q <(LC_COLLATE=C sort /usr/share/dict/words) <(LC_COLLATE=es_ES.UTF-8sort /usr/share/dict/words)" The upside is that we don't have to cope with the new characters added in every version of Unicode (although I have not examinedLC_CTYPE yet). best regards Tobias [1]: https://github.com/apple-oss-distributions/adv_cmds/tree/adv_cmds-119/usr-share-locale.tproj/colldef [2]: https://opensource.apple.com/releases/ [3]: https://github.com/freebsd/freebsd-src/tree/stable/10/share/colldef [4]: find /usr/share/locale/*/LC_COLLATE -type f -exec md5 {} \; [5]: https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/BSDmakefile [6]: https://github.com/apple-oss-distributions/adv_cmds/blob/adv_cmds-119/usr-share-locale.tproj/colldef/la_LN.US-ASCII.src [7]: find /usr/share/locale/*/LC_COLLATE -type l -exec stat -f "%N%SY" {} \;
On Fri, Jun 10, 2022 at 12:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jun 9, 2022 at 5:18 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > You seem to have some > > other idea in mind where the system only knows about one > > "en-US-x-icu", but somehow, somewhere else (where?), keeps track of > > which indexes were built with ICU 63 and which with ICU 67, which I > > don't yet grok. Or did I misunderstand? > > That's what I meant, yes -- you got it right. OK, I see now. I think if you design a system to record the library that each index (and constraint, ...) was built with, it'd surely finish up being at least conceptually something like the system Julien and I built and then reverted in ec483147. Except that it'd be a stronger form of that, because instead of just squawking when the version is not the latest/current version, it'd keep working but route collations to the older library for indexes that haven't been rebuilt yet. That sounds nice, but introduces subtle problems for the planner. For example, pathkeys that look compatible might not be, when merge-joining an ICU 63 index scan against an ICU 67 index scan. You could teach it about that, whereas with my distinct OID concept they would already be considered non-matching automatically.
On Thu, Jun 9, 2022 at 5:59 PM Thomas Munro <thomas.munro@gmail.com> wrote: > That sounds nice, but introduces subtle problems for the planner. For > example, pathkeys that look compatible might not be, when > merge-joining an ICU 63 index scan against an ICU 67 index scan. You > could teach it about that, whereas with my distinct OID concept they > would already be considered non-matching automatically. Right -- my proposal is likely to be more difficult to implement. Seems like it might be worth going to the trouble of teaching the planner about this difference, though. That exact issue seems like the true underlying problem to me: we have two sets of behaviors for a given collation, that are equivalent for some purposes (the user thinks of them as totally interchangeable), but not for other purposes (we can't expect old indexes to continue to work with a new physical collation for their logical collation). So directly tackling that seems natural to me. -- Peter Geoghegan
On Fri, Jun 10, 2022 at 1:06 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jun 9, 2022 at 5:59 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > That sounds nice, but introduces subtle problems for the planner. For > > example, pathkeys that look compatible might not be, when > > merge-joining an ICU 63 index scan against an ICU 67 index scan. You > > could teach it about that, whereas with my distinct OID concept they > > would already be considered non-matching automatically. > > Right -- my proposal is likely to be more difficult to implement. > Seems like it might be worth going to the trouble of teaching the > planner about this difference, though. Well I can report that the system from ec483147 was hellishly complicated, and not universally loved. Which isn't to say that there isn't a simple and loveable way to do it, waiting to be discovered, and I do think we could fix most of the problems with that work. It's just that I was rather thinking of this new line of attack as being a way to avoid the complications of identifying dependencies on moving things through complicated analysis of object graphs and AST, by instead attaching those slippery external things to the floor with a nail gun. That is, treating ICU 63 and ICU 67's collations as completely unrelated. I understand that that's not ideal from an end-user perspective, but maybe it's more realistically and robustly and simply implementable. Hmm.
Am 08.06.2022 um 16:16 schrieb Tom Lane <tgl@sss.pgh.pa.us>: > The proposed patch would result in a warning about every collation- > sensitive index during every macOS major version upgrade, ie about > once a year for most people. > We need something that has at least *some* connection to actual changes. In Postgres.app we introduced default collation versioning and warnings about possible mismatches from outside the actualserver. When the user runs initdb with the GUI wrapper, the OS version and a checksum of the LC_COLLATE file of theused default collation is stored as meta-data. This allows to display a reindex warning on startup if the hash changesor we hardcode a known incompatible OS change. Having collversion support on macOS within postgres would leverage the existing infrastructure for version change warningsand enables support for multiple collations. But I agree, we need something more specific than the major OS versionhere. Lacking any collation version information from the provider, a checksum on the binary LC_COLLATE file is thebest I can come up with. Best regards, Tobias
On Thu, Jun 9, 2022 at 6:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Well I can report that the system from ec483147 was hellishly > complicated, and not universally loved. Which isn't to say that there > isn't a simple and loveable way to do it, waiting to be discovered, > and I do think we could fix most of the problems with that work. I admit that I don't have much idea of how difficult it would be to make it all work. I'm definitely not claiming that it's easy. > I understand that that's not ideal from an > end-user perspective, but maybe it's more realistically and robustly > and simply implementable. Hmm. That may be a decisive reason to go with your proposal. I really don't know. -- Peter Geoghegan
On Fri, Jun 10, 2022 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jun 9, 2022 at 6:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Well I can report that the system from ec483147 was hellishly > > complicated, and not universally loved. Which isn't to say that there > > isn't a simple and loveable way to do it, waiting to be discovered, > > and I do think we could fix most of the problems with that work. > > I admit that I don't have much idea of how difficult it would be to > make it all work. I'm definitely not claiming that it's easy. Hrrm... perhaps my memory of ec483147 is confusing me. I think I'm starting to come around to your idea a bit more now. Let me sketch out some more details here and see where this goes. I *was* thinking that you'd have to find all references to collations through static analysis, as we did in that version tracking project. But perhaps for this you only need to record one ICU library version for the whole index at build time, without any analysis at all, and it would be used for any and all ICU collations that are reached while evaluating anything to do with that index (index navigation, but also eg WHERE clause for partial index, etc). That would change to the "current" value when you REINDEX. Perhaps that could be modeled with a pg_depend row pointing to a pg_icu_library row, which you'd probably need anyway, to prevent a registered ICU library that is needed for a live index from being dropped. (That's assuming that the pg_icu_library catalogue concept has legs... well if we're going with dlopen(), we'll need *somewhere* to store the shared object paths. Perhaps it's not a given that we really want paths in a table... I guess it might prevent certain cross-OS streaming rep scenarios, but mostly that'd be solvable with symlinks...) One problem is that to drop an old pg_icu_library row, you'd have to go and REINDEX everything, even indexes that don't really use collations! If you want to prove that an index doesn't use collations, you're back in ec483147 territory. Perhaps we don't care about that and we're happy to let useless dependencies on pg_icu_library rows accumulate, or to require useless work to be able to drop them. I'm not sure how we'd know what the "current" library version is. The highest numbered one currently in that pg_icu_library catalogue I sketched? So if I do whatever new DDL we invent to tell the system about a new ICU library, and it's got a higher number than any others, new indexes start using it but old ones keep using whatever they're using. Maybe with some way for users to override it, so users who really want to use an older one when creating a new index can say so. I suppose it would be the same for constraints. For those, considering that they need to be rechecked, the only way to change ICU version would be to drop the constraint and recreate it. Same goes for range partitioned tables, right? It'd keep using the old ICU library until you drop the p table and create a new one, at which point you're using the new current ICU library and it'll recheck all your partitions against the constraints when you add them. (Those constraints are much simpler constants, so for those we could prove no use of ICU without the general ec483147 beast.) I think these things would have to survive pg_upgrade, but would be lost on dump/restore. There's still the pathkey problem to solve, and maybe some more problems like that hiding somewhere. I'm not sold on any particular plan, but working through some examples helped me see your idea better... I may try to code that up in a minimal way so we can kick the tyres...
On Fri, Jun 10, 2022 at 12:48 PM Tobias Bussmann <t.bussmann@gmx.net> wrote: > Perhaps I can shed some light on this matter: Hi Tobias, Oh, thanks for your answers. Definitely a few bits of interesting archeology I was not aware of. > Apple's libc collations have always been a bit special in that concern, even for the non-UTF8 ones. Rooted in ancient FreeBSDthey "try to keep collating table backward compatible with ASCII" thus upper and lower cases characters are separated(There are exceptions like 'cs_CZ.ISO8859-2'). Wow. I see that I can sort the English dictionary the way most people expect by pretending it's Czech. What a mess! > With your smoke test "sort /usr/share/dict/words" on a modern macOS you won't see a difference between "C" and "en_US.UTF-8"but with "( echo '5£'; echo '£5' ) | LC_COLLATE=en_US.UTF-8 sort" you can produce a difference against "( echo'5£'; echo '£5' ) | LC_COLLATE=C sort". Or test with "diff -q <(LC_COLLATE=C sort /usr/share/dict/words) <(LC_COLLATE=es_ES.UTF-8sort /usr/share/dict/words)" I see, so it does *something*, just not what anybody wants.
On Fri, Jun 10, 2022 at 4:30 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I'm not sold on any particular plan, but working through some examples > helped me see your idea better... I may try to code that up in a > minimal way so we can kick the tyres... I did a bit of hacking on that idea. The goal was to stamp each index with an ICU major version (not sure where, not done in the attached), and if that doesn't match the library we're linked against, we'd try to dlopen() libraries via symlinks with known name formats under PGDATA/pg_icu_lib, which an administrator would have to create. That seemed a bit simpler than dealing with new catalogs for now... See attached unfinished patch, which implements some of that. It has a single collation for en-US-x-icu, and routes calls to different libraries depending on dynamic scope (which in cold hard reality translates into a nasty global variable "current_icu_library"). The idea was that it would normally point to the library we're linked against, but whenever computing anything related to an index stamped with ICU 63, we'd do pg_icu_activate_major_version(63), and afterwards undo that. Performance concerns aside, that now seems a bit too ugly and fragile to me, and I gave up. How could we convince ourselves that we'd set the active ICU library correctly in all the required dynamic scopes, but not leaked it into any other scopes? Does that even make sense? But if not done like that, how else could we do it? Better ideas/code welcome. Executive summary of experiments so far: the "distinct collations" concept is quite simple and robust, but exposes all the versions to users and probably makes it really hard to upgrade (details not worked out), while the "time travelling collations" concept is nice for users but hard to pin down and prove correctness for since it seems to require dynamic scoping/global state changes affecting code in far away places.
Attachment
On Fri, Jun 10, 2022 at 6:48 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Executive summary of experiments so far: the "distinct collations" > concept is quite simple and robust, but exposes all the versions to > users and probably makes it really hard to upgrade (details not worked > out), while the "time travelling collations" concept is nice for users > but hard to pin down and prove correctness for since it seems to > require dynamic scoping/global state changes affecting code in far > away places. It didn't really occur to me until now that the pg_dump problems that come with the approach you outlined ("distinct collations") are likely to be total blockers, and not just disadvantages. It's not just ICU that prefers approximately correct behavior over throwing an "unknown collation" error -- the same could be said for pg_dump itself. After all, pg_dump doesn't care about collation versions -- except when run in binary mode, for pg_upgrade, where it must satisfy the requirements of pg_upgrade. Even today we could be restoring to a server with an older ICU version, where in general we might not get exactly the behavior the user expects (though usually only when they've decided to use advanced features like custom tailorings). So pg_dump already deliberately disregards the ICU version, for essentially the same reasons that I argued were good reasons upthread (when talking about a multi-ICU-version Postgres via "time travelling collations"). Some more thoughts on "time travelling collations": Doing a version switch in one atomic operation (like a special REINDEX) isn't going to be practical. We need to be prepared for cases where a database has a mix of indexes with old and new physical collations. We certainly cannot allow queries to give wrong answers...but I tend to doubt that (say) making merge joins work with two indexes with different physical collations (though matching logical collations) actually makes much sense. Maybe we can cut scope in a pragmatic way instead. The special REINDEX (or whatever) won't work as an atomic operation...but that doesn't mean that the system as a whole will have a mix of old and new physical collations forever, or even for very long. So while everything still has to work correctly, mediocre performance with certain kinds of plan shapes might be okay. As you kind of said yourself yesterday, "time travelling collations" could naturally have an invariant that worked at the index/constraint level (not the collation level): any given index needs to use only one underlying ICU version at the same time, no matter what. The invariant could perhaps be leveraged in the planner -- perhaps we start out with a working assumption that *every* index is on the newer ICU version (when at least one index is on the lastest and "now current" version), and work backwards by excluding indexes that still have old physical collations. Defining the problem as a problem with old indexes/constraints only seems like it might make things a lot easier. -- Peter Geoghegan
On Thu, Jun 9, 2022 at 9:31 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Perhaps that could be modeled with a pg_depend row pointing to a > pg_icu_library row, which you'd probably need anyway, to prevent a > registered ICU library that is needed for a live index from being > dropped. (That's assuming that the pg_icu_library catalogue concept > has legs... well if we're going with dlopen(), we'll need *somewhere* > to store the shared object paths. Perhaps it's not a given that we > really want paths in a table... I guess it might prevent certain > cross-OS streaming rep scenarios, but mostly that'd be solvable with > symlinks...) Do we even need to store a version for indexes most of the time if we're versioning ICU itself, as part of the "time travelling collations" design? For that matter, do we even need to version collations directly anymore? I'm pretty sure that the value of pg_collation.collversion is always the same in practice, or has a lot of redundancy. Because mostly it's just an ICU version. This is what I see on my system, at least: pg@regression:5432 [53302]=# select count(*), collversion from pg_collation where collprovider = 'icu' group by 2; count │ collversion ───────┼───────────── 329 │ 153.112.41 471 │ 153.112 (2 rows) (Not sure why there are two different distinct collversion values offhand, but generally looks like collversion isn't terribly meaningful at the level of individual pg_collation entries.) If indexes and constraints with old physical collations are defined as being the exception to the general rule (the rule meaning "every index uses the current ICU version for the database as a whole"), and if those indexes/constraints are enumerated and stored (in a new system catalog) when a switchover of the database's ICU version is first initialized, then there might not be any meaningful dependency to speak of. Not for indexes, at least. The *database as a whole* is dependent on the current version of ICU -- it's not any one index. Very occasionally the database will also be dependent on a single older ICU version that we're still transitioning away from. There is a "switch-a-roo" going on, but not really at the level of indexes -- it's a very specialized thing, that works at the level of the whole database, and involves exactly 2 ICU versions. You should probably be able to back out of it once it begins, but mostly it's an inflexible process that just does what we need it to do. Does something like that seem sensible to you? -- Peter Geoghegan
On Sat, Jun 11, 2022 at 2:29 PM Peter Geoghegan <pg@bowt.ie> wrote: > The special REINDEX (or whatever) won't work as an atomic > operation...but that doesn't mean that the system as a whole will have > a mix of old and new physical collations forever, or even for very > long. So while everything still has to work correctly, mediocre > performance with certain kinds of plan shapes might be okay. Yeah. And if you don't like the negative effects of a system in transition, you could also create new otherwise identical indexes, and then only drop the old ones once that's done, and add enough brains to keep everything working nicely on the old indexes until enough of the new indexes exist. Or something. I was thinking that could also be true for the "distinct collations" concept, only with different problems to solve... > As you kind of said yourself yesterday, "time travelling collations" > could naturally have an invariant that worked at the index/constraint > level (not the collation level): any given index needs to use only one > underlying ICU version at the same time, no matter what. The invariant > could perhaps be leveraged in the planner -- perhaps we start out with > a working assumption that *every* index is on the newer ICU version > (when at least one index is on the lastest and "now current" version), > and work backwards by excluding indexes that still have old physical > collations. Defining the problem as a problem with old > indexes/constraints only seems like it might make things a lot easier. Yes, that (posited) invariant was an enabling realisation for the (unfinished, but IMHO useful to grok) v2 patch. The disabling realisation that stopped me from finishing it was that I doubt my ability to find all the right places to wrap with "pg_icu_activate_major_version(X)" and "pg_icu_activate_major_version(-1)", and thus the whole approach. Do you know where to put all the ICU version switching regions, and how to verify that they cover exactly all the right code, but don't leak into any of the wrong code, and do you know where to store/retrieve X? It feels... flimsy to me, but how else could you make collations behave differently when being called on behalf of some index rather than some other thing, given no other context? Explicitly passing a pg_icu_library all over the place also sounds non-fun. I'm also suspicious that there are more subtle hazards like pathkeys lurking in the shadows. We go to great effort to recognise matching and non-matching collations by OID alone, which is why my first attempt was "distinct [OIDs]", so that'd keep working. I wondered if DB2's support looked more like "time travel" or "distinct". Based only on a quick glance at their manual[1], it looks a bit like they have "I don't care" collations which are subject to weirdness on upgrade when they change underneath your feet, and then "distinct" collations which have an explicit prefix to pin down the version (indirectly via CLDR version) and route to a specific library (N copies of ICU that ship with it), with a note recommending the latter for indexes and constraints. So I'd guess you'll stay on the old versions forever until you explicitly migrate data to a new collation. [1] https://www.ibm.com/docs/en/db2/11.1?topic=support-locale-names-sql-xquery
On Sat, Jun 11, 2022 at 3:36 PM Peter Geoghegan <pg@bowt.ie> wrote: > Do we even need to store a version for indexes most of the time if > we're versioning ICU itself, as part of the "time travelling > collations" design? For that matter, do we even need to version > collations directly anymore? They're still useful for non-ICU collations (for example FreeBSD and Windows can tell you about version changes based on open standards), and they're *maybe* still useful for ICU, considering that there are minor version upgrades, though I hope that would never actually detect a change if we built a multi-version system like what we are discussing here. Certainly they don't make sense in the current catalog layout with TT collations, though, there's only one attribute to cover N libraries (though the reverted version tracking thing would handle it just fine, because that moved it into a per-index location). I mention minor upgrade as a topic to poke at because the popular Linux distros only allow major ICU versions to be installed concurrently, but minor versions are also released from time to time and replace the libraries (well, the .68 library is a symlink to .68.1, and then changes to .68.2, following typical conventions, but the packages don't let you have .68.1 and .68.2 at the same time). To pick a random example, ICU upgraded 68.1 -> 68.2 at one point, which a bit of googling tells me included CLDR 38 -> CLDR 38.1. It looks like they tweaked a few super minor things. Could such a change affect the values that ucol_getVersion() reports? This came up in the last round of this stuff with Doole[1], but we didn't dig further and I still don't know what to think about it. [1] https://www.postgresql.org/message-id/CADE5jYJTnYaTNXMFKOK-0p44%2BDm5LMcRcJ5kVi1MVHomb2QTkQ%40mail.gmail.com
On Fri, Jun 10, 2022 at 8:47 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I'm also suspicious that there are more subtle hazards like pathkeys > lurking in the shadows. We go to great effort to recognise matching > and non-matching collations by OID alone, which is why my first > attempt was "distinct [OIDs]", so that'd keep working. It's definitely possible that we won't be able to find a workable solution that deals with "time travel collations" sensibly from the planner's perspective. It's certainly not a neat adjunct to what we have. I think that it *might* be possible to find a way to make it work that is suboptimal, but works. Without being overly clever. The DB2 docs say "use the CLDR version prefix to avoid unexpected changes in behavior when upgrading to future releases of the Db2 database". But if you don't do that, and get a change in behavior, then surely any related indexes must have been rebuilt too. The interesting part may be what that upgrade looks like in detail. -- Peter Geoghegan
On Fri, Jun 10, 2022 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote: > They're still useful for non-ICU collations (for example FreeBSD and > Windows can tell you about version changes based on open standards), > and they're *maybe* still useful for ICU, considering that there > are minor version upgrades, though I hope that would never actually > detect a change if we built a multi-version system like what we are > discussing here. Right. I was mostly just asking this as a rhetorical question. What about "time travel collations", but without the time travel part? That is, what about supporting multiple ICU versions per cluster, but not per database? So you could upgrade the OS and Postgres, using standard packages that typically just use the latest ICU version -- typically, but not always. If you happen to have been on an older version of ICU on upgrade, then that version of ICU will still work at the level of a whole database -- your database. Maybe you can create new databases with old and new ICU versions if you want to. That obviously runs into the problem of needing to eventually do a dump and reload -- but I suppose that "eventually" could be a very long time. At least the OS package doesn't declare one version of ICU the blessed version, now and forever, effectively vendoring ICU in a backdoor fashion. At least old databases have significant runway, while at the same time new databases that want to use the same standard Postgres package aren't forced to use the same ancient ICU version. -- Peter Geoghegan
On Sat, Jun 11, 2022 at 4:21 PM Peter Geoghegan <pg@bowt.ie> wrote: > What about "time travel collations", but without the time travel part? > That is, what about supporting multiple ICU versions per cluster, but > not per database? So you could upgrade the OS and Postgres, using > standard packages that typically just use the latest ICU version -- > typically, but not always. If you happen to have been on an older > version of ICU on upgrade, then that version of ICU will still work at > the level of a whole database -- your database. Maybe you can create > new databases with old and new ICU versions if you want to. > > That obviously runs into the problem of needing to eventually do a > dump and reload -- but I suppose that "eventually" could be a very > long time. At least the OS package doesn't declare one version of ICU > the blessed version, now and forever, effectively vendoring ICU in a > backdoor fashion. At least old databases have significant runway, > while at the same time new databases that want to use the same > standard Postgres package aren't forced to use the same ancient ICU > version. Hmm. I think that's effectively what you'd get using my "distinct collation" patch (v1, or this much better v3, attached), if you put version prefixes in colliculocale, and updated them in the template database after an OS upgrade to affect new databases. I realise you probably mean something a little more automatic... I think "pinned forever" ICU versions would be useful, because I think there are very few expert users who want fine generalised control over version changes, but almost all other users don't care at all about any of this stuff -- as long as their indexes keep indexing and their constraints keep constraining. So I think you can make a lot of people happy by ignoring the complexities of upgrades and providing a way to nail the version down for the lifetime of the database. Also, it's not *impossible* to move to a later ICU, it's just a bit tricky; the key point is that it's under your control if you want to do that, independently of an OS upgrade, as you said. Based on my reading of that DB2 manual page, I reckon my v3 "distinct collation" patch is about as good as what they have. If you don't choose to use prefixes then later OS upgrades (ie upgrades that change the version of ICU that PostgreSQL is linked against) might corrupt your indexes and constraints -- I think that's what they're saying -- though at least we'll try to warn about that with our weak warning system. If you do choose to use prefixes you'll be stuck on that ICU version forever, even across updates that cause PostgreSQL to be linked to future releases of ICU, unless you're prepared to do a whole bunch of careful upgrading work (perhaps with some future tooling to help with that). Attached is a much more fleshed out version of the "distinct collation" patch (to be clear: v3 descends from v1, while v2 was at attempt at the timelord approach). Main changes: 1. I now also route strTo{Upper,Lower,Title} to the right version of ICU. That requires dlopen'ing a second library. 2. You create distinct collations with optional ICU major version prefixes, along the lines of what Jim was suggesting. For example (and I'm not actually proposing -x-icu67 suffixes, just avoiding a collision in my example): postgres=# create collation "en-x-icu67" (provider = icu , locale = '67:en'); CREATE COLLATION postgres=# create collation "en-x-icu63" (provider = icu , locale = '63:en'); CREATE COLLATION 3. For versions other than the one we are linked against, it tries to open libraries with typical names inside $libdir. An administrator could drop symlinks in there like so: $ ln -s /usr/lib/x86_64-linux-gnu/libicui18n.so.63 ~/install/lib/postgresql/ $ ln -s /usr/lib/x86_64-linux-gnu/libicuuc.so.63 ~/install/lib/postgresql/ What I like about this $libdir scheme is that I imagine that we could ask our friends in the packaging teams to create packages for that. Users would then think of them in much the same way as extensions. You'd just type: $ sudo apt-get install postgresql-16-icu71 Happy to keep trying to figure out the competing and rather more ambitious TT version too (that I sketched some bits of in v2), but I'm a whole lot fuzzier on how that can work and kinda stuck on the problems I raised.
Attachment
On Sun, Jun 12, 2022 at 11:59 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, Jun 11, 2022 at 4:21 PM Peter Geoghegan <pg@bowt.ie> wrote: > > What about "time travel collations", but without the time travel part? > > That is, what about supporting multiple ICU versions per cluster, but > > not per database? So you could upgrade the OS and Postgres, using > > standard packages that typically just use the latest ICU version -- > > typically, but not always. If you happen to have been on an older > > version of ICU on upgrade, then that version of ICU will still work at > > the level of a whole database -- your database. Maybe you can create > > new databases with old and new ICU versions if you want to. > > > > That obviously runs into the problem of needing to eventually do a > > dump and reload -- but I suppose that "eventually" could be a very > > long time. At least the OS package doesn't declare one version of ICU > > the blessed version, now and forever, effectively vendoring ICU in a > > backdoor fashion. At least old databases have significant runway, > > while at the same time new databases that want to use the same > > standard Postgres package aren't forced to use the same ancient ICU > > version. > > Hmm. I think that's effectively what you'd get using my "distinct > collation" patch (v1, or this much better v3, attached), if you put > version prefixes in colliculocale, and updated them in the template > database after an OS upgrade to affect new databases. I realise you > probably mean something a little more automatic... Thinking some more about what you said above: really, most people only care about the default collation. I'm not yet sure what I think initdb should put into pg_collation when importing the initial set of collation objects in the "distinct" world (perhaps an un-prefixed and a prefixed variant of each, with names ending -x-icu and -x-icu63?), but as for the default collation, I should point out that the "distinct" patch already gives you a nailed-to-the-ground database approximately as you described above if you just do something like this: postgres=# create database db2 locale_provider = icu icu_locale = '67:en' template = template0 ...; Small bugfix attached (v3 was accidentally calling uiter_setUTF8() and u_errorName() directly in a couple of places).
Attachment
Hey Jeremy, On Tue, Jun 7, 2022 at 12:42 PM Jeremy Schneider <schneider@ardentperf.com> wrote: > Thomas - thanks for the link back to one of the threads. I spent some time reading through that and it’s a lot of material;I haven’t read the whole thread yet. If you have some others that would also be particularly good background, letme know. I’m doing a chunk of this in my spare time at the moment, but I do want to keep getting more up to speed. I waspulled into a bunch of various things related to PostgreSQL and ICU and collation and OS’s over the past couple years,so I learned a lot from on-the-ground experience and I am interested in trying to get a little more involved in theconversation here. There were more threads, but they mostly say the same things, hence my current attempt to move from bloviation to trying out the ideas with actual code :-D > Personally, I really do think there should at least be an *option* to tell the DB to fully error rather than just warnon version mismatch. Correctness matters to many users, and being able to *trust* string comparisons are correct is prettydamn fundamental all throughout a database. It really doesn’t get any more basic and the potential for bad things tohappen is pretty astronomical, if you can’t trust those. I understand the consternation about dealing with upgrades oflarge & busy databases, but I’m still surprised that the community consensus arrived at the present behavior, and I havea lot of reading to do, to really understand how that happened and where the dialogue is today. Given that the only thing you could do about it is REINDEX, and yet we don't even know which indexes needed to be REINDEXed (the problem Julien and I tried to address, but so far without success), it seemed highly premature to convert the warning to an error. I don't think the community consensus is that we have arrived somewhere, it's more like we're in transit, possibly without a map. For example 15 gains ICU support for the default collation (= how most people consume collations), which changes things, and will surely lead to more people thinking about this problem space. > Multiple versions of ICU sounds nice for users who need real linguistic collation (like what Oracle and DB2 offer), butI still feel like there needs to be a super simple basic “pseudo-linguistic” collation baked in, that’s “good enough”for 99% of users and that is guaranteed to be the same everywhere on every platform and just won’t ever change. Ithink glibc needs to be phased out somehow. At a minimum, not the default for new users… to stop the bleeding. If MySQLwasn’t GPL then I’d say to just copy their collations. I’d be reluctant to spend too much time on a POC now though,it feels like my idea is the outlier and the general PG hacker consensus would be to reject this idea. (But maybeI’m wrong?) Hmm. Well I personally wouldn't try to write a collator any more willingly than I'd try to write a new cryptographic algorithm, just not my bag. We don't want to handle complaints about our sort order (we already bat away complaints about glibc's, and I heard an account from an OS vendor about the non-stop contradictory crank complaints about sort order they used to receive before they gave up and adopted ICU). ICU really is quite authoritative here. If you mean that you don't even want to have to specify a language like "en", then note that you don't have to: ICU has a "root" collator which you can request with an empty string (all other collators apply cultural tweaks on top of that). Like everything else, the root collator has changed over time, though. With my "distinct" experimental patch (v4), you could set your database default collation to a specific ICU major version's root collator like so: create database ... locale_provider = icu icu_locale = '71:' template = template0 That'll keep working, even across pg_upgrades to some version of PostgreSQL far in the future that is linked against ICU 100, by dlopen'ing the .71 libraries, for as long as you can get your hands on a libicu71 package or otherwise build your own, and it'll error out if it can't open that library, which may be the hard error you were looking for. If there's an API change in ICU we'll have to make some changes, but that's already true. Review/testing/flames/rants/better ideas welcome. > Anyway, again, apologies for my pants-on-fire email last week. I hope I can enjoy a few beers someday - or coffee for thenon-drinkers - with a few other PG collation nerds (which I never set out to be, but it may have befallen me <g>). +1
On Thu, Jun 9, 2022 at 11:33 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Jun 9, 2022 at 5:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I'm sure that Apple are indeed updating the UTF8 data behind > > their proprietary i18n APIs, but the libc APIs are mostly getting benign > > neglect. > > As for how exactly they might be doing that, I don't know, but a bit > of light googling tells me that a private, headerless, > please-don't-call-me-directly copy of ICU arrived back in macOS > 10.3[1]. I don't see it on my 12.4 system, but I also know that 12.x > started hiding system libraries completely (the linker is magic and > pulls libraries from some parallel dimension, there is no > /usr/lib/libSystem.B.dylib file on disk, and yet otool -L > <your_favourite_executable> references it). The other thread about a macOS linking problem nerd-sniped me back into here to find out how to see breadcrumbs between hidden libraries on this super weird UNIX™ and confirm that they are indeed still shipping a private ICU for use by their Core Foundation stuff that's used by fancy ObjC/Swift/... etc GUI apps. The following command was an interesting discovery for me because otool -L can't see any of the new kind of ghost libraries: % dyld_info -dependents /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation Though I can't get my hands on the hidden ICU library itself to disassemble (without installing weird extra tools, apparently [1]), that at least revealed its name, which I could then dlopen out of curiosity. It seems they jammed all the ICU sub-libraries into one, and configured it with --disable-renaming so it doesn't have major version suffixes on symbol names. It'd clearly be a terrible idea for us to try to use any of that, and Mac users should be very happy with the new support for ICU as DB default. [1] https://lapcatsoftware.com/articles/bigsur.html
On Mon, Jun 13, 2022 at 5:41 PM Thomas Munro <thomas.munro@gmail.com> wrote: > It'd clearly be a terrible idea for us to try to use any of that, and > Mac users should be very happy with the new support for ICU as DB > default. This suggests something that I already suspected: nobody particularly expects the system lib C to be authoritative for the OS as a whole, in the way that Postgres supposes. At least in the case of Mac OS, which is after all purely a desktop operating system. -- Peter Geoghegan
On 11.06.22 05:35, Peter Geoghegan wrote: > Do we even need to store a version for indexes most of the time if > we're versioning ICU itself, as part of the "time travelling > collations" design? For that matter, do we even need to version > collations directly anymore? Conversely, why are we looking at the ICU version instead of the collation version. If we have recorded the collation as being version 1234, we need to look through the available ICU versions (assuming we can load multiple ones somehow) and pick the one that provides 1234. It doesn't matter whether it's the same ICU version that the collation was originally created with, as long as the collation version stays the same.
> On Jun 14, 2022, at 14:10, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > Conversely, why are we looking at the ICU version instead of the collation version. If we have recorded the collationas being version 1234, we need to look through the available ICU versions (assuming we can load multiple ones somehow)and pick the one that provides 1234. It doesn't matter whether it's the same ICU version that the collation wasoriginally created with, as long as the collation version stays the same. Does Unicode CDLR provide (or even track) versioning of collation or other i18n functionality for individual locale settings?I’m thinking it might not even have that concept in the original source repo/data, but I might be remembering wrong. It would require not only watching for changes in the per-locale tailoring rules but also being cognizant of changes in root/DUCETbehavior and understanding the impact of changes there. (Common mistake I’ve seen folks make when comparing OS glibc versions is only looking at locale data, not realizing therehave been changes to root behavior that didn’t involve any changes to local data files) -Jeremy
On Wed, Jun 15, 2022 at 7:10 AM Jeremy Schneider <schneider@ardentperf.com> wrote: > > On Jun 14, 2022, at 14:10, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > Conversely, why are we looking at the ICU version instead of the collation version. If we have recorded the collationas being version 1234, we need to look through the available ICU versions (assuming we can load multiple ones somehow)and pick the one that provides 1234. It doesn't matter whether it's the same ICU version that the collation wasoriginally created with, as long as the collation version stays the same. One difference would be the effect if ICU ever ships a minor library version update that changes the reported collversion. 1. With the code I proposed in my v4 patch, our version mismatch warnings would kick in, but otherwise everything would continue to work (and corrupt indexes, if they really moved anything around). 2. With a system that (somehow) opens all available libraries and looks for match, it would fail to find one. That is assuming that you are using the typical major-versioned packages we can see in software distributions like Debian. I don't know if minor version changes actually do that, though have wondered out loud a few times in these threads. I might go and poke at some ancient packages to see if that's happened before. To defend against that, we could instead do major + minor versioning, but so far I worried about major only because that's they way they ship 'em in Debian and (AFAICS) RHEL etc, so if you can't easily install 68.0 and 68.1 at the same time. On the other hand, you could always "pin" (or similar concepts) the libicu68 package to a specific minor release, to fix the problem (whether you failed like 1 or like 2 above). > (Common mistake I’ve seen folks make when comparing OS glibc versions is only looking at locale data, not realizing therehave been changes to root behavior that didn’t involve any changes to local data files) Yeah, I've wondered idly before if libc projects and ICU couldn't just offer a way to ask for versions explicitly, and ship historical data. With some system of symlinks to make it all work with defaults for those who don't care, a libc could have /usr/share/locale/en_US@CLDR34.UTF-8 etc so you could setlocale(LC_COLLATE, "en_US@CLDR34"), or something. I suppose they don't want to promise to be able to interpret the old data in future releases, and, as you say, sometimes the changes are in C code, due to bugs or algorithm changes, not the data.
On Jun 14, 2022, at 19:06, Thomas Munro <thomas.munro@gmail.com> wrote:
One difference would be the effect if ICU ever ships a minor library
version update that changes the reported collversion.
If I’m reading it correctly, ICU would not change collation in major versions, as an explicit matter of policy around DUCET stability and versioning.
With some system of symlinks to make it all work with defaults for
those who don't care, a libc could have
/usr/share/locale/en_US@CLDR34.UTF-8 etc so you could
setlocale(LC_COLLATE, "en_US@CLDR34"), or something. I suppose they
don't want to promise to be able to interpret the old data in future
releases, and, as you say, sometimes the changes are in C code, due to
bugs or algorithm changes, not the data.
If I understand correctly, files in /usr/share/locale aren’t enough because those only have the tailoring rules, and core algorithm and data (before applying locale-specific tweaks) also change between versions. I’m pretty sure glibc works similar to UCA in this regard (albeit based on ISO 14651 and not CDLR), and the Unicode link above is a good illustration of default collation rules that underly the locale-specific tweaks.
-Jeremy
Sent from my TI-83
On 14.06.22 21:10, Jeremy Schneider wrote: > Does Unicode CDLR provide (or even track) versioning of collation or other i18n functionality for individual locale settings? Yes. You can see that in PostgreSQL as various pre-seeded ICU collations having different versions.
Hi, Here is a rebase of this experimental patch. I think the basic mechanics are promising, but we haven't agreed on a UX. I hope we can figure this out. Restating the choice made in this branch of the experiment: Here I try to be just like DB2 (if I understood its manual correctly). In DB2, you can use names like "en_US" if you don't care about changes, and names like "CLDR181_en_US" if you do. It's the user's choice to use the second kind to avoid "unexpected effects on applications or database objects" after upgrades. Translated to PostgreSQL concepts, you can use a database default ICU locale like "en-US" if you don't care and "67:en-US" if you do, and for COLLATION objects it's the same. The convention I tried in this patch is that you use either "en-US-x-icu" (which points to "en-US") or "en-US-x-icu67" (which points to "67:en-US") depending on whether you care about this problem. I recognise that this is a bit cheesy, it's all the user's problem to deal with or ignore. An alternative mentioned by Peter E was that the locale names shouldn't carry the prefix, but somehow we should have a list of ICU versions to search for a matching datcollversion/collversion. How would that look? Perhaps a GUC, icu_library_versions = '63, 67, 71'? There is a currently natural and smallish range of supported versions, probably something like 54 ... U_ICU_VERSION_MAJOR_NUM, but it seems a bit weird to try to dlopen ~25 libraries or whatever it might be... Do you think we should try to code this up? I haven't tried it, but the main usability problem I predict with that idea is this: It can cope with a scenario where you created a database with ICU 63 and started using a default of "en" and maybe some explicit fr-x-icu or whatever, and then you upgrade to a new postgres binary using ICU 71, and, as long as you still have ICU 63 installed it'll just magicaly keep using 63, now via dlopen(). But it doesn't provide a way for me to create a new database that uses 63 on purpose when I know what I'm doing. There are various reasons I might want to do that. Maybe the ideas could be combined? Perhaps "en" means "create using binary's linked ICU, open using search-by-collversion", while "67:en" explicitly says which to use? Changes since last version: * Now it just uses the default dlopen() search path, unless you set icu_library_path. Is that a security problem? It's pretty convenient, because it means you can just "apt-get install libicu63" (or local equivalent) and that's all, now 63 is available. * To try the idea out, I made it automatically create "*-x-icu67" alongside the regular "-x-icu" collation objects at initdb time.
Attachment
On Sat, Oct 22, 2022 at 10:24 AM Thomas Munro <thomas.munro@gmail.com> wrote: > ... But it > doesn't provide a way for me to create a new database that uses 63 on > purpose when I know what I'm doing. There are various reasons I might > want to do that. Thinking some more about this, I guess that could be addressed by having an explicit way to request either the library version or collversion-style version when creating a database or collation, but not actually storing it in daticulocale/colliculocale. That could be done either as part of the string that is trimmed off before storing it (so it's only used briefly during creation to find a non-default library)... Perhaps that'd look like initdb --icu-locale "67:en" (ICU library version) or "154.14:en" (individual collation version) or some new syntax in a few places. Thereafter, it would always be looked up by searching for the right library by [dat]collversion as Peter E suggested. Let me try harder to vocalise some more thoughts that have stopped me from trying to code the search-by-collversion design so far: Suppose your pgdata encounters a PostgreSQL linked against a later ICU library, most likely after an OS upgrade or migratoin, a pg_upgrade, or via streaming replication. You might get a new error "can't find ICU collation 'en' with version '153.14'; HINT: install missing ICU library version", and somehow you'll have to work out which one might contain 'en' v153.14 and install it with apt-get etc. Then it'll magically work: your postgres linked against (say) 71 will happily work with the dlopen'd 67. This is enough if you want to stay on 67 until the heat death of the universe. So far so good. Problem 1: Suppose you're ready to start using (say) v72. I guess you'd use the REFRESH command, which would open the main linked ICU's collversion and stamp that into the catalogue, at which point new sessions would start using that, and then you'd have to rebuild all your indexes (with no help from PG to tell you how to find everything that needs to be rebuilt, as belaboured in previous reverted work). Aside from the possibility of getting the rebuilding job wrong (as belaboured elsewhere), it's not great, because there is still a transitional period where you can be using the wrong version for your data. So this requires some careful planning and understanding from the administrator. I admit that the upgrade story is a tiny bit better than the v5 DB2-style patch, which starts using the new version immediately if you didn't use a prefix (and logs the usual warnings about collversion mismatch) instead of waiting for you to run REFRESH. But both of them have a phase where they might use the wrong library to access an index. That's dissatisfying, and leads me to prefer the simple DB2-style solution that at least admits up front that it's not very clever. The DB2-style patch could be improved a bit here with the addition of one more GUC: default_icu_library, so the administrator, rather than the packager, remains in control of which version we use for non-prefixed iculocale values (likely to be what almost everyone is interested in), defaulting to what the packager linked against. I've added that to the patch for illustration (though obviously the error messages produced by collversion mismatch could use some adjustment, ie to clarify that the warning might be cleared by installing and selecting a different library version). Problem 2: If ICU 67 ever decides to report a different version for a given collation (would it ever do that? I don't expect so, but ...), we'd be unable to open the collation with the search-by-collversion design, and potentially the database. What is a user supposed to do then? Presumably our error/hint for that would be "please insert the correct ICU library into drive A", but now there is no correct library; if you can even diagnose what's happened, I guess you might downgrade the ICU library using package tools or whatever if possible, but otherwise you'd be stuck, if you just can't get the right library. Is this a problem? Would you want to be able to say "I don't care, computer, please just press on"? So I think we need a way to turn off the search-by-collversion thing. How should it look? I'd love to hear others' thoughts on how we can turn this into a workable solution. Hopefully while staying simple...
Attachment
On 22.10.22 03:22, Thomas Munro wrote: > Suppose your pgdata encounters a PostgreSQL linked against a later ICU > library, most likely after an OS upgrade or migratoin, a pg_upgrade, > or via streaming replication. You might get a new error "can't find > ICU collation 'en' with version '153.14'; HINT: install missing ICU > library version", and somehow you'll have to work out which one might > contain 'en' v153.14 and install it with apt-get etc. Then it'll > magically work: your postgres linked against (say) 71 will happily > work with the dlopen'd 67. This is enough if you want to stay on 67 > until the heat death of the universe. So far so good. What I'm wondering is where those ICU installations are going to come from. In order for this project to be viable, we would need to convince some combination of ICU maintainers, OS packagers, and PGDG packagers to provide and maintain five year's worth of ICU packages (yearly releases AFAICT). Is that something we are willing to get into? (Even to test this I need to figure out where to get another ICU installation from. I'll try how easy manual installations are.)
On Tue, Nov 1, 2022 at 11:33 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > What I'm wondering is where those ICU installations are going to come > from. In order for this project to be viable, we would need to convince > some combination of ICU maintainers, OS packagers, and PGDG packagers to > provide and maintain five year's worth of ICU packages (yearly releases > AFAICT). Is that something we are willing to get into? I hacked on this on a Debian machine that has a couple of these installed and they work fine, but now I realise that might have to do with the major upgrade history of the machine. So yeah... probably. :-/ Not being involved in packaging I have no idea how plausible such a backports (erm, forwardports?) repo would be, and I have even less idea for other distros.
On Wed, Nov 2, 2022 at 1:42 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Nov 1, 2022 at 11:33 PM Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: > > What I'm wondering is where those ICU installations are going to come > > from. In order for this project to be viable, we would need to convince > > some combination of ICU maintainers, OS packagers, and PGDG packagers to > > provide and maintain five year's worth of ICU packages (yearly releases > > AFAICT). Is that something we are willing to get into? > > I hacked on this on a Debian machine that has a couple of these > installed and they work fine, but now I realise that might have to do > with the major upgrade history of the machine. So yeah... probably. > :-/ Not being involved in packaging I have no idea how plausible such > a backports (erm, forwardports?) repo would be, and I have even less > idea for other distros. After sleeping on it, I don't really agree that the project is not viable even if it requires hoop-jumping to set up right now. It's a chicken-and-egg problem, and the first step is to make it possible to do it at all, thereby creating the demand for convenient packages. I think we have several topics here: 1. Technical problems relating to dlopen'ing. Does it work? Is the default dlopen() secure enough? Is it building sensible library names, even on the freaky-library OSes (Windows, macOS, AIX)? Is it enough to have that GUC for non-default path, should it be a search path, should it share the existing dynamic_library_path? Are the indirect function calls fast enough? Is the way it handles API stability sound? Can we drop some unfinished complexity by dropping pre-53 ICU? Does it use too much memory? 2. User experience problems relating to upgrade paths and user interface. Is it enough to start with the basic DB2-style approach that I've prototyped here? How should we refer to library versions? Is your search-for-the-collversion idea better? My gut feeling is that the early version should be about giving people options, and not trying to be too clever/automatic with questionable semantics, and later improvements could follow, for example if we have another go at the per-object version tracking. 3. Library availability. This is a problem for downstream communities to solve. For example, the people who build Windows installers might want to start bundling the ICU versions from their earlier releases, the people involved with each Linux/BSD distro would hopefully figure out a good way to publish the packages from older OS releases in one repo, and the people running managed systems probably do their own packaging anyway, they'll figure it out. I realise that you are involved in packaging and I am not, so we probably have different perspectives: I get to say "and here, magic happens!" :-) FWIW at least 57, 63 and 67 (corresponding to deb9, 10, 11) from http://ftp.debian.org/debian/pool/main/i/icu/ can be installed with dpkg -i on my Debian 11 machine. 52 (deb8) too, probably, but it has dependencies I didn't look into. 71 and 72 are newer than the -dev version (what we link against), so I didn't try installing but the patch as posted wouldn't let me open them: the idea here is to allow only older stuff to be dlopen'd, so if a breaking API change comes down the pipe we'll be able to deal with it. Not being a packaging guy, I don't how how stupid it would be to build a package repo that literally just exposes these via an index and that's all, or whether it's better to rebuild the ICU versions from source against modern C/C++ runtimes etc.
On 02.11.22 00:57, Thomas Munro wrote: > 3. Library availability. This is a problem for downstream > communities to solve. For example, the people who build Windows > installers might want to start bundling the ICU versions from their > earlier releases, the people involved with each Linux/BSD distro would > hopefully figure out a good way to publish the packages from older OS > releases in one repo, and the people running managed systems probably > do their own packaging anyway, they'll figure it out. I realise that > you are involved in packaging and I am not, so we probably have > different perspectives: I get to say "and here, magic happens!" :-) I made a Homebrew repository for ICU versions 50 through 72: https://github.com/petere/homebrew-icu All of these packages build and pass their self-tests on my machine. So from that experience, I think maintaining a repository of ICU versions, and being able to install more than one for testing this feature, is feasible. Now I have started building PostgreSQL against these, to get some baseline of what is supported and actually works. The results are a bit mixed so far, more to come later. The installation instructions currently say that the minimum required version of ICU is 4.2. That was the one that shipped with RHEL 6. I think we have de-supported RHEL 6 and could increase that. The version in RHEL 7 is 50. (My repository happens to start at 50 because the new versioning system started at 49, but 49 doesn't appear to be tagged at the icu github site.) Note: Recent versions of libxml2 link against icu. This isn't a problem, thanks to the symbol versioning, but if you get libxml2 via pkg-config, you might get LDFLAGS from not the icu version you wanted.
On Tue, Nov 8, 2022 at 1:22 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > I made a Homebrew repository for ICU versions 50 through 72: > https://github.com/petere/homebrew-icu Nice! > All of these packages build and pass their self-tests on my machine. So > from that experience, I think maintaining a repository of ICU versions, > and being able to install more than one for testing this feature, is > feasible. I wonder what the situation with CVEs is in older releases. I heard a rumour that upstream might only patch current + previous, leaving it up to distros to back-patch to whatever they need to support, but I haven't tried to track down cold hard evidence of this or think about what it means for this project...
On 22.10.22 03:22, Thomas Munro wrote: > I'd love to hear others' thoughts on how we can turn this into a > workable solution. Hopefully while staying simple... I played with this patch a bit. It looks like a reasonable approach. Attached is a small patch to get the dynamic libicu* lookup working with the library naming on macOS. Instead of packing the ICU version into the locale field ('63:en'), I would make it a separate field in pg_collation and a separate argument in CREATE COLLATION. At this point, perhaps it would be good to start building some tests to demonstrate various upgrade scenarios and to ensure portability.
Attachment
I looked at v6. * We'll need some clearer instructions on how to build/install extra ICU versions that might not be provided by the distribution packaging. For instance, I got a cryptic error until I used --enable-rpath, which might not be obvious to all users. * Can we have a better error when the library was built with -- disable-renaming? We can just search for the plain (no suffix) symbol. * We should use dlerror() instead of %m to report dlopen() errors. * It seems like the collation version is just there to issue WARNINGs when a user is using the non-versioned locale syntax and the library changes underneath them (or if there is collation version change within a single ICU major version)? * How are you testing this? * In my tests (sort, hacked so abbreviate is always false), I see a ~3% regression for ICU+UTF8. That's fine with me. I assume it's due to the indirect function call, but that's not obvious to me from the profile. If it's a major problem we could have a special case of varstrfastcmp_locale() that works on the compile-time ICU version. I realize your patch is experimental, but when there is a better consensus on the approach, we should consider adding declarative syntax such as: CREATE COLLATION (or LOCALE?) PROVIDER icu67 TYPE icu VERSION '67' AS '/path/to/icui18n.so.67'; It will offer more opportunities to catch errors early and offer better error messages. It would also enable it to function if the library is built with --disable-renaming (though we'd have to trust the user). On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote: > Problem 1: Suppose you're ready to start using (say) v72. I guess > you'd use the REFRESH command, which would open the main linked ICU's > collversion and stamp that into the catalogue, at which point new > sessions would start using that, and then you'd have to rebuild all > your indexes (with no help from PG to tell you how to find everything > that needs to be rebuilt, as belaboured in previous reverted work). > Aside from the possibility of getting the rebuilding job wrong (as > belaboured elsewhere), it's not great, because there is still a > transitional period where you can be using the wrong version for your > data. So this requires some careful planning and understanding from > the administrator. How is this related to the search-by-collversion design? It seems like it's hard no matter what. -- Jeff Davis PostgreSQL Contributor Team - AWS
Replying to Peter and Jeff in one email. On Sat, Nov 12, 2022 at 3:57 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > On 22.10.22 03:22, Thomas Munro wrote: > > I'd love to hear others' thoughts on how we can turn this into a > > workable solution. Hopefully while staying simple... > > I played with this patch a bit. It looks like a reasonable approach. Great news. > Attached is a small patch to get the dynamic libicu* lookup working with > the library naming on macOS. Thanks, squashed. > Instead of packing the ICU version into the locale field ('63:en'), I > would make it a separate field in pg_collation and a separate argument > in CREATE COLLATION. I haven't tried this yet, as I focused on coming up with a way of testing in this iteration. I can try this next. I'm imagining that we'd have pg_collation.collicuversion and pg_database.daticuversion, and they'd default to 0 for "use the GUC", and perhaps you'd even be able to ALTER them. Perhaps we wouldn't even need the GUC then... 0 could mean "the linked version", and if you don't like it, you ALTER it. Thinking about this. > At this point, perhaps it would be good to start building some tests to > demonstrate various upgrade scenarios and to ensure portability. OK, here's what I came up with. You enable it in PG_TEST_EXTRA, and tell it about an alternative ICU version you have in the standard library search path that is not the same as the main/linked one: $ meson configure -DPG_TEST_EXTRA="icu=63" $ meson test icu/020_multiversion Another change from your feedback: you mentioned that RHEL7 shipped with ICU 50, so I removed my suggestion of dropping some extra code we carry for versions before 54 and set the minimum acceptable version to 50. It probably works further back than that, but that's a decent range, I think. On Tue, Nov 15, 2022 at 1:55 PM Jeff Davis <pgsql@j-davis.com> wrote: > I looked at v6. Thanks for jumping in and testing! > * We'll need some clearer instructions on how to build/install extra > ICU versions that might not be provided by the distribution packaging. > For instance, I got a cryptic error until I used --enable-rpath, which > might not be obvious to all users. Suggestions welcome. No docs at all yet... > * Can we have a better error when the library was built with -- > disable-renaming? We can just search for the plain (no suffix) symbol. I threw out that symbol probing logic, and wrote something simpler that should now also work with --disable-renaming (though not tested). Now it does a cross-check with the library's self-reported major version, just to make sure there wasn't a badly named library file, which may be more likely with --disable-renaming. > * We should use dlerror() instead of %m to report dlopen() errors. Fixed. > * It seems like the collation version is just there to issue WARNINGs > when a user is using the non-versioned locale syntax and the library > changes underneath them (or if there is collation version change within > a single ICU major version)? Correct. I have now updated the warning messages you get when they don't match, to provide a hint about what to do about it. I am sure they need some more word-smithing, though. > * How are you testing this? Ad hoc noodling before now, but see attached. > I realize your patch is experimental, but when there is a better > consensus on the approach, we should consider adding declarative syntax > such as: > > CREATE COLLATION (or LOCALE?) PROVIDER icu67 > TYPE icu VERSION '67' AS '/path/to/icui18n.so.67'; > > It will offer more opportunities to catch errors early and offer better > error messages. It would also enable it to function if the library is > built with --disable-renaming (though we'd have to trust the user). Earlier in this and other threads, we wondered if each ICU major version should be a separate provider, which is what you're showing there, or should be an independent property of an individual COLLATION, which is what v6 did with '63:en' and what Peter suggested I make more formal with CREATE COLLATION foo (..., ICU_VERSION=63). I actually started out thinking we'd have multiple providers, but I couldn't really think of any advantage, and I think it makes some upgrade scenarios more painful. Can you elaborate on why you'd want that model? > On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote: > > Problem 1: Suppose you're ready to start using (say) v72. I guess > > you'd use the REFRESH command, which would open the main linked ICU's > > collversion and stamp that into the catalogue, at which point new > > sessions would start using that, and then you'd have to rebuild all > > your indexes (with no help from PG to tell you how to find everything > > that needs to be rebuilt, as belaboured in previous reverted work). > > Aside from the possibility of getting the rebuilding job wrong (as > > belaboured elsewhere), it's not great, because there is still a > > transitional period where you can be using the wrong version for your > > data. So this requires some careful planning and understanding from > > the administrator. > > How is this related to the search-by-collversion design? It seems like > it's hard no matter what. Yeah. I just don't like the way it *appears* to be doing something clever, but it doesn't solve any fundamental problem at all because the collversion information is under human control and so it's really doing something stupid. Hence desire to build something that at least admits that it's primitive and just gives you some controls, in a first version. We could always reconsider that in later work though, maybe even an optional policy or something?
Attachment
On Sat, Nov 19, 2022 at 7:38 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Nov 15, 2022 at 1:55 PM Jeff Davis <pgsql@j-davis.com> wrote: > > I realize your patch is experimental, but when there is a better > > consensus on the approach, we should consider adding declarative syntax > > such as: > > > > CREATE COLLATION (or LOCALE?) PROVIDER icu67 > > TYPE icu VERSION '67' AS '/path/to/icui18n.so.67'; > > > > It will offer more opportunities to catch errors early and offer better > > error messages. It would also enable it to function if the library is > > built with --disable-renaming (though we'd have to trust the user). > > Earlier in this and other threads, we wondered if each ICU major version should > be a separate provider, which is what you're showing there, or should be an > independent property of an individual COLLATION, which is what v6 did with > '63:en' and what Peter suggested I make more formal with CREATE COLLATION foo > (..., ICU_VERSION=63). I actually started out thinking we'd have multiple > providers, but I couldn't really think of any advantage, and I think it makes > some upgrade scenarios more painful. Can you elaborate on why you'd want > that model? Hmm, thinking some more about this... I said the above thinking that you couldn't change a provider after creating a database/collation. But what if you could? 1. CREATE DATABASE x LOCALE_PROVIDER=icu ...; 2. Some time later after an upgrade, my postgres binary is linked against a new ICU version and I start seeing warnings. 3. ALTER DATABASE x LOCALE_PROVIDER=icu63; I suppose you shouldn't be allowed to change libc -> icu, but you could change icu - > icuXXX, or I guess icuXXX -> icuXXX. What if you didn't have to manually manage the set of available providers with DDL like you showed, but we just automatically supported "icu" (= the linked ICU, whatever it might be), and icu50 up to icuXXX where XXX is the linked ICU's version? We can encode those values + libc as an int, to replace the existing char the represents providers in catalogues. That's basically just a different way of encoding the same information that Peter was suggesting I put in a new catalogue attribute. How do you like that bikeshed colour?
On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote: > Problem 2: If ICU 67 ever decides to report a different version for > a > given collation (would it ever do that? I don't expect so, but ...), > we'd be unable to open the collation with the search-by-collversion > design, and potentially the database. What is a user supposed to do > then? Presumably our error/hint for that would be "please insert the > correct ICU library into drive A", but now there is no correct > library Let's say that Postgres is compiled against version 67.X, and the sysadmin upgrades the ICU package to 67.Y, which reports a different collation version for some locale. Your current patch makes this impossible for the administrator to fix, because there's no way to have two different libraries loaded with the same major version number, so it will always pick the compiled-in ICU. The user will be forced to accept the new version of the collation, see WARNINGs in their logs, and possibly corrupt their indexes. Search-by-collversion would still be frustrating for the admin, but at least it would be possible to fix by compiling their own 67.X and asking Postgres to search that library, too. We could make it slightly more friendly by having an error that reports the libraries searched and the collation versions found, if none of the versions match. We can have a GUC that controls whether a failure to find the right version is a WARNING or an ERROR. On Sat, 2022-11-19 at 07:38 +1300, Thomas Munro wrote: > > * We'll need some clearer instructions on how to build/install > > extra > > ICU versions that might not be provided by the distribution > > packaging. > > For instance, I got a cryptic error until I used --enable-rpath, > > which > > might not be obvious to all users. > > Suggestions welcome. No docs at all yet... I tried to write up some docs. It's hard to explain why we are exposing to the user the collation version and the library version in these different ways, and what effects they have. The current patch feels like it hasn't decided whether the collation version is ucol_getVersion() (collversion) or u_getVersion() (library version). The collversion is more prominent in the UI (with its own syntax), yet it's just a cross-check for whether to issue a WARNING or not; while the library version is hidden in the locale field and it actually decides which symbol is called. > > > Yeah. I just don't like the way it *appears* to be doing something > clever, but > it doesn't solve any fundamental problem at all because the > collversion > information is under human control and so it's really doing something > stupid. I assume by "human control" you mean "ALTER COLLATION ... REFRESH VERSION". I agree that relying on the admin's declaration is dubious, especially when we provide no good advice on how to actually do that safely. But I don't see what using the library version instead buys us here, except that library version is part of the LOCALE, and there's no ALTER command for that. You could just as easily deprecate/eliminate the ALTER COLLATION REFRESH VERSION, and then say that the collversion is out of human control, too. By introducing multiple libraries, I think we need to change that syntax anyway, to be something like: ALTER COLLATION ... SET VERSION TO '...' or even: ALTER COLLATION ... FORCE VERSION TO '...' > Hence desire to build something that at least admits that it's > primitive and > just gives you some controls, in a first version. Using either the library version or the collation version seems reasonably simple to me. But from a documentation and usability standpoint, the way they are currently mixed seems confusing. -- Jeff Davis PostgreSQL Contributor Team - AWS
On Tue, Nov 22, 2022 at 7:34 PM Jeff Davis <pgsql@j-davis.com> wrote: > On Sat, 2022-10-22 at 14:22 +1300, Thomas Munro wrote: > > Problem 2: If ICU 67 ever decides to report a different version for > > a > > given collation (would it ever do that? I don't expect so, but ...), > > we'd be unable to open the collation with the search-by-collversion > > design, and potentially the database. What is a user supposed to do > > then? Presumably our error/hint for that would be "please insert the > > correct ICU library into drive A", but now there is no correct > > library > > Let's say that Postgres is compiled against version 67.X, and the > sysadmin upgrades the ICU package to 67.Y, which reports a different > collation version for some locale. > > Your current patch makes this impossible for the administrator to fix, > because there's no way to have two different libraries loaded with the > same major version number, so it will always pick the compiled-in ICU. > The user will be forced to accept the new version of the collation, see > WARNINGs in their logs, and possibly corrupt their indexes. They could probably also 'pin' the older minor version package using their package manager (= downgrade) until they're ready to upgrade and use REFRESH VERSION to certify that they've rebuilt everything relevant or are OK with risks. Not pretty I admit, but I think the end result is about the same for search-for-collversion, because I imagine that (1) the default behaviour on failure to search would likely be to use the linked library instead and WARN about [dat]collversion mismatch, so far the same, and (2) the set of people who would really be prepared to compile their own copy of 67.X instead of downgrading or REFRESHing (with or without rebuilding) is vanishingly small. Two questions I wondered about: 1. *Do* they change ucol_getVersion() values in minor releases? I tried to find a written policy on that. https://icu.unicode.org/processes is not encouraging: it gives the example of a "third digit in an official release number" [changing] because a CLDR change was incorporated. Hrmph. But that's clearly not even the modern ICU versioning system (it made a change a bit like ours in 49, making the first number only major, so maybe that "third" number is now the second number, AKA minor version), and also that's a CLDR minor version change; is CLDR minor even in the recipe for ucol_getVersion()? Even without data changes, I guess that bug fixes could apply to the UCA logic, and I assume that UCA logic is included in it. Hmm. A non-hypothetical example of a CLDR change within an ICU major version that I've been able to find is: https://cldr.unicode.org/index/downloads/cldr-38 Here we see that CLDR had a minor version bump 38 -> 38.1, "a very small number of incremental additions to version 38 to address the specific bugs listed in Δ38.1", and was included in ICU 68.2. Being a minor ICU release 68.1 -> 68.2, perhaps you could finish up running that just with a regular upgrade on typical distros (not a major OS upgrade), and since PostgreSQL would normally be linked against eg .68, not .68.1, it'd start using it at the next cluster start when that symlink is updated to point to .68.2. As it happens, if you follow the documentation links to see what actually changed in that particular pair of CLDR+ICU minor releases, it's timezones and locale stuff other than collations, so wouldn't affect us. Can we find a chapter and verse that says that ICU would only ever move to a new CLDR in a minor release, and CLDR would never change order of pre-existing code points in a minor release? It might be interesting to see if https://github.com/unicode-org/icu/tree/release-68-1 and https://github.com/unicode-org/icu/tree/release-68-2 report a different ucol_getVersion() for any locale, but not conclusive if it doesn't; it might be because something in the version pipeline knew that particular CLDR change didn't affect collators... This speculation feels pretty useless. Maybe we should go and read the code or ask an ICU expert, but I'm not against making it theoretically possible to access two different minor versions at once, just to cover all the bases for future-proofing. 2. Would package managers ever allow two minor versions to be installed at once? I highly doubt it; they're probably more interested in ABI stability so that dependent packages work when bugfixes are shipped, and that's certainly nailed down at the major version level. It'd probably be a case of having to compile it yourself, which seems unlikely to me in the real world. That's why I left minor version out of earlier patches, but I'm OK with changing that. As for how, I think that depends on our modelling decision (see below). > Search-by-collversion would still be frustrating for the admin, but at > least it would be possible to fix by compiling their own 67.X and > asking Postgres to search that library, too. We could make it slightly > more friendly by having an error that reports the libraries searched > and the collation versions found, if none of the versions match. We can > have a GUC that controls whether a failure to find the right version is > a WARNING or an ERROR. Good ideas. > I tried to write up some docs. It's hard to explain why we are exposing > to the user the collation version and the library version in these > different ways, and what effects they have. Always a good test: see how crazy it sounds when translated to user speak. > The current patch feels like it hasn't decided whether the collation > version is ucol_getVersion() (collversion) or u_getVersion() (library > version). The collversion is more prominent in the UI (with its own > syntax), yet it's just a cross-check for whether to issue a WARNING or > not; while the library version is hidden in the locale field and it > actually decides which symbol is called. Yeah. I agree that it sucks to have two kinds of versions flying around in the user's mind. > > Yeah. I just don't like the way it *appears* to be doing something > > clever, but > > it doesn't solve any fundamental problem at all because the > > collversion > > information is under human control and so it's really doing something > > stupid. > > I assume by "human control" you mean "ALTER COLLATION ... REFRESH > VERSION". I agree that relying on the admin's declaration is dubious, > especially when we provide no good advice on how to actually do that > safely. > > But I don't see what using the library version instead buys us here, > except that library version is part of the LOCALE, and there's no ALTER > command for that. You could just as easily deprecate/eliminate the > ALTER COLLATION REFRESH VERSION, and then say that the collversion is > out of human control, too. > > By introducing multiple libraries, I think we need to change that > syntax anyway, to be something like: > > ALTER COLLATION ... SET VERSION TO '...' > > or even: > > ALTER COLLATION ... FORCE VERSION TO '...' OK. Time for a new list of the various models we've discussed so far: 1. search-by-collversion: We introduce no new "library version" concept to COLLATION and DATABASE object and little or no new syntax. Whenever opening a collation or database, the system will search some candidate list of ICU libraries to try to find the one that agrees with [dat]collversion. When creating a new collation or database, the system will select one (probably the linked one unless you override somehow) and record ucol_getVersion() in [dat]collversion. When searching, it might fail to find a suitable library and ereport; to fix that, it is the admin's job to somehow expand the set of candidate libraries. In such a failure case, perhaps it would fall back to using some default library version (probably the one that is linked, overridable by GUC?), with a WARNING (unless you turned on ERRORs), and if you want to shut it up without supplying the right candidate library, you can still fall back to the REFRESH VERSION hammer (or maybe that should indeed called FORCE to make it clearer that it's not a harmless operation where the system holds your hand, you're actually certifying that you have rebuilt indexes and you know what you're doing). The set of candidate versions could perhaps be provided with extra_icu_library_versions=63,71 OR =63.1,63.2 strings, at least on Unix systems following the traditional symlink conventions. Remembering that a typical Unixoid system should have libraries and symlinks like: libicui18n.a libicui18n.so -> libicui18n.so.71.1 libicui18n.so.63 -> libicui18n.so.63.1 libicui18n.so.63.1 libicui18n.so.67 -> libicui18n.so.67.1 libicui18n.so.67.1 libicui18n.so.71 -> libicui18n.so.71.1 libicui18n.so.71.1 The reason I prefer major[.minor] strings over whole library names is that we need to dlopen two of them so it's a little easier to build them from those parts than have to supply both names. The reason I prefer to keep allowing major-only versions to be listed is that it's good to have the option to just follow minor upgrades automatically. Or I guess you could make something that can automatically search a whole directory (which directory?) to find all the suitably named libraries so you don't ever have to mention versions manually (if you want "apt-get install libicu72" to be enough with no GUC change needed) -- is that too weird? Perhaps we could write functions that can show you the available versions to demystify the searching mechanism slightly and show how various numbers relate, something like (warning: I made up numbers for illustration, they are wrong!): SELECT * FROM pg_available_icu_libraries() icu_version unicode_version uca_version cldr_version 67.1 14.0 3.1 38.0 71.1 15.0 4.0 42.0 SELECT * FROM pg_available_icu_collation_versions('en') icu_version collation_version 67.1 142.42 71.1 153.112 2. lib-version-in-providers: We introduce a separate provider value for each ICU version, for example ICU63, plus an unversioned ICU like today. The collversion column is used only for warnings. Warnings are expected when you used the unversioned ICU provider and upgrade to a binary linked to a later library. You can clear the warnings by doing ALTER COLLATION/DATABASE SET [LOCALE_]PROVIDER = ICU63, or with the REFRESH VERSION hammer. Not sure how you fit minor versions into that, if we want to support those. Maybe ICU means "whatever is linked", ICU63 means "whatever libicui18n.so.63 points to" and ICU63_1 means libicu18n.so.63.1, something like that, so the user can choose from three levels of specificity. 3. lib-version-in-attributes: We introduce daticuversion (alongside datcollversion) and collicuversion (alongside collversion). Similar to the above, but it's a separate property and the provider is always ICU. New syntax for CREATE/ALTER COLLATION/DATABASE to set and change ICU_VERSION. 4. lib-version-in-locale: "63:en" from earlier versions. That was mostly a strawman proposal to avoid getting bogged down in syntax/catalogue/model change discussions while trying to prove that dlopen would even work. It doesn't sound like anyone really likes this. 5. lib-version-in-collversion: We didn't explicitly discuss this before, but you hinted at it: we could just use u_getVersion() in [dat]collversion. I haven't analysed this much but I don't think it has a very nice upgrade path from PG15, and it forces you to decide whether to store just the major version and not even notice when the (unstored) minor version changes, or store major.minor and complain/break down when routine minor upgrades happen. It is a logical possibility though, once you decide you only want one kind of version in the system. I'm willing to update the patch to try one of these out so we can kick the tyres some more, but I'll wait to see if we can get some consensus on the way forward. Despite my initial reactions, I'm willing to try out the search-by-collversion concept if others are keen on it. The example I worked through in the first paragraph of this email helped me warm to it a little, and with the observability functions I showed you might have a chance of figuring out what's going on in some edge cases. Any other ideas, or votes for these ideas?
On Wed, 2022-11-23 at 18:08 +1300, Thomas Munro wrote: > (1) the default behaviour on failure to search would > likely be to use the linked library instead and WARN about > [dat]collversion mismatch, so far the same, and Agreed. > (2) the set of people > who would really be prepared to compile their own copy of 67.X > instead > of downgrading or REFRESHing (with or without rebuilding) is > vanishingly small. The set of people prepared to do so is probably small. But the set of people who will do it (prepared or not) when a problem comes up is significantly larger ;-) > 1. *Do* they change ucol_getVersion() values in minor releases? I > tried to find a written policy on that. It seems like a valid concern. The mere existence of a collation version separate from the library major version seems to suggest that it's possible. Perhaps they avoid it in most cases; but absent a specific policy against it, the separate collation version seems to allow them the freedom to do so. > This speculation feels pretty useless. Maybe we should go and read > the code or ask an ICU expert, but I'm not against making it > theoretically possible to access two different minor versions at > once, > just to cover all the bases for future-proofing. I don't think this should be an overriding concern that drives the whole design. It is a nudge in favor of search-by-collversion. > 2. Would package managers ever allow two minor versions to be > installed at once? I highly doubt it; Agreed. I'm sure this has been discussed, but which distros even support multiple major versions of ICU? > > 1. search-by-collversion: We introduce no new "library version" > concept to COLLATION and DATABASE object and little or no new syntax. > Whenever opening a collation or database, the system will search some > candidate list of ICU libraries to try to find the one that agrees > with [dat]collversion. [...] > The reason I prefer major[.minor] strings over whole library names is > that we need to dlopen two of them so it's a little easier to build > them from those parts than have to supply both names. It also makes it easier to know which version suffixes to look for. > The reason I > prefer to keep allowing major-only versions to be listed is that it's > good to have the option to just follow minor upgrades automatically. Makes sense. > Or I guess you could make something that can automatically search a > whole directory (which directory?) to find all the suitably named > libraries so you don't ever have to mention versions manually (if you > want "apt-get install libicu72" to be enough with no GUC change > needed) -- is that too weird? That seems to go a little too far. > SELECT * FROM pg_available_icu_libraries() > SELECT * FROM pg_available_icu_collation_versions('en') +1 > 2. lib-version-in-providers: We introduce a separate provider value > for each ICU version, for example ICU63, plus an unversioned ICU like > today. I expressed interest in this approach before, but when you allowed ICU compiled with --disable-renaming, that mitigated my concerns about when to throw that error. > 3. lib-version-in-attributes: We introduce daticuversion (alongside > datcollversion) and collicuversion (alongside collversion). I think this is the best among 2-4. > 4. lib-version-in-locale: "63:en" from earlier versions. That was > mostly a strawman proposal to avoid getting bogged down in > syntax/catalogue/model change discussions while trying to prove that > dlopen would even work. It doesn't sound like anyone really likes > this. I don't see any advantage of this over 3. > 5. lib-version-in-collversion: We didn't explicitly discuss this > before, but you hinted at it: we could just use u_getVersion() in > [dat]collversion. The advantage here is that it's very easy to tell the admin what library the collation is looking for, but the disadvantages you point out seem a lot worse: migration problems from v15, and the minor version question. I'd vote for 1 on the grounds that it's easier to document and understand a single collation version, which comes straight from ucol_getVersion(). This approach makes it a separate problem to find the collation version among whatever libraries the admin can provide; but adding some observability into the search should mitigate any confusion. Can you go over the advantages of approaches 2-4 again? Is it just a concern about burdening the admin with finding the right ICU library version for a given collation version? That's a valid concern, but I don't think that should be an overriding design point. It seems more important to model the collation versions properly. -- Jeff Davis PostgreSQL Contributor Team - AWS
On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote: > I'm sure this has been discussed, but which distros even support > multiple major versions of ICU? For Debian and friends, you can install any number of libicuNN packages (if you can find them eg from previous release repos), but there's only one libicu-dev. That means that one specific major version is blessed by each Debian release and has its headers and static libraries for you to use as a developer, but you can still install the dynamic libraries from older releases at the same time to satisfy the dependencies of packages or programs that were built on an earlier OS release. They don't declare conflicts on each other and they contain non-conflicting filenames. That's similar to the way standard libraries and various other things are treated, for backward compatibility. For RHEL and friends, I'm pretty sure it's the same concept, but I don't use those and haven't seen it with my own eyes. I don't know for other Linux distros/families, but I expect the above two cover a huge percentage of our users and I expect others to have made similar choices. For the BSDs, which tend to have a single binary package with both headers and libraries owing to their origins as source-based distributions (ports), the above way of thinking doesn't work; I couldn't develop this on my usual FreeBSD battlestation without building ICU myself (problem being that there's only one "pkg install icu") and I hope to talk to someone who knows what to do about that eventually. I want this to work there easily for end users. macOS and Windows have so many different ways of installing things that there isn't a single answer there; supposedly open source is like a bazaar and closed source like a cathedral, but as far as package management goes, it looks more like rubble to me.
On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote: > I'd vote for 1 on the grounds that it's easier to document and > understand a single collation version, which comes straight from > ucol_getVersion(). This approach makes it a separate problem to find > the collation version among whatever libraries the admin can provide; > but adding some observability into the search should mitigate any > confusion. OK, it sounds like I should code that up next. > Can you go over the advantages of approaches 2-4 again? Is it just a > concern about burdening the admin with finding the right ICU library > version for a given collation version? That's a valid concern, but I > don't think that should be an overriding design point. It seems more > important to model the collation versions properly. Yes, that's a good summary. The user has a problem, and the solution is to find some version of ICU and install it, so the problem space necessarily involves the other kind of version. My idea was that we should therefore make that part of the model. But the observability support does indeed make it a bit clearer what's going on.
On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote: > > I'd vote for 1 on the grounds that it's easier to document and > > understand a single collation version, which comes straight from > > ucol_getVersion(). This approach makes it a separate problem to find > > the collation version among whatever libraries the admin can provide; > > but adding some observability into the search should mitigate any > > confusion. > > OK, it sounds like I should code that up next. Here's the first iteration. The version rosetta stone functions look like this: postgres=# select * from pg_icu_library_versions(); icu_version | unicode_version | cldr_version -------------+-----------------+-------------- 67.1 | 13.0 | 37.0 63.1 | 11.0 | 34.0 57.1 | 8.0 | 29.0 (3 rows) postgres=# select * from pg_icu_collation_versions('zh'); icu_version | uca_version | collator_version -------------+-------------+------------------ 67.1 | 13.0 | 153.14.37 63.1 | 11.0 | 153.88.34 57.1 | 8.0 | 153.64.29 (3 rows) It's no longer necessary to put anything in PG_TEST_EXTRA to run "meson test irc/020_multiversion" usefully. It will find extra ICU versions all by itself in your system library search path and SKIP if it doesn't find a second major version. I have tried to cover the main scenarios that I expect users to encounter in the update TAP tests, with commentary that I hope will be helpful to assess the usability of this thing. Other changes: * now using RTLD_LOCAL instead of RTLD_GLOBAL (I guess the latter might cause trouble for someone using --disable-renaming, but I haven't tested that and am not an expert on linker/loader arcana) * fixed library names on Windows (based on reading the manual, but I haven't tested that) * fixed failure on non-ICU builds (the reason CI was failing in v7, some misplaced #ifdefs) * various cleanup * I've attached a throwaway patch to install a second ICU version on Debian/amd64 on CI, since otherwise the new test would SKIP on all systems This is just a first cut, but enough to try out and see if we like it, what needs to be improved, what edge cases we haven't thought about etc. Let me know what you think.
Attachment
On Sat, Nov 26, 2022 at 6:27 PM Thomas Munro <thomas.munro@gmail.com> wrote: > This is just a first cut, but enough to try out and see if we like it, > what needs to be improved, what edge cases we haven't thought about > etc. Let me know what you think. BTW one problem to highlight (mentioned but buried in the test comments), is that REFRESH VERSION doesn't affect other sessions or even the current session. You have to log out and back in again to pick up the new version. Obviously that's not good enough, but fixing that involves making it transactional, I think. If you abort, we have to go back to using the old version, if you commit you keep the new version and we might also consider telling other backends to start using the new version -- or something like that. I think that's just a Small Matter of Programming, but a little bit finickity and I need to take a break for a bit and go work on bugs elsewhere, hence v8 didn't address that yet.
On Wed, Nov 23, 2022 at 12:09 AM Thomas Munro <thomas.munro@gmail.com> wrote: > OK. Time for a new list of the various models we've discussed so far: > > 1. search-by-collversion: We introduce no new "library version" > concept to COLLATION and DATABASE object and little or no new syntax. > > 2. lib-version-in-providers: We introduce a separate provider value > for each ICU version, for example ICU63, plus an unversioned ICU like > today. > > 3. lib-version-in-attributes: We introduce daticuversion (alongside > datcollversion) and collicuversion (alongside collversion). Similar > to the above, but it's a separate property and the provider is always > ICU. New syntax for CREATE/ALTER COLLATION/DATABASE to set and change > ICU_VERSION. > > 4. lib-version-in-locale: "63:en" from earlier versions. That was > mostly a strawman proposal to avoid getting bogged down in > syntax/catalogue/model change discussions while trying to prove that > dlopen would even work. It doesn't sound like anyone really likes > this. > > 5. lib-version-in-collversion: We didn't explicitly discuss this > before, but you hinted at it: we could just use u_getVersion() in > [dat]collversion. I'd like to vote against #3 at least in the form that's described here. If we had three more libraries providing collations, it's likely that they would need versioning, too. So if we add an explicit notion of provider version, then it ought not to be specific to libicu. I think it's OK to decide that different library versions are different providers (your option #2), or that they are the same provider but give rise to different collations (your option #4), or that there can be multiple version of each collation which are distinguished by some additional provider version field (your #3 made more generic). I don't really understand #1 or #5 well enough to have an educated opinion, but I do think that #1 seems a bit magical. It hopes that the combination of a collation name and a datcollversion will be sufficient to find exactly one matcing collation in a list of provided libraries. The advantage of that, as I understand it, is that if you do something to your system that causes the number of matches to go from one to zero, you can just throw another library on the pile and get the number back up to one. Woohoo! But there's a part of me that worries: what if the number goes up to two, and they're not all the same? Probably that's something that shouldn't happen, but if it does then I think there's kind of no way to fix it. With the other options, if there's some way to jigger the catalog state to match what you want to happen, you can always repair the situation somehow, because the library to be used for each collation is explicitly specified in some way, and you just have to get it to match what you want to have happen. I don't know too much about this, though, so I might have it all wrong. -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote: > Here's the first iteration. I will send a full review shortly, but I encountered an ICU bug along the way, which caused me some confusion for a bit. I'll skip past the various levels of confusion I had (burned a couple hours), and get right to the repro: Install the latest release of all major versions 50-69, and compile postgres against 70. You'll get: =# select * from pg_icu_collation_versions('en_US') order by icu_version; icu_version | uca_version | collator_version -------------+-------------+------------------ 50.2 | 6.2 | 58.0.6.50 51.3 | 6.2 | 58.0.6.50 52.2 | 6.2 | 58.0.6.50 53.2 | 6.3 | 137.51 54.2 | 7.0 | 137.56 55.2 | 7.0 | 153.56 56.2 | 8.0 | 153.64 57.2 | 8.0 | 153.64 58.3 | 9.0 | 153.72 59.2 | 9.0 | 153.72 60.3 | 10.0 | 153.80 61.2 | 10.0 | 153.80 62.2 | 11.0 | 153.88 63.2 | 11.0 | 153.88 64.2 | 12.1 | 153.97 65.1 | 12.1 | 153.97 66.1 | 13.0 | 153.14 67.1 | 13.0 | 153.14 68.2 | 13.0 | 153.14 69.1 | 13.0 | 153.14 70.1 | 14.0 | 153.112 (21 rows) This is good information, because it tells us that major library versions change more often than collation versions, empirically- speaking. But did you notice that the version went backwards from 65.1 -> 66.1? Well, actually, it didn't. The version of that collation in 66.1 went from 153.97 -> 153.104. But there's a bug in versionToString() that does the decimal output incorrectly when there's a '0' digit between the hundreds and the ones place. I'll see about reporting that, but I thought I'd mention it here because it could have consequences, as we are storing the strings :-( The bug is still present in 70.1, but it's masked because it went to .112. Incidentally, this answers our other question about whether the collation version can change in a minor version update. Perhaps not, but if they fix this bug and backport it, then the version *string* will change in a minor update. Ugh. Regards, Jeff Davis
On Mon, Nov 28, 2022 at 9:55 PM Jeff Davis <pgsql@j-davis.com> wrote: > But did you notice that the version went backwards from 65.1 -> 66.1? > Well, actually, it didn't. The version of that collation in 66.1 went > from 153.97 -> 153.104. But there's a bug in versionToString() that > does the decimal output incorrectly when there's a '0' digit between > the hundreds and the ones place. I'll see about reporting that, but I > thought I'd mention it here because it could have consequences, as we > are storing the strings :-( > > The bug is still present in 70.1, but it's masked because it went to > .112. > > Incidentally, this answers our other question about whether the > collation version can change in a minor version update. Perhaps not, > but if they fix this bug and backport it, then the version *string* > will change in a minor update. Ugh. That is ... astonishingly bad. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, 2022-11-28 at 21:57 -0500, Robert Haas wrote: > That is ... astonishingly bad. https://unicode-org.atlassian.net/browse/CLDR-16175 -- Jeff Davis PostgreSQL Contributor Team - AWS
On Tue, Nov 29, 2022 at 3:55 PM Jeff Davis <pgsql@j-davis.com> wrote: > =# select * from pg_icu_collation_versions('en_US') order by > icu_version; > icu_version | uca_version | collator_version > -------------+-------------+------------------ > 50.2 | 6.2 | 58.0.6.50 > 51.3 | 6.2 | 58.0.6.50 > 52.2 | 6.2 | 58.0.6.50 > 53.2 | 6.3 | 137.51 > 54.2 | 7.0 | 137.56 > 55.2 | 7.0 | 153.56 > 56.2 | 8.0 | 153.64 > 57.2 | 8.0 | 153.64 > 58.3 | 9.0 | 153.72 > 59.2 | 9.0 | 153.72 > 60.3 | 10.0 | 153.80 > 61.2 | 10.0 | 153.80 > 62.2 | 11.0 | 153.88 > 63.2 | 11.0 | 153.88 > 64.2 | 12.1 | 153.97 > 65.1 | 12.1 | 153.97 > 66.1 | 13.0 | 153.14 > 67.1 | 13.0 | 153.14 > 68.2 | 13.0 | 153.14 > 69.1 | 13.0 | 153.14 > 70.1 | 14.0 | 153.112 > (21 rows) > > This is good information, because it tells us that major library > versions change more often than collation versions, empirically- > speaking. Wow, nice discovery about 104 -> 14. Yeah, I imagine we'll want some kind of band-aid to tolerate that exact screwup and avoid spurious warnings. Bugs aside, that's quite a revealing table in other ways. We can see: * The version scheme changed completely in ICU 53. This corresponds to a major rewrite of the collation code, I see[1]. * The first component seems to be (UCOL_RUNTIME_VERSION << 4) + 9. UCOL_RUNTIME_VERSION is in their uvernum.h, currently 9, was 8, bumped between 54 and 55 (I see this in their commit log), corresponding to the two possible numbers 137 and 153 that we see there. I don't know where the final 9 term is coming from but it looks stable since the v2 collation rewrite landed. * The second component seems to be uca_version_major * 8 + uca_version_minor (that's the Unicode Collation Algorithm version, and so far always matches the Unicode version, visible in the output of the other function). * The values you showed for English don't have a third component, but if you try some other locales like 'zh' you'll see the CLDR major version in third position. So I guess some locales depend on CLDR data and others don't. TL;DR it *looks* like the set of ingredients for the version string is: * UCOL_RUNTIME_VERSION (rarely changes) * UCA/Unicode major.minor version * sometimes CLDR major version, not sure when * 9 [1] https://icu.unicode.org/design/collation/v2
On Mon, 2022-11-28 at 14:11 -0500, Robert Haas wrote: > I don't really understand #1 or #5 well enough to have an educated > opinion, but I do think that #1 seems a bit magical. It hopes that > the > combination of a collation name and a datcollversion will be > sufficient to find exactly one matcing collation in a list of > provided > libraries. The advantage of that, as I understand it, is that if you > do something to your system that causes the number of matches to go > from one to zero, you can just throw another library on the pile and > get the number back up to one. Woohoo! But there's a part of me that > worries: what if the number goes up to two, and they're not all the > same? Probably that's something that shouldn't happen, but if it does > then I think there's kind of no way to fix it. With the other > options, > if there's some way to jigger the catalog state to match what you > want > to happen, you can always repair the situation somehow, because the > library to be used for each collation is explicitly specified in some > way, and you just have to get it to match what you want to have > happen. Not necessarily, #2-4 (at least as implemented in v7) can only load one major version at a time, so can't specify minor versions: https://www.postgresql.org/message-id/9f8e9b5a3352478d4cf7d6c0a5dd7e82496be4b6.camel@j-davis.com With #1, you can provide control over the search order to find the symbol you want. Granted, if you want to specify that different collations look in different libraries for the same version, then it won't work, because the search order is global -- is that what you're worried about? If so, I think we need to compare it against the downsides of #2-4, which in my opinion are more serious. The first thing to sort out with options #2-4 is: what about minor versions? V7 took the approach that only the major version matters. That means that if you want to select a specific minor version, then you are out of luck, because only one major at a time can be loaded, globally. But paying attention to minor versions seems like a mess -- we'd need even more magical fallbacks that try later minor versions or something. Second, there is weirdness in the common case that a collation version doesn't change between versions. Let's say you have a collation "mycoll" with locale "en_US" and it's pointed at built-in library version 64, with collation version 153.97. GUC default_icu_library_version is set to 63. Then you upgrade the system and ICU gets updated from 64 -> 65. Now, it can't find version 64 to load, so it falls back to 63 (which has the wrong version 153.88), even though 65 is just fine because it still offers that locale with version 153.97. (A similar problem exists when you remove a version of ICU from icu_library_path, and another version suffices for all of your collations.) Thirdly, as I said earlier, it's just hard on the user to try to sort out two different versions modeled in the database. Understanding encodings and collations are hard enough, and then we introduce *two* versions on top of that. Fourth, I don't see what the point of ucol_getVersion() is in schemes #2-4. All it does is control a WARNING, because throwing an error (at least by default) would be too harsh, given that users have lived with these risks for so long. But if all it does is throw a warning, what's the point in modeling it in the catalog as though it's the most important version? Ultimately, I think collation version (as reported by ucol_getVersion()) is the most accurate and least-surprising way to match a library-provided collation with the collation in the catalog. And it seems like we'd be using it in exactly the way the ICU maintainers intend it to be used. Of course, I cast my vote for #1 before I discovered this ICU bug here: https://www.postgresql.org/message-id/0f7922d4f411376f420ec9139febeae4cdc748a6.camel@j-davis.com That injects some doubt, to be sure. If I were to try to solve the problems with #2-4, one approach might be to treat the built-in ICU version differently from the ones in icu_library_path. Not quite sure, I'd have to think more. But as of now, I'd still lean toward #1 until a better option is presented. Regards, Jeff Davis
On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote: > On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> > wrote: > > On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com> > > wrote: > > > I'd vote for 1 on the grounds that it's easier to document and > > > understand a single collation version, which comes straight from > > > ucol_getVersion(). This approach makes it a separate problem to > > > find > > > the collation version among whatever libraries the admin can > > > provide; > > > but adding some observability into the search should mitigate any > > > confusion. > > > > OK, it sounds like I should code that up next. > > Here's the first iteration. Thank you. Proposed changes: * I attached a first pass of some documentation. * Should be another GUC to turn WARNING into an ERROR. Useful at least for testing; perhaps too dangerous for production. * The libraries should be loaded in a more diliberate order. The "*" should be expanded in a descending fashion so that later versions are preferred. * GUCs should be validated. * Should validate that loaded library has expected version. * We need to revise or remove pg_collation_actual_version() and pg_database_collation_actual_version(). * The GUCs are PGC_SUSET, but don't take effect because icu_library_list_fully_loaded is never reset. * The extra collations you're adding at bootstrap time are named based on the library major version. I suppose it might be more "proper" to name them based on the collation version, but that would be more verbose, so I won't advocate for that. Just pointing it out. * It looks hard (or impossible) to mix multiple ICU libraries with the same major version and different minor versions. That's because, e.g., libicui18n.so.63.1 links against libicuuc.63 and libicudata.63, and when you install ICU 63.2, those dependencies get clobbered with the 63.2 versions. That fails the sanity check I proposed above about the library version number matching the requested library version number. And it also just seems wrong -- why would you have minor- version precision about an ICU library but then only major-version precision about the ICU dependencies of that library? Doesn't that defeat the whole purpose of this naming scheme? (Maybe another ICU bug?). Minor comments: * ICU_I18N is defined in make_icu_library_name() but used outside of it. One solution might be to have it return both library names to the caller and rename it as make_icu_library_names(). * get_icu_function() could use a clarifying comment or a better name. Something that communicates that you are looking for the function in the given library with the given major version number (which may or may not be needed depending on how the library was compiled). * typo in comment over make_icu_collator: s/u_getVersion/ucol_getVersion/ * The return value of make_icu_collator() seems backwards to me, stylistically. I typically see the false-is-good pattern with integer returns. * weird bracketing style in get_icu_collator for the "else" > The version rosetta stone functions look like this: > > postgres=# select * from pg_icu_library_versions(); > icu_version | unicode_version | cldr_version > -------------+-----------------+-------------- > 67.1 | 13.0 | 37.0 > 63.1 | 11.0 | 34.0 > 57.1 | 8.0 | 29.0 > (3 rows) > > postgres=# select * from pg_icu_collation_versions('zh'); > icu_version | uca_version | collator_version > -------------+-------------+------------------ > 67.1 | 13.0 | 153.14.37 > 63.1 | 11.0 | 153.88.34 > 57.1 | 8.0 | 153.64.29 > (3 rows) I like these functions. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On 11/28/22 14:11, Robert Haas wrote: > On Wed, Nov 23, 2022 at 12:09 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> OK. Time for a new list of the various models we've discussed so far: >> >> 1. search-by-collversion: We introduce no new "library version" >> concept to COLLATION and DATABASE object and little or no new syntax. >> >> 2. lib-version-in-providers: We introduce a separate provider value >> for each ICU version, for example ICU63, plus an unversioned ICU like >> today. >> >> 3. lib-version-in-attributes: We introduce daticuversion (alongside >> datcollversion) and collicuversion (alongside collversion). Similar >> to the above, but it's a separate property and the provider is always >> ICU. New syntax for CREATE/ALTER COLLATION/DATABASE to set and change >> ICU_VERSION. >> >> 4. lib-version-in-locale: "63:en" from earlier versions. That was >> mostly a strawman proposal to avoid getting bogged down in >> syntax/catalogue/model change discussions while trying to prove that >> dlopen would even work. It doesn't sound like anyone really likes >> this. >> >> 5. lib-version-in-collversion: We didn't explicitly discuss this >> before, but you hinted at it: we could just use u_getVersion() in >> [dat]collversion. > > I'd like to vote against #3 at least in the form that's described > here. If we had three more libraries providing collations, it's likely > that they would need versioning, too. So if we add an explicit notion > of provider version, then it ought not to be specific to libicu. +many > I think it's OK to decide that different library versions are > different providers (your option #2), or that they are the same > provider but give rise to different collations (your option #4), or > that there can be multiple version of each collation which are > distinguished by some additional provider version field (your #3 made > more generic). I think provider and collation version are distinct concepts. The provider ('c' versus 'i' for example) determines a unique code path in the backend due to different APIs, whereas collation version is related to a specific ordering given a set of characters. > I don't really understand #1 or #5 well enough to have an educated > opinion, but I do think that #1 seems a bit magical. It hopes that the > combination of a collation name and a datcollversion will be > sufficient to find exactly one matcing collation in a list of provided > libraries. The advantage of that, as I understand it, is that if you > do something to your system that causes the number of matches to go > from one to zero, you can just throw another library on the pile and > get the number back up to one. Woohoo! But there's a part of me that > worries: what if the number goes up to two, and they're not all the > same? Probably that's something that shouldn't happen, but if it does > then I think there's kind of no way to fix it. With the other options, > if there's some way to jigger the catalog state to match what you want > to happen, you can always repair the situation somehow, because the > library to be used for each collation is explicitly specified in some > way, and you just have to get it to match what you want to have > happen. My vote is for something like #5. The collversion should indicate a specific immutable ordering behavior. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Nov 28, 2022 at 11:49 PM Jeff Davis <pgsql@j-davis.com> wrote: > Not necessarily, #2-4 (at least as implemented in v7) can only load one > major version at a time, so can't specify minor versions: > https://www.postgresql.org/message-id/9f8e9b5a3352478d4cf7d6c0a5dd7e82496be4b6.camel@j-davis.com > > With #1, you can provide control over the search order to find the > symbol you want. Granted, if you want to specify that different > collations look in different libraries for the same version, then it > won't work, because the search order is global -- is that what you're > worried about? If so, I think we need to compare it against the > downsides of #2-4, which in my opinion are more serious. You know more about this than I do, for sure, so don't let my vote back the project into a bad spot. But, yeah, the thing you mention here is what I'm worried about. Without a way to force a certain behavior for a certain particular collation, you don't have an escape valve if the global library ordering isn't doing what you want. Your argument seems to at least partly be that #1 will be more usable on the whole, and that does seem like an important consideration. People may have a lot of collations and adjusting them all individually could be difficult and unpleasant. However, I think it's also worth asking what options someone has if #1 can't be made to work due to a single ordering controlling every collation. It's entirely possible that the scenario I'm worried about is too remote in practice to be concerned about. I don't know how this stuff works well enough to be certain. It's just that, on the basis of previous experience, (1) it's not that uncommon for people to actually end up in situations that we thought shouldn't ever happen and (2) code that deals with collations is more untrustworthy than average. -- Robert Haas EDB: http://www.enterprisedb.com
On 11/28/22 6:54 PM, Jeff Davis wrote: > > =# select * from pg_icu_collation_versions('en_US') order by > icu_version; > icu_version | uca_version | collator_version > -------------+-------------+------------------ > ... > 67.1 | 13.0 | 153.14 > 68.2 | 13.0 | 153.14 > 69.1 | 13.0 | 153.14 > 70.1 | 14.0 | 153.112 > (21 rows) > > This is good information, because it tells us that major library > versions change more often than collation versions, empirically- > speaking. It seems to me that the collator_version field is not a good version identifier to use. Just taking a quick glance at the ICU home page right now, it shows that all of the last 5 versions of ICU have included "additions and corrections" to locale data itself, including 68 to 69 where the collator version did not change. Is it possible that this "collator_version" only reflects the code that processes collation data to do comparisons/sorts, but it does not reflect updates to the locale data itself? https://icu.unicode.org/ ICU v72 -> CLDR v42 ICU v71 -> CLDR v41 ICU v70 -> CLDR v40 ICU v69 -> CLDR v39 ICU v68 -> CLDR v38 -Jeremy -- http://about.me/jeremy_schneider
On Wed, Nov 30, 2022 at 7:03 AM Jeremy Schneider <schneider@ardentperf.com> wrote: > It seems to me that the collator_version field is not a good version > identifier to use. > > Just taking a quick glance at the ICU home page right now, it shows that > all of the last 5 versions of ICU have included "additions and > corrections" to locale data itself, including 68 to 69 where the > collator version did not change. > > Is it possible that this "collator_version" only reflects the code that > processes collation data to do comparisons/sorts, but it does not > reflect updates to the locale data itself? I think it also includes the CLDR version for *some* locales. From a quick look, that includes 'ar', 'ru', 'tr', 'zh'. Jeff, would you mind sharing the same table for one of those? Perhaps 'en' really does depend only on the UCA?
On Tue, 2022-11-29 at 12:32 -0500, Robert Haas wrote: > You know more about this than I do, for sure, so don't let my vote > back the project into a bad spot. I'm going back and forth myself. I haven't found a great answer here yet. > But, yeah, the thing you mention > here is what I'm worried about. Without a way to force a certain > behavior for a certain particular collation, you don't have an escape > valve if the global library ordering isn't doing what you want. One bit of weirdness is that I may have found another ICU problem. First, install 63.1, and you get (editing for clarity): $ ls -l /path/to/libicui18n.so.63* /path/to/libicui18n.so.63 -> libicui18n.so.63.1 /path/to/libicui18n.so.63.1 $ ls -l /path/to/libicuuc.so.63* /path/to/libicuuc.so.63 -> libicuuc.so.63.1 /path/to/libicuuc.so.63.1 $ ls -l /path/to/libicudata.so.63* /path/to/libicudata.so.63 -> libicudata.so.63.1 /path/to/lib/libicudata.so.63.1 $ ldd /path/to/libicui18n.so.63.1 libicuuc.so.63 => /path/to/libicuuc.so.63 libicudata.so.63 => /path/to/libicudata.so.63 OK, now install 63.2. Then you get: $ ls -l /path/to/libicui18n.so.63* /path/to/libicui18n.so.63 -> libicui18n.so.63.2 /path/to/libicui18n.so.63.1 /path/to/libicui18n.so.63.2 $ ls -l /path/to/libicuuc.so.63* /path/to/libicuuc.so.63 -> libicuuc.so.63.2 /path/to/libicuuc.so.63.1 /path/to/libicuuc.so.63.2 $ ls -l /path/to/libicudata.so.63* /path/to/libicudata.so.63 -> libicudata.so.63.2 /path/to/libicudata.so.63.1 /path/to/libicudata.so.63.2 $ ldd /path/to/libicui18n.so.63.2 libicuuc.so.63 => /path/to/libicuuc.so.63 libicudata.so.63 => /path/to/libicudata.so.63 The problem is that the specific minor version 63.1 depends on only the major version of its ICU link dependencies. When loading libicui18n.so.63.1, you are actually pulling in libicuuc.so.63.2 and libicudata.so.63.2. When I tried this with Thomas's patch, it caused some confusing problems. I inserted a check that, when you open a library, that the requested and reported versions match, and the check failed when multiple minors are installed. In other words, opening libicui18n.so.63.1 reports a version of 63.2! (Note: I compiled ICU with --enable-rpath, but I don't think it matters.) Summary: even locking down to a minor version does not seem to identify a specific ICU library, because its shared library dependencies do not reference a specific minor version. > It's entirely possible that the scenario I'm worried about is too > remote in practice to be concerned about. I don't know how this stuff > works well enough to be certain. It's just that, on the basis of > previous experience, (1) it's not that uncommon for people to > actually > end up in situations that we thought shouldn't ever happen and (2) > code that deals with collations is more untrustworthy than average. Yeah... -- Jeff Davis PostgreSQL Contributor Team - AWS
On Tue, 2022-11-29 at 11:27 -0500, Joe Conway wrote: > My vote is for something like #5. The collversion should indicate a > specific immutable ordering behavior. Easier said than done: https://www.postgresql.org/message-id/abddc35a7a447d93e2b8371a1a9052cb48866070.camel@j-davis.com Even pointing at a specific minor version doesn't guarantee that specific ICU code is loaded. It could also be a mix of different minor versions that happen to be installed. But if we ignore that problem for a moment, and assume that major version is precise enough, let me make another proposal (not advocating for this, but wanted to put it out there): 6. Create a new concept of a "locked down collation" that points at some specific collation code (identified by some combination of library version and collation version or whatever else can be used to identify it). If a collation is locked down, it would never have a fallback or any other magic, it would either find the code it's looking for, or fail. If a collation is not locked down, it would look only in the built-in ICU library, and warn if it detects some kind of change (again, by whatever heuristic we think is reasonable). #6 doesn't answer all of the problems I pointed out earlier: https://www.postgresql.org/message-id/83faecb4a89dfb5794938e7b4d9f89daf4c5d631.camel@j-davis.com but could be a better starting place for answers. -- Jeff Davis PostgreSQL Contributor Team - AWS
On Wed, 2022-11-30 at 07:18 +1300, Thomas Munro wrote: > On Wed, Nov 30, 2022 at 7:03 AM Jeremy Schneider > <schneider@ardentperf.com> wrote: > > It seems to me that the collator_version field is not a good > > version > > identifier to use. > > > > Just taking a quick glance at the ICU home page right now, it shows > > that > > all of the last 5 versions of ICU have included "additions and > > corrections" to locale data itself, including 68 to 69 where the > > collator version did not change. > > > > Is it possible that this "collator_version" only reflects the code > > that > > processes collation data to do comparisons/sorts, but it does not > > reflect updates to the locale data itself? > > I think it also includes the CLDR version for *some* locales. From a > quick look, that includes 'ar', 'ru', 'tr', 'zh'. Jeff, would you > mind sharing the same table for one of those? Perhaps 'en' really > does depend only on the UCA? =# select * from pg_icu_collation_versions('ar') order by icu_version; icu_version | uca_version | collator_version -------------+-------------+------------------ 50.2 | 6.2 | 58.0.0.50 51.3 | 6.2 | 58.0.0.50 52.2 | 6.2 | 58.0.0.50 53.2 | 6.3 | 137.51.25 54.2 | 7.0 | 137.56.26 55.2 | 7.0 | 153.56.27.1 56.2 | 8.0 | 153.64.28 57.2 | 8.0 | 153.64.29 58.3 | 9.0 | 153.72.30.3 59.2 | 9.0 | 153.72.31.1 60.3 | 10.0 | 153.80.32.1 61.2 | 10.0 | 153.80.33 62.2 | 11.0 | 153.88.33.8 63.2 | 11.0 | 153.88.34 64.2 | 12.1 | 153.97.35.8 65.1 | 12.1 | 153.97.36 66.1 | 13.0 | 153.14.36.8 67.1 | 13.0 | 153.14.37 68.2 | 13.0 | 153.14.38.8 69.1 | 13.0 | 153.14.39 70.1 | 14.0 | 153.112.40 (21 rows) =# select * from pg_icu_collation_versions('zh') order by icu_version; icu_version | uca_version | collator_version -------------+-------------+------------------ 50.2 | 6.2 | 58.0.0.50 51.3 | 6.2 | 58.0.0.50 52.2 | 6.2 | 58.0.0.50 53.2 | 6.3 | 137.51.25 54.2 | 7.0 | 137.56.26 55.2 | 7.0 | 153.56.27.1 56.2 | 8.0 | 153.64.28 57.2 | 8.0 | 153.64.29 58.3 | 9.0 | 153.72.30.3 59.2 | 9.0 | 153.72.31.1 60.3 | 10.0 | 153.80.32.1 61.2 | 10.0 | 153.80.33 62.2 | 11.0 | 153.88.33.8 63.2 | 11.0 | 153.88.34 64.2 | 12.1 | 153.97.35.8 65.1 | 12.1 | 153.97.36 66.1 | 13.0 | 153.14.36.8 67.1 | 13.0 | 153.14.37 68.2 | 13.0 | 153.14.38.8 69.1 | 13.0 | 153.14.39 70.1 | 14.0 | 153.112.40 (21 rows) -- Jeff Davis PostgreSQL Contributor Team - AWS
On 11/29/22 13:59, Jeff Davis wrote: > On Tue, 2022-11-29 at 11:27 -0500, Joe Conway wrote: >> My vote is for something like #5. The collversion should indicate a >> specific immutable ordering behavior. > > Easier said than done: > https://www.postgresql.org/message-id/abddc35a7a447d93e2b8371a1a9052cb48866070.camel@j-davis.com > > Even pointing at a specific minor version doesn't guarantee that > specific ICU code is loaded. It could also be a mix of different minor > versions that happen to be installed. I understand that it is not easily done, but if the combination of collprovider + collversion does not represent specific immutable ordering behavior for a given locale, what value is there in tracking it? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, 2022-11-28 at 19:36 -0800, Jeff Davis wrote: > On Mon, 2022-11-28 at 21:57 -0500, Robert Haas wrote: > > That is ... astonishingly bad. > > https://unicode-org.atlassian.net/browse/CLDR-16175 Oops, reported in CLDR instead of ICU. Moved to: https://unicode-org.atlassian.net/browse/ICU-22215 -- Jeff Davis PostgreSQL Contributor Team - AWS
On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote: > One bit of weirdness is that I may have found another ICU problem. Reported as: https://unicode-org.atlassian.net/browse/ICU-22216 -- Jeff Davis PostgreSQL Contributor Team - AWS
On Wed, Nov 30, 2022 at 8:03 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2022-11-30 at 07:18 +1300, Thomas Munro wrote: > > I think it also includes the CLDR version for *some* locales. From a > > quick look, that includes 'ar', 'ru', 'tr', 'zh'. Jeff, would you > > mind sharing the same table for one of those? Perhaps 'en' really > > does depend only on the UCA? > > =# select * from pg_icu_collation_versions('ar') order by icu_version; > icu_version | uca_version | collator_version > -------------+-------------+------------------ > 50.2 | 6.2 | 58.0.0.50 > 51.3 | 6.2 | 58.0.0.50 > 52.2 | 6.2 | 58.0.0.50 > 53.2 | 6.3 | 137.51.25 > 54.2 | 7.0 | 137.56.26 > 55.2 | 7.0 | 153.56.27.1 > 56.2 | 8.0 | 153.64.28 > 57.2 | 8.0 | 153.64.29 > 58.3 | 9.0 | 153.72.30.3 > 59.2 | 9.0 | 153.72.31.1 > 60.3 | 10.0 | 153.80.32.1 > 61.2 | 10.0 | 153.80.33 > 62.2 | 11.0 | 153.88.33.8 > 63.2 | 11.0 | 153.88.34 > 64.2 | 12.1 | 153.97.35.8 > 65.1 | 12.1 | 153.97.36 > 66.1 | 13.0 | 153.14.36.8 > 67.1 | 13.0 | 153.14.37 > 68.2 | 13.0 | 153.14.38.8 > 69.1 | 13.0 | 153.14.39 > 70.1 | 14.0 | 153.112.40 > (21 rows) Thanks. So now we can see that the CLDR minor version is there too. At a guess, in ICU 60 and before, it was the 4th component directly, and from ICU 61 on, it's shifted left 3 bits. I guess that means those CLDR-dependent locales have higher frequency collversion changes, including everyday "apt-get upgrade" (no major OS upgrade required), assuming that Debian et al take those minor upgrades, while others like 'en' should be stable for the whole ICU major version's lifetime, and even across some ICU major version upgrades, because the Unicode/UCA version changes more slowly. Those CLDR-dependent locales therefore present us with a problem: as discussed a while back, it's impossible to install two minor versions of the same ICU major version with packages, and as Jeff has pointed out in recent emails, even if you compile them yourself (which no one really expects users to do), it doesn't really work because the SONAMEs only have the major version, so the various libraries that make up ICU will not be able to open each other correctly (they'll follow symlinks to an arbitrary minor version). (These two things are not unrelated.) So I probably need to remove the code that claimed to support minor version addressing and go back to the previous thinking that major will have to be enough. In terms of user experience, I think that might mean that users of 'zh' who encounter warnings after a minor upgrade would therefore really only have the options of REFRESHing and rebuilding, or downgrading the package, because there's no way for us to access the older version. Users of 'en' probably only encounter collversion changes when moving between OS releases with an ICU major version change, and then the various schemes in this thread can help them avoid the need to rebuild, until they eventually want to, if ever.
On Tue, Nov 29, 2022 at 1:59 PM Jeff Davis <pgsql@j-davis.com> wrote: > 6. Create a new concept of a "locked down collation" that points at > some specific collation code (identified by some combination of library > version and collation version or whatever else can be used to identify > it). If a collation is locked down, it would never have a fallback or > any other magic, it would either find the code it's looking for, or > fail. If a collation is not locked down, it would look only in the > built-in ICU library, and warn if it detects some kind of change > (again, by whatever heuristic we think is reasonable). It seems like it would be somewhat reasonable to allow varying levels of specificity in saying which what suffix to append when calling dlopen() on the ICU library. Like you could allow adding nothing, which would find the system-default ICU, or you could add 53 to find the default version of ICU 53, or you could 53.1 to pick a specific minor version. The idea is that the symlinks in the filesystem would be responsible for sorting out the meaning of the supplied string. The way that minor versions work may preclude having this work as well as one might hope, though. I continue to be confused about why collation maintainers think that it's OK to whack stuff around in minor versions. The thought that people might use collations to sort data that needs to stay sorted after upgrading the library seems to be an alien one, and it doesn't really seem like libicu is a whole lot better than libc, either. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Nov 30, 2022 at 8:52 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 29, 2022 at 1:59 PM Jeff Davis <pgsql@j-davis.com> wrote: > > 6. Create a new concept of a "locked down collation" that points at > > some specific collation code (identified by some combination of library > > version and collation version or whatever else can be used to identify > > it). If a collation is locked down, it would never have a fallback or > > any other magic, it would either find the code it's looking for, or > > fail. If a collation is not locked down, it would look only in the > > built-in ICU library, and warn if it detects some kind of change > > (again, by whatever heuristic we think is reasonable). > > It seems like it would be somewhat reasonable to allow varying levels > of specificity in saying which what suffix to append when calling > dlopen() on the ICU library. Like you could allow adding nothing, > which would find the system-default ICU, or you could add 53 to find > the default version of ICU 53, or you could 53.1 to pick a specific > minor version. The idea is that the symlinks in the filesystem would > be responsible for sorting out the meaning of the supplied string. The > way that minor versions work may preclude having this work as well as > one might hope, though. I'm struggling to understand what's new about proposal #6. The earlier proposals except #1 already contemplated different levels of locked-down-ness. For example in the libversion-as-provider idea, we said you could use just provider = ICU (warn me if the collverison changes, but always use the "default" library and carry on, pretty much like today except perhaps "the default" can be changed with a GUC), or you could be more specific and say provider = ICU63. (We also mentioned ICU63_2 as a third level of specificity, but maybe that's practically impossible.) And it was the same for the other ideas, just encoded in different ways.
On Tue, 2022-11-29 at 14:34 -0500, Joe Conway wrote: > I understand that it is not easily done, but if the combination of > collprovider + collversion does not represent specific immutable > ordering behavior for a given locale Given the u_versionToString() bug, we know the version string could end up being the same between two different collation versions (e.g. 153.104 and 153.14). So that really undermines the credibility of ICU's collation versions (at least the strings, which is what we store in collversion). But if we ignore that bug, do we have evidence that the actual versions could be the same for collations that sort differently? It's worth exploring, to be sure, but right now I don't know of a case. > , what value is there in tracking [collation version]? Similarly, what is the value in tracking the library minor versions, if when you open libicui18n.63.1, you may end up with a mix of code between 63.1 and 63.2? That doesn't mean it's impossible. We could attach collations to a library major version, and tell administrators that once they install a major version in icu_library_path, they never touch that major version again (no updates or new minors, only new majors). #6 might be a good approach to facilitate this best practice. We'd then probably need to change collversion to be a library major version, and then come up with a migration path from 15 -> 16. Or we could store both library major version and collversion, and verify both. -- Jeff Davis PostgreSQL Contributor Team - AWS
On Wed, 2022-11-30 at 08:41 +1300, Thomas Munro wrote: > In terms of user experience, I think that might mean that users of > 'zh' who encounter warnings after a minor upgrade would therefore > really only have the options of REFRESHing and rebuilding, or > downgrading the package, because there's no way for us to access the > older version. Users of 'en' probably only encounter collversion > changes when moving between OS releases with an ICU major version > change, and then the various schemes in this thread can help them > avoid the need to rebuild, until they eventually want to, if ever. I installed the first minor release for each major, and got some new tables. I think we can all agree that it's a lot easier to work with information once it's in table form. Here's what I found for the 'ar' locale (firstminor/lastminor are the icu library versions, firstcollversion/lastcollversion are their respective collation versions for the given locale): firstminor | lastminor | firstcollversion | lastcollversion ------------+-----------+------------------+----------------- 60.1 | 60.3 | 153.80.32 | 153.80.32.1 64.1 | 64.2 | 153.96.35 | 153.97.35.8 68.1 | 68.2 | 153.14.38 | 153.14.38.8 (3 rows) For 'en': firstminor | lastminor | firstcollversion | lastcollversion ------------+-----------+------------------+----------------- 64.1 | 64.2 | 153.96 | 153.97 (1 row) And for 'zh': firstminor | lastminor | firstcollversion | lastcollversion ------------+-----------+------------------+----------------- 60.1 | 60.3 | 153.80.32 | 153.80.32.1 64.1 | 64.2 | 153.96.35 | 153.97.35.8 68.1 | 68.2 | 153.14.38 | 153.14.38.8 (3 rows) It looks like collation versions do change in minor releases. It looks like it's *not* safe to lock a collation to a major version *if* that major version could be updated to a new minor. And we can't lock to a minor, as I said earlier. Therefore, once we lock a collation down to a major release, we better keep that in the icu_library_path, and never touch it, and never install a new minor for that major. Then again, maybe some of these are just about how the version is reported... maybe 153.80.32 and 153.80.32.1 are really the same version? But 64.1 -> 64.2 looks like a real difference. I suppose the next step is to test with actual data and find differences? -- Jeff Davis PostgreSQL Contributor Team - AWS
On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis <pgsql@j-davis.com> wrote: > Here's what I found for the 'ar' locale (firstminor/lastminor are the > icu library versions, firstcollversion/lastcollversion are their > respective collation versions for the given locale): > > firstminor | lastminor | firstcollversion | lastcollversion > ------------+-----------+------------------+----------------- > 60.1 | 60.3 | 153.80.32 | 153.80.32.1 > 64.1 | 64.2 | 153.96.35 | 153.97.35.8 > 68.1 | 68.2 | 153.14.38 | 153.14.38.8 > (3 rows) Right, this fits with what I said earlier: the third component is CLDR major, fourth component is CLDR minor except from ICU 61 on the CLDR minor is << 3'd (X.X.38.8 means CLDR 38.1). I wrote something about that particular CLDR upgrade that happened in ICU 68 back here, with a link to the CLDR change list: https://www.postgresql.org/message-id/CA+hUKGJxg6AbKC9RJ7r1ByVLtvVkThQV+RZO6BKVWYESPCp3Ug@mail.gmail.com TL;DR that particular CLDR change didn't actually affect collations, it affected other locale stuff we don't care about (timezones etc). We probably have to assume that any CLDR change *might* affect us, though, unless we can find a written policy somewhere that says CLDR minor changes never change sort order. But I wouldn't want to get into 2nd guessing their ucol_getVersion() format, and if they knew that minor changes didn't affect sort order they presumably wouldn't have included it in the recipe, so I think we simply have to treat it as opaque and assume that ucol_getVersion() change means what it says on the tin: sort order might have changed. > I suppose the next step is to test with actual data and find > differences? Easier to read the published CLDR deltas, but I'm not sure it'd tell us much about what *could* happen in future releases...
On Wed, 2022-11-30 at 09:00 +1300, Thomas Munro wrote: > I'm struggling to understand what's new about proposal #6. Perhaps it's just a slight variant; I'm not sure. It's not a complete proposal yet. The difference I had in mind is that it would treat the built-in ICU differently from what is found in icu_library_path. I think that could remove confusion over what happens when you upgrade the system's ICU library. -- Jeff Davis PostgreSQL Contributor Team - AWS
On Wed, Nov 30, 2022 at 8:38 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote: > > One bit of weirdness is that I may have found another ICU problem. > > Reported as: > > https://unicode-org.atlassian.net/browse/ICU-22216 I'm no expert on loader/linker arcana but I have a feeling this is a dead end. It's an ancient Unix or at least elf-era Unix convention that SONAMEs have major versions only, because major versions are the basis of ABI stability. As a workaround with an already built ICU, I think you could use elf editing tools like "patchelf" to change the SONAME and DT_NEEDED to include the minor version. Or you could convince the build/link scripts to set them that way in the first place, but no distro would want to do that as it would cause lots of executables to fail to load when the next ICU minor comes out.
On Wed, 2022-11-30 at 10:52 +1300, Thomas Munro wrote: > On Wed, Nov 30, 2022 at 8:38 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote: > > > One bit of weirdness is that I may have found another ICU > > > problem. > > > > Reported as: > > > > https://unicode-org.atlassian.net/browse/ICU-22216 > > I'm no expert on loader/linker arcana but I have a feeling this is a > dead end. It's an ancient Unix or at least elf-era Unix convention > that SONAMEs have major versions only, because major versions are the > basis of ABI stability. It's possible that it's more a problem of how they are doing it: the specific version is coming from a dependency rather than the library itself. The results are surprising, so I figured it's worth a report. Let's see what they say. Regardless, even if they did make a change, it's not going to help us anytime soon. We can't rely on any scheme that involves multiple minor versions for a single major version being installed at once. That means that, if you create a collation depending on ICU X.Y, and then it gets upgraded to X.(Y+1), and you create another collation depending on that library version, you are stuck. -- Jeff Davis PostgreSQL Contributor Team - AWS
On Wed, 2022-11-30 at 10:29 +1300, Thomas Munro wrote: > On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis <pgsql@j-davis.com> wrote: > > Here's what I found for the 'ar' locale (firstminor/lastminor are > > the > > icu library versions, firstcollversion/lastcollversion are their > > respective collation versions for the given locale): > > > > firstminor | lastminor | firstcollversion | lastcollversion > > ------------+-----------+------------------+----------------- > > 60.1 | 60.3 | 153.80.32 | 153.80.32.1 > > 64.1 | 64.2 | 153.96.35 | 153.97.35.8 > > 68.1 | 68.2 | 153.14.38 | 153.14.38.8 > > (3 rows) > > Right, this fits with what I said earlier: the third component is > CLDR > major, fourth component is CLDR minor except from ICU 61 on the CLDR > minor is << 3'd (X.X.38.8 means CLDR 38.1). What about 64.1 -> 64.2? That changed the *second* component from 96 -> 97. Are we agreed that collations can materially change in minor ICU releases? -- Jeff Davis PostgreSQL Contributor Team - AWS
On Wed, Nov 30, 2022 at 1:32 PM Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2022-11-30 at 10:29 +1300, Thomas Munro wrote: > > On Wed, Nov 30, 2022 at 9:59 AM Jeff Davis <pgsql@j-davis.com> wrote: > > > Here's what I found for the 'ar' locale (firstminor/lastminor are > > > the > > > icu library versions, firstcollversion/lastcollversion are their > > > respective collation versions for the given locale): > > > > > > firstminor | lastminor | firstcollversion | lastcollversion > > > ------------+-----------+------------------+----------------- > > > 60.1 | 60.3 | 153.80.32 | 153.80.32.1 > > > 64.1 | 64.2 | 153.96.35 | 153.97.35.8 > > > 68.1 | 68.2 | 153.14.38 | 153.14.38.8 > > > (3 rows) > > > > Right, this fits with what I said earlier: the third component is > > CLDR > > major, fourth component is CLDR minor except from ICU 61 on the CLDR > > minor is << 3'd (X.X.38.8 means CLDR 38.1). > > What about 64.1 -> 64.2? That changed the *second* component from 96 -> > 97. Are we agreed that collations can materially change in minor ICU > releases? That means that the Unicode/UCA version switched from 12 to 12.1, so that's a confirmed sighting of a UCA minor version bump within one ICU major version. Let's see what the purpose of that Unicode minor release was[1]: "Unicode 12.1 adds exactly one character, for a total of 137,929 characters. The new character added to Version 12.1 is: U+32FF SQUARE ERA NAME REIWA Version 12.1 adds that single character to enable software to be rapidly updated to support the new Japanese era name in calendrical systems and date formatting. The new Japanese era name was officially announced on April 1, 2019, and is effective as of May 1, 2019." Wow! Wikipedia says[2] "the "rei" character 令 has never appeared before". The sort order of characters that didn't previously exist is a special topic. In theory they can't hurt you because you shouldn't have been using them, but PostgreSQL doesn't enforce that (other systems do), so you could be exposed to a change from whatever default ordering the non-existent codepoint had for random implementation reasons to some deliberate ordering which may or may not be the same. Are all Unicode/UCA minor versions of that type? I dunno. Something to research, but [3] is far too vague and [4] is about other problems. [1] https://unicode.org/versions/Unicode12.1.0/ [2] https://en.wikipedia.org/wiki/Reiwa [3] https://www.unicode.org/versions/#major_minor [4] https://www.unicode.org/policies/stability_policy.html
On Wed, Nov 30, 2022 at 1:25 PM Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2022-11-30 at 10:52 +1300, Thomas Munro wrote: > > On Wed, Nov 30, 2022 at 8:38 AM Jeff Davis <pgsql@j-davis.com> wrote: > > > On Tue, 2022-11-29 at 10:46 -0800, Jeff Davis wrote: > > > https://unicode-org.atlassian.net/browse/ICU-22216 > > > > I'm no expert on loader/linker arcana but I have a feeling this is a > > dead end. It's an ancient Unix or at least elf-era Unix convention > > that SONAMEs have major versions only, because major versions are the > > basis of ABI stability. > > It's possible that it's more a problem of how they are doing it: the > specific version is coming from a dependency rather than the library > itself. The results are surprising, so I figured it's worth a report. > Let's see what they say. > > Regardless, even if they did make a change, it's not going to help us > anytime soon. We can't rely on any scheme that involves multiple minor > versions for a single major version being installed at once. That means > that, if you create a collation depending on ICU X.Y, and then it gets > upgraded to X.(Y+1), and you create another collation depending on that > library version, you are stuck. Mainstream package maintainers aren't going to let that happen anyway as discussed, so this would always be a fairly specialised concern. Maybe someone in our community would be motivated to publish a repo full of mutant packages that don't conflict with each other and that have specially modified DT_NEEDED, or are rolled into one single library so the DT_NEEDED problem goes away.
On Wed, Nov 30, 2022 at 01:50:51PM +1300, Thomas Munro wrote: > The new character added to Version 12.1 is: > > U+32FF SQUARE ERA NAME REIWA > > Version 12.1 adds that single character to enable software to be > rapidly updated to support the new Japanese era name in calendrical > systems and date formatting. The new Japanese era name was officially > announced on April 1, 2019, and is effective as of May 1, 2019." > > Wow! Wow++. I didn't know this one. > Wikipedia says[2] "the "rei" character 令 has never appeared before". At least there was some time ahead to prepare for the switch from "平 成" to "令和". Things were much "funnier" when the era has switched from "昭和" to "平成", as the sudden death of the emperor has required Japan to switch to a new calendar very suddenly back in the day.. I've heard this was quite a mess for folks in IT back then, especially for public agencies. -- Michael
Attachment
Jeff Davis <pgsql@j-davis.com> writes: > On Mon, 2022-11-28 at 19:36 -0800, Jeff Davis wrote: >> On Mon, 2022-11-28 at 21:57 -0500, Robert Haas wrote: >> > That is ... astonishingly bad. >> >> https://unicode-org.atlassian.net/browse/CLDR-16175 > > Oops, reported in CLDR instead of ICU. Moved to: > > https://unicode-org.atlassian.net/browse/ICU-22215 Out of morbid curiosity I went source diving, and the culprit is this bit (which will also break if a version component ever goes above 999): /* write the decimal field value */ field=versionArray[part]; if(field>=100) { *versionString++=(char)('0'+field/100); field%=100; } if(field>=10) { *versionString++=(char)('0'+field/10); field%=10; } *versionString++=(char)('0'+field); (https://sources.debian.org/src/icu/72.1-3/source/common/putil.cpp#L2308) because apparently snprintf() is too hard? - ilmari
On Tue, Nov 29, 2022 at 7:51 PM Jeff Davis <pgsql@j-davis.com> wrote: > On Sat, 2022-11-26 at 18:27 +1300, Thomas Munro wrote: > > On Thu, Nov 24, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> > > wrote: > > > On Thu, Nov 24, 2022 at 3:07 PM Jeff Davis <pgsql@j-davis.com> > > > wrote: > > > > I'd vote for 1 on the grounds that it's easier to document and > > > > understand a single collation version, which comes straight from > > > > ucol_getVersion(). This approach makes it a separate problem to > > > > find > > > > the collation version among whatever libraries the admin can > > > > provide; > > > > but adding some observability into the search should mitigate any > > > > confusion. > > > > > > OK, it sounds like I should code that up next. > > > > Here's the first iteration. > > Thank you. Thanks for the review. Responses further down. And thanks also for the really interesting discussion about how the version numbers work (or in some cases, don't work...), and practical packaging and linking problems. To have a hope of making something happen for PG16, which I think means we need a serious contender patch in the next few weeks, we really need to make some decisions. I enjoyed trying out search-by-collversion, but it's still not my favourite. On the ballot we have two main questions: 1. Should we commit to search-by-collversion, or one of the explicit library version ideas, and if the latter, which? 2. Should we try to support being specific about minor versions (in various different ways according to the choice made for #1)? My tentative votes are: 1. I think we should seriously consider provider = ICU63. I still think search-by-collversion is a little too magical, even though it clearly can be made to work. Of the non-magical systems, I think encoding the choice of library into the provider name would avoid the need to add a second confusing "X_version" concept alongside our existing "X_version" columns in catalogues and DDL syntax, while still making it super clear what is going on. This would include adding DDL commands so you can do ALTER DATABASE/COLLATION ... PROVIDER = ICU63 to make warnings go way. 2. I think we should ignore minor versions for now (other than reporting them in the relevant introspection functions), but not make any choices that would prevent us from changing our mind about that in a later release. For example, having two levels of specificity ICU and ICU68 in the libver-in-provider-name design wouldn't preclude us from adding support for ICU68_2 later I haven't actually tried that design out in code yet, but I'm willing to try to code that up very soon. So no new patch from me yet. Does anyone else want to express a view? > Proposed changes: > > * I attached a first pass of some documentation. Thanks. Looks pretty good, and much of it would stay if we changed to one of the other models. > * Should be another GUC to turn WARNING into an ERROR. Useful at least > for testing; perhaps too dangerous for production. OK, will add that into the next version. > * The libraries should be loaded in a more diliberate order. The "*" > should be expanded in a descending fashion so that later versions are > preferred. Yeah, I agree. > * GUCs should be validated. Will do. > * Should validate that loaded library has expected version. Will do. > * We need to revise or remove pg_collation_actual_version() and > pg_database_collation_actual_version(). I never liked that use of the word "actual"... > * The GUCs are PGC_SUSET, but don't take effect because > icu_library_list_fully_loaded is never reset. True. Just rought edges because I was trying to prototype search-by-collversion fast. Will consider this for the next version. > * The extra collations you're adding at bootstrap time are named based > on the library major version. I suppose it might be more "proper" to > name them based on the collation version, but that would be more > verbose, so I won't advocate for that. Just pointing it out. Ah, yes, the ones with names like "en-US-x-icu68". I agree that made a little less sense in the search-by-collversion patch. Maybe we wouldn't want these at all in the search-by-collversion model. But I think they're perfect the way they are in the provider = ICU68 model. The other idea I considered ages ago was that we could use namespaces: you could "icu68.en-US", or just "en-US" in some contexts to get what your search path sees, but that all seemed a little too cute and not really like anything else we do with system-created catalogues, so I gave that idea up. > * It looks hard (or impossible) to mix multiple ICU libraries with the > same major version and different minor versions. That's because, > e.g., libicui18n.so.63.1 links against libicuuc.63 and libicudata.63, > and when you install ICU 63.2, those dependencies get clobbered with > the 63.2 versions. That fails the sanity check I proposed above about > the library version number matching the requested library version > number. And it also just seems wrong -- why would you have minor- > version precision about an ICU library but then only major-version > precision about the ICU dependencies of that library? Doesn't that > defeat the whole purpose of this naming scheme? (Maybe another ICU > bug?). I don't think it's a bug exactly. That scheme is designed to advertise ABI stability, and not intended to support parallel installation of minor versions. It does seem a little silly for libraries that are shipped together as one atomic unit not to use fully qualified dependency names, though. I think there would be various technical solutions, if you're prepared to give up existing ready-made packages and build stuff yourself. Install them into different directories with different DT_RPATH so they can't see each other (but then our icu_library_path needs to support a list of paths or it won't find these ones which will have to be not in the usual system path), or clobber the DT_NEEDED (but I guess not the DT_SONAME) to mention the minor version, and equivalent concepts for other non-elf systems (at a glance the same problem applies on macOS), or re-roll the libraries into a single .so. Or convince them to support a single library build mode (maybe there is one already? I couldn't find it). That's all a bit against the grain for now, and makes me want to abandon the notion of minor versions completely for now but leave the option open for later exploration. In the meantime, I think the feature is still pretty useful. For example, it helps you with the common case of a major OS upgrade or streaming replication across major OS versions: just find the right .deb/rpm/whatever for the older one, and install it, until you're ready to upgrade and REFRESH. The story is not quite as good for someone with an index full of Chinese or Turkish text who gets a surprise warning after a minor apt-get update, because the Japanese have decided to invent a new character. We can't offer a nice solution to that: they have to determine that it is safe to REFRESH to clear the warning, with or without rebuild, or downgrade/pin the ICU package until they are ready to REFRESH. But that is already the case today and this patch neither helps nor hinders. The only reason we didn't know about this pre-existing type of problem is because (approximately) nobody uses ICU yet, because it wasn't available as a database default yet. > Minor comments: > > * ICU_I18N is defined in make_icu_library_name() but used outside of > it. One solution might be to have it return both library names to the > caller and rename it as make_icu_library_names(). Good idea, will do. > * get_icu_function() could use a clarifying comment or a better name. > Something that communicates that you are looking for the function in > the given library with the given major version number (which may or may > not be needed depending on how the library was compiled). Agreed. > * typo in comment over make_icu_collator: > s/u_getVersion/ucol_getVersion/ Thanks. > * The return value of make_icu_collator() seems backwards to me, > stylistically. I typically see the false-is-good pattern with integer > returns. Agreed. > * weird bracketing style in get_icu_collator for the "else" Yep. > > The version rosetta stone functions look like this: > > > > postgres=# select * from pg_icu_library_versions(); > > icu_version | unicode_version | cldr_version > > -------------+-----------------+-------------- > > 67.1 | 13.0 | 37.0 > > 63.1 | 11.0 | 34.0 > > 57.1 | 8.0 | 29.0 > > (3 rows) > > > > postgres=# select * from pg_icu_collation_versions('zh'); > > icu_version | uca_version | collator_version > > -------------+-------------+------------------ > > 67.1 | 13.0 | 153.14.37 > > 63.1 | 11.0 | 153.88.34 > > 57.1 | 8.0 | 153.64.29 > > (3 rows) > > I like these functions. Yeah, they've been quite educational. Now I'm wondering what form these functions would take in the provider = ICU68 patch.
On Sun, Dec 4, 2022 at 10:12 PM Thomas Munro <thomas.munro@gmail.com> wrote: > My tentative votes are: > > 1. I think we should seriously consider provider = ICU63. I still > think search-by-collversion is a little too magical, even though it > clearly can be made to work. Of the non-magical systems, I think > encoding the choice of library into the provider name would avoid the > need to add a second confusing "X_version" concept alongside our > existing "X_version" columns in catalogues and DDL syntax, while still > making it super clear what is going on. This would include adding DDL > commands so you can do ALTER DATABASE/COLLATION ... PROVIDER = ICU63 > to make warnings go way. +1. I wouldn't lose any sleep if we picked a different non-magical option, but I think this is probably my favorite of the explicit-library-version options (though it is close) and I like it better than search-by-collversion. (It's possible that I'm wrong to like it better, but I do.) > 2. I think we should ignore minor versions for now (other than > reporting them in the relevant introspection functions), but not make > any choices that would prevent us from changing our mind about that in > a later release. For example, having two levels of specificity ICU > and ICU68 in the libver-in-provider-name design wouldn't preclude us > from adding support for ICU68_2 later +1. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote: > 1. I think we should seriously consider provider = ICU63. I still > think search-by-collversion is a little too magical, even though it > clearly can be made to work. Of the non-magical systems, I think > encoding the choice of library into the provider name would avoid the > need to add a second confusing "X_version" concept alongside our > existing "X_version" columns in catalogues and DDL syntax, while > still > making it super clear what is going on. As I understand it, this is #2 in your previous list? Can we put the naming of the provider into the hands of the user, e.g.: CREATE COLLATION PROVIDER icu63 TYPE icu AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63'; In this model, icu would be a "provider kind" and icu63 would be the specific provider, which is named by the user. That seems like the least magical approach, to me. We need an ICU library; the administrator gives us one that looks like ICU; and we're happy. It avoids a lot of the annoyances we're discussing, and puts the power in the hands of the admin. If they want to allow minor version updates, they specify the library with .so.63, and let the symlinking handle it. Of course, we can still do some sanity checks (WARNINGs or ERRORs) when we think something is going wrong; like the version of ICU is too new, or the reported version (ucol_getVersion()) doesn't match what's in collversion. But we basically get out of the business of understanding ICU versioning and leave that up to the administrator. It's easier to document, and would require fewer GUCs (if any). And it avoids mixing version information from another project into our data model. -- Jeff Davis PostgreSQL Contributor Team - AWS
On 12/5/22 12:41, Jeff Davis wrote: > On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote: >> 1. I think we should seriously consider provider = ICU63. I still >> think search-by-collversion is a little too magical, even though it >> clearly can be made to work. Of the non-magical systems, I think >> encoding the choice of library into the provider name would avoid the >> need to add a second confusing "X_version" concept alongside our >> existing "X_version" columns in catalogues and DDL syntax, while >> still >> making it super clear what is going on. > > As I understand it, this is #2 in your previous list? > > Can we put the naming of the provider into the hands of the user, e.g.: > > CREATE COLLATION PROVIDER icu63 TYPE icu > AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63'; > > In this model, icu would be a "provider kind" and icu63 would be the > specific provider, which is named by the user. > > That seems like the least magical approach, to me. We need an ICU > library; the administrator gives us one that looks like ICU; and we're > happy. +1 I like this. The provider kind defines which path we take in our code, and the specific library unambiguously defines a specific collation behavior (I think, ignoring bugs?) -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Dec 6, 2022 at 6:45 AM Joe Conway <mail@joeconway.com> wrote: > On 12/5/22 12:41, Jeff Davis wrote: > > On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote: > >> 1. I think we should seriously consider provider = ICU63. I still > >> think search-by-collversion is a little too magical, even though it > >> clearly can be made to work. Of the non-magical systems, I think > >> encoding the choice of library into the provider name would avoid the > >> need to add a second confusing "X_version" concept alongside our > >> existing "X_version" columns in catalogues and DDL syntax, while > >> still > >> making it super clear what is going on. > > > > As I understand it, this is #2 in your previous list? > > > > Can we put the naming of the provider into the hands of the user, e.g.: > > > > CREATE COLLATION PROVIDER icu63 TYPE icu > > AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63'; > > > > In this model, icu would be a "provider kind" and icu63 would be the > > specific provider, which is named by the user. > > > > That seems like the least magical approach, to me. We need an ICU > > library; the administrator gives us one that looks like ICU; and we're > > happy. > > +1 > > I like this. The provider kind defines which path we take in our code, > and the specific library unambiguously defines a specific collation > behavior (I think, ignoring bugs?) OK, I'm going to see what happens if I try to wrangle that stuff into a new catalogue table.
On Tue, 2022-12-06 at 10:33 +1300, Thomas Munro wrote: > OK, I'm going to see what happens if I try to wrangle that stuff into > a new catalogue table. I've been hacking on a major refactor of the locale-related code. I attached my progress and I think several patches are ready. The main motivation is that I was frustrated by the special cases everywhere. I wanted to make it easier to hack on this code going forward, now that we are adding even more complexity for multiple ICU libraries. I'm posting to this thread (rather than my previous refactoring thread[1]) because a lot of the people interested in working on this code are here. So, if you like (or don't like) the structure of these changes, please let me know. Changes: * Introduce pg_locale_internal.h to hide all USE_ICU code, including all callers of the ICU routines. The files that still need to include pg_locale_internal.h are: - pg_locale.c - regc_pg_locale.c - formatting.c - like.c - like_support.c - collationcmds.c * Other callers (in files that don't include pg_locale_internal.h) don't need to branch based on the provider, platform, database encoding, USE_ICU, HAVE_LOCALE_T, etc. * ICU and libc are treated the same way in more places. * I made it so pg_locale_t is constructed first, then moved to TopMemoryContext, so that it won't leak in TopMemoryContext if errors are encountered. * Introduce pg_strcoll, pg_strncoll, pg_strxfrm, and pg_strnxfrm so that varlena/hash/verchar code doesn't worry about the details. * Add method structure pg_icu_library, borrowed from Thomas's patch, that provides one convenient place to provide multiple-ICU-library support. * Add a hook that allows you to fill in the pg_icu_library structure however you want while a pg_locale_t is being constructed. This allows do-it-yourself ICU library lockdown. On the negative side, it increases the line count. Part of that is because adding indirection for the ICU library is just more lines of code, but a lot of it is just that I used a lot of smaller functions. Perhaps my style is a bit verbose? Even though we're close to consensus on how we should offer control over the ICU libraries, having the hook may be useful for experimentation, testing, or as a last resort. Right now the hook has limited information to use to find the right library -- just the ICU collation name and the version, because that's what we have in the catalog. But I assume the patch Thomas is working on will change that. Performance: I did brief performance sanity tests on several paths and the results are unremarkable (which is generally good for a refactor). On the path I was watching most closely, ICU/UTF8/en-US-x-icu, it came in about 2% faster, which was a pleasant surprise. This was true both when I disabled abbreviated keys (to stress localized comparison paths) and also with abbreviated keys enabled. My previous refactoring work[1] ended up a percent or two slower. My guess right now is that I moved some code around after I noticed that ICU accepts NUL-terminated strings (by specifying the lenght as -1), and that helped. But I'll need to profile and look more closely to be more certain of my results, these are preliminary. There are a few things that could be done differently: * I am still a bit confused about why someone would want a collation with a different lc_collate and lc_ctype in libc; and assuming there is a reason, why it can't be done with ICU. The way I did the refactoring tries to accommodate them as different concepts, but I can rip that out. * In theory, we could also support multilib libc, and an associated get_libc_library() and hook, but there are a couple big challenges. Firstly, if it's the default locale, it relies on setlocale(), so we'd have to figure out what to do about that. Second, having an a second version of glibc on your system is not as normal or trivial as having a second version of ICU. * I made the hook simple, but all it can do is replace the ICU library. It's possible that it would want to construct it's own entire pg_locale_t for some reason, and keep more complex state in a private pointer, or something like that. It seemed better to keep it simple, but maybe someone would want more flexibility there? * I used the library indirection for pretty much all ICU calls, including the ucnv_ and the uloc_ functions. I did this mainly because, if we are so paranoid about ICU changing in subtle ways, we might as well make it possible to lock down everything. I can rip this out, too, but it didn't add many lines. Loose ends: * I need to do something with get_collation_actual_version. I had an earlier iteration that went through pg_newlocale() and then queried the resulting pg_locale_t structure, but that changed the error paths in a way that failed a couple tests, so I left that out. * Error paths could be improved further to make sure that libc locale_t and UCollator structures are freed in error paths during construction. I was thinking about using resowner for this, and then if the pg_locale_t structure gets moved to TopMemoryContext, just doing a ResourceOwnerForget. Alternatively, I could just be careful about the error paths. * We'd need to adapt this and make sure it works with whatever scheme we decide is best for finding the right library. I suspect this would just be adding another parameter to get_icu_library (and the hook) to represent the new collation provider Oid (and get_icu_library could use that to look up the library names and load them). Comments welcome. [1] https://www.postgresql.org/message-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On 05.12.22 22:33, Thomas Munro wrote: > On Tue, Dec 6, 2022 at 6:45 AM Joe Conway <mail@joeconway.com> wrote: >> On 12/5/22 12:41, Jeff Davis wrote: >>> On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote: >>>> 1. I think we should seriously consider provider = ICU63. I still >>>> think search-by-collversion is a little too magical, even though it >>>> clearly can be made to work. Of the non-magical systems, I think >>>> encoding the choice of library into the provider name would avoid the >>>> need to add a second confusing "X_version" concept alongside our >>>> existing "X_version" columns in catalogues and DDL syntax, while >>>> still >>>> making it super clear what is going on. >>> >>> As I understand it, this is #2 in your previous list? >>> >>> Can we put the naming of the provider into the hands of the user, e.g.: >>> >>> CREATE COLLATION PROVIDER icu63 TYPE icu >>> AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63'; >>> >>> In this model, icu would be a "provider kind" and icu63 would be the >>> specific provider, which is named by the user. >>> >>> That seems like the least magical approach, to me. We need an ICU >>> library; the administrator gives us one that looks like ICU; and we're >>> happy. >> >> +1 >> >> I like this. The provider kind defines which path we take in our code, >> and the specific library unambiguously defines a specific collation >> behavior (I think, ignoring bugs?) > > OK, I'm going to see what happens if I try to wrangle that stuff into > a new catalogue table. I'm reviewing the commit fest entry https://commitfest.postgresql.org/41/3956/, which points to this thread. It appears that the above patch did not come about in time. The patch of record is now Jeff's refactoring patch, which is also tracked in another commit fest entry (https://commitfest.postgresql.org/41/4058/). So as a matter of procedure, we should probably close this commit fest entry for now. (Maybe we should also use a different thread subject in the future.) I have a few quick comments on the above syntax example: There is currently a bunch of locale-using code that selects different code paths by "collation provider", i.e., a libc-based code path and an ICU-based code path (and sometimes also a default provider path). The above proposal would shift the terminology and would probably require some churn at those sites, in that they would now have to select by "collation provider type". We could probably avoid that by shifting the terms a bit, so instead of the suggested provider type -> provider we could use provider -> version of that provider (or some other actual term), which would leave the meaning of "provider" unchanged as far as locale-using code is concerned. At least that's my expectation, since no code for this has been seen yet. We should keep this in mind in any case. Also, the above example exposes a lot of operating system level details. This creates issues with dump/restore, which some of the earlier patches avoided by using a path-based approach, and it would also require some thoughts about permissions. We probably want non-superusers to be able to interact with this system somehow, for upgrading (for some meaning of that action) indexes etc. without superuser access. The more stuff from the OS we expose, the more stuff we have to be able to lock down again in a usable manner. (The search-by-collversion approach can probably avoid those issues better.)
Attached are a new set of patches, including a major enhancement: the icu_multilib contrib module. The pure refactoring patches have gone into this thread: This patch series is dependent on those patches, and contains: 0001: Support multiple ICU libraries 0002: Add test
Attachment
On Thu, 2023-01-19 at 00:11 -0800, Jeff Davis wrote: > Attached are a new set of patches, including a major enhancement: the > icu_multilib contrib module. Attached rebased v8. [ It looks like my email client truncated the last email somehow, in case someone was wondering why it just stopped. ] The big change is the introduction of the icu_multilib contrib module which provides a lot of the functionality requested in this thread: * icu version stability, which allows you to "lock down" ICU to a specific major and minor version (or major version only) * multi-lib ICU, which (if a GUC is set) will enable the "search by collversion" behavior. Some doubts were raised about the wisdom of this approach, but it's the only multi-lib solution we have without doing some significant catalog work. I rendered the HTML docs for icu_multilib and attached to this email to make it easier to view. icu_multilib assumes that the various ICU library versions are already available in a single location, most likely installed with a package manager. That location can be the same as the built-in ICU, or a different location. Ideally, packagers would start to offer a few "stable" versions of ICU that would be available for a long time, but it will take a while for that to happen. So for now, it's up to the user to figure out how to get the right versions of ICU on their system and keep them there. Automated tests of icu_multilib are a problem unless the one running the tests is willing to compile the right versions of ICU (like I did). But I at least have automated tests for the hooks by using the test module test_collator_lib_hooks. The v7 patches in this thread are dependent on the pure refactoring patches in this CF entry: https://commitfest.postgresql.org/41/3935/ https://postgr.es/m/052a5ed874d110be2f3ae28752e363306b10966d.camel@j-davis.com The requested functionality _not_ offered by icu_multilib is tying a specific collation to a specific ICU version. A few variants were proposed, the latest is to tie a collation to the library file itself through the provider. That needs to be done with proper catalog support in core. But I believe the work I've done here has made a lot of progress in that direction, and also shows the versatility of the new hook to solve at least some problems. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On Sat, 21 Jan 2023 at 02:24, Jeff Davis <pgsql@j-davis.com> wrote: > > On Thu, 2023-01-19 at 00:11 -0800, Jeff Davis wrote: > > Attached are a new set of patches, including a major enhancement: the > > icu_multilib contrib module. > > Attached rebased v8. > > [ It looks like my email client truncated the last email somehow, in > case someone was wondering why it just stopped. ] > > The big change is the introduction of the icu_multilib contrib module > which provides a lot of the functionality requested in this thread: > > * icu version stability, which allows you to "lock down" ICU to a > specific major and minor version (or major version only) > * multi-lib ICU, which (if a GUC is set) will enable the "search by > collversion" behavior. Some doubts were raised about the wisdom of this > approach, but it's the only multi-lib solution we have without doing > some significant catalog work. > > I rendered the HTML docs for icu_multilib and attached to this email to > make it easier to view. > > icu_multilib assumes that the various ICU library versions are already > available in a single location, most likely installed with a package > manager. That location can be the same as the built-in ICU, or a > different location. Ideally, packagers would start to offer a few > "stable" versions of ICU that would be available for a long time, but > it will take a while for that to happen. So for now, it's up to the > user to figure out how to get the right versions of ICU on their system > and keep them there. > > Automated tests of icu_multilib are a problem unless the one running > the tests is willing to compile the right versions of ICU (like I did). > But I at least have automated tests for the hooks by using the test > module test_collator_lib_hooks. > > The v7 patches in this thread are dependent on the pure refactoring > patches in this CF entry: > > https://commitfest.postgresql.org/41/3935/ > > https://postgr.es/m/052a5ed874d110be2f3ae28752e363306b10966d.camel@j-davis.com > > The requested functionality _not_ offered by icu_multilib is tying a > specific collation to a specific ICU version. A few variants were > proposed, the latest is to tie a collation to the library file itself > through the provider. That needs to be done with proper catalog support > in core. But I believe the work I've done here has made a lot of > progress in that direction, and also shows the versatility of the new > hook to solve at least some problems. This thread has been idle for a year now, It has stalled after a lot of discussion. @Jeff Davis: Do you want to try to restart the discussion by posting an updated version and see what happens? Regards, Vignesh
On Sat, 2024-01-20 at 07:40 +0530, vignesh C wrote: > This thread has been idle for a year now, It has stalled after a lot > of discussion. > @Jeff Davis: Do you want to try to restart the discussion by posting > an updated version and see what happens? Thank you for following up. Yes, I'd like to find a path forward here, but I need some validation from others on my approach. I rendered the docs I wrote as an HTML page and attached it to this thread, to make it easier for others to read and comment. It's basically a tool for experts who are willing to devote effort to managing their collations and ICU libraries. Is that what we want? At an implementation level, did I get the extension APIs right? I considered making the API simpler, but that would require the extension to do quite a bit more work (including a lot of redundant work) to use ICU properly. Regards, Jeff Davis
On Mon, 22 Jan 2024 at 00:28, Jeff Davis <pgsql@j-davis.com> wrote: > > On Sat, 2024-01-20 at 07:40 +0530, vignesh C wrote: > > This thread has been idle for a year now, It has stalled after a lot > > of discussion. > > @Jeff Davis: Do you want to try to restart the discussion by posting > > an updated version and see what happens? > > Thank you for following up. Yes, I'd like to find a path forward here, > but I need some validation from others on my approach. Let's start by posting a rebased version to fix the CFBot patch apply issue as in [1]: === Applying patches on top of PostgreSQL commit ID 402388946fb3ac54f0fd5944d7e177ef7737eab2 === === applying patch ./v8-0001-Support-multiple-ICU-collation-provider-libraries.patch patching file src/backend/commands/collationcmds.c Hunk #1 FAILED at 566. .... 1 out of 4 hunks FAILED -- saving rejects to file src/backend/commands/collationcmds.c.rej patching file src/backend/utils/adt/formatting.c Hunk #1 succeeded at 1575 (offset 9 lines). Hunk #2 succeeded at 1587 (offset 9 lines). Hunk #3 succeeded at 1605 (offset 9 lines). Hunk #4 succeeded at 1700 (offset 3 lines). Hunk #5 succeeded at 1819 (offset -1 lines). Hunk #6 succeeded at 1939 (offset -5 lines). patching file src/backend/utils/adt/pg_locale.c Hunk #1 FAILED at 70. ... Hunk #31 FAILED at 2886. Hunk #32 FAILED at 2902. 22 out of 32 hunks FAILED -- saving rejects to file src/backend/utils/adt/pg_locale.c.rej [1] - http://cfbot.cputube.org/patch_46_3956.log Regards, Vignesh
On Sun, Jan 21, 2024 at 1:58 PM Jeff Davis <pgsql@j-davis.com> wrote: > I rendered the docs I wrote as an HTML page and attached it to this > thread, to make it easier for others to read and comment. It's > basically a tool for experts who are willing to devote effort to > managing their collations and ICU libraries. Is that what we want? > > At an implementation level, did I get the extension APIs right? I > considered making the API simpler, but that would require the extension > to do quite a bit more work (including a lot of redundant work) to use > ICU properly. Not that I'm the most qualified person to have an opinion on this topic, but did you intend to attach this stuff to this email, or is it somewhere else? -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, 2024-02-01 at 15:58 -0500, Robert Haas wrote: > Not that I'm the most qualified person to have an opinion on this > topic, but did you intend to attach this stuff to this email, or is > it > somewhere else? The previous patch is here: https://www.postgresql.org/message-id/6f4a8c01a5cb1edf3a07d204c371fbddaef252f9.camel%40j-davis.com And I attached the rendered HTML doc page, which conveniently renders in the archives (thanks to web team -- I didn't know if that would actually work until I tried it): https://www.postgresql.org/message-id/attachment/142818/icu-multilib.html For anyone interested in this work, the docs are the best place to start. I'm hesitant to put much more work into it (e.g. new patches, etc.) without more feedback. Your opinion would certainly be valuable -- for instance, when reading the docs, can you imagine yourself actually using this if you ran into a collation versioning/migration problem? Regards, Jeff Davis
On Sun, Feb 4, 2024 at 10:42 PM Jeff Davis <pgsql@j-davis.com> wrote: > I'm hesitant to put much more work into it (e.g. new patches, etc.) > without more feedback. Your opinion would certainly be valuable -- for > instance, when reading the docs, can you imagine yourself actually > using this if you ran into a collation versioning/migration problem? I'm having some difficulty understanding what the docs are trying to tell me. I think there are some issues with ordering and pacing. "The icu_multilib module provides control over the version (or versions) of the ICU provider library used by PostgreSQL, which can be different from the version of ICU with which it was built. Collations are a product of natural language, and natural language evolves over time; but PostgreSQL depends on stable ordering for structures such as indexes. Newer versions of ICU update the provided collators to adapt to changes in natural language, so it's important to control when and how those new versions of ICU are used to prevent problems such as index corruption." Check. So far, so good. "This module assumes that the necessary versions of ICU are already available, such as through the operating system's package manager; and already properly installed in a single location accessible to PostgreSQL. The configration variable icu_multilib.library_path should be set to the location where these ICU library versions are installed." Here I feel we've skipped a few steps. I suggest postponing all discussion of specific GUCs to a later point -- specifically the configuration parameters section, which I think should actually be F.19.1, with the use cases following that rather than preceding it. In this introductory section, I suggest elaborating a bit more on what problem we're trying to solve at a conceptual level. It feels like we've gone straight from the very general issue (collation definitions need to be stable but language isn't) to very specific (here's a GUC that you can set to a pathname). I feel like the need for this module should be more specifically motivated. Maybe something like: 1. Here's what we think your OS package manager is probably going to do. 2. That's going to interact with PostgreSQL in this way that I will now describe. 3. See, that sucks, because of the stuff I said above about needing stable collations! 4. But if you installed this module instead, then you could prevent the things I said under #2 from happening. 5. Instead, you'd get this other behavior, which would make you happy. I feel like I can almost piece together in my head how this is supposed to work -- I think it's like "we expect the OS package manager to drop all the ICU versions in the same directory via side by side installs, and that works well for other programs because ... for some mysterious reason they can latch onto the specific version they were linked against ... but we can't or don't do that because ... I guess we're dumber than those other pieces of software or something???? ... so this module lets you ask for more sensible behavior." But I think that could be spelled out a bit more clearly and directly than this document seems to me to do. I also wonder if we should be explaining why we don't get this right out of the box. Like, if the normal behavior categorically sucks, why do you have to install icu_multilib to get something else? Why not make the multilib treatment the default? And if the normal behavior is better for some cases and the icu_multilib behavior is better for other cases, then maybe we ought to explain which one to use in which scenario. "icu_multilib must be loaded via shared_preload_libraries. icu_multilib ignores any ICU library with a major version greater than that with which PostgreSQL was built." It's not clear from reading this whether the second sentence here is a regrettable implementation restriction or design behavior. If it's design behavior, what's the point of it? -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, 2024-02-11 at 22:04 +0530, Robert Haas wrote: > 1. Here's what we think your OS package manager is probably going to > do. > 2. That's going to interact with PostgreSQL in this way that I will > now describe. > 3. See, that sucks, because of the stuff I said above about needing > stable collations! > 4. But if you installed this module instead, then you could prevent > the things I said under #2 from happening. > 5. Instead, you'd get this other behavior, which would make you > happy. I like that framing, thank you. I'll try to come up with something there. > I feel like I can almost piece together in my head how this is > supposed to work -- I think it's like "we expect the OS package > manager to drop all the ICU versions in the same directory via side > by > side installs, and that works well for other programs because ... for > some mysterious reason they can latch onto the specific version they > were linked against ... but we can't or don't do that because ... I > guess we're dumber than those other pieces of software or > something???? ... Postgres can and does latch on to the version of ICU it was compiled against. It's a normal shared library dependency. The problem is that databases -- and the file structures -- outlive a particular version of Postgres. So if Postgres 16 is compiled against ICU X and Postgres 17 is compiled against ICU Y, how do you upgrade from 16 to 17? Postgres 17 will try to access the old file structures using ICU Y, and they'll be corrupt. What we want is the file structures that depend on ICU X to continue to find ICU X even after you upgrade to Postgres 17, yet allow new structures to be created using ICU Y. In other words, "multi-lib", meaning that the same Postgres binary is linking to multiple versions of ICU and the different versions for different structures. That would allow users to recreate one index at a time to use ICU Y, until nothing depends on ICU X any longer. I should say this is not an easy process even if something like icu_multilib is available. We don't have all of the information needed in the catalog to track which structures depend on which versions of a collation library, collation library version is itself not easy to define, and it still involves rebuilding (or at least re-validating) a lot of structures. This is a "make hard things possible" tool, and I suspect only a handful of users would use it successfully to migrate to new ICU versions. More simply, some users might just want to lock down the version of ICU to X, and just use that forever until they have a reason to change it. icu_multilib can also facilitate that, though it's still not trivial. > "icu_multilib must be loaded via shared_preload_libraries. > icu_multilib ignores any ICU library with a major version greater > than > that with which PostgreSQL was built." > > It's not clear from reading this whether the second sentence here is > a > regrettable implementation restriction or design behavior. If it's > design behavior, what's the point of it? That restriction came from Thomas's (uncommitted) work on the same problem. I believe the reasoning was that we don't know whether future versions of ICU might break something that we're doing, though perhaps there's a better way. Regards, Jeff Davis
On Tue, Feb 13, 2024 at 1:55 AM Jeff Davis <pgsql@j-davis.com> wrote: > Postgres can and does latch on to the version of ICU it was compiled > against. It's a normal shared library dependency. > > The problem is that databases -- and the file structures -- outlive a > particular version of Postgres. So if Postgres 16 is compiled against > ICU X and Postgres 17 is compiled against ICU Y, how do you upgrade > from 16 to 17? Postgres 17 will try to access the old file structures > using ICU Y, and they'll be corrupt. > > What we want is the file structures that depend on ICU X to continue to > find ICU X even after you upgrade to Postgres 17, yet allow new > structures to be created using ICU Y. In other words, "multi-lib", > meaning that the same Postgres binary is linking to multiple versions > of ICU and the different versions for different structures. That would > allow users to recreate one index at a time to use ICU Y, until nothing > depends on ICU X any longer. Ah, I see. At least, I think I do. I think some of this material could be very usefully included into the first section of the doc you're trying to write. What you say here makes it a lot easier to grasp the motivation and use case for this code, at least for me. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Feb 13, 2024 at 9:25 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Sun, 2024-02-11 at 22:04 +0530, Robert Haas wrote: > > "icu_multilib must be loaded via shared_preload_libraries. > > icu_multilib ignores any ICU library with a major version greater > > than > > that with which PostgreSQL was built." > > > > It's not clear from reading this whether the second sentence here is > > a > > regrettable implementation restriction or design behavior. If it's > > design behavior, what's the point of it? > > That restriction came from Thomas's (uncommitted) work on the same > problem. I believe the reasoning was that we don't know whether future > versions of ICU might break something that we're doing, though perhaps > there's a better way. Right, to spell that out more fully: We compile and link against one particular ICU library that is present at compile time, and there is a place in that multi-lib patch that assigns the function pointers from that version to variables of the function pointer type that we expect. Compilation would fail if ICU ever changed relevant function prototypes in a future release, and then we'd have to come up with some trampoline/wrapper scheme to wallpaper over differences. That's why I think it's safe to use dlsym() to access function pointers for versions up to and including the one whose headers we were compiled against, but not later ones which we haven't tested in that way. Sadly I won't be able to work on multi-lib ICU support again in this cycle. I think we managed to prove that dlopen works for this, and learn some really interesting stuff about Unicode and ICU evolution, but we still have to come up with the right model, catalogues and DDL etc, for a nice user experience. What I was most recently experimenting with based on earlier discussions was the idea of declaring separate providers: icu72 and icu68 could both exist and you could create extra indexes and then drop the originals as a no-downtime upgrade path. I have a pet theory that you could usefully support multi-version libc locales too if you're prepared to make certain assumptions (short version: take the collation definition files from any older version of your OS, compile with newer version's localedef, give it a name like "en_US@ubuntu18", and assume/pray they didn't change stuff that wasn't expressed in the definition file), so I was working on a generalisation slightly wider than just multi-version ICU.