Thread: Draft release notes for next week's releases
I've prepared a first cut at next week's release notes: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=29b6123ecb4113e366325245cec5a5c221dae691 (As usual, I will make the notes for older branches by extracting relevant items from this list, after it's been reviewed.) Please review. If you prefer to read it on the web, it should be up at http://www.postgresql.org/docs/devel/static/release-9-5-2.html in an hour or so, after guaibasaurus's next buildfarm run. Probably the most discussion-worthy item is whether we can say anything more about the strxfrm mess. Should we make a wiki page about that and have the release note item link to it? regards, tom lane
On Sat, Mar 26, 2016 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I've prepared a first cut at next week's release notes: > > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=29b6123ecb4113e366325245cec5a5c221dae691 > > (As usual, I will make the notes for older branches by extracting > relevant items from this list, after it's been reviewed.) Please > review. If you prefer to read it on the web, it should be up at > > http://www.postgresql.org/docs/devel/static/release-9-5-2.html > > in an hour or so, after guaibasaurus's next buildfarm run. > > Probably the most discussion-worthy item is whether we can say > anything more about the strxfrm mess. Should we make a wiki > page about that and have the release note item link to it? > > regards, tom lane Sorry for speaking up late, but: + <listitem> + <para> + Correctly handle wraparound cases in the <literal>pg_subtrans</> + startup logic for hot standby (Jeff Janes) + </para> + </listitem> This applies to all recovery scenarios, whether they are hot standby or just plain-old automatic crash recovery. (However, it does only matter when prepared transactions are in use.) Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Sat, Mar 26, 2016 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > + Correctly handle wraparound cases in the <literal>pg_subtrans</> > + startup logic for hot standby (Jeff Janes) > This applies to all recovery scenarios, whether they are hot standby > or just plain-old automatic crash recovery. (However, it does only > matter when prepared transactions are in use.) Thanks for the clarification, will fix! regards, tom lane
On Sat, Mar 26, 2016 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Probably the most discussion-worthy item is whether we can say > anything more about the strxfrm mess. Should we make a wiki > page about that and have the release note item link to it? I think that there is an argument against doing so, which is that right now, all we have to offer on that are weasel words. However, I'm still in favor of a Wiki page, because I would not be at all surprised if our understanding of this problem evolved, and we were able to offer better answers in several weeks. Realistically, it will probably take at least that long before affected users even start to think about this. -- Peter Geoghegan
On Sat, Mar 26, 2016 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Probably the most discussion-worthy item is whether we can say
> anything more about the strxfrm mess. Should we make a wiki
> page about that and have the release note item link to it?
I think that there is an argument against doing so, which is that
right now, all we have to offer on that are weasel words. However, I'm
still in favor of a Wiki page, because I would not be at all surprised
if our understanding of this problem evolved, and we were able to
offer better answers in several weeks. Realistically, it will probably
take at least that long before affected users even start to think
about this.
One question to debate is whether placing a list of "known" (collated from the program runs lots of people performed) would do more harm than good. Personally I'd rather see a list of known failures and evaluate my situation objectively (i.e., large index but no reported problem on my combination of locale and platform). I understand that a lack of evidence is not proof that I am unaffected at this stage in the game. Having something I can execute on my server to try and verify behavior - irrespective of the correctness of the indexes themselves - would be welcomed.
David J.
<p dir="ltr"><br /> On Mar 28, 2016 09:44, "Peter Geoghegan" <<a href="mailto:pg@heroku.com">pg@heroku.com</a>> wrote:<br/> ><br /> > On Sat, Mar 26, 2016 at 4:34 PM, Tom Lane <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>wrote:<br /> > > Probably the most discussion-worthy itemis whether we can say<br /> > > anything more about the strxfrm mess. Should we make a wiki<br /> > > pageabout that and have the release note item link to it?<br /> ><br /> > I think that there is an argument againstdoing so, which is that<br /> > right now, all we have to offer on that are weasel words. However, I'm<br /> >still in favor of a Wiki page, because I would not be at all surprised<br /> > if our understanding of this problemevolved, and we were able to<br /> > offer better answers in several weeks. Realistically, it will probably<br/> > take at least that long before affected users even start to think<br /> > about this.<br /><p dir="ltr">Shouldwe start thinking about ICU ? I compare Postgres with ICU and without and found 27x improvement in btreeindex creation for russian strings. This includes effect of abbreviated keys and ICU itself. Also, we'll get systemindependent locale.<br /> ><br /> ><br /> > --<br /> > Peter Geoghegan<br /> ><br /> ><br /> >--<br /> > Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> > To make changes to your subscription:<br/> > <a href="http://www.postgresql.org/mailpref/pgsql-hackers">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/>
On Mon, Mar 28, 2016 at 12:08 AM, Oleg Bartunov <obartunov@gmail.com> wrote: > Should we start thinking about ICU ? I compare Postgres with ICU and without > and found 27x improvement in btree index creation for russian strings. This > includes effect of abbreviated keys and ICU itself. Also, we'll get system > independent locale. I think we should. I want to develop a detailed proposal before talking about it more, though, because the idea is controversial. Did you use the FreeBSD ports patch? Do you have your own patch that you could share? I'm not surprised that ICU is so much faster, especially now that UTF-8 is not a second class citizen (it's been possible to build ICU to specialize all its routines to handle UTF-8 for years now). As you may know, ICU supports partial sort keys, and sort key compression, which may have also helped: http://userguide.icu-project.org/collation/architecture That page also describes how binary sort keys are versioned, which allows them to be stored on disk. It says "A common example is the use of keys to build indexes in databases". We'd be crazy to trust Glibc strxfrm() to be stable *on disk*, but ICU already cares deeply about the things we need to care about, because it's used by other database systems like DB2, Firebird, and in some configurations SQLite [1]. Glibc strxfrm() is not great with codepoints from the Cyrillic alphabet -- it seems to store 2 bytes per code-point in the primary weight level. So ICU might also do better in your test case for that reason. [1] https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/icu/README.txt -- Peter Geoghegan
On Mon, Mar 28, 2016 at 1:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
Yes, I see on this page, that ICU is ~3 times faster for russian text.
http://site.icu-project.org/charts/collation-icu4c48-glibc
On Mon, Mar 28, 2016 at 12:08 AM, Oleg Bartunov <obartunov@gmail.com> wrote:
> Should we start thinking about ICU ? I compare Postgres with ICU and without
> and found 27x improvement in btree index creation for russian strings. This
> includes effect of abbreviated keys and ICU itself. Also, we'll get system
> independent locale.
I think we should. I want to develop a detailed proposal before
talking about it more, though, because the idea is controversial.
Did you use the FreeBSD ports patch? Do you have your own patch that
you could share?
We'll post the patch. Teodor made something to get abbreviated keys work as
I remember. I should say, that 27x improvement I got on my macbook. I will
check on linux.
check on linux.
I'm not surprised that ICU is so much faster, especially now that
UTF-8 is not a second class citizen (it's been possible to build ICU
to specialize all its routines to handle UTF-8 for years now). As you
may know, ICU supports partial sort keys, and sort key compression,
which may have also helped:
http://userguide.icu-project.org/collation/architecture
That page also describes how binary sort keys are versioned, which
allows them to be stored on disk. It says "A common example is the use
of keys to build indexes in databases". We'd be crazy to trust Glibc
strxfrm() to be stable *on disk*, but ICU already cares deeply about
the things we need to care about, because it's used by other database
systems like DB2, Firebird, and in some configurations SQLite [1].
Glibc strxfrm() is not great with codepoints from the Cyrillic
alphabet -- it seems to store 2 bytes per code-point in the primary
weight level. So ICU might also do better in your test case for that
reason.
Yes, I see on this page, that ICU is ~3 times faster for russian text.
http://site.icu-project.org/charts/collation-icu4c48-glibc
[1] https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/icu/README.txt
--
Peter Geoghegan
On Mon, Mar 28, 2016 at 12:55 AM, Oleg Bartunov <obartunov@gmail.com> wrote: > We'll post the patch. Cool. > Teodor made something to get abbreviated keys work as > I remember. I should say, that 27x improvement I got on my macbook. I will > check on linux. I think that Linux will be much faster. The stxfrm() blob produced by Mac OSX will have a horribly low concentration of entropy. For an 8 byte Datum, you get only 2 distinguishing bytes. It's really, really bad. Mac OSX probably makes very little use of strxfrm() in practice; there are proprietary APIs that do something similar, but all using UTF-16 only. -- Peter Geoghegan
On Mon, Mar 28, 2016 at 2:06 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Mar 28, 2016 at 12:55 AM, Oleg Bartunov <obartunov@gmail.com> wrote:
> We'll post the patch.
Cool.
> Teodor made something to get abbreviated keys work as
> I remember. I should say, that 27x improvement I got on my macbook. I will
> check on linux.
I think that Linux will be much faster. The stxfrm() blob produced by
Mac OSX will have a horribly low concentration of entropy. For an 8
byte Datum, you get only 2 distinguishing bytes. It's really, really
bad. Mac OSX probably makes very little use of strxfrm() in practice;
there are proprietary APIs that do something similar, but all using
UTF-16 only.
Yes, Linux is much-much faster, I see no difference in performance using latest icu 57_1.
I tested on Ubuntu 14.4.04. But still, icu provides us abbreviated keys and collation stability,
so let's add --with-icu.
--
Peter Geoghegan
Oleg Bartunov-2 wrote > But still, icu provides us abbreviated keys and collation stability, Does include ICU mean that collation handling is identical across platforms? E.g. a query on Linux involving string comparison would yield the same result on MacOS and Windows? If that is the case I'm all for it. Currently the different behaviour in handling collation aware string comparisons is a bug in my eyes from a user's perspective. I do understand and can accept the technical reasons for that, but it still feels odd that a query yields different results (with identical data) just because it runs on a different platform. -- View this message in context: http://postgresql.nabble.com/Draft-release-notes-for-next-week-s-releases-tp5895357p5895484.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Oleg Bartunov <obartunov@gmail.com> writes: > Should we start thinking about ICU ? Isn't it still true that ICU fails to meet our minimum requirements? That would include (a) working with the full Unicode character range (not only UTF16) and (b) working with non-Unicode encodings. No doubt we could deal with (b) by inserting a conversion, but that would take a lot of shine off the performance numbers you mention. I'm also not exactly convinced by your implicit assumption that ICU is bug-free. regards, tom lane
All, Changed the thread name (we're no longer talking about release notes...). * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Oleg Bartunov <obartunov@gmail.com> writes: > > Should we start thinking about ICU ? > > Isn't it still true that ICU fails to meet our minimum requirements? > That would include (a) working with the full Unicode character range > (not only UTF16) and (b) working with non-Unicode encodings. No doubt > we could deal with (b) by inserting a conversion, but that would take > a lot of shine off the performance numbers you mention. > > I'm also not exactly convinced by your implicit assumption that ICU is > bug-free. We have a wiki page about ICU. I'm not sure that it's current, but if it isn't and people are interested then perhaps we should update it: https://wiki.postgresql.org/wiki/Todo:ICU If we're going to talk about minimum requirements, I'd like to argue that we require whatever system we're using to have versioning (which glibc currently lacks, as I understand it...) to avoid the risk that indexes will become corrupt when whatever we're using for collation changes. I'm pretty sure that's already bitten us on at least some RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues with strcoll vs. strxfrm. Regarding key abbreviation and performance, if we are confident that strcoll and strxfrm are at least independently internally consistent then we could consider offering an option to choose between them. We'd need to identify what each index was built with to do so, however, as they would need to be rebuilt if the choice changes, at least until/unless they're made to reliably agree. Even using only one or the other doesn't address the versioning problem though, which is a problem for all currently released versions of PG and is just going to continue to be an issue. Thanks! Stephen
On Mon, Mar 28, 2016 at 10:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm also not exactly convinced by your implicit assumption that ICU is > bug-free. Noah spent some time looking at ICU back when he was EnterpriseDB, and his conclusion was that ICU collations weren't stable across releases, which is pretty much the same problem we're running into with glibc collations. Now it might still be true that they have the equivalent of strxfrm() and strcoll() and that those things behave consistently with each other, and that would be very good. Everybody seems to agree it's faster, and that's good, too. But I wonder what we do about the fact that, as with glibc, an ICU upgrade involves a REINDEX of every potentially affected index. It seems like ICU has some facilities built into it that might be useful for detecting and handling such situations, but I don't understand them well enough to know whether they'd solve our versioning problems or how effectively they would do so, and I think there are packaging issues that tie into it, too. http://userguide.icu-project.org/design mentions building with specific configure flags if you need to link with multiple server versions, and I don't know what operating system packagers typically do about that stuff. In any case, I agree that we'd be very unwise to think that ICU is necessarily going to be bug-free without testing it carefully. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 28, 2016 at 7:57 AM, Stephen Frost <sfrost@snowman.net> wrote: > If we're going to talk about minimum requirements, I'd like to argue > that we require whatever system we're using to have versioning (which > glibc currently lacks, as I understand it...) to avoid the risk that > indexes will become corrupt when whatever we're using for collation > changes. I'm pretty sure that's already bitten us on at least some > RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues > with strcoll vs. strxfrm. I totally agree that anything we should adopt should support versioning. Glibc does have a non-standard versioning scheme, but we don't use it. Other stdlibs may do versioning another way, or not at all. A world in which ICU is the defacto standard for Postgres (i.e. the actual standard on all major platforms), we mostly just have one thing to target, which seems like something to aim for. Collations change from time to time, legitimately. Read from "Collation order is not fixed", here: http://unicode.org/reports/tr10/#Stability The question is only how we deal with this when it happens. One thing that's attractive about ICU is that it makes this explicit, both for the logical behavior of a collation, as well as the stability of binary sort keys (Glibc's versioning seemingly just does the former). So the equivalent of strxfrm() output has license to change for technical reasons that are orthogonal to the practical concerns of end-users about how text sorts in their locale. ICU is clear on what it takes to make binary sort keys in indexes work. And various major database systems rely on this being right. > Regarding key abbreviation and performance, if we are confident that > strcoll and strxfrm are at least independently internally consistent > then we could consider offering an option to choose between them. I think they just need to match, per the standard. After all, abbreviation will sometimes require strcoll() tie-breakers. Clearly it would be very naive to imagine that ICU is bug-free. However, I surmise that there is a large difference how ICU and glibc think about things like strxfrm() or strcoll() stability and consistency. Tom was able to demonstrate that strxfrm() and strcoll() behaved inconsistently without too much effort, contrary to POSIX, and in many common cases. I doubt that the Glibc maintainers are all that concerned about it. Certainly, less concerned than they are about the latest security bug. Whereas if this happened in ICU, it would be a total failure of the project to fulfill its most basic goals. Our disaster would also be a disaster for several other major database systems. ICU carefully and explicitly considers multiple forms of stability, "deterministic" sort ordering, etc. That *is* a big difference, and it makes me optimistic that there'd be far fewer problems. I also think that ICU could be a reasonable basis for case-insensitive collations, which would let us kill citext, a module that I consider to be a total kludge. And, we might also be able to lock down WAL compatibility, which would be generally useful. -- Peter Geoghegan
* Peter Geoghegan (pg@heroku.com) wrote: > On Mon, Mar 28, 2016 at 7:57 AM, Stephen Frost <sfrost@snowman.net> wrote: > > If we're going to talk about minimum requirements, I'd like to argue > > that we require whatever system we're using to have versioning (which > > glibc currently lacks, as I understand it...) to avoid the risk that > > indexes will become corrupt when whatever we're using for collation > > changes. I'm pretty sure that's already bitten us on at least some > > RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues > > with strcoll vs. strxfrm. > > I totally agree that anything we should adopt should support > versioning. Glibc does have a non-standard versioning scheme, but we > don't use it. Other stdlibs may do versioning another way, or not at > all. A world in which ICU is the defacto standard for Postgres (i.e. > the actual standard on all major platforms), we mostly just have one > thing to target, which seems like something to aim for. Having to figure out how each and every stdlib does versioning doesn't sound fun, I certainly agree with you there, but it hardly seems impossible. What we need, even if we look to move to ICU, is a place to remember that version information and a way to do something when we discover that we're now using a different version. I'm not quite sure what the best way to do that is, but I imagine it involves changes to existing catalogs or perhaps even a new one. I don't have any particularly great ideas for existing releases (maybe stash information in the index somewhere when it's rebuilt and then check it and throw an ERROR if they don't match?) > The question is only how we deal with this when it happens. One thing > that's attractive about ICU is that it makes this explicit, both for > the logical behavior of a collation, as well as the stability of > binary sort keys (Glibc's versioning seemingly just does the former). > So the equivalent of strxfrm() output has license to change for > technical reasons that are orthogonal to the practical concerns of > end-users about how text sorts in their locale. ICU is clear on what > it takes to make binary sort keys in indexes work. And various major > database systems rely on this being right. There seems to be some disagreement about if ICU provides the information we'd need to make a decision or not. It seems like it would, given its usage in other database systems, but if so, we need to very clearly understand exactly how it works and how we can depend on it. > > Regarding key abbreviation and performance, if we are confident that > > strcoll and strxfrm are at least independently internally consistent > > then we could consider offering an option to choose between them. > > I think they just need to match, per the standard. After all, > abbreviation will sometimes require strcoll() tie-breakers. Ok, I didn't see that in the man-pages. If that's the case then it seems like there isn't much hope of just using strxfrm(). Thanks! Stephen
On Mon, Mar 28, 2016 at 12:36 PM, Stephen Frost <sfrost@snowman.net> wrote: > Having to figure out how each and every stdlib does versioning doesn't > sound fun, I certainly agree with you there, but it hardly seems > impossible. What we need, even if we look to move to ICU, is a place to > remember that version information and a way to do something when we > discover that we're now using a different version. I think that the versioning situation is all over the place. It isn't in the C standard. And there are many different versions of many different stdlibs to support. Most importantly, where support nominally exists, a strong incentive to get it exactly right may not. We've seen that already. > I'm not quite sure what the best way to do that is, but I imagine it > involves changes to existing catalogs or perhaps even a new one. I > don't have any particularly great ideas for existing releases (maybe > stash information in the index somewhere when it's rebuilt and then > check it and throw an ERROR if they don't match?) I think we'd need to introduce an abstraction like a "collation provider", of which ICU would theoretically be just one. The OS would be a baked-in collation provider. Everything that works today would continue to work. We'd then largely just be grandfathering out systems that rely on OS locales across major version upgrades, since the vast majority of users are happy with Unicode, and have no cultural or technical reason to prefer the OS locales that I can think of. I am unconvinced with the idea that it especially matters that sort(1) might not be in agreement with Postgres. Neither is any Java app, or any .Net app, or the user's web browser in the case of Safari or Google Chrome (maybe others). I want Postgres to be consistent with Postgres, across different nodes on the network, in environments where I may have little knowledge of the underlying OS. Think "sort pushdown in postgres_fdw". Users from certain East Asian user communities might prefer to stick with regional encodings, perhaps due to specific concerns about the Han Unification controversy. But I'm pretty sure that these users have very low expectations about collations in Postgres today. I was recently told that collating Japanese is starting to get a bit better, due to various new initiatives, but that most experienced Japanese Postgres DBAs tend to use the "C" collation. I don't want to impose a Unicode monoculture on anyone. But I do think there are clear benefits for the large majority of users that always use Unicode. Nothing needs to break that works today to make this happen. Abbreviated keys provide an immediate incentive for users to adopt ICU; users that might otherwise be on the fence about it. >> The question is only how we deal with this when it happens. One thing >> that's attractive about ICU is that it makes this explicit, both for >> the logical behavior of a collation, as well as the stability of >> binary sort keys (Glibc's versioning seemingly just does the former). >> So the equivalent of strxfrm() output has license to change for >> technical reasons that are orthogonal to the practical concerns of >> end-users about how text sorts in their locale. ICU is clear on what >> it takes to make binary sort keys in indexes work. And various major >> database systems rely on this being right. > > There seems to be some disagreement about if ICU provides the > information we'd need to make a decision or not. It seems like it > would, given its usage in other database systems, but if so, we need to > very clearly understand exactly how it works and how we can depend on > it. It seems likely that it exposes the information required to make what we need to do practical. Certainly, adopting ICU is a big project that we should proceed cautiously with, but there is a reason why every other major database system uses either ICU, or a library based on UCA [1] that allows the system to centrally control versioned collations (SQLite just makes this optional). I think that ICU *could* still tie us to the available collations on an OS (those collations that are available with their ICU packages). What I haven't figured out yet is if it's practical to install versions that are available from some central location, like the CLDR [2]. I don't think we'd want to have Postgres ship "supported collations" in each major version, in roughly the style of the IANA timezone stuff, but it's far too early to rule that out. It would have upsides. [1] https://en.wikipedia.org/wiki/Unicode_collation_algorithm [2] http://cldr.unicode.org/ -- Peter Geoghegan
On Mon, Mar 28, 2016 at 5:57 PM, Stephen Frost <sfrost@snowman.net> wrote:
All,
Changed the thread name (we're no longer talking about release
notes...).
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Oleg Bartunov <obartunov@gmail.com> writes:
> > Should we start thinking about ICU ?
>
> Isn't it still true that ICU fails to meet our minimum requirements?
> That would include (a) working with the full Unicode character range
> (not only UTF16) and (b) working with non-Unicode encodings. No doubt
> we could deal with (b) by inserting a conversion, but that would take
> a lot of shine off the performance numbers you mention.
>
> I'm also not exactly convinced by your implicit assumption that ICU is
> bug-free.
We have a wiki page about ICU. I'm not sure that it's current, but if
it isn't and people are interested then perhaps we should update it:
https://wiki.postgresql.org/wiki/Todo:ICU
Good point, I forget about this page.
If we're going to talk about minimum requirements, I'd like to argue
that we require whatever system we're using to have versioning (which
glibc currently lacks, as I understand it...) to avoid the risk that
indexes will become corrupt when whatever we're using for collation
changes. I'm pretty sure that's already bitten us on at least some
RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
with strcoll vs. strxfrm.
agree.
Regarding key abbreviation and performance, if we are confident that
strcoll and strxfrm are at least independently internally consistent
then we could consider offering an option to choose between them.
We'd need to identify what each index was built with to do so, however,
as they would need to be rebuilt if the choice changes, at least
until/unless they're made to reliably agree. Even using only one or the
other doesn't address the versioning problem though, which is a problem
for all currently released versions of PG and is just going to continue
to be an issue.
Ideally, we should benchmarking all locales on all platforms for all kind indexes. But that's big project.
Thanks!
Stephen
On Mon, Mar 28, 2016 at 1:36 PM, Thomas Kellerer <spam_eater@gmx.net> wrote:
Oleg Bartunov-2 wrote
> But still, icu provides us abbreviated keys and collation stability,
Does include ICU mean that collation handling is identical across platforms?
E.g. a query on Linux involving string comparison would yield the same
result on MacOS and Windows?
Yes, it does and that's the most important issue for us.
If that is the case I'm all for it.
Currently the different behaviour in handling collation aware string
comparisons is a bug in my eyes from a user's perspective. I do understand
and can accept the technical reasons for that, but it still feels odd that a
query yields different results (with identical data) just because it runs on
a different platform.
--
View this message in context: http://postgresql.nabble.com/Draft-release-notes-for-next-week-s-releases-tp5895357p5895484.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Mar 28, 2016 at 6:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
agree.
In other thread I wrote:
"Ideally, we should benchmarking all locales on all platforms for all kind indexes. But that's big project."
On Mon, Mar 28, 2016 at 10:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm also not exactly convinced by your implicit assumption that ICU is
> bug-free.
Noah spent some time looking at ICU back when he was EnterpriseDB, and
his conclusion was that ICU collations weren't stable across releases,
which is pretty much the same problem we're running into with glibc
collations. Now it might still be true that they have the equivalent
of strxfrm() and strcoll() and that those things behave consistently
with each other, and that would be very good. Everybody seems to
agree it's faster, and that's good, too. But I wonder what we do
about the fact that, as with glibc, an ICU upgrade involves a REINDEX
of every potentially affected index. It seems like ICU has some
facilities built into it that might be useful for detecting and
handling such situations, but I don't understand them well enough to
know whether they'd solve our versioning problems or how effectively
they would do so, and I think there are packaging issues that tie into
it, too. http://userguide.icu-project.org/design mentions building
with specific configure flags if you need to link with multiple server
versions, and I don't know what operating system packagers typically
do about that stuff.
In any case, I agree that we'd be very unwise to think that ICU is
necessarily going to be bug-free without testing it carefully.
agree.
In other thread I wrote:
"Ideally, we should benchmarking all locales on all platforms for all kind indexes. But that's big project."
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
> Does include ICU mean that collation handling is identical across platforms? > E.g. a query on Linux involving string comparison would yield the same > result on MacOS and Windows? > Yes, it does and that's the most important issue for us. Yes, exactly. Attached patch adds support for libicu with configure flag --with-icu. Patch rebased to current HEAD, hope, it works. It's based on https://people.freebsd.org/~girgen/postgresql-icu/readme.html work, and it was migrated to 9.5 with abbrevation keys support. Patch in current state is not ready to commit, of course. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
On Sat, Mar 26, 2016 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Probably the most discussion-worthy item is whether we can say > anything more about the strxfrm mess. Should we make a wiki > page about that and have the release note item link to it? I just noticed that the release notes mention char(n) as affected. That's not actually true, because char(n) SortSupport only came in 9.6. The Wiki page now shows this, which may be the most important place, but ideally we'd fix this in the release notes. I guess it's too late. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > I just noticed that the release notes mention char(n) as affected. > That's not actually true, because char(n) SortSupport only came in > 9.6. The Wiki page now shows this, which may be the most important > place, but ideally we'd fix this in the release notes. I guess it's > too late. Well, too late for 9.5.2 anyway. It still makes sense to correct that text for future releases. I'm inclined to wait a little bit though and see what other improvements become apparent. For instance, I think the point about non-first index columns not being affected is of greater weight than you seem to place on it. regards, tom lane
On Thu, Mar 31, 2016 at 2:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, too late for 9.5.2 anyway. It still makes sense to correct that > text for future releases. I'm inclined to wait a little bit though and > see what other improvements become apparent. For instance, I think the > point about non-first index columns not being affected is of greater > weight than you seem to place on it. The SQL query on the Wiki page does the right thing there now, so users will have the benefit of not unnecessarily reindexing when text was not the leading/first pg_index attribute. We have that covered, I suppose, because everyone will look to the Wiki page for guidance. I also noted quite a few non-obvious safe cases on the Wiki page, as pointed out already over on the other thread. -- Peter Geoghegan
On Tue, Mar 29, 2016 at 5:18 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > It's based on https://people.freebsd.org/~girgen/postgresql-icu/readme.html > work, and it was migrated to 9.5 with abbrevation keys support. > Patch in current state is not ready to commit, of course. Cool. Some quick observations on this: * We need to have a strxfrm_l_icu(), not just a strxfrm_icu(). That seems easy. * We should look into using the ucol_nextSortKeyPart() API: http://userguide.icu-project.org/collation/architecture#TOC-Partial-sort-keys I think that this could be a lot faster, because we only need a part of the collation tables in CPU cache during the generation of abbreviated keys. There is an optimization described at a low level here: https://github.com/icu-project/icu4c/blob/bbd17a792336de5873550794f8304a4b548b0663/source/i18n/collationkeys.cpp#L337 I think this could make our special strxfrm() (which only actually needs 8 bytes for abbreviated keys) a lot faster. I'd be interested to see how your Russian text example does with this extra optimization. We should not be surprised that this kind of support exists within ICU, because abbreviated keys are actually quite an old idea. -- Peter Geoghegan
On Thu, Apr 14, 2016 at 4:42 PM, Peter Geoghegan <pg@heroku.com> wrote: > * We should look into using the ucol_nextSortKeyPart() API: > > http://userguide.icu-project.org/collation/architecture#TOC-Partial-sort-keys Another more rich API we could immediately put to good use is the ICU strcoll() variant that does not require NUL-terminated strings: https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a3abc6779e6452106415918199308fab4 We do not use a NUL byte for terminating text data, and so must copy its contents into a temp buffer, or array on the stack, all rather inefficiently. Robert has expressed an interest in an API like this strcoll() variant in the past [1], to avoid this unnecessary overhead. [1] http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html -- Peter Geoghegan