Thread: Draft release notes for next week's releases

Draft release notes for next week's releases

From
Tom Lane
Date:
I've prepared a first cut at next week's release notes:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=29b6123ecb4113e366325245cec5a5c221dae691

(As usual, I will make the notes for older branches by extracting
relevant items from this list, after it's been reviewed.)  Please
review.  If you prefer to read it on the web, it should be up at

http://www.postgresql.org/docs/devel/static/release-9-5-2.html

in an hour or so, after guaibasaurus's next buildfarm run.

Probably the most discussion-worthy item is whether we can say
anything more about the strxfrm mess.  Should we make a wiki
page about that and have the release note item link to it?
        regards, tom lane



Re: Draft release notes for next week's releases

From
Jeff Janes
Date:
On Sat, Mar 26, 2016 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I've prepared a first cut at next week's release notes:
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=29b6123ecb4113e366325245cec5a5c221dae691
>
> (As usual, I will make the notes for older branches by extracting
> relevant items from this list, after it's been reviewed.)  Please
> review.  If you prefer to read it on the web, it should be up at
>
> http://www.postgresql.org/docs/devel/static/release-9-5-2.html
>
> in an hour or so, after guaibasaurus's next buildfarm run.
>
> Probably the most discussion-worthy item is whether we can say
> anything more about the strxfrm mess.  Should we make a wiki
> page about that and have the release note item link to it?
>
>                         regards, tom lane


Sorry for speaking up late, but:

+    <listitem>
+     <para>
+      Correctly handle wraparound cases in the <literal>pg_subtrans</>
+      startup logic for hot standby (Jeff Janes)
+     </para>
+    </listitem>

This applies to all recovery scenarios, whether they are hot standby
or just plain-old automatic crash recovery.  (However, it does only
matter when prepared transactions are in use.)

Cheers,

Jeff



Re: Draft release notes for next week's releases

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Sat, Mar 26, 2016 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> +      Correctly handle wraparound cases in the <literal>pg_subtrans</>
> +      startup logic for hot standby (Jeff Janes)

> This applies to all recovery scenarios, whether they are hot standby
> or just plain-old automatic crash recovery.  (However, it does only
> matter when prepared transactions are in use.)

Thanks for the clarification, will fix!
        regards, tom lane



Re: Draft release notes for next week's releases

From
Peter Geoghegan
Date:
On Sat, Mar 26, 2016 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Probably the most discussion-worthy item is whether we can say
> anything more about the strxfrm mess.  Should we make a wiki
> page about that and have the release note item link to it?

I think that there is an argument against doing so, which is that
right now, all we have to offer on that are weasel words. However, I'm
still in favor of a Wiki page, because I would not be at all surprised
if our understanding of this problem evolved, and we were able to
offer better answers in several weeks. Realistically, it will probably
take at least that long before affected users even start to think
about this.


-- 
Peter Geoghegan



Re: Draft release notes for next week's releases

From
"David G. Johnston"
Date:
On Sun, Mar 27, 2016 at 8:43 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Sat, Mar 26, 2016 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Probably the most discussion-worthy item is whether we can say
> anything more about the strxfrm mess.  Should we make a wiki
> page about that and have the release note item link to it?

I think that there is an argument against doing so, which is that
right now, all we have to offer on that are weasel words. However, I'm
still in favor of a Wiki page, because I would not be at all surprised
if our understanding of this problem evolved, and we were able to
offer better answers in several weeks. Realistically, it will probably
take at least that long before affected users even start to think
about this.

​One question to debate is whether placing a list of "known" (collated from the program runs lots of people performed) would do more harm than good.  Personally I'd rather see a list of known failures and evaluate my situation objectively (i.e., large index but no reported problem on my combination of locale and platform).  I understand that a lack of evidence is not proof that I am unaffected at this stage in the game.  Having something I can execute on my server to try and verify behavior - irrespective of the correctness of the indexes themselves - would be welcomed.

David J.

Re: Draft release notes for next week's releases

From
Oleg Bartunov
Date:
<p dir="ltr"><br /> On Mar 28, 2016 09:44, "Peter Geoghegan" <<a href="mailto:pg@heroku.com">pg@heroku.com</a>>
wrote:<br/> ><br /> > On Sat, Mar 26, 2016 at 4:34 PM, Tom Lane <<a
href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>wrote:<br /> > > Probably the most discussion-worthy
itemis whether we can say<br /> > > anything more about the strxfrm mess.  Should we make a wiki<br /> > >
pageabout that and have the release note item link to it?<br /> ><br /> > I think that there is an argument
againstdoing so, which is that<br /> > right now, all we have to offer on that are weasel words. However, I'm<br />
>still in favor of a Wiki page, because I would not be at all surprised<br /> > if our understanding of this
problemevolved, and we were able to<br /> > offer better answers in several weeks. Realistically, it will
probably<br/> > take at least that long before affected users even start to think<br /> > about this.<br /><p
dir="ltr">Shouldwe start thinking about ICU ? I compare Postgres with ICU and without and found 27x improvement in
btreeindex creation for russian strings. This includes effect of abbreviated keys and ICU itself. Also, we'll get
systemindependent locale.<br /> ><br /> ><br /> > --<br /> > Peter Geoghegan<br /> ><br /> ><br />
>--<br /> > Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> > To make changes to your
subscription:<br/> > <a
href="http://www.postgresql.org/mailpref/pgsql-hackers">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/> 

Re: Draft release notes for next week's releases

From
Peter Geoghegan
Date:
On Mon, Mar 28, 2016 at 12:08 AM, Oleg Bartunov <obartunov@gmail.com> wrote:
> Should we start thinking about ICU ? I compare Postgres with ICU and without
> and found 27x improvement in btree index creation for russian strings. This
> includes effect of abbreviated keys and ICU itself. Also, we'll get system
> independent locale.

I think we should. I want to develop a detailed proposal before
talking about it more, though, because the idea is controversial.

Did you use the FreeBSD ports patch? Do you have your own patch that
you could share?

I'm not surprised that ICU is so much faster, especially now that
UTF-8 is not a second class citizen (it's been possible to build ICU
to specialize all its routines to handle UTF-8 for years now). As you
may know, ICU supports partial sort keys, and sort key compression,
which may have also helped:
http://userguide.icu-project.org/collation/architecture

That page also describes how binary sort keys are versioned, which
allows them to be stored on disk. It says "A common example is the use
of keys to build indexes in databases". We'd be crazy to trust Glibc
strxfrm() to be stable *on disk*, but ICU already cares deeply about
the things we need to care about, because it's used by other database
systems like DB2, Firebird, and in some configurations SQLite [1].

Glibc strxfrm() is not great with codepoints from the Cyrillic
alphabet -- it seems to store 2 bytes per code-point in the primary
weight level. So ICU might also do better in your test case for that
reason.

[1] https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/icu/README.txt
-- 
Peter Geoghegan



Re: Draft release notes for next week's releases

From
Oleg Bartunov
Date:


On Mon, Mar 28, 2016 at 1:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Mar 28, 2016 at 12:08 AM, Oleg Bartunov <obartunov@gmail.com> wrote:
> Should we start thinking about ICU ? I compare Postgres with ICU and without
> and found 27x improvement in btree index creation for russian strings. This
> includes effect of abbreviated keys and ICU itself. Also, we'll get system
> independent locale.

I think we should. I want to develop a detailed proposal before
talking about it more, though, because the idea is controversial.

Did you use the FreeBSD ports patch? Do you have your own patch that
you could share?

 We'll post the patch. Teodor made something to get abbreviated keys work as
I remember. I should say, that 27x improvement I got on my macbook. I will
check on linux. 

I'm not surprised that ICU is so much faster, especially now that
UTF-8 is not a second class citizen (it's been possible to build ICU
to specialize all its routines to handle UTF-8 for years now). As you
may know, ICU supports partial sort keys, and sort key compression,
which may have also helped:
http://userguide.icu-project.org/collation/architecture


That page also describes how binary sort keys are versioned, which
allows them to be stored on disk. It says "A common example is the use
of keys to build indexes in databases". We'd be crazy to trust Glibc
strxfrm() to be stable *on disk*, but ICU already cares deeply about
the things we need to care about, because it's used by other database
systems like DB2, Firebird, and in some configurations SQLite [1].

Glibc strxfrm() is not great with codepoints from the Cyrillic
alphabet -- it seems to store 2 bytes per code-point in the primary
weight level. So ICU might also do better in your test case for that
reason.

Yes, I see on this page, that ICU is ~3 times faster for russian text.
http://site.icu-project.org/charts/collation-icu4c48-glibc
 

[1] https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/icu/README.txt
--
Peter Geoghegan

Re: Draft release notes for next week's releases

From
Peter Geoghegan
Date:
On Mon, Mar 28, 2016 at 12:55 AM, Oleg Bartunov <obartunov@gmail.com> wrote:
>  We'll post the patch.

Cool.

> Teodor made something to get abbreviated keys work as
> I remember. I should say, that 27x improvement I got on my macbook. I will
> check on linux.

I think that Linux will be much faster. The stxfrm() blob produced by
Mac OSX will have a horribly low concentration of entropy. For an 8
byte Datum, you get only 2 distinguishing bytes. It's really, really
bad. Mac OSX probably makes very little use of strxfrm() in practice;
there are proprietary APIs that do something similar, but all using
UTF-16 only.

-- 
Peter Geoghegan



Re: Draft release notes for next week's releases

From
Oleg Bartunov
Date:


On Mon, Mar 28, 2016 at 2:06 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Mar 28, 2016 at 12:55 AM, Oleg Bartunov <obartunov@gmail.com> wrote:
>  We'll post the patch.

Cool.

> Teodor made something to get abbreviated keys work as
> I remember. I should say, that 27x improvement I got on my macbook. I will
> check on linux.

I think that Linux will be much faster. The stxfrm() blob produced by
Mac OSX will have a horribly low concentration of entropy. For an 8
byte Datum, you get only 2 distinguishing bytes. It's really, really
bad. Mac OSX probably makes very little use of strxfrm() in practice;
there are proprietary APIs that do something similar, but all using
UTF-16 only.

Yes, Linux is much-much faster, I see no difference in performance using latest icu 57_1.
I tested on Ubuntu 14.4.04.  But still, icu provides us abbreviated keys and collation stability,
so let's add --with-icu.
 

--
Peter Geoghegan

Re: Draft release notes for next week's releases

From
Thomas Kellerer
Date:
Oleg Bartunov-2 wrote
> But still, icu provides us abbreviated keys and collation stability,

Does include ICU mean that collation handling is identical across platforms?
E.g. a query on Linux involving string comparison would yield the same
result on MacOS and Windows? 

If that is the case I'm all for it. 

Currently the different behaviour in handling collation aware string
comparisons is a bug in my eyes from a user's perspective. I do understand
and can accept the technical reasons for that, but it still feels odd that a
query yields different results (with identical data) just because it runs on
a different platform.




--
View this message in context:
http://postgresql.nabble.com/Draft-release-notes-for-next-week-s-releases-tp5895357p5895484.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: Draft release notes for next week's releases

From
Tom Lane
Date:
Oleg Bartunov <obartunov@gmail.com> writes:
> Should we start thinking about ICU ?

Isn't it still true that ICU fails to meet our minimum requirements?
That would include (a) working with the full Unicode character range
(not only UTF16) and (b) working with non-Unicode encodings.  No doubt
we could deal with (b) by inserting a conversion, but that would take
a lot of shine off the performance numbers you mention.

I'm also not exactly convinced by your implicit assumption that ICU is
bug-free.
        regards, tom lane



Dealing with collation and strcoll/strxfrm/etc

From
Stephen Frost
Date:
All,

Changed the thread name (we're no longer talking about release
notes...).

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Oleg Bartunov <obartunov@gmail.com> writes:
> > Should we start thinking about ICU ?
>
> Isn't it still true that ICU fails to meet our minimum requirements?
> That would include (a) working with the full Unicode character range
> (not only UTF16) and (b) working with non-Unicode encodings.  No doubt
> we could deal with (b) by inserting a conversion, but that would take
> a lot of shine off the performance numbers you mention.
>
> I'm also not exactly convinced by your implicit assumption that ICU is
> bug-free.

We have a wiki page about ICU.  I'm not sure that it's current, but if
it isn't and people are interested then perhaps we should update it:

https://wiki.postgresql.org/wiki/Todo:ICU

If we're going to talk about minimum requirements, I'd like to argue
that we require whatever system we're using to have versioning (which
glibc currently lacks, as I understand it...) to avoid the risk that
indexes will become corrupt when whatever we're using for collation
changes.  I'm pretty sure that's already bitten us on at least some
RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
with strcoll vs. strxfrm.

Regarding key abbreviation and performance, if we are confident that
strcoll and strxfrm are at least independently internally consistent
then we could consider offering an option to choose between them.
We'd need to identify what each index was built with to do so, however,
as they would need to be rebuilt if the choice changes, at least
until/unless they're made to reliably agree.  Even using only one or the
other doesn't address the versioning problem though, which is a problem
for all currently released versions of PG and is just going to continue
to be an issue.

Thanks!

Stephen

Re: Draft release notes for next week's releases

From
Robert Haas
Date:
On Mon, Mar 28, 2016 at 10:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm also not exactly convinced by your implicit assumption that ICU is
> bug-free.

Noah spent some time looking at ICU back when he was EnterpriseDB, and
his conclusion was that ICU collations weren't stable across releases,
which is pretty much the same problem we're running into with glibc
collations.  Now it might still be true that they have the equivalent
of strxfrm() and strcoll() and that those things behave consistently
with each other, and that would be very good.  Everybody seems to
agree it's faster, and that's good, too.  But I wonder what we do
about the fact that, as with glibc, an ICU upgrade involves a REINDEX
of every potentially affected index.  It seems like ICU has some
facilities built into it that might be useful for detecting and
handling such situations, but I don't understand them well enough to
know whether they'd solve our versioning problems or how effectively
they would do so, and I think there are packaging issues that tie into
it, too.  http://userguide.icu-project.org/design mentions building
with specific configure flags if you need to link with multiple server
versions, and I don't know what operating system packagers typically
do about that stuff.

In any case, I agree that we'd be very unwise to think that ICU is
necessarily going to be bug-free without testing it carefully.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Dealing with collation and strcoll/strxfrm/etc

From
Peter Geoghegan
Date:
On Mon, Mar 28, 2016 at 7:57 AM, Stephen Frost <sfrost@snowman.net> wrote:
> If we're going to talk about minimum requirements, I'd like to argue
> that we require whatever system we're using to have versioning (which
> glibc currently lacks, as I understand it...) to avoid the risk that
> indexes will become corrupt when whatever we're using for collation
> changes.  I'm pretty sure that's already bitten us on at least some
> RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
> with strcoll vs. strxfrm.

I totally agree that anything we should adopt should support
versioning. Glibc does have a non-standard versioning scheme, but we
don't use it. Other stdlibs may do versioning another way, or not at
all. A world in which ICU is the defacto standard for Postgres (i.e.
the actual standard on all major platforms), we mostly just have one
thing to target, which seems like something to aim for.

Collations change from time to time, legitimately. Read from
"Collation order is not fixed", here:

http://unicode.org/reports/tr10/#Stability

The question is only how we deal with this when it happens. One thing
that's attractive about ICU is that it makes this explicit, both for
the logical behavior of a collation, as well as the stability of
binary sort keys (Glibc's versioning seemingly just does the former).
So the equivalent of strxfrm() output has license to change for
technical reasons that are orthogonal to the practical concerns of
end-users about how text sorts in their locale. ICU is clear on what
it takes to make binary sort keys in indexes work. And various major
database systems rely on this being right.

> Regarding key abbreviation and performance, if we are confident that
> strcoll and strxfrm are at least independently internally consistent
> then we could consider offering an option to choose between them.

I think they just need to match, per the standard. After all,
abbreviation will sometimes require strcoll() tie-breakers.

Clearly it would be very naive to imagine that ICU is bug-free.
However, I surmise that there is a large difference how ICU and glibc
think about things like strxfrm() or strcoll() stability and
consistency. Tom was able to demonstrate that strxfrm() and strcoll()
behaved inconsistently without too much effort, contrary to POSIX, and
in many common cases. I doubt that the Glibc maintainers are all that
concerned about it. Certainly, less concerned than they are about the
latest security bug. Whereas if this happened in ICU, it would be a
total failure of the project to fulfill its most basic goals. Our
disaster would also be a disaster for several other major database
systems. ICU carefully and explicitly considers multiple forms of
stability, "deterministic" sort ordering, etc. That *is* a big
difference, and it makes me optimistic that there'd be far fewer
problems.

I also think that ICU could be a reasonable basis for case-insensitive
collations, which would let us kill citext, a module that I consider
to be a total kludge. And, we might also be able to lock down WAL
compatibility, which would be generally useful.

-- 
Peter Geoghegan



Re: Dealing with collation and strcoll/strxfrm/etc

From
Stephen Frost
Date:
* Peter Geoghegan (pg@heroku.com) wrote:
> On Mon, Mar 28, 2016 at 7:57 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > If we're going to talk about minimum requirements, I'd like to argue
> > that we require whatever system we're using to have versioning (which
> > glibc currently lacks, as I understand it...) to avoid the risk that
> > indexes will become corrupt when whatever we're using for collation
> > changes.  I'm pretty sure that's already bitten us on at least some
> > RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
> > with strcoll vs. strxfrm.
>
> I totally agree that anything we should adopt should support
> versioning. Glibc does have a non-standard versioning scheme, but we
> don't use it. Other stdlibs may do versioning another way, or not at
> all. A world in which ICU is the defacto standard for Postgres (i.e.
> the actual standard on all major platforms), we mostly just have one
> thing to target, which seems like something to aim for.

Having to figure out how each and every stdlib does versioning doesn't
sound fun, I certainly agree with you there, but it hardly seems
impossible.  What we need, even if we look to move to ICU, is a place to
remember that version information and a way to do something when we
discover that we're now using a different version.

I'm not quite sure what the best way to do that is, but I imagine it
involves changes to existing catalogs or perhaps even a new one.  I
don't have any particularly great ideas for existing releases (maybe
stash information in the index somewhere when it's rebuilt and then
check it and throw an ERROR if they don't match?)

> The question is only how we deal with this when it happens. One thing
> that's attractive about ICU is that it makes this explicit, both for
> the logical behavior of a collation, as well as the stability of
> binary sort keys (Glibc's versioning seemingly just does the former).
> So the equivalent of strxfrm() output has license to change for
> technical reasons that are orthogonal to the practical concerns of
> end-users about how text sorts in their locale. ICU is clear on what
> it takes to make binary sort keys in indexes work. And various major
> database systems rely on this being right.

There seems to be some disagreement about if ICU provides the
information we'd need to make a decision or not.  It seems like it
would, given its usage in other database systems, but if so, we need to
very clearly understand exactly how it works and how we can depend on
it.

> > Regarding key abbreviation and performance, if we are confident that
> > strcoll and strxfrm are at least independently internally consistent
> > then we could consider offering an option to choose between them.
>
> I think they just need to match, per the standard. After all,
> abbreviation will sometimes require strcoll() tie-breakers.

Ok, I didn't see that in the man-pages.  If that's the case then it
seems like there isn't much hope of just using strxfrm().

Thanks!

Stephen

Re: Dealing with collation and strcoll/strxfrm/etc

From
Peter Geoghegan
Date:
On Mon, Mar 28, 2016 at 12:36 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Having to figure out how each and every stdlib does versioning doesn't
> sound fun, I certainly agree with you there, but it hardly seems
> impossible.  What we need, even if we look to move to ICU, is a place to
> remember that version information and a way to do something when we
> discover that we're now using a different version.

I think that the versioning situation is all over the place. It isn't
in the C standard. And there are many different versions of many
different stdlibs to support. Most importantly, where support
nominally exists, a strong incentive to get it exactly right may not.
We've seen that already.

> I'm not quite sure what the best way to do that is, but I imagine it
> involves changes to existing catalogs or perhaps even a new one.  I
> don't have any particularly great ideas for existing releases (maybe
> stash information in the index somewhere when it's rebuilt and then
> check it and throw an ERROR if they don't match?)

I think we'd need to introduce an abstraction like a "collation
provider", of which ICU would theoretically be just one. The OS would
be a baked-in collation provider. Everything that works today would
continue to work. We'd then largely just be grandfathering out systems
that rely on OS locales across major version upgrades, since the vast
majority of users are happy with Unicode, and have no cultural or
technical reason to prefer the OS locales that I can think of.

I am unconvinced with the idea that it especially matters that sort(1)
might not be in agreement with Postgres. Neither is any Java app, or
any .Net app, or the user's web browser in the case of Safari or
Google Chrome (maybe others). I want Postgres to be consistent with
Postgres, across different nodes on the network, in environments where
I may have little knowledge of the underlying OS. Think "sort pushdown
in postgres_fdw".

Users from certain East Asian user communities might prefer to stick
with regional encodings, perhaps due to specific concerns about the
Han Unification controversy. But I'm pretty sure that these users have
very low expectations about collations in Postgres today. I was
recently told that collating Japanese is starting to get a bit better,
due to various new initiatives, but that most experienced Japanese
Postgres DBAs tend to use the "C" collation.

I don't want to impose a Unicode monoculture on anyone. But I do think
there are clear benefits for the large majority of users that always
use Unicode. Nothing needs to break that works today to make this
happen. Abbreviated keys provide an immediate incentive for users to
adopt ICU; users that might otherwise be on the fence about it.

>> The question is only how we deal with this when it happens. One thing
>> that's attractive about ICU is that it makes this explicit, both for
>> the logical behavior of a collation, as well as the stability of
>> binary sort keys (Glibc's versioning seemingly just does the former).
>> So the equivalent of strxfrm() output has license to change for
>> technical reasons that are orthogonal to the practical concerns of
>> end-users about how text sorts in their locale. ICU is clear on what
>> it takes to make binary sort keys in indexes work. And various major
>> database systems rely on this being right.
>
> There seems to be some disagreement about if ICU provides the
> information we'd need to make a decision or not.  It seems like it
> would, given its usage in other database systems, but if so, we need to
> very clearly understand exactly how it works and how we can depend on
> it.

It seems likely that it exposes the information required to make what
we need to do practical.

Certainly, adopting ICU is a big project that we should proceed
cautiously with, but there is a reason why every other major database
system uses either ICU, or a library based on UCA [1] that allows the
system to centrally control versioned collations (SQLite just makes
this optional).

I think that ICU *could* still tie us to the available collations on
an OS (those collations that are available with their ICU packages).
What I haven't figured out yet is if it's practical to install
versions that are available from some central location, like the CLDR
[2]. I don't think we'd want to have Postgres ship "supported
collations" in each major version, in roughly the style of the IANA
timezone stuff, but it's far too early to rule that out. It would have
upsides.

[1] https://en.wikipedia.org/wiki/Unicode_collation_algorithm
[2] http://cldr.unicode.org/
-- 
Peter Geoghegan



Re: Dealing with collation and strcoll/strxfrm/etc

From
Oleg Bartunov
Date:


On Mon, Mar 28, 2016 at 5:57 PM, Stephen Frost <sfrost@snowman.net> wrote:
All,

Changed the thread name (we're no longer talking about release
notes...).

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Oleg Bartunov <obartunov@gmail.com> writes:
> > Should we start thinking about ICU ?
>
> Isn't it still true that ICU fails to meet our minimum requirements?
> That would include (a) working with the full Unicode character range
> (not only UTF16) and (b) working with non-Unicode encodings.  No doubt
> we could deal with (b) by inserting a conversion, but that would take
> a lot of shine off the performance numbers you mention.
>
> I'm also not exactly convinced by your implicit assumption that ICU is
> bug-free.

We have a wiki page about ICU.  I'm not sure that it's current, but if
it isn't and people are interested then perhaps we should update it:

https://wiki.postgresql.org/wiki/Todo:ICU


Good point, I forget about this page.

 
If we're going to talk about minimum requirements, I'd like to argue
that we require whatever system we're using to have versioning (which
glibc currently lacks, as I understand it...) to avoid the risk that
indexes will become corrupt when whatever we're using for collation
changes.  I'm pretty sure that's already bitten us on at least some
RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
with strcoll vs. strxfrm.

agree.
 

Regarding key abbreviation and performance, if we are confident that
strcoll and strxfrm are at least independently internally consistent
then we could consider offering an option to choose between them.
We'd need to identify what each index was built with to do so, however,
as they would need to be rebuilt if the choice changes, at least
until/unless they're made to reliably agree.  Even using only one or the
other doesn't address the versioning problem though, which is a problem
for all currently released versions of PG and is just going to continue
to be an issue.

Ideally, we should benchmarking all locales on all platforms for all kind indexes. But that's  big project.
 

Thanks!

Stephen

Re: Draft release notes for next week's releases

From
Oleg Bartunov
Date:


On Mon, Mar 28, 2016 at 1:36 PM, Thomas Kellerer <spam_eater@gmx.net> wrote:
Oleg Bartunov-2 wrote
> But still, icu provides us abbreviated keys and collation stability,

Does include ICU mean that collation handling is identical across platforms?
E.g. a query on Linux involving string comparison would yield the same
result on MacOS and Windows?

Yes, it does and that's the most important issue for us.
 

If that is the case I'm all for it.

Currently the different behaviour in handling collation aware string
comparisons is a bug in my eyes from a user's perspective. I do understand
and can accept the technical reasons for that, but it still feels odd that a
query yields different results (with identical data) just because it runs on
a different platform.




--
View this message in context: http://postgresql.nabble.com/Draft-release-notes-for-next-week-s-releases-tp5895357p5895484.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Draft release notes for next week's releases

From
Oleg Bartunov
Date:


On Mon, Mar 28, 2016 at 6:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Mar 28, 2016 at 10:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm also not exactly convinced by your implicit assumption that ICU is
> bug-free.
 

Noah spent some time looking at ICU back when he was EnterpriseDB, and
his conclusion was that ICU collations weren't stable across releases,
which is pretty much the same problem we're running into with glibc
collations.  Now it might still be true that they have the equivalent
of strxfrm() and strcoll() and that those things behave consistently
with each other, and that would be very good.  Everybody seems to
agree it's faster, and that's good, too.  But I wonder what we do
about the fact that, as with glibc, an ICU upgrade involves a REINDEX
of every potentially affected index.  It seems like ICU has some
facilities built into it that might be useful for detecting and
handling such situations, but I don't understand them well enough to
know whether they'd solve our versioning problems or how effectively
they would do so, and I think there are packaging issues that tie into
it, too.  http://userguide.icu-project.org/design mentions building
with specific configure flags if you need to link with multiple server
versions, and I don't know what operating system packagers typically
do about that stuff.

In any case, I agree that we'd be very unwise to think that ICU is
necessarily going to be bug-free without testing it carefully.

agree.

In other thread I wrote:
"Ideally, we should benchmarking all locales on all platforms for all kind indexes. But that's  big project."
 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Draft release notes for next week's releases

From
Teodor Sigaev
Date:
>     Does include ICU mean that collation handling is identical across platforms?
>     E.g. a query on Linux involving string comparison would yield the same
>     result on MacOS and Windows?
> Yes, it does and that's the most important issue for us.

Yes, exactly. Attached patch adds support for libicu with configure flag
--with-icu. Patch rebased to current HEAD, hope, it works.

It's based on https://people.freebsd.org/~girgen/postgresql-icu/readme.html
work, and it was migrated to 9.5 with abbrevation keys support.
Patch in current state is not ready to commit, of course.


--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: Draft release notes for next week's releases

From
Peter Geoghegan
Date:
On Sat, Mar 26, 2016 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Probably the most discussion-worthy item is whether we can say
> anything more about the strxfrm mess.  Should we make a wiki
> page about that and have the release note item link to it?

I just noticed that the release notes mention char(n) as affected.
That's not actually true, because char(n) SortSupport only came in
9.6. The Wiki page now shows this, which may be the most important
place, but ideally we'd fix this in the release notes. I guess it's
too late.

-- 
Peter Geoghegan



Re: Draft release notes for next week's releases

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> I just noticed that the release notes mention char(n) as affected.
> That's not actually true, because char(n) SortSupport only came in
> 9.6. The Wiki page now shows this, which may be the most important
> place, but ideally we'd fix this in the release notes. I guess it's
> too late.

Well, too late for 9.5.2 anyway.  It still makes sense to correct that
text for future releases.  I'm inclined to wait a little bit though and
see what other improvements become apparent.  For instance, I think the
point about non-first index columns not being affected is of greater
weight than you seem to place on it.
        regards, tom lane



Re: Draft release notes for next week's releases

From
Peter Geoghegan
Date:
On Thu, Mar 31, 2016 at 2:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, too late for 9.5.2 anyway.  It still makes sense to correct that
> text for future releases.  I'm inclined to wait a little bit though and
> see what other improvements become apparent.  For instance, I think the
> point about non-first index columns not being affected is of greater
> weight than you seem to place on it.

The SQL query on the Wiki page does the right thing there now, so
users will have the benefit of not unnecessarily reindexing when text
was not the leading/first pg_index attribute. We have that covered, I
suppose, because everyone will look to the Wiki page for guidance.

I also noted quite a few non-obvious safe cases on the Wiki page, as
pointed out already over on the other thread.

-- 
Peter Geoghegan



Re: Draft release notes for next week's releases

From
Peter Geoghegan
Date:
On Tue, Mar 29, 2016 at 5:18 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> It's based on https://people.freebsd.org/~girgen/postgresql-icu/readme.html
> work, and it was migrated to 9.5 with abbrevation keys support.
> Patch in current state is not ready to commit, of course.

Cool.

Some quick observations on this:

* We need to have a strxfrm_l_icu(), not just a strxfrm_icu(). That seems easy.

* We should look into using the ucol_nextSortKeyPart() API:

http://userguide.icu-project.org/collation/architecture#TOC-Partial-sort-keys

I think that this could be a lot faster, because we only need a part
of the collation tables in CPU cache during the generation of
abbreviated keys. There is an optimization described at a low level
here:

https://github.com/icu-project/icu4c/blob/bbd17a792336de5873550794f8304a4b548b0663/source/i18n/collationkeys.cpp#L337

I think this could make our special strxfrm() (which only actually
needs 8 bytes for abbreviated keys) a lot faster. I'd be interested to
see how your Russian text example does with this extra optimization.
We should not be surprised that this kind of support exists within
ICU, because abbreviated keys are actually quite an old idea.

-- 
Peter Geoghegan



Re: Draft release notes for next week's releases

From
Peter Geoghegan
Date:
On Thu, Apr 14, 2016 at 4:42 PM, Peter Geoghegan <pg@heroku.com> wrote:
> * We should look into using the ucol_nextSortKeyPart() API:
>
> http://userguide.icu-project.org/collation/architecture#TOC-Partial-sort-keys

Another more rich API we could immediately put to good use is the ICU
strcoll() variant that does not require NUL-terminated strings:

https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a3abc6779e6452106415918199308fab4

We do not use a NUL byte for terminating text data, and so must copy
its contents into a temp buffer, or array on the stack, all rather
inefficiently. Robert has expressed an interest in an API like this
strcoll() variant in the past [1], to avoid this unnecessary overhead.

[1] http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html
-- 
Peter Geoghegan