Thread: Losing my latin on Ordering...

Losing my latin on Ordering...

From

Dominique Devienne

Date:

14 February 2023, 09:31:00

Hi. Porting a unit test to PostgreSQL, we got a failure related to ordering.

We've distilled it to the below. The DB is en_US.UTF-8, and the sorting we get

does not make sense to me. The same prefix can be sorted differently based on the suffix apprently, which doesn't make any sense to me.

Surely sorting should be "constant left-to-right", no? What are we missing?

I'm already surprised (star) comes before (space), when the latter "comes before" the former in both ASCII and UTF-8, but that the two "Foo*" and "Foo " prefixed pairs are not clustered after sorting is just mistifying to me. So how come?

For now we can work-around this by explicitly adding the `collate "C"` on the queries underlying that particular test, but that would be wrong in the general case of international strings to sort, so I'd really like to understand what's going on.

Thanks, --DD

PS: if I try "en_US.UTF-8" or "en_US"."UTF-8" for the collate, it fails.

How come what pg_database.datcollate displays is not a valid value for collate?

PPS: We tried on v12 and v13 I believe. Somehow my v14.2 on Windows doesn't have en_US as a collation...

ddevienne=> with t(v) as (values ('Foo All'), ('Foo*'), ('Foo Brief'), ('Foo*All')) select '<'||v||'>', v::bytea from t order by v collate "en_US";
?column? | v
-------------+----------------------
<Foo*> | \x466f6f2a
<Foo All> | \x466f6f20416c6c
<Foo*All> | \x466f6f2a416c6c
<Foo Brief> | \x466f6f204272696566
(4 rows)

ddevienne=> with t(v) as (values ('Foo All'), ('Foo*'), ('Foo Brief'), ('Foo*All')) select '<'||v||'>', v::bytea from t order by v collate "C";
?column? | v
-------------+----------------------
<Foo All> | \x466f6f20416c6c
<Foo Brief> | \x466f6f204272696566
<Foo*> | \x466f6f2a
<Foo*All> | \x466f6f2a416c6c
(4 rows)

Re: Losing my latin on Ordering...

From

Laurenz Albe

Date:

14 February 2023, 10:23:19

On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:
> Hi. Porting a unit test to PostgreSQL, we got a failure related to ordering.
>
> We've distilled it to the below. The DB is en_US.UTF-8, and the sorting we get
> does not make sense to me. The same prefix can be sorted differently based on
> the suffix apprently, which doesn't make any sense to me.
>
> Surely sorting should be "constant left-to-right", no? What are we missing?

No, it isn't.  That's not how natural language collations work.

They typically use different levels of comparison: first, strings are sorted
according to base character, ignoring accents, case and punctuation.
Wherever that comparison is equal, the next level is used (typically accents),
then the next (case), and so on.

> I'm already surprised (star) comes before (space), when the latter "comes
> before" the former in both ASCII and UTF-8, but that the two "Foo*" and "Foo "
> prefixed pairs are not clustered after sorting is just mistifying to me. So how come?

Because they compare identical on the first three levels.  Any difference in
letters, accents or case weighs stronger, even if it occurs to the right
of these substrings.

> For now we can work-around this by explicitly adding the `collate "C"` on
> the queries underlying that particular test, but that would be wrong in the
> general case of international strings to sort, so I'd really like to understand
> what's going on.

Yes, it soulds like the "C" collation may be best for you.  That is, if you don't
mind that "Z" < "a".

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Re: Losing my latin on Ordering...

From

Dominique Devienne

Date:

14 February 2023, 11:17:47

On Tue, Feb 14, 2023 at 11:23 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:
> Hi. Porting a unit test to PostgreSQL, we got a failure related to ordering.
>
> We've distilled it to the below. The DB is en_US.UTF-8, and the sorting we get
> does not make sense to me. The same prefix can be sorted differently based on
> the suffix apprently, which doesn't make any sense to me.
>
> Surely sorting should be "constant left-to-right", no? What are we missing?

No, it isn't. That's not how natural language collations work.

Honestly, who expects the same prefix to sort differently based on what comes after, in left-to-right languages?

How does one even find out what the (capricious?) rules for sorting in a given collation are?

I'm aware of sorting taking numerical numbers in text influencing sort, so "Foo10" comes after "Foo9",

but that's not what we are discussing here. "Foo*" and "Foo " have no logical relatioship, like 9 and 10 do.

> I'm already surprised (star) comes before (space), when the latter "comes
> before" the former in both ASCII and UTF-8, but that the two "Foo*" and "Foo "
> prefixed pairs are not clustered after sorting is just mistifying to me. So how come?

Because they compare identical on the first three levels. Any difference in
letters, accents or case weighs stronger, even if it occurs to the right
of these substrings.

That's completely unintuitive...

> For now we can work-around this by explicitly adding the `collate "C"` on
> the queries underlying that particular test, but that would be wrong in the
> general case of international strings to sort, so I'd really like to understand
> what's going on.

Yes, it soulds like the "C" collation may be best for you. That is, if you don't
mind that "Z" < "a".

I would mind if I asked for case-insensitive comparisons.

So the "C" collation is fine with general UTF-8 encoding?

I.e. it will be codepoint ordered OK?

Re: Losing my latin on Ordering...

From

Alvaro Herrera

Date:

14 February 2023, 11:35:16

On 2023-Feb-14, Dominique Devienne wrote:

> Honestly, who expects the same prefix to sort differently based on what
> comes after, in left-to-right languages?

Look, we don't define the collation rules.  We just grab the collation
rules defined by experts in collations.  In this case the experts have
advised the glibc developers to write collations this way; but even if
you went further and looked at the ICU libraries, you would find that
they have pretty much the same definition.

> How does one even find out what the (capricious?) rules for sorting in a
> given collation are?

I suggest to look at a telephone book carefully sometime (provided you
can find one ... apparently nobody wants them anymore.)

> So the "C" collation is fine with general UTF-8 encoding?
> I.e. it will be codepoint ordered OK?

Sure, just make sure to use the definition of C that uses UTF-8 encoding
(I think it's typically called C.UTF-8).

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Losing my latin on Ordering...

From

Laurenz Albe

Date:

14 February 2023, 11:40:46

On Tue, 2023-02-14 at 12:17 +0100, Dominique Devienne wrote:
> On Tue, Feb 14, 2023 at 11:23 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:
> > > Surely sorting should be "constant left-to-right", no? What are we missing?
> >
> > No, it isn't.  That's not how natural language collations work.
>
> Honestly, who expects the same prefix to sort differently based on what comes
> after, in left-to-right languages?
> How does one even find out what the (capricious?) rules for sorting in a given
> collation are?

Look at the documentation / implementation.

As far as ICU is concerned, here: https://unicode.org/reports/tr10/

> > > I'm already surprised (star) comes before (space), when the latter "comes
> > > before" the former in both ASCII and UTF-8, but that the two "Foo*" and "Foo "
> > > prefixed pairs are not clustered after sorting is just mistifying to me. So how come?
> >
> > Because they compare identical on the first three levels.  Any difference in
> > letters, accents or case weighs stronger, even if it occurs to the right
> > of these substrings.
>
> That's completely unintuitive...

Well, you can complain to GNU and the Unicode consortium, but that's pretty
much the way it is.

> > Yes, it soulds like the "C" collation may be best for you.  That is, if you don't
> > mind that "Z" < "a".
>
> I would mind if I asked for case-insensitive comparisons.
>
> So the "C" collation is fine with general UTF-8 encoding?
> I.e. it will be codepoint ordered OK?

Yes, exactly.

Yours,
Laurenz Albe

Re: Losing my latin on Ordering...

From

Dominique Devienne

Date:

14 February 2023, 12:06:18

On Tue, Feb 14, 2023 at 12:35 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Feb-14, Dominique Devienne wrote:
> Honestly, who expects the same prefix to sort differently based on what
> comes after, in left-to-right languages?

Look, we don't define the collation rules.

Ok, ok, sorry. To you, Laurenz, and everyone.

I obviously disagree with these rules, but I'm a nobody, so who cares :)

> So the "C" collation is fine with general UTF-8 encoding?
> I.e. it will be codepoint ordered OK?

Sure, just make sure to use the definition of C that uses UTF-8 encoding
(I think it's typically called C.UTF-8).

OK, so for new DBs, sounds like we need to

CREATE DATABASE ... WITH LOCALE 'C.UTF-8' ENCODING UTF8

Correct?

But what about existing DBs? Can the collation be changed a posteriori?

ALTER DATABASE does not seem to support the same options.

We don't want to have to sprinkle COLLATE "C" all over the place in the code.

And there are quite a few DBs out there already. What to do about them?

Re: Losing my latin on Ordering...

From

Laurenz Albe

Date:

14 February 2023, 12:27:53

On Tue, 2023-02-14 at 13:06 +0100, Dominique Devienne wrote:
> > Sure, just make sure to use the definition of C that uses UTF-8 encoding
> > (I think it's typically called C.UTF-8).
>  
> OK, so for new DBs, sounds like we need to
>
> CREATE DATABASE ... WITH LOCALE 'C.UTF-8' ENCODING UTF8 
>
> Correct?

Collations are identifiers, so it has to be double quotes.
The name depends on the operating system; if that is Unix-like,
you can run "locale -a" to get all available locales.

On my system it would be

CREATE DATABASE x TEMPLATE template0 LOCALE "C.utf8" ENCODING UTF8;

> But what about existing DBs? Can the collation be changed a posteriori?
> ALTER DATABASE does not seem to support the same options.
>
> We don't want to have to sprinkle COLLATE "C" all over the place in the code.
> And there are quite a few DBs out there already. What to do about them?

The only option is dump/restore.

A changed collation means changed indexes, so there is no better option.

Yours,
Laurenz Albe