Re: String comparison problem in select - too many results - Mailing list pgsql-general

From Peter J. Holzer
Subject Re: String comparison problem in select - too many results
Date
Msg-id 20180114174735.7zgnvuftlpnn5hyu@hjp.at
Whole thread Raw
In response to Re: String comparison problem in select - too many results  (Francisco Olarte <folarte@peoplecall.com>)
List pgsql-general
On 2018-01-14 13:20:05 +0100, Francisco Olarte wrote:
> On Sun, Jan 14, 2018 at 12:14 PM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> > On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote:
> >> C collation is like sorting raw bytes, it doesn't event sort
> >> upper/lower case correctly
> >
> > Now you are falling into the same trap as Durumdara, calling an
> > unintended sort order "not correct" ;-).
>
> Well, had you quoted / read a little farther:

Sorry. I had hoped that adding a winking smiley would make it clear that
this was just a friendly jab, not a serious criticism.

> >>
> C collation is like sorting raw bytes, it doesn't event sort
> upper/lower case correctly ( Do not know how you do it in HU, but in
> ES we sort aA before bB, while C locale normally sorts AB..ab.. It's
> what non-locale aware programs use because it's dirt cheap to
> implement, just sort the data as unsigned byte arrays
> lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that
> this works too ( You can sort by raw bytes and you get the data sorted
> lexicographically sorted by code points ).
> <<
>
> You'll see I was trying to define "correctly" somehow. English is not
> my native language and it seems I obviously failed at it, I'll try to
> do it better next time.

English isn't my native language either, so the failure may be on my
side (actually, I don't think a shared native language is a guarantee
for successful communication either).

I did read that and I didn't assume that you thought that there is one
and only one correct way to sort for each language, but I did think it
was slightly amusing that you used the word "correct" after berating
Durumdara for using the word "wrong".

Anyway, what I wanted to communicate is that the correct sort order
depends on the application. I agree that case should almost never be a
primary criterium, but even there might be some exceptions (I prefer
C collation for filenames, because it puts Changes, README and TODO at
the front, but I'm aware that this is mostly because I started to use
Unix in the 80's). But punctuation, spaces, ... those might have to be
treated very differently.

And so I'm not very happy with opaque collation identifiers like
"de_AT.UTF-8". What does that mean? Who decides what the correct sort
order is in Austria, and is this even the same on Linux and Windows?

Often the details don't matter. Whether digits are sorted before or
after letters, whether punctuation is ignored or considered (and if the
latter, how), as long as the order is internally consistent and not too
far off the users' expectations, the users can deal with it (and
probably won't even notice that the order is slightly different in
say the company directory and the city's phonebook). But sometimes such
details do matter and then you have to explicitely order items.


> > C collation is certainly not what a "normal user" expects. It is
> > therefore wrong for many applications (e.g., you couldn't use it to
> > sort a telephone book), but it might be correct for others (e.g., it you
> > need a stable, unambiguous sort order but don't care much about the
> > order itself).
>
> stable / unambiguous is shared by a la lot of collation methods, but I
> see what you try to say.

I'm worried that something like "de_AT.UTF-8" is not stable. Somebody
might decide that ignoring whitespace wasn't such a good idea after all
and "fix" it.

Unicode TR#10 actually warns about this:

| Collation order is not fixed.
|
|   Over time, collation order will vary: there may be fixes needed as
|   more information becomes available about languages; there may be new
|   government or industry standards for the language that require
|   changes; and finally, new characters added to the Unicode Standard
|   will interleave with the previously-defined ones. This means that
|   collations must be carefully versioned.

... and if I remember correctly there have been cases where PostgreSQL
indexes where unusable after an upgrade because the collation had
changed.

> I've found sort more and more surprissing since I started ( with the
> electromechanical IBM card sorters, those )

I fully agree with this.

> > Which brings us back to Durumdara's example: I don't know his
> > application, so I don't know what "normal users" of his application
> > would expect, but to me those values look like two numbers separated by
> > a slash. So I would expect '23/4567' to be sorted between '18/0212' and
> > '180/2010', but of course the C collation doesn't do that:
>
> He does not seem to want this.

He didn't mention it. We don't know yet whether he doesn't want it or
just didn't think of it yet :-).

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

pgsql-general by date:

Previous
From: Andreas Joseph Krogh
Date:
Subject: Sv: Re: Is ORDER BY in sub-query preserved when outer query is only projection?
Next
From: Vick Khera
Date:
Subject: Re: psql format result as markdown tables