Re: String comparison problem in select - too many results - Mailing list pgsql-general

From Peter J. Holzer
Subject Re: String comparison problem in select - too many results
Date
Msg-id 20180114111407.dxtp2d3ip45oc2ju@hjp.at
Whole thread Raw
In response to Re: String comparison problem in select - too many results  (Francisco Olarte <folarte@peoplecall.com>)
Responses Re: String comparison problem in select - too many results  (Francisco Olarte <folarte@peoplecall.com>)
List pgsql-general
On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote:
> C collation is like sorting raw bytes, it doesn't event sort
> upper/lower case correctly

Now you are falling into the same trap as Durumdara, calling an
unintended sort order "not correct" ;-).

C collation is certainly not what a "normal user" expects. It is
therefore wrong for many applications (e.g., you couldn't use it to
sort a telephone book), but it might be correct for others (e.g., it you
need a stable, unambiguous sort order but don't care much about the
order itself).

(By coincidence, I also stumbled over the fact that the en_US.UTF-8
collation ignores punctuation characters and spaces at least in the
first pass - took me a minute or so to figure out why I got an
"inconsistent" (read: unexpected and not obvious for me) sort order.)

Lexicographers are interested in sorting single words. Therefore they
aren't interested in punctuation. They may also not be very interested
in accents, because most languages have few words which differ only by
an accent (and it the dictionary is printed on paper, the user will
easily be able to scan up and down a few centimeters and find the right
entry without knowing whether "à" is sorted before or after "á").

Somebody preparing an address list should care about punctuation: You
would probably not expect to find "Smith-Jones, Amanda" between "Smith,
John" and "Smith, Julia". And you probably want to sort "23 Main Street"
before "180 Main Street".

Which brings us back to Durumdara's example: I don't know his
application, so I don't know what "normal users" of his application
would expect, but to me those values look like two numbers separated by
a slash. So I would expect '23/4567' to be sorted between '18/0212' and
'180/2010', but of course the C collation doesn't do that:

=> select * from foo order by t collate "C";
╔══════════╗
║    t     ║
╟──────────╢
║ 18/0113  ║
║ 18/0212  ║
║ 180/2010 ║
║ 23/4567  ║
╚══════════╝
(4 rows)

It might be possible to define a custom collation for that, but in a
case like this I would either split this field into two integer fields
or use a function-based index.

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

pgsql-general by date:

Previous
From: "Peter J. Holzer"
Date:
Subject: Re: Sv: Re: Sv: Re: Sv: Re: Sv: Re: data-checksums
Next
From: Andreas Joseph Krogh
Date:
Subject: Is ORDER BY in sub-query preserved when outer query is only projection?