Re: String comparison problem in select - too many results - Mailing list pgsql-general
From | Peter J. Holzer |
---|---|
Subject | Re: String comparison problem in select - too many results |
Date | |
Msg-id | 20180114174735.7zgnvuftlpnn5hyu@hjp.at Whole thread Raw |
In response to | Re: String comparison problem in select - too many results (Francisco Olarte <folarte@peoplecall.com>) |
List | pgsql-general |
On 2018-01-14 13:20:05 +0100, Francisco Olarte wrote: > On Sun, Jan 14, 2018 at 12:14 PM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote: > > On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote: > >> C collation is like sorting raw bytes, it doesn't event sort > >> upper/lower case correctly > > > > Now you are falling into the same trap as Durumdara, calling an > > unintended sort order "not correct" ;-). > > Well, had you quoted / read a little farther: Sorry. I had hoped that adding a winking smiley would make it clear that this was just a friendly jab, not a serious criticism. > >> > C collation is like sorting raw bytes, it doesn't event sort > upper/lower case correctly ( Do not know how you do it in HU, but in > ES we sort aA before bB, while C locale normally sorts AB..ab.. It's > what non-locale aware programs use because it's dirt cheap to > implement, just sort the data as unsigned byte arrays > lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that > this works too ( You can sort by raw bytes and you get the data sorted > lexicographically sorted by code points ). > << > > You'll see I was trying to define "correctly" somehow. English is not > my native language and it seems I obviously failed at it, I'll try to > do it better next time. English isn't my native language either, so the failure may be on my side (actually, I don't think a shared native language is a guarantee for successful communication either). I did read that and I didn't assume that you thought that there is one and only one correct way to sort for each language, but I did think it was slightly amusing that you used the word "correct" after berating Durumdara for using the word "wrong". Anyway, what I wanted to communicate is that the correct sort order depends on the application. I agree that case should almost never be a primary criterium, but even there might be some exceptions (I prefer C collation for filenames, because it puts Changes, README and TODO at the front, but I'm aware that this is mostly because I started to use Unix in the 80's). But punctuation, spaces, ... those might have to be treated very differently. And so I'm not very happy with opaque collation identifiers like "de_AT.UTF-8". What does that mean? Who decides what the correct sort order is in Austria, and is this even the same on Linux and Windows? Often the details don't matter. Whether digits are sorted before or after letters, whether punctuation is ignored or considered (and if the latter, how), as long as the order is internally consistent and not too far off the users' expectations, the users can deal with it (and probably won't even notice that the order is slightly different in say the company directory and the city's phonebook). But sometimes such details do matter and then you have to explicitely order items. > > C collation is certainly not what a "normal user" expects. It is > > therefore wrong for many applications (e.g., you couldn't use it to > > sort a telephone book), but it might be correct for others (e.g., it you > > need a stable, unambiguous sort order but don't care much about the > > order itself). > > stable / unambiguous is shared by a la lot of collation methods, but I > see what you try to say. I'm worried that something like "de_AT.UTF-8" is not stable. Somebody might decide that ignoring whitespace wasn't such a good idea after all and "fix" it. Unicode TR#10 actually warns about this: | Collation order is not fixed. | | Over time, collation order will vary: there may be fixes needed as | more information becomes available about languages; there may be new | government or industry standards for the language that require | changes; and finally, new characters added to the Unicode Standard | will interleave with the previously-defined ones. This means that | collations must be carefully versioned. ... and if I remember correctly there have been cases where PostgreSQL indexes where unusable after an upgrade because the collation had changed. > I've found sort more and more surprissing since I started ( with the > electromechanical IBM card sorters, those ) I fully agree with this. > > Which brings us back to Durumdara's example: I don't know his > > application, so I don't know what "normal users" of his application > > would expect, but to me those values look like two numbers separated by > > a slash. So I would expect '23/4567' to be sorted between '18/0212' and > > '180/2010', but of course the C collation doesn't do that: > > He does not seem to want this. He didn't mention it. We don't know yet whether he doesn't want it or just didn't think of it yet :-). hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | hjp@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
Attachment
pgsql-general by date: