Thread: String comparison problem in select - too many results
Dear Members!
Today one of hour clients reported an error.
She made a report filtered with string range, and she got wrong documents.
I checked it.
She filtered by a document nr (varchar field -> character varying(15)).
The range was: '18/0113', and '18/0212'.
Q.Close;
Q.SQL.Text := 'select * from idoc where nr >=:pfrom and nr <=:pto';
Q.paramByName('pfrom').Asstring := PFrom;
Q.paramByName('pto').Asstring := PTo;
Q.Open;
This devart TPGQuery uses parameters, the query prepared before call.
She said that she got nr = '180/2010' document too (and much more).
Firstly I tried to check this in LibreOffice calc, because it seems to be impossible.
Lesser | Greater | ||
'18/0113 | '180/2010 | True | False |
'18/0212 | '180/2010 | True | False |
It's ok.
Then I tried to use the select in the database.
I also got 180/2010 in the list!!!
I also got 180/2010 in the list!!!
Then I tried a simple check with SQL:
select
'18/0113' > '180/2010',
'18/0113' < '180/2010',
'18/0212' > '180/2010',
'18/0212' < '180/2010'
---
f;t;t;f
Whaaaaat????
It's impossible.
First I assumed the "/" is special char like escape "\". But in this case the pretag is "18/0" in every input string!!!
Next I supposed that PGSQL converts this expression and the numeric value could be different.
Without calculation I changed the select to:
To be sure the strings must be strings. I use replace at the end to filter "/" sign.
First I assumed the "/" is special char like escape "\". But in this case the pretag is "18/0" in every input string!!!
Next I supposed that PGSQL converts this expression and the numeric value could be different.
Without calculation I changed the select to:
select
cast('18/0113' as text) > cast('180/2010' as text),
cast('18/0113' as text) < cast('180/2010' as text),
cast('18/0212' as text) > cast('180/2010' as text),
cast('18/0212' as text) < cast('180/2010' as text),
replace(cast('18/0113' as text), '/', '_') > replace(cast('180/2010' as text), '/', '_'),
replace(cast('18/0212' as text), '/', '_') > replace(cast('180/2010' as text), '/', '_')
The result is also wrong:
f;t;t;f;f;t
Please help me in this theme!
What can cause differences between similar strings?
How can I force the good range?
How can I force the good range?
What is the base of the problem?
The PG is 9.4 on Linux, the DataBase encoding is:
ENCODING = 'UTF8'
LC_COLLATE = 'hu_HU.UTF-8'
LC_CTYPE = 'hu_HU.UTF-8'
Thank you for your help!
Best regards
dd
On Wednesday, January 10, 2018, Durumdara <durumdara@gmail.com> wrote:
The PG is 9.4 on Linux, the DataBase encoding is:ENCODING = 'UTF8'LC_COLLATE = 'hu_HU.UTF-8'LC_CTYPE = 'hu_HU.UTF-8'
The collection rules for hu_HU.UTF-8 probably pretend symbols don't exist, this is not uncommon. You probably need to use 'C' collation.
David J.
Dear David!
I tried in in different servers, different databases.
I tried to simulate this in LibreOffice:
This thing could happen if '\' sign is not interpreted (as removed), or replaced by '0' in the background.
I tried in in different servers, different databases.
1.) Windows local PG: LC_COLLATE = 'Hungarian_Hungary.1250' - ok.
2.) Linux remote PG: LC_CTYPE = 'en_US.UTF-8' - also wrong!!! - UTF problem???
3.) Forcing C collation: - ok
4.) Replace '/' to 'A': - ok
2.) Linux remote PG: LC_CTYPE = 'en_US.UTF-8' - also wrong!!! - UTF problem???
3.) Forcing C collation: - ok
4.) Replace '/' to 'A': - ok
selectreplace('18/0113', '/', 'A') > replace('180/2010', '/', 'A'),replace('18/0113', '/', 'A') < replace('180/2010', '/', 'A'),replace('18/0212', '/', 'A') > replace('180/2010', '/', 'A'),replace('18/0212', '/', 'A') < replace('180/2010', '/', 'A')
5.) Replace '/' to space: - wrong
6.) Replace '/' to empty string: wrong
6.) Replace '/' to empty string: wrong
selectreplace('18/0113', '/', '') > replace('180/2010', '/', ''),replace('18/0113', '/', '') < replace('180/2010', '/', ''),replace('18/0212', '/', '') > replace('180/2010', '/', ''),replace('18/0212', '/', '') < replace('180/2010', '/', '')
Normal Removed As Space As 0
18/0113 180113 18 0113 1800113
18/0212 1802010 18 0212 18002010
180/2010 1802010 180 2010 18002010
180/2010 180212 180 2010 1800212
This thing could happen if '\' sign is not interpreted (as removed), or replaced by '0' in the background.
The main problem that we have many searches in programme where we suppose good evaluation, and we have more sites with different servers (and collation).
Hmmm...
Thanks
dd
2018-01-10 16:25 GMT+01:00 David G. Johnston <david.g.johnston@gmail.com>:
On Wednesday, January 10, 2018, Durumdara <durumdara@gmail.com> wrote:The PG is 9.4 on Linux, the DataBase encoding is:ENCODING = 'UTF8'LC_COLLATE = 'hu_HU.UTF-8'LC_CTYPE = 'hu_HU.UTF-8'The collection rules for hu_HU.UTF-8 probably pretend symbols don't exist, this is not uncommon. You probably need to use 'C' collation.David J.
On Thu, Jan 11, 2018 at 9:57 AM, Durumdara <durumdara@gmail.com> wrote: > I tried in in different servers, different databases. > 1.) Windows local PG: LC_COLLATE = 'Hungarian_Hungary.1250' - ok. > 2.) Linux remote PG: LC_CTYPE = 'en_US.UTF-8' - also wrong!!! - UTF > problem??? Your problem seems to be you consider wrong anything that doesn't match your expectation. > 3.) Forcing C collation: - ok > 4.) Replace '/' to 'A': - ok ... More examples zapped, as they do not prove anything. > The main problem that we have many searches in programme where we suppose > good evaluation, and we have more sites with different servers (and > collation). You must define good evaluation. In your case it seems you consider good evaluation is lexicographical comparison of ascii char values. This is called 'C' collation and you have been told to it. If your "programme" is doing string comparison in server collation and you need good comparison, defined as C collation, that is a bug. Fix it. I would recommend reading https://www.postgresql.org/docs/9.6/static/collation.html and related docs, but someone who so boldly states postgres collations are good/bad surely knows all about it. Try to build from this: with xx(x) as (values ('18/0113'),('18/0212'),('180/2010')) select x collate "C" from xx order by 1; Francisco Olarte.
Dear Francesco!
My "bug" is that I commonly used Windows environment where the default collation is ANSI, based on Windows language which is Hungarian here (Windows1250).
But because of special characters we used UTF8 to store data in database.
I supposed that UTF8.hu_HU is working like local natural order here, and the common ASCII chars are (like '/') in same position.
Python/Delphi/LibreOffice can sort these numbers correctly (based on local ANSI sort).
I supposed that UTF8.hu_HU is using mostly same order which is valid here and it contains all ASCII + ANSI characters we are using here in daily work, and they are in very similar order.
I never thought that it can't handle normal characters in 7 bit range...
For these numbers I can use C collation, it's ok.
Thank you!
dd
2018-01-11 11:11 GMT+01:00 Francisco Olarte <folarte@peoplecall.com>:
On Thu, Jan 11, 2018 at 9:57 AM, Durumdara <durumdara@gmail.com> wrote:
> I tried in in different servers, different databases.
> 1.) Windows local PG: LC_COLLATE = 'Hungarian_Hungary.1250' - ok.
> 2.) Linux remote PG: LC_CTYPE = 'en_US.UTF-8' - also wrong!!! - UTF
> problem???
Your problem seems to be you consider wrong anything that doesn't
match your expectation.
> 3.) Forcing C collation: - ok
> 4.) Replace '/' to 'A': - ok
... More examples zapped, as they do not prove anything.
> The main problem that we have many searches in programme where we suppose
> good evaluation, and we have more sites with different servers (and
> collation).
You must define good evaluation. In your case it seems you consider
good evaluation is lexicographical comparison of ascii char values.
This is called 'C' collation and you have been told to it.
If your "programme" is doing string comparison in server collation and
you need good comparison, defined as C collation, that is a bug. Fix
it.
I would recommend reading
https://www.postgresql.org/docs/9.6/static/collation.html and related
docs, but someone who so boldly states postgres collations are
good/bad surely knows all about it.
Try to build from this:
with xx(x) as (values ('18/0113'),('18/0212'),('180/2010'))
select x collate "C" from xx order by 1;
Francisco Olarte.
Durumdara: On Thu, Jan 11, 2018 at 1:23 PM, Durumdara <durumdara@gmail.com> wrote: > Dear Francesco! FrancIsco, with an I, Spanish, not Italian. > My "bug" is that I commonly used Windows environment where the default > collation is ANSI, based on Windows language which is Hungarian here > (Windows1250). I'm not sure ANSI is a collation in windows. You should read a bit about locales, as they are not the same as collations. In fact I suspect the order issues you've got are not UTF8 derived, as all your data is ASCII(7bits), which represents the same in ANSI and UTF8 or ISO-8859-whatever ( IIRC win1250 is a bastard superset of ISO8859-1 ) > But because of special characters we used UTF8 to store data in database. > I supposed that UTF8.hu_HU is working like local natural order here, and the > common ASCII chars are (like '/') in same position. You must define "local natural order". But your problem is in the locale, not in the encoding. If you managed to use win1250.hu_HU your sorting will be the same, what you need is to use UTF8.C > Python/Delphi/LibreOffice can sort these numbers correctly (based on local > ANSI sort). ANSI does not define sort order. And those three use, IIRC, C-locale like sort. You are incorrectly assuming this is the correct one. > I supposed that UTF8.hu_HU is using mostly same order which is valid here > and it contains all ASCII + ANSI characters we are using here in daily work, > and they are in very similar order. > I never thought that it can't handle normal characters in 7 bit range... It can. Your assumptions are BAD. You are not used to working with collation-aware systems like postgres, and you assume they must use the same as non-locale-awaer programs by default. This is normally never true. Also, windows is notoriously dificult to configure for locales. IIRC ( haven't used it in 15 years ) you had to go to keyboard preferences to change it. > For these numbers I can use C collation, it's ok. C collation is like sorting raw bytes, it doesn't event sort upper/lower case correctly ( Do not know how you do it in HU, but in ES we sort aA before bB, while C locale normally sorts AB..ab.. It's what non-locale aware programs use because it's dirt cheap to implement, just sort the data as unsigned byte arrays lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that this works too ( You can sort by raw bytes and you get the data sorted lexicographically sorted by code points ). Another thing, if you have a column which you want sorted in C locale, it can be defined as such so it does it by default ( but if you do it with an alter column or similar you'll need a reindex probably ). As a final note, please, DO NOT TOP POST and TRIM YOUR QUOTES. Specially I do not need my signature quoted. Francisco Olarte.
On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote: > C collation is like sorting raw bytes, it doesn't event sort > upper/lower case correctly Now you are falling into the same trap as Durumdara, calling an unintended sort order "not correct" ;-). C collation is certainly not what a "normal user" expects. It is therefore wrong for many applications (e.g., you couldn't use it to sort a telephone book), but it might be correct for others (e.g., it you need a stable, unambiguous sort order but don't care much about the order itself). (By coincidence, I also stumbled over the fact that the en_US.UTF-8 collation ignores punctuation characters and spaces at least in the first pass - took me a minute or so to figure out why I got an "inconsistent" (read: unexpected and not obvious for me) sort order.) Lexicographers are interested in sorting single words. Therefore they aren't interested in punctuation. They may also not be very interested in accents, because most languages have few words which differ only by an accent (and it the dictionary is printed on paper, the user will easily be able to scan up and down a few centimeters and find the right entry without knowing whether "à" is sorted before or after "á"). Somebody preparing an address list should care about punctuation: You would probably not expect to find "Smith-Jones, Amanda" between "Smith, John" and "Smith, Julia". And you probably want to sort "23 Main Street" before "180 Main Street". Which brings us back to Durumdara's example: I don't know his application, so I don't know what "normal users" of his application would expect, but to me those values look like two numbers separated by a slash. So I would expect '23/4567' to be sorted between '18/0212' and '180/2010', but of course the C collation doesn't do that: => select * from foo order by t collate "C"; ╔══════════╗ ║ t ║ ╟──────────╢ ║ 18/0113 ║ ║ 18/0212 ║ ║ 180/2010 ║ ║ 23/4567 ║ ╚══════════╝ (4 rows) It might be possible to define a custom collation for that, but in a case like this I would either split this field into two integer fields or use a function-based index. hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | hjp@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
Attachment
On Sun, Jan 14, 2018 at 12:14 PM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote: > On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote: >> C collation is like sorting raw bytes, it doesn't event sort >> upper/lower case correctly > > Now you are falling into the same trap as Durumdara, calling an > unintended sort order "not correct" ;-). Well, had you quoted / read a little farther: >> C collation is like sorting raw bytes, it doesn't event sort upper/lower case correctly ( Do not know how you do it in HU, but in ES we sort aA before bB, while C locale normally sorts AB..ab.. It's what non-locale aware programs use because it's dirt cheap to implement, just sort the data as unsigned byte arrays lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that this works too ( You can sort by raw bytes and you get the data sorted lexicographically sorted by code points ). << You'll see I was trying to define "correctly" somehow. English is not my native language and it seems I obviously failed at it, I'll try to do it better next time. > C collation is certainly not what a "normal user" expects. It is > therefore wrong for many applications (e.g., you couldn't use it to > sort a telephone book), but it might be correct for others (e.g., it you > need a stable, unambiguous sort order but don't care much about the > order itself). stable / unambiguous is shared by a la lot of collation methods, but I see what you try to say. > (By coincidence, I also stumbled over the fact that the en_US.UTF-8 > collation ignores punctuation characters and spaces at least in the > first pass - took me a minute or so to figure out why I got an > "inconsistent" (read: unexpected and not obvious for me) sort order.) Nearly all the locale-aware sorts do funny things with punctuation/spaces. I've found sort more and more surprissing since I started ( with the electromechanical IBM card sorters, those ) > Lexicographers are interested in sorting single words. Therefore they > aren't interested in punctuation. They may also not be very interested > in accents, because most languages have few words which differ only by > an accent We have loads of them in spanish, but they are normally easy and many of them come from verbs conjugation which does not o into the dictionary ... > (and it the dictionary is printed on paper, the user will > easily be able to scan up and down a few centimeters and find the right > entry without knowing whether "à" is sorted before or after "á"). and we have none of this, I only know French doing it. > Somebody preparing an address list should care about punctuation: You > would probably not expect to find "Smith-Jones, Amanda" between "Smith, > John" and "Smith, Julia". And you probably want to sort "23 Main Street" > before "180 Main Street". > Which brings us back to Durumdara's example: I don't know his > application, so I don't know what "normal users" of his application > would expect, but to me those values look like two numbers separated by > a slash. So I would expect '23/4567' to be sorted between '18/0212' and > '180/2010', but of course the C collation doesn't do that: He does not seem to want this. As all his examples use the same prefix I think he just want to extract a small range of keys with a common prefix . I've had this problems when "augmenting" a part-number code to be product-part, and using things like "between xxx- and xxx-zz" to get all parts for product xxx ( it was changed to dual fields at the next iteration, but sometimes you need these things for the interim ). I mean, the fact that they are both numbers doesn't mean he wants / need numerical ordering on them, for many purposes just collapsing prefixes is enough. > > => select * from foo order by t collate "C"; > ╔══════════╗ > ║ t ║ > ╟──────────╢ > ║ 18/0113 ║ > ║ 18/0212 ║ > ║ 180/2010 ║ > ║ 23/4567 ║ > ╚══════════╝ > (4 rows) > > It might be possible to define a custom collation for that, but in a > case like this I would either split this field into two integer fields > or use a function-based index. Yep, but he may have a temporary problem. C collation puts all the "prefixes" together, which normally is good enough. Francisco Olarte.
On 2018-01-14 13:20:05 +0100, Francisco Olarte wrote: > On Sun, Jan 14, 2018 at 12:14 PM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote: > > On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote: > >> C collation is like sorting raw bytes, it doesn't event sort > >> upper/lower case correctly > > > > Now you are falling into the same trap as Durumdara, calling an > > unintended sort order "not correct" ;-). > > Well, had you quoted / read a little farther: Sorry. I had hoped that adding a winking smiley would make it clear that this was just a friendly jab, not a serious criticism. > >> > C collation is like sorting raw bytes, it doesn't event sort > upper/lower case correctly ( Do not know how you do it in HU, but in > ES we sort aA before bB, while C locale normally sorts AB..ab.. It's > what non-locale aware programs use because it's dirt cheap to > implement, just sort the data as unsigned byte arrays > lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that > this works too ( You can sort by raw bytes and you get the data sorted > lexicographically sorted by code points ). > << > > You'll see I was trying to define "correctly" somehow. English is not > my native language and it seems I obviously failed at it, I'll try to > do it better next time. English isn't my native language either, so the failure may be on my side (actually, I don't think a shared native language is a guarantee for successful communication either). I did read that and I didn't assume that you thought that there is one and only one correct way to sort for each language, but I did think it was slightly amusing that you used the word "correct" after berating Durumdara for using the word "wrong". Anyway, what I wanted to communicate is that the correct sort order depends on the application. I agree that case should almost never be a primary criterium, but even there might be some exceptions (I prefer C collation for filenames, because it puts Changes, README and TODO at the front, but I'm aware that this is mostly because I started to use Unix in the 80's). But punctuation, spaces, ... those might have to be treated very differently. And so I'm not very happy with opaque collation identifiers like "de_AT.UTF-8". What does that mean? Who decides what the correct sort order is in Austria, and is this even the same on Linux and Windows? Often the details don't matter. Whether digits are sorted before or after letters, whether punctuation is ignored or considered (and if the latter, how), as long as the order is internally consistent and not too far off the users' expectations, the users can deal with it (and probably won't even notice that the order is slightly different in say the company directory and the city's phonebook). But sometimes such details do matter and then you have to explicitely order items. > > C collation is certainly not what a "normal user" expects. It is > > therefore wrong for many applications (e.g., you couldn't use it to > > sort a telephone book), but it might be correct for others (e.g., it you > > need a stable, unambiguous sort order but don't care much about the > > order itself). > > stable / unambiguous is shared by a la lot of collation methods, but I > see what you try to say. I'm worried that something like "de_AT.UTF-8" is not stable. Somebody might decide that ignoring whitespace wasn't such a good idea after all and "fix" it. Unicode TR#10 actually warns about this: | Collation order is not fixed. | | Over time, collation order will vary: there may be fixes needed as | more information becomes available about languages; there may be new | government or industry standards for the language that require | changes; and finally, new characters added to the Unicode Standard | will interleave with the previously-defined ones. This means that | collations must be carefully versioned. ... and if I remember correctly there have been cases where PostgreSQL indexes where unusable after an upgrade because the collation had changed. > I've found sort more and more surprissing since I started ( with the > electromechanical IBM card sorters, those ) I fully agree with this. > > Which brings us back to Durumdara's example: I don't know his > > application, so I don't know what "normal users" of his application > > would expect, but to me those values look like two numbers separated by > > a slash. So I would expect '23/4567' to be sorted between '18/0212' and > > '180/2010', but of course the C collation doesn't do that: > > He does not seem to want this. He didn't mention it. We don't know yet whether he doesn't want it or just didn't think of it yet :-). hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | hjp@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
Attachment
Dear Peter and Francisco!
But the filters must be same. We used many range filters in text fieds.
Firstly: sorry for name change. I might read too many about Totti or I was listening too many "Vamos a la playa" spanitaliano disco... :-) :-) :-)
To better understand my problem I write about the background.
We used different database engine "before" PG, and more codes based on that SQL.
Because now only 70-80%% migrated, we need to preserve the compatibility with older database engine.
This run on Windows, and it uses the local collation order (ANSI: the basic ASCII extended with HUN characters. Like ISO-8859-2, or Windows1250).
It uses Windows comparison, so it doesn't matter that client sort or the server - in Hungarian machines it's same (AnsiCompareText/Str).
We dont't need to worry about filters/orders.
All ASCII characters preserved in sort, but Hungarians inserted between ASCII...
English: A, B, C, D, E, F,
Hungarian: A, Á, B, C, D, E, É, F
The data export works same in MS Excel or Libreoffice. The filters are same.
I supposed that HU_HU collation mostly similar.
The whole thing is important in row filtering, not in order. For sort data we can use Devexpress grids.
But the filters must be same. We used many range filters in text fieds.
With older engine we can safely search ranges with "Á16-00023", "18343843/BÉR" values, the result row number is same in programme side and f. e. in Excel side.
I didn't understand what a hell caused this problem in PG side, what I did wrong.
This is why I showed many demonstrations, because I wanted to understand. The simulations points to "symbol removal".
This is why I showed many demonstrations, because I wanted to understand. The simulations points to "symbol removal".
Yes, we may handle these problems, but we need to construct "compatible" SQL, or we need to make different solution in PG side.
This would cause "bugs" in the working applications, and make difference between two products.
This would cause "bugs" in the working applications, and make difference between two products.
The C collation is good for only numbered fields.
Splitting to substring and order on them is also good in this example.
Splitting to substring and order on them is also good in this example.
But for special accents (, "ű", "ó", "ő", etc) we need to use different solution...
And the text fields are free, so "/" is only one symbol they can use.
And the text fields are free, so "/" is only one symbol they can use.
Thank you for your instructions, help, and "flame"... :-)
Have a nice day for you!
dd