Thread: String comparison problem in select - too many results

String comparison problem in select - too many results

From
Durumdara
Date:
Dear Members!


Today one of hour clients reported an error.

She made a report filtered with string range, and she got wrong documents.

I checked it.
She filtered by a document nr (varchar field -> character varying(15)).

The range was: '18/0113', and '18/0212'.

    Q.Close;
    Q.SQL.Text := 'select * from idoc where nr >=:pfrom and nr  <=:pto';
    Q.paramByName('pfrom').Asstring := PFrom;
    Q.paramByName('pto').Asstring := PTo;
    Q.Open;

This devart TPGQuery uses parameters, the query prepared before call.

She said that she got nr = '180/2010'  document too (and much more).

Firstly I tried to check this in LibreOffice calc, because it seems to be impossible.



LesserGreater
'18/0113'180/2010TrueFalse
'18/0212'180/2010TrueFalse

It's ok.

Then I tried to use the select in the database.
I also got 180/2010 in the list!!!

Then I tried a simple check with SQL:

select 
'18/0113' > '180/2010',
'18/0113' < '180/2010',
'18/0212' > '180/2010',
'18/0212' < '180/2010'
---
f;t;t;f

Whaaaaat????

It's impossible.
First I assumed the "/" is special char like escape "\". But in this case the pretag is "18/0" in every input string!!!

Next I supposed that PGSQL converts this expression and the numeric value could be different.
Without calculation I changed the select to:

select 
cast('18/0113' as text) > cast('180/2010' as text),
cast('18/0113' as text) < cast('180/2010' as text),
cast('18/0212' as text) > cast('180/2010' as text),
cast('18/0212' as text) < cast('180/2010' as text),
replace(cast('18/0113' as text), '/', '_') > replace(cast('180/2010' as text), '/', '_'),
replace(cast('18/0212' as text), '/', '_') > replace(cast('180/2010' as text), '/', '_')

To be sure the strings must be strings. I use replace at the end to filter "/" sign.
The result is also wrong:
f;t;t;f;f;t

Please help me in this theme!

What can cause differences between similar strings?
How can I force the good range?
What is the base of the problem?

The PG is 9.4 on Linux, the DataBase encoding is:
       ENCODING = 'UTF8'
       LC_COLLATE = 'hu_HU.UTF-8'
       LC_CTYPE = 'hu_HU.UTF-8'


Thank you for your help!

Best regards
   dd



Re: String comparison problem in select - too many results

From
"David G. Johnston"
Date:
On Wednesday, January 10, 2018, Durumdara <durumdara@gmail.com> wrote:

The PG is 9.4 on Linux, the DataBase encoding is:
       ENCODING = 'UTF8'
       LC_COLLATE = 'hu_HU.UTF-8'
       LC_CTYPE = 'hu_HU.UTF-8'


The collection rules for hu_HU.UTF-8 probably pretend symbols don't exist, this is not uncommon.  You probably need to use 'C' collation.

David J. 

Re: String comparison problem in select - too many results

From
Durumdara
Date:
Dear David!

I tried in in different servers, different databases.

1.) Windows local PG:  LC_COLLATE = 'Hungarian_Hungary.1250' - ok.
2.) Linux remote PG: LC_CTYPE = 'en_US.UTF-8' - also wrong!!! - UTF problem???
3.) Forcing C collation:  - ok
4.) Replace '/' to 'A': - ok
select
replace('18/0113', '/', 'A') > replace('180/2010', '/', 'A'),
replace('18/0113', '/', 'A') < replace('180/2010', '/', 'A'),
replace('18/0212', '/', 'A') > replace('180/2010', '/', 'A'),
replace('18/0212', '/', 'A') < replace('180/2010', '/', 'A')
5.) Replace '/' to space: - wrong
6.) Replace '/' to empty string: wrong
select
replace('18/0113', '/', '') > replace('180/2010', '/', ''),
replace('18/0113', '/', '') < replace('180/2010', '/', ''),
replace('18/0212', '/', '') > replace('180/2010', '/', ''),
replace('18/0212', '/', '') < replace('180/2010', '/', '')

I tried to simulate this in LibreOffice:

Normal Removed As Space          As 0
18/0113 180113 18 0113 1800113
18/0212 1802010 18 0212 18002010
180/2010         1802010 180 2010         18002010
180/2010         180212 180 2010         1800212


This thing could happen if '\' sign is not interpreted (as removed), or replaced by '0' in the background.

The main problem that we have many searches in programme where we suppose good evaluation, and we have more sites with different servers (and collation).

Hmmm...

Thanks
  dd


2018-01-10 16:25 GMT+01:00 David G. Johnston <david.g.johnston@gmail.com>:
On Wednesday, January 10, 2018, Durumdara <durumdara@gmail.com> wrote:

The PG is 9.4 on Linux, the DataBase encoding is:
       ENCODING = 'UTF8'
       LC_COLLATE = 'hu_HU.UTF-8'
       LC_CTYPE = 'hu_HU.UTF-8'


The collection rules for hu_HU.UTF-8 probably pretend symbols don't exist, this is not uncommon.  You probably need to use 'C' collation.

David J. 

Re: String comparison problem in select - too many results

From
Francisco Olarte
Date:
On Thu, Jan 11, 2018 at 9:57 AM, Durumdara <durumdara@gmail.com> wrote:
> I tried in in different servers, different databases.
> 1.) Windows local PG:  LC_COLLATE = 'Hungarian_Hungary.1250' - ok.
> 2.) Linux remote PG: LC_CTYPE = 'en_US.UTF-8' - also wrong!!! - UTF
> problem???

Your problem seems to be you consider wrong anything that doesn't
match your expectation.

> 3.) Forcing C collation:  - ok
> 4.) Replace '/' to 'A': - ok

...  More examples zapped, as they do not prove anything.


> The main problem that we have many searches in programme where we suppose
> good evaluation, and we have more sites with different servers (and
> collation).

You must define good evaluation. In your case it seems you consider
good evaluation is lexicographical comparison of ascii char values.
This is called 'C' collation and you have been told to it.

If your "programme" is doing string comparison in server collation and
you need good comparison, defined as  C collation, that is a bug. Fix
it.

I would recommend reading
https://www.postgresql.org/docs/9.6/static/collation.html and related
docs, but someone who so boldly states postgres collations are
good/bad surely knows all about it.


Try to build from this:


with xx(x) as (values ('18/0113'),('18/0212'),('180/2010'))
select x collate "C"  from xx order by 1;

Francisco Olarte.


Re: String comparison problem in select - too many results

From
Durumdara
Date:
Dear Francesco!

My "bug" is that I commonly used Windows environment where the default collation is ANSI, based on Windows language which is Hungarian here (Windows1250).
But because of special characters we used UTF8 to store data in database.
I supposed that UTF8.hu_HU is working like local natural order here, and the common ASCII chars are (like '/') in same position.
Python/Delphi/LibreOffice can sort these numbers correctly (based on local ANSI sort).

I supposed that UTF8.hu_HU is using mostly same order which is valid here and it contains all ASCII + ANSI characters we are using here in daily work, and they are in very similar order.
I never thought that it can't handle normal characters in 7 bit range...

For these numbers I can use C collation, it's ok.

Thank you!

dd



2018-01-11 11:11 GMT+01:00 Francisco Olarte <folarte@peoplecall.com>:
On Thu, Jan 11, 2018 at 9:57 AM, Durumdara <durumdara@gmail.com> wrote:
> I tried in in different servers, different databases.
> 1.) Windows local PG:  LC_COLLATE = 'Hungarian_Hungary.1250' - ok.
> 2.) Linux remote PG: LC_CTYPE = 'en_US.UTF-8' - also wrong!!! - UTF
> problem???

Your problem seems to be you consider wrong anything that doesn't
match your expectation.

> 3.) Forcing C collation:  - ok
> 4.) Replace '/' to 'A': - ok

...  More examples zapped, as they do not prove anything.


> The main problem that we have many searches in programme where we suppose
> good evaluation, and we have more sites with different servers (and
> collation).

You must define good evaluation. In your case it seems you consider
good evaluation is lexicographical comparison of ascii char values.
This is called 'C' collation and you have been told to it.

If your "programme" is doing string comparison in server collation and
you need good comparison, defined as  C collation, that is a bug. Fix
it.

I would recommend reading
https://www.postgresql.org/docs/9.6/static/collation.html and related
docs, but someone who so boldly states postgres collations are
good/bad surely knows all about it.


Try to build from this:


with xx(x) as (values ('18/0113'),('18/0212'),('180/2010'))
select x collate "C"  from xx order by 1;

Francisco Olarte.

Re: String comparison problem in select - too many results

From
Francisco Olarte
Date:
Durumdara:

On Thu, Jan 11, 2018 at 1:23 PM, Durumdara <durumdara@gmail.com> wrote:
> Dear Francesco!

FrancIsco, with an I, Spanish, not Italian.


> My "bug" is that I commonly used Windows environment where the default
> collation is ANSI, based on Windows language which is Hungarian here
> (Windows1250).

I'm not sure ANSI is a collation in windows. You should read a bit
about locales, as they are not the same as collations. In fact I
suspect the order issues you've got are not UTF8 derived, as all your
data is ASCII(7bits), which represents the same in ANSI and UTF8 or
ISO-8859-whatever ( IIRC win1250 is a bastard superset of ISO8859-1 )

> But because of special characters we used UTF8 to store data in database.
> I supposed that UTF8.hu_HU is working like local natural order here, and the
> common ASCII chars are (like '/') in same position.

You must define "local natural order". But your problem is in the
locale, not in the encoding. If you managed to use win1250.hu_HU your
sorting will be the same, what you need is to use UTF8.C


> Python/Delphi/LibreOffice can sort these numbers correctly (based on local
> ANSI sort).

ANSI does not define sort order. And those three use, IIRC, C-locale
like sort. You are incorrectly assuming this is the correct one.

> I supposed that UTF8.hu_HU is using mostly same order which is valid here
> and it contains all ASCII + ANSI characters we are using here in daily work,
> and they are in very similar order.
> I never thought that it can't handle normal characters in 7 bit range...

It can. Your assumptions are BAD. You are not used to working with
collation-aware systems like postgres, and you assume they must use
the same as non-locale-awaer programs by default. This is normally
never true.

Also, windows is notoriously dificult to configure for locales. IIRC (
haven't used it in 15 years ) you had to go to keyboard preferences to
change it.

> For these numbers I can use C collation, it's ok.

C collation is like sorting raw bytes, it doesn't event sort
upper/lower case correctly ( Do not know how you do it in HU, but in
ES we sort aA before bB, while C locale normally sorts AB..ab.. It's
what non-locale aware programs use because it's dirt cheap to
implement, just sort the data as unsigned byte arrays
lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that
this works too ( You can sort by raw bytes and you get the data sorted
lexicographically sorted by code points ).

Another thing, if you have a column which you want sorted in C locale,
it can be defined as such so it does it by default ( but if you do it
with an alter column or similar you'll need a reindex probably ).

As a final note, please, DO NOT TOP POST and TRIM YOUR QUOTES.
Specially I do not need my signature quoted.

Francisco Olarte.


Re: String comparison problem in select - too many results

From
"Peter J. Holzer"
Date:
On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote:
> C collation is like sorting raw bytes, it doesn't event sort
> upper/lower case correctly

Now you are falling into the same trap as Durumdara, calling an
unintended sort order "not correct" ;-).

C collation is certainly not what a "normal user" expects. It is
therefore wrong for many applications (e.g., you couldn't use it to
sort a telephone book), but it might be correct for others (e.g., it you
need a stable, unambiguous sort order but don't care much about the
order itself).

(By coincidence, I also stumbled over the fact that the en_US.UTF-8
collation ignores punctuation characters and spaces at least in the
first pass - took me a minute or so to figure out why I got an
"inconsistent" (read: unexpected and not obvious for me) sort order.)

Lexicographers are interested in sorting single words. Therefore they
aren't interested in punctuation. They may also not be very interested
in accents, because most languages have few words which differ only by
an accent (and it the dictionary is printed on paper, the user will
easily be able to scan up and down a few centimeters and find the right
entry without knowing whether "à" is sorted before or after "á").

Somebody preparing an address list should care about punctuation: You
would probably not expect to find "Smith-Jones, Amanda" between "Smith,
John" and "Smith, Julia". And you probably want to sort "23 Main Street"
before "180 Main Street".

Which brings us back to Durumdara's example: I don't know his
application, so I don't know what "normal users" of his application
would expect, but to me those values look like two numbers separated by
a slash. So I would expect '23/4567' to be sorted between '18/0212' and
'180/2010', but of course the C collation doesn't do that:

=> select * from foo order by t collate "C";
╔══════════╗
║    t     ║
╟──────────╢
║ 18/0113  ║
║ 18/0212  ║
║ 180/2010 ║
║ 23/4567  ║
╚══════════╝
(4 rows)

It might be possible to define a custom collation for that, but in a
case like this I would either split this field into two integer fields
or use a function-based index.

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

Re: String comparison problem in select - too many results

From
Francisco Olarte
Date:
On Sun, Jan 14, 2018 at 12:14 PM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote:
>> C collation is like sorting raw bytes, it doesn't event sort
>> upper/lower case correctly
>
> Now you are falling into the same trap as Durumdara, calling an
> unintended sort order "not correct" ;-).

Well, had you quoted / read a little farther:
>>
C collation is like sorting raw bytes, it doesn't event sort
upper/lower case correctly ( Do not know how you do it in HU, but in
ES we sort aA before bB, while C locale normally sorts AB..ab.. It's
what non-locale aware programs use because it's dirt cheap to
implement, just sort the data as unsigned byte arrays
lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that
this works too ( You can sort by raw bytes and you get the data sorted
lexicographically sorted by code points ).
<<

You'll see I was trying to define "correctly" somehow. English is not
my native language and it seems I obviously failed at it, I'll try to
do it better next time.

> C collation is certainly not what a "normal user" expects. It is
> therefore wrong for many applications (e.g., you couldn't use it to
> sort a telephone book), but it might be correct for others (e.g., it you
> need a stable, unambiguous sort order but don't care much about the
> order itself).

stable / unambiguous is shared by a la lot of collation methods, but I
see what you try to say.


> (By coincidence, I also stumbled over the fact that the en_US.UTF-8
> collation ignores punctuation characters and spaces at least in the
> first pass - took me a minute or so to figure out why I got an
> "inconsistent" (read: unexpected and not obvious for me) sort order.)

Nearly all the locale-aware sorts do funny things with
punctuation/spaces. I've found sort more and more surprissing since I
started ( with the electromechanical IBM card sorters, those )

> Lexicographers are interested in sorting single words. Therefore they
> aren't interested in punctuation. They may also not be very interested
> in accents, because most languages have few words which differ only by
> an accent

We have loads of them in spanish, but they are normally easy and many
of them come from verbs conjugation which does not o into the
dictionary  ...

> (and it the dictionary is printed on paper, the user will
> easily be able to scan up and down a few centimeters and find the right
> entry without knowing whether "à" is sorted before or after "á").

and we have none of this, I only know French doing it.

> Somebody preparing an address list should care about punctuation: You
> would probably not expect to find "Smith-Jones, Amanda" between "Smith,
> John" and "Smith, Julia". And you probably want to sort "23 Main Street"
> before "180 Main Street".


> Which brings us back to Durumdara's example: I don't know his
> application, so I don't know what "normal users" of his application
> would expect, but to me those values look like two numbers separated by
> a slash. So I would expect '23/4567' to be sorted between '18/0212' and
> '180/2010', but of course the C collation doesn't do that:

He does not seem to want this. As all his examples use the same prefix
I think he just want to extract a small range of keys with a common
prefix . I've had this problems when "augmenting" a part-number code
to be product-part, and using things like "between xxx- and xxx-zz" to
get all parts for product xxx ( it was changed to dual fields at the
next iteration, but sometimes you need these things for the interim ).

I mean, the fact that they are both numbers doesn't mean he wants /
need numerical ordering on them, for many purposes just collapsing
prefixes is enough.

>
> => select * from foo order by t collate "C";
> ╔══════════╗
> ║    t     ║
> ╟──────────╢
> ║ 18/0113  ║
> ║ 18/0212  ║
> ║ 180/2010 ║
> ║ 23/4567  ║
> ╚══════════╝
> (4 rows)
>
> It might be possible to define a custom collation for that, but in a
> case like this I would either split this field into two integer fields
> or use a function-based index.

Yep, but he may have a temporary problem. C collation puts all the
"prefixes" together, which normally is good enough.

Francisco Olarte.


Re: String comparison problem in select - too many results

From
"Peter J. Holzer"
Date:
On 2018-01-14 13:20:05 +0100, Francisco Olarte wrote:
> On Sun, Jan 14, 2018 at 12:14 PM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> > On 2018-01-12 11:08:39 +0100, Francisco Olarte wrote:
> >> C collation is like sorting raw bytes, it doesn't event sort
> >> upper/lower case correctly
> >
> > Now you are falling into the same trap as Durumdara, calling an
> > unintended sort order "not correct" ;-).
>
> Well, had you quoted / read a little farther:

Sorry. I had hoped that adding a winking smiley would make it clear that
this was just a friendly jab, not a serious criticism.

> >>
> C collation is like sorting raw bytes, it doesn't event sort
> upper/lower case correctly ( Do not know how you do it in HU, but in
> ES we sort aA before bB, while C locale normally sorts AB..ab.. It's
> what non-locale aware programs use because it's dirt cheap to
> implement, just sort the data as unsigned byte arrays
> lexicographically. And in fact, IIRC, utf-8 is dessigned in a way that
> this works too ( You can sort by raw bytes and you get the data sorted
> lexicographically sorted by code points ).
> <<
>
> You'll see I was trying to define "correctly" somehow. English is not
> my native language and it seems I obviously failed at it, I'll try to
> do it better next time.

English isn't my native language either, so the failure may be on my
side (actually, I don't think a shared native language is a guarantee
for successful communication either).

I did read that and I didn't assume that you thought that there is one
and only one correct way to sort for each language, but I did think it
was slightly amusing that you used the word "correct" after berating
Durumdara for using the word "wrong".

Anyway, what I wanted to communicate is that the correct sort order
depends on the application. I agree that case should almost never be a
primary criterium, but even there might be some exceptions (I prefer
C collation for filenames, because it puts Changes, README and TODO at
the front, but I'm aware that this is mostly because I started to use
Unix in the 80's). But punctuation, spaces, ... those might have to be
treated very differently.

And so I'm not very happy with opaque collation identifiers like
"de_AT.UTF-8". What does that mean? Who decides what the correct sort
order is in Austria, and is this even the same on Linux and Windows?

Often the details don't matter. Whether digits are sorted before or
after letters, whether punctuation is ignored or considered (and if the
latter, how), as long as the order is internally consistent and not too
far off the users' expectations, the users can deal with it (and
probably won't even notice that the order is slightly different in
say the company directory and the city's phonebook). But sometimes such
details do matter and then you have to explicitely order items.


> > C collation is certainly not what a "normal user" expects. It is
> > therefore wrong for many applications (e.g., you couldn't use it to
> > sort a telephone book), but it might be correct for others (e.g., it you
> > need a stable, unambiguous sort order but don't care much about the
> > order itself).
>
> stable / unambiguous is shared by a la lot of collation methods, but I
> see what you try to say.

I'm worried that something like "de_AT.UTF-8" is not stable. Somebody
might decide that ignoring whitespace wasn't such a good idea after all
and "fix" it.

Unicode TR#10 actually warns about this:

| Collation order is not fixed.
|
|   Over time, collation order will vary: there may be fixes needed as
|   more information becomes available about languages; there may be new
|   government or industry standards for the language that require
|   changes; and finally, new characters added to the Unicode Standard
|   will interleave with the previously-defined ones. This means that
|   collations must be carefully versioned.

... and if I remember correctly there have been cases where PostgreSQL
indexes where unusable after an upgrade because the collation had
changed.

> I've found sort more and more surprissing since I started ( with the
> electromechanical IBM card sorters, those )

I fully agree with this.

> > Which brings us back to Durumdara's example: I don't know his
> > application, so I don't know what "normal users" of his application
> > would expect, but to me those values look like two numbers separated by
> > a slash. So I would expect '23/4567' to be sorted between '18/0212' and
> > '180/2010', but of course the C collation doesn't do that:
>
> He does not seem to want this.

He didn't mention it. We don't know yet whether he doesn't want it or
just didn't think of it yet :-).

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

Re: String comparison problem in select - too many results

From
Durumdara
Date:
Dear Peter and Francisco!


Firstly: sorry for name change. I might read too many about Totti or I was listening too many "Vamos a la playa" spanitaliano disco... :-) :-) :-)

To better understand my problem I write about the background.
We used different database engine "before" PG, and more codes based on that SQL.
Because now only 70-80%% migrated, we need to preserve the compatibility with older database engine.
This run on Windows, and it uses the local collation order (ANSI: the basic ASCII extended with HUN characters. Like ISO-8859-2, or Windows1250).
It uses Windows comparison, so it doesn't matter that client sort or the server - in Hungarian machines it's same (AnsiCompareText/Str).
We dont't need to worry about filters/orders.

All ASCII characters preserved in sort, but Hungarians inserted between ASCII...
English: A, B, C, D, E, F,
Hungarian: A, Á, B, C, D, E, É, F

The data export works same in MS Excel or Libreoffice. The filters are same.

I supposed that HU_HU collation mostly similar.

The whole thing is important in row filtering, not in order. For sort data we can use Devexpress grids.

But the filters must be same. We used many range filters in text fieds. 
With older engine we can safely search ranges with "Á16-00023", "18343843/BÉR" values, the result row number is same in programme side and f. e. in Excel side.

I didn't understand what a hell caused this problem in PG side, what I did wrong.
This is why I showed many demonstrations, because I wanted to understand. The simulations points to "symbol removal".

Yes, we may handle these problems, but we need to construct "compatible" SQL, or we need to make different solution in PG side.
This would cause "bugs" in the working applications, and make difference between two products.

The C collation is good for only numbered fields. 
Splitting to substring and order on them is also good in this example.
But for special accents (, "ű", "ó", "ő", etc) we need to use different solution...
And the text fields are free, so "/" is only one symbol they can use.

Thank you for your instructions, help, and "flame"... :-)

Have a nice day for you!

dd