Re: Problem in 'ORDER BY' of a column using a created collation? - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Problem in 'ORDER BY' of a column using a created collation?
Date
Msg-id CA+TgmobWgZC-JkS4OKbxQsjuBh7mpKMPw5kxtcHJgN-7+vGgCg@mail.gmail.com
Whole thread Raw
In response to Problem in 'ORDER BY' of a column using a created collation?  (Nishant Sharma <nishant.sharma@enterprisedb.com>)
List pgsql-hackers
On Mon, Aug 25, 2025 at 3:52 AM Nishant Sharma
<nishant.sharma@enterprisedb.com> wrote:
> We did debug 'Experiment 1' and we find that:-
> Whatever is the Column 1 in 'ORDER BY' gets correctly ordered,
> because it uses abbreviated sort optimization due to which its data
> datum gets converted to abbreviated datum using
> "varstr_abbrev_convert()" function, and then the comparator
> function selected is
> "ssup->comparator = ssup_datum_unsigned_cmp()"
> for sorting operation. But in case of column 2 in 'ORDER BY' (which
> is showing incorrect result for 'Experiment 1') does not use
> abbreviated sort optimization and here comparator function selected
> is "ssup->comparator = varlenafastcmp_locale" -->
> "strncoll_icu_utf8()", which appears, uses the third party ICU
> library function for comparison and does not work as expected.

Generally, there are two methods for performing collation-aware string
comparisons, and they are expected to produce equivalent results. For
libc collations, the first method is to use strcoll() or strcoll_l()
or similar to directly compared the strings, and the other is to use
strxfrm() or similar to convert the string to a normalize
representation which can then be compared using memcmp(). Equivalent
facilities exist for ICU; see collate_methods_icu and
collate_methods_icu_utf8 in pg_local_icu.c; one or the other of those
should be getting used here.

Even though these two methods are expected to produce equivalent
results, if there's a bug, they might not. That bug could either exist
in our code or it could exist in ICU; I suspect the latter is more
likely, but the former is certainly possible. What I think you want to
do is try to track down two specific strings where transforming them
via strnxfrm_icu() produces results that compare in one order using
memcmp(), but passing the same strings to strncoll_icu() or
strncoll_icu_utf8() -- whichever is appropriate -- produces a
different result. If you're able to find such strings, then you can
probably rephrase those examples in terms of whatever underlying
functions strncoll_icu() and strxfrm_icu() are calling and we can
maybe report the problem to the ICU maintainers. If no such strings
seem to exist, then there's probably a problem in the PostgreSQL code
someplace, and you should try to figure out why we're getting the
wrong answer despite ICU apparently doing the right thing.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Report bytes and transactions actually sent downtream
Next
From: Álvaro Herrera
Date:
Subject: Re: refactor backend type lists