Home > mailing lists

Re: SQL/JSON path: collation for comparisons, minor typos in docs - Mailing list pgsql-hackers

From	Markus Winand
Subject	Re: SQL/JSON path: collation for comparisons, minor typos in docs
Date	August 8, 2019 08:53:20
Msg-id	A6A0BD39-E43F-4790-AE4C-338C7CBB0291@winand.at Whole thread
In response to	Re: SQL/JSON path: collation for comparisons, minor typos in docs (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Responses	Re: SQL/JSON path: collation for comparisons, minor typos in docs
List	pgsql-hackers

Tree view

Hi!

The patch makes my tests pass.

I wonder about a few things:

- Isn’t there any code that could be re-used for that (the one triggered by ‘a’ < ‘A’ COLLATE ucs_basic)?

- For object key members, the standard also refers to unicode code point collation (SQL-2:2016 4.46.3, last paragraph).

- I guess it also applies to the “starts with” predicate, but I cannot find this explicitly stated in the standard.

My tests check whether those cases do case-sensitive comparisons. With my default collation "en_US.UTF-8” I cannot discover potential issues there. I haven’t played around with nondeterministic ICU collations yet :(

-markus

ps.: for me, testing the regular expression dialect of like_regex is out of scope

On 8 Aug 2019, at 02:27, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Thu, Aug 8, 2019 at 3:05 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
On Thu, Aug 8, 2019 at 12:55 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
On Wed, Aug 7, 2019 at 4:11 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
On Wed, Aug 7, 2019 at 2:25 PM Markus Winand <markus.winand@winand.at> wrote:
I was playing around with JSON path quite a bit and might have found one case where the current implementation doesn’t follow the standard.

The functionality in question are the comparison operators except ==. They use the database default collation rather then the standard-mandated "Unicode codepoint collation” (SQL-2:2016 9.39 General Rule 12 c iii 2 D, last sentence in first paragraph).

Thank you for pointing! Nikita is about to write a patch fixing that.

Please, see the attached patch.

Our idea is to not sacrifice "==" operator performance for standard
conformance. So, "==" remains per-byte comparison. For consistency
in other operators we compare code points first, then do per-byte
comparison. In some edge cases, when same Unicode codepoints have
different binary representations in database encoding, this behavior
diverges standard. In future we can implement strict standard
conformance by normalization of input JSON strings.

Previous version of patch has buggy implementation of
compareStrings(). Revised version is attached.

Nikita pointed me that for UTF-8 strings per-byte comparison result
matches codepoints comparison result. That allows simplify patch a
lot.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
<0001-Use-Unicode-codepoint-collation-in-jsonpath-4.patch>

pgsql-hackers by date:

From: Kyotaro Horiguchi
Date: 08 August 2019, 08:43:11
Subject: Re: Small patch to fix build on Windows

From: Michael Paquier
Date: 08 August 2019, 09:22:31
Subject: Re: Documentation clarification re: ANALYZE

Re: SQL/JSON path: collation for comparisons, minor typos in docs - Mailing list pgsql-hackers

Previous

Next