Thread: pgsql: Adjust string comparison in jsonpath

pgsql: Adjust string comparison in jsonpath

From
Alexander Korotkov
Date:
Adjust string comparison in jsonpath

We have implemented jsonpath string comparison using default database locale.
However, standard requires us to compare Unicode codepoints.  This commit
implements that, but for performance reasons we still use per-byte comparison
for "==" operator.  Thus, for consistency other comparison operators do per-byte
comparison if Unicode codepoints appear to be equal.

In some edge cases, when same Unicode codepoints have different binary
representations in database encoding, we diverge standard to achieve better
performance of "==" operator.  In future to implement strict standard
conformance, we can do normalization of input JSON strings.

Original patch was written by Nikita Glukhov, rewritten by me.

Reported-by: Markus Winand
Discussion: https://postgr.es/m/8B7FA3B4-328D-43D7-95A8-37B8891B8C78%40winand.at
Author: Nikita Glukhov, Alexander Korotkov
Backpatch-through: 12

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/d54ceb9e176152f930e60709e07c636e8e5414f5

Modified Files
--------------
src/backend/utils/adt/jsonpath_exec.c        |  72 +++++++++++-
src/test/regress/expected/jsonb_jsonpath.out | 163 +++++++++++++++++++++++++++
src/test/regress/sql/jsonb_jsonpath.sql      |  16 +++
3 files changed, 248 insertions(+), 3 deletions(-)


Re: pgsql: Adjust string comparison in jsonpath

From
Andrew Dunstan
Date:
On 8/11/19 4:10 PM, Alexander Korotkov wrote:
> Adjust string comparison in jsonpath
>
> We have implemented jsonpath string comparison using default database locale.
> However, standard requires us to compare Unicode codepoints.  This commit
> implements that, but for performance reasons we still use per-byte comparison
> for "==" operator.  Thus, for consistency other comparison operators do per-byte
> comparison if Unicode codepoints appear to be equal.
>
> In some edge cases, when same Unicode codepoints have different binary
> representations in database encoding, we diverge standard to achieve better
> performance of "==" operator.  In future to implement strict standard
> conformance, we can do normalization of input JSON strings.
>


This appears to have upset prion when testing on en_US.iso885915.


cheers


andrew





-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Adjust string comparison in jsonpath

From
Thomas Munro
Date:
On Mon, Aug 12, 2019 at 9:04 AM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
> On 8/11/19 4:10 PM, Alexander Korotkov wrote:
> > Adjust string comparison in jsonpath
> >
> > We have implemented jsonpath string comparison using default database locale.
> > However, standard requires us to compare Unicode codepoints.  This commit
> > implements that, but for performance reasons we still use per-byte comparison
> > for "==" operator.  Thus, for consistency other comparison operators do per-byte
> > comparison if Unicode codepoints appear to be equal.
> >
> > In some edge cases, when same Unicode codepoints have different binary
> > representations in database encoding, we diverge standard to achieve better
> > performance of "==" operator.  In future to implement strict standard
> > conformance, we can do normalization of input JSON strings.
> >
>
>
> This appears to have upset prion when testing on en_US.iso885915.

Also lapwing's "InstallCheck-fr_FR" stage crashed on this commit, when
running JSON queries, on HEAD and REL_12_STABLE:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-08-11%2021%3A02%3A32
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-08-11%2020%3A40%3A07

-- 
Thomas Munro
https://enterprisedb.com



Re: pgsql: Adjust string comparison in jsonpath

From
Alexander Korotkov
Date:
On Mon, Aug 12, 2019 at 1:25 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Aug 12, 2019 at 9:04 AM Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
> > On 8/11/19 4:10 PM, Alexander Korotkov wrote:
> > > Adjust string comparison in jsonpath
> > >
> > > We have implemented jsonpath string comparison using default database locale.
> > > However, standard requires us to compare Unicode codepoints.  This commit
> > > implements that, but for performance reasons we still use per-byte comparison
> > > for "==" operator.  Thus, for consistency other comparison operators do per-byte
> > > comparison if Unicode codepoints appear to be equal.
> > >
> > > In some edge cases, when same Unicode codepoints have different binary
> > > representations in database encoding, we diverge standard to achieve better
> > > performance of "==" operator.  In future to implement strict standard
> > > conformance, we can do normalization of input JSON strings.
> > >
> >
> >
> > This appears to have upset prion when testing on en_US.iso885915.
>
> Also lapwing's "InstallCheck-fr_FR" stage crashed on this commit, when
> running JSON queries, on HEAD and REL_12_STABLE:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-08-11%2021%3A02%3A32
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-08-11%2020%3A40%3A07

Thank you for pointing!  I hope I can investigate this shortly.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: pgsql: Adjust string comparison in jsonpath

From
Thomas Munro
Date:
On Mon, Aug 12, 2019 at 10:30 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> > > This appears to have upset prion when testing on en_US.iso885915.
> >
> > Also lapwing's "InstallCheck-fr_FR" stage crashed on this commit, when
> > running JSON queries, on HEAD and REL_12_STABLE:

> Thank you for pointing!  I hope I can investigate this shortly.

Hi Alexander,

I can reproduce this by running LANG="fr_FR.ISO8859-1" initdb, then
running installcheck (on some other OSes that might be called just
"fr_FR").  See this comment in mbutils.c:

 * The functions return a palloc'd, null-terminated string if conversion
 * is required.  However, if no conversion is performed, the given source
 * string pointer is returned as-is.

You call pfree() on the result of pg_server_to_any() without checking
if it just returned in the input pointer (for example, it does that if
you give it an empty string).  That triggers an assertion failure
somewhere inside pfree().  The following fixes that for me, and is
based on code I found elsewhere in the tree.

--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -2028,8 +2028,10 @@ compareStrings(const char *mbstr1, int mblen1,
                cmp = binaryCompareStrings(utf8str1, strlen(utf8str1),

utf8str2, strlen(utf8str2));

-               pfree(utf8str1);
-               pfree(utf8str2);
+               if (mbstr1 != utf8str1)
+                       pfree(utf8str1);
+               if (mbstr2 != utf8str2)
+                       pfree(utf8str2);

With that fixed it no longer crashes, but then the regression test
fails due to differences in the output, which look like locale
ordering differences.

-- 
Thomas Munro
https://enterprisedb.com



Re: pgsql: Adjust string comparison in jsonpath

From
Alexander Korotkov
Date:
пн, 12 авг. 2019 г., 3:25 Thomas Munro <thomas.munro@gmail.com>:
On Mon, Aug 12, 2019 at 10:30 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> > > This appears to have upset prion when testing on en_US.iso885915.
> >
> > Also lapwing's "InstallCheck-fr_FR" stage crashed on this commit, when
> > running JSON queries, on HEAD and REL_12_STABLE:

> Thank you for pointing!  I hope I can investigate this shortly.

Hi Alexander,

I can reproduce this by running LANG="fr_FR.ISO8859-1" initdb, then
running installcheck (on some other OSes that might be called just
"fr_FR").  See this comment in mbutils.c:

 * The functions return a palloc'd, null-terminated string if conversion
 * is required.  However, if no conversion is performed, the given source
 * string pointer is returned as-is.

You call pfree() on the result of pg_server_to_any() without checking
if it just returned in the input pointer (for example, it does that if
you give it an empty string).  That triggers an assertion failure
somewhere inside pfree().  The following fixes that for me, and is
based on code I found elsewhere in the tree.

--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -2028,8 +2028,10 @@ compareStrings(const char *mbstr1, int mblen1,
                cmp = binaryCompareStrings(utf8str1, strlen(utf8str1),

utf8str2, strlen(utf8str2));

-               pfree(utf8str1);
-               pfree(utf8str2);
+               if (mbstr1 != utf8str1)
+                       pfree(utf8str1);
+               if (mbstr2 != utf8str2)
+                       pfree(utf8str2);

With that fixed it no longer crashes, but then the regression test
fails due to differences in the output, which look like locale
ordering differences.

Thank you for the diagnostics.  Should be fixed by 251c8e39.

BTW, test failures appears to be caused not by locale differences, but by using strlen() on non null-terminated original strings.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company