Thread: pgsql: Adjust string comparison in jsonpath
Adjust string comparison in jsonpath We have implemented jsonpath string comparison using default database locale. However, standard requires us to compare Unicode codepoints. This commit implements that, but for performance reasons we still use per-byte comparison for "==" operator. Thus, for consistency other comparison operators do per-byte comparison if Unicode codepoints appear to be equal. In some edge cases, when same Unicode codepoints have different binary representations in database encoding, we diverge standard to achieve better performance of "==" operator. In future to implement strict standard conformance, we can do normalization of input JSON strings. Original patch was written by Nikita Glukhov, rewritten by me. Reported-by: Markus Winand Discussion: https://postgr.es/m/8B7FA3B4-328D-43D7-95A8-37B8891B8C78%40winand.at Author: Nikita Glukhov, Alexander Korotkov Backpatch-through: 12 Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/d54ceb9e176152f930e60709e07c636e8e5414f5 Modified Files -------------- src/backend/utils/adt/jsonpath_exec.c | 72 +++++++++++- src/test/regress/expected/jsonb_jsonpath.out | 163 +++++++++++++++++++++++++++ src/test/regress/sql/jsonb_jsonpath.sql | 16 +++ 3 files changed, 248 insertions(+), 3 deletions(-)
On 8/11/19 4:10 PM, Alexander Korotkov wrote: > Adjust string comparison in jsonpath > > We have implemented jsonpath string comparison using default database locale. > However, standard requires us to compare Unicode codepoints. This commit > implements that, but for performance reasons we still use per-byte comparison > for "==" operator. Thus, for consistency other comparison operators do per-byte > comparison if Unicode codepoints appear to be equal. > > In some edge cases, when same Unicode codepoints have different binary > representations in database encoding, we diverge standard to achieve better > performance of "==" operator. In future to implement strict standard > conformance, we can do normalization of input JSON strings. > This appears to have upset prion when testing on en_US.iso885915. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Aug 12, 2019 at 9:04 AM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > On 8/11/19 4:10 PM, Alexander Korotkov wrote: > > Adjust string comparison in jsonpath > > > > We have implemented jsonpath string comparison using default database locale. > > However, standard requires us to compare Unicode codepoints. This commit > > implements that, but for performance reasons we still use per-byte comparison > > for "==" operator. Thus, for consistency other comparison operators do per-byte > > comparison if Unicode codepoints appear to be equal. > > > > In some edge cases, when same Unicode codepoints have different binary > > representations in database encoding, we diverge standard to achieve better > > performance of "==" operator. In future to implement strict standard > > conformance, we can do normalization of input JSON strings. > > > > > This appears to have upset prion when testing on en_US.iso885915. Also lapwing's "InstallCheck-fr_FR" stage crashed on this commit, when running JSON queries, on HEAD and REL_12_STABLE: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-08-11%2021%3A02%3A32 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-08-11%2020%3A40%3A07 -- Thomas Munro https://enterprisedb.com
On Mon, Aug 12, 2019 at 1:25 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Mon, Aug 12, 2019 at 9:04 AM Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: > > On 8/11/19 4:10 PM, Alexander Korotkov wrote: > > > Adjust string comparison in jsonpath > > > > > > We have implemented jsonpath string comparison using default database locale. > > > However, standard requires us to compare Unicode codepoints. This commit > > > implements that, but for performance reasons we still use per-byte comparison > > > for "==" operator. Thus, for consistency other comparison operators do per-byte > > > comparison if Unicode codepoints appear to be equal. > > > > > > In some edge cases, when same Unicode codepoints have different binary > > > representations in database encoding, we diverge standard to achieve better > > > performance of "==" operator. In future to implement strict standard > > > conformance, we can do normalization of input JSON strings. > > > > > > > > > This appears to have upset prion when testing on en_US.iso885915. > > Also lapwing's "InstallCheck-fr_FR" stage crashed on this commit, when > running JSON queries, on HEAD and REL_12_STABLE: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-08-11%2021%3A02%3A32 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-08-11%2020%3A40%3A07 Thank you for pointing! I hope I can investigate this shortly. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Aug 12, 2019 at 10:30 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > > > This appears to have upset prion when testing on en_US.iso885915. > > > > Also lapwing's "InstallCheck-fr_FR" stage crashed on this commit, when > > running JSON queries, on HEAD and REL_12_STABLE: > Thank you for pointing! I hope I can investigate this shortly. Hi Alexander, I can reproduce this by running LANG="fr_FR.ISO8859-1" initdb, then running installcheck (on some other OSes that might be called just "fr_FR"). See this comment in mbutils.c: * The functions return a palloc'd, null-terminated string if conversion * is required. However, if no conversion is performed, the given source * string pointer is returned as-is. You call pfree() on the result of pg_server_to_any() without checking if it just returned in the input pointer (for example, it does that if you give it an empty string). That triggers an assertion failure somewhere inside pfree(). The following fixes that for me, and is based on code I found elsewhere in the tree. --- a/src/backend/utils/adt/jsonpath_exec.c +++ b/src/backend/utils/adt/jsonpath_exec.c @@ -2028,8 +2028,10 @@ compareStrings(const char *mbstr1, int mblen1, cmp = binaryCompareStrings(utf8str1, strlen(utf8str1), utf8str2, strlen(utf8str2)); - pfree(utf8str1); - pfree(utf8str2); + if (mbstr1 != utf8str1) + pfree(utf8str1); + if (mbstr2 != utf8str2) + pfree(utf8str2); With that fixed it no longer crashes, but then the regression test fails due to differences in the output, which look like locale ordering differences. -- Thomas Munro https://enterprisedb.com
пн, 12 авг. 2019 г., 3:25 Thomas Munro <thomas.munro@gmail.com>:
On Mon, Aug 12, 2019 at 10:30 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> > > This appears to have upset prion when testing on en_US.iso885915.
> >
> > Also lapwing's "InstallCheck-fr_FR" stage crashed on this commit, when
> > running JSON queries, on HEAD and REL_12_STABLE:
> Thank you for pointing! I hope I can investigate this shortly.
Hi Alexander,
I can reproduce this by running LANG="fr_FR.ISO8859-1" initdb, then
running installcheck (on some other OSes that might be called just
"fr_FR"). See this comment in mbutils.c:
* The functions return a palloc'd, null-terminated string if conversion
* is required. However, if no conversion is performed, the given source
* string pointer is returned as-is.
You call pfree() on the result of pg_server_to_any() without checking
if it just returned in the input pointer (for example, it does that if
you give it an empty string). That triggers an assertion failure
somewhere inside pfree(). The following fixes that for me, and is
based on code I found elsewhere in the tree.
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -2028,8 +2028,10 @@ compareStrings(const char *mbstr1, int mblen1,
cmp = binaryCompareStrings(utf8str1, strlen(utf8str1),
utf8str2, strlen(utf8str2));
- pfree(utf8str1);
- pfree(utf8str2);
+ if (mbstr1 != utf8str1)
+ pfree(utf8str1);
+ if (mbstr2 != utf8str2)
+ pfree(utf8str2);
With that fixed it no longer crashes, but then the regression test
fails due to differences in the output, which look like locale
ordering differences.
Thank you for the diagnostics. Should be fixed by 251c8e39.
BTW, test failures appears to be caused not by locale differences, but by using strlen() on non null-terminated original strings.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company