Re: jsonpath - Mailing list pgsql-hackers

From Nikita Glukhov
Subject Re: jsonpath
Date
Msg-id 38018365-5793-2827-e3c1-5a3b259fb6ae@postgrespro.ru
Whole thread Raw
In response to Re: jsonpath  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: jsonpath
List pgsql-hackers
Attached 21-st version of the patches rebased onto the current master.

Added new patch 0004 with documentation written by Liudmila Mantrova, and
optional patch 0003 that was separated from 0002.


On 24.11.2018 23:03, Tomas Vondra wrote:

Hi,

I've done another round of reviews on v20, assuming the patch is almost
ready to commit, but unfortunately I ran into a bunch of issues that
need to be resolved. None of this is a huge issue, but it's also above
the threshold of what could be tweaked by a committer IMHO.
Thank your for your review.
(Which brings the question who plans to commit this. The patch does not
have a committer in the CF app, but I see both Teodor and Alexander are
listed as it's authors, so I'd expect it to be one of those. Or I might
do that, of course.)
0001
----

1) to_timestamp() does this:
 do_to_timestamp(date_txt, VARDATA(fmt), VARSIZE_ANY_EXHDR(fmt), false,                 &tm, &fsec, &fprec, NULL);

Shouldn't it really do VARDATA_ANY() instead of VARDATA()? It's what the
function did before (well, it called text_to_cstring, but that does
VARDATA_ANY). The same thing applies to to_date(), BTW.

I also find it a bit inconvenient that we expand the fmt like this in
all do_to_timestamp() calls, although it's just to_datetime() that needs
to do it this way. I realize we can't change to_datetime() because it's
external API, but maybe we should make it construct a varlena and pass
it to do_to_timestamp().
Of course, there must be VARDATA_ANY() instead of of VARDATA().  But I decided
to construct varlena and pass it to do_to_timestamp() and to to_datetime().

2) We define both DCH_FF# and DCH_ff#, but we never ever use the
lower-case version. Heck, it's not mentioned even in DCH_keywords, which
does this:
 ... {"FF1", 3, DCH_FF1, false, FROM_CHAR_DATE_NONE},  /* F */ ... {"ff1", 3, DCH_FF1, false, FROM_CHAR_DATE_NONE},  /* F */ ...

Compare that to DCH_DAY, DCH_Day and DCH_day, mapped to "DAY", "Day" and
"day". 
Yes, "ff#" are mapped to DCH_FF# like "mi" is mapped DCH_MI.

"Day", "day" are not mapped to DCH_DAY because they determine letter case in the
output, but "ff1" and "FF#" output contains only digits.

Also, the comment for "ff1" is wrong, it should be "f" I guess.
Fixed.
And of course, there are regression tests for FF# variants, but not the
lower-case ones.

Tests were added

3) Of course, these new formatting patterns should be added to SGML
docs, for example to "Template Patterns for Date/Time Formatting" in
func.sgml (and maybe to other places, I haven't looked for them very
thoroughly).

Fixed.

4) The last block in DCH_from_char() does this
   while (*s == ' ')       s++;

but I suppose it should use isspace() just like the code immediately
before it.

Fixed.

5) I might be missing something, but why is there the "D" at the end of
the return flags from DCH_from_char?
   /* Return flags for DCH_from_char() */   #define DCH_DATED    0x01   #define DCH_TIMED    0x02   #define DCH_ZONED    0x04
These terms "dated", "timed", and "zoned" are taken from the standard:

9.39.10) If <JSON datetime template> JDT is specified, ... : a) If JDT contains <datetime template year>, ... then JDT is dated. b) If JDT contains <datetime template 12-hour>, ... then JDT is timed. c) If JDT contains <datetime template time zone hour>, ... then JDT is zoned.


0002
----

1) There are some unnecessary changes to to_datetime signature (date_txt
renamed to vs. datetxt), which is mostly minor but unnecessary churn.

2) There are some extra changes to to_datetime (extra parameters, etc.).
I wonder why those are not included in 0001, as part of the supporting
datetime infrastructure.
Extra changes to to_datetime() were moved into patch 0001.

3) I'm not sure whether the _safe functions are a good or bad idea, but
at the very least the comments should be updated to explain what it does
(as the API has changed, obviously).
 
This functions were introduced for elimination of PG_TRY/PG_CATCH in jsonpath
code.  But this error handling approach has a lot of issues (see below), so I
decided separate again them into optional patch 0004.

4) the json.c changes are under-documented, e.g. there are no comments
for lex_peek_value, JsonEncodeDateTime doesn't say what tzp is for and
whether it needs to be specified all the time, half of the functions at
the end don't have comments (some of them are really simple, but then
other simple functions do have comments).

I don't know what the right balance here is (I'm certainly not asking
for bogus comments just to have comments) and I agree that the existing
code is not exactly abundant in comments. But perhaps having at least
some would be nice.
Some new comments were added to json.c.
The same thing applies to jsonpath.c and jsonpath_exec.c, I think. There
are pretty much no comments whatsoever (at least at the function level,
explaining what the functions do). It would be good to have a file-level
comment too, explaining how jsonpath works, probably.

I hope to add detailed comments for jsonpath in the next version of the patches.

5) I see uniqueifyJsonbObject now does this:
if (!object->val.object.uniquified)	return;

That seems somewhat strange, considering the function says it'll
uniqueify objects, but then exits when noticing the passed object is not
uniquified?

'uniquified' field was rename to 'uniquify'.

6) Why do we make make_result inline? (numeric.c) If this needs to be
marked with "inline" then perhaps all the _internal functions should be
marked like that too? I have my doubts about the need for this.


7) The other places add _safe to functions that don't throw errors
directly and instead update edata. Why are set_var_from_str, div_var,
mod_var excluded from this convention?
I added "_safe" postfix only to functions having the both variants: safe (error
info is placed into *edata) and unsafe (errors are always thrown).

One-line make_result() is called from many unsafe places, so I decided to leave
it as it was.  It can also be defined as a simple macro.

New "_internal" functions are called from jsonpath_exec.c, so they can't be
"static inline", but "extern inline" seems to be non-portable.

8) I wonder if the changes in numeric can have negative impact on
performance. Has anyone done any performance tests of this part?
I've tried to measure this impact by evaluation of expression with dozens of '+'
or even by adding a loop into numeric_add(), but failed to get any measurable
difference.

0003
----

1) jsonb_gin.c should probably add comments briefly explaining what
JsonPathNode, GinEntries, ExtractedPathEntry, ExtractedJsonPath and
JsonPathExtractionContext are for.
A lot of comments were added. Also major refactoring of jsonb_gin.c was done.

2) I find it a bit suspicious that there are no asserts in json_gin.c
(well, there are 3 in the existing code, but nothing in the new code,
and the patch essentially doubles the amount of code here).
I have managed to find only 4 similar places suitable for asserts.

No comments for 0004 at this point.

regards

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: don't mark indexes invalid unnecessarily
Next
From: Andres Freund
Date:
Subject: GIN predicate locking slows down valgrind isolationtests tremendously