Re: jsonpath - Mailing list pgsql-hackers
From | Nikita Glukhov |
---|---|
Subject | Re: jsonpath |
Date | |
Msg-id | 220821fb-52c0-c277-0959-322c8eef76b2@postgrespro.ru Whole thread Raw |
In response to | Re: jsonpath (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: jsonpath
|
List | pgsql-hackers |
Attached 36th version of the patches.
On 03.03.2019 21:08, Tomas Vondra wrote:
Hi, Here are some initial comments from a review of the 0001 part. I plan to do more testing on a large data set and additional round of review over the next week. FWIW I've passed this through valgrind and the usual battery of regression tests, and there were no issues.
Thanks again for your review.
I haven't looked at 0002 yet, but it seems fairly small (especially compared to 0001). func.sgml ========= 1) I see the changes removed <indexterm zone="functions-json"> for some reason. Is that intentional?
Fixed.
2) <command>WHERE</command> We generally tag <literal>WHERE</literal>, not <command>.
Fixed.
3) Filter expressions are applied from left to right Perhaps s/applied/evaluated/ in reference to expressions?
Fixed.
4) The result of the filter expression may be true, false, or unknown. It's not entirely clear to me what "unknown" means here. NULL, or something else? There's a section in the SQL/JSON standard explaining this (page 83), but perhaps we should explain it a bit here too? The standard says "In the SQL/JSON data model, there are no SQL nulls, so Unknown is not part of the SQL/JSON data model." so I'm a bit confused what "unknown" references to. Maybe some example? Also, what happens when the result is unknown?
"unknown" refers here to ordinary three-valued logical Unknown, which is represented in SQL by NULL. JSON path expressions return sequences of SQL/JSON items, which are defined by SQL/JSON data model. But JSON path predicates (logical expressions), which are used in filters, return three-valued logical values: False, True, or Unknown. Filters accept only items for which the filter predicate returned True. False and Unknown results are skipped. Unknown can be checked with IS UNKNOWN predicate: SELECT jsonb_path_query_array('[1, "1", true, null]', '$[*] ? ((@ < 2) is unknown)');jsonb_path_query_array ------------------------["1", true] (1 row) Comparison of JSON nulls to non-nulls returns always False, not Unknown (see SQL/JSON data model). Comparison of non-comparable items return Unknown.
5) There's an example showing how to apply filter at a certain level, using the @ variable, but what if we want to apply multiple filters at different levels? Would it make sense to add such example?
Examples were added. Filters can be nested, but by the standard it is impossible to reference item of the outer filter, because @ always references current item of the innermost filter. We have additional patch with a simple jsonpath syntax extension for this case -- @N:@0, like @, references innermost item@1 references item one level upper@2 references item two levels upper For example, selecting all objects from array '[{"vals": [1,2,3], "val": 2}, {"vals": [4,5], "val": 6}]' having in their .vals[] element greater than their .val field: '$[*] ? (@.vals[*] ? (@ > @1.val))' It is impossible to do this by the standard in the single jsonpath expression. Also there is idea to use lambda expressions with ECMAScript 6 syntax in filters and user method (see below): '$[*] ? (obj => obj.vals[*] ? (val => val > obj.val))' I already have a patch implementing lambda expressions, which were necessary for implementaion of built-in/user-written functions and methods like .map(), reduce(), max(). I can post it again if it is interesting.
6) ... extensions of the SQL/JSON standard I'm not sure what "extension" is here. Is that an extension defined in the SQL standard, or an additional PostgreSQL functionality not described in the standard? (I assume the latter, just checking.)
Yes, this is additional functionality. "Writing the path as an expression is also valid" I have moved this into SQL/JSON patch, because it is related only path specification in SQL/JSON functions.
7) There are references to "SQL/JSON sequences" without any explanation what it means. Maybe I'm missing something obvious, though.
SQL/JSON sequence is a sequence of SQL/JSON items. The corresponding definition was added.
8) Implicit unwrapping in the lax mode is not performed in the following cases: I suggest to reword it like this: In the lax mode, implicit unwrapping is not performed when:
Fixed. I also decided to merge this paragraph with paragraph describing auto-unwrapping.
9) We're not adding the datetime() method for now, due to the issues with timezones. I wonder if we should add a note why it's missing to the docs ...
The corresponding note was added.
10) I'm a bit puzzled though, because the standard says this in the description of type() function on page 77 If I is a datetime, then “date”, “time without time zone”, “time with time zone”, “timestamp without time zone”, or “timestamp with time zone”, as appropriate. But I see our type() function does not return anything like that (which I presume is independent of timezone stuff). But I see jsonb.h has no concept of datetime values, and the standard actually says this in the datetime() section JSON has no datetime types. Datetime values are most likely stored in character strings. Considering this, the type() section makes little sense, no?
According to the SQL/JSON data model, SQL/JSON items can be null, string, boolean, numeric, and datetime. Datetime items exists only at execution time, they are serialized into JSON strings when the resulting SQL/JSON item is converted into jsonb. After removal of .datetime() method, support of datetime SQL/JSON items was removed from jsonpath executor too. Numeric items can be of any numeric type. In PostgreSQL we have the following numeric datatypes: integers, floats and numeric. But our jsonpath executor supports now only numeric-typed items, because this is only type we can get directly from jsonb. Support for other numeric datatypes (float8 is most necessary, because it is produced by .double() item method) can be added by extending JsonbValue or by introducing struct JsonItem in executor, having JsonbValue as a part (see patch #4 in v34).
jsonb_util.c ============ I see we're now handling NaN values in convertJsonbScalar(). Isn't it actually a bug that we don't do this already? Or is it not needed for some reason?
Numeric JsonbValues created outside of jsonpath executor cannot be NaN, because this case in explicitly handled in JsonbValue-producing functions. For example, datum_to_jsonb() which is used in to_jsonb(), jsonb_build_array() and others converts Inf and NaN into JSON strings. But jsonb_plperl and jsonb_plpython transforms do not allow NaNs (I think it is needed only for consistency of "PL => jsonb => PL" roundtrip). In our jsonpath executor we can produce NaNs and we should not touch them before conversion to the resulting jsonb. Moreover, in SQL/JSON functions numeric SQL/JSON item can be directly converted into numeric RETURNING type. So, I decided to add a check for NaN to the low-level convertJsonbScalar() instead of checking items before every call to JsonbValueToJsonb() or pushJsonbValue(). But after introduction of struct JsonItem (see patch #4) introduction there will appropriate place for this check -- JsonItemToJsonbValue().
========== I suppose this should say "jsonpath version number" instead? elog(ERROR, "unsupported jsonb version number %d", version);
Fixed.
On 05.03.2019 2:27, Tomas Vondra wrote:
A bunch of additional comments, after looking at the patch a bit today. All are mostly minor, and sometime perhaps a matter of preference. 1) There's a mismatch between the comment and actual function name for jsonb_path_match_opr and jsonb_path_exists_opr(). The comments say "_novars" instead.
Fixed.
2) In a couple of switches the "default" case does a return with a value, following elog(ERROR). So it's practically unreachable, AFAICS it's fine without it, and we don't do this elsewhere. And I don't get any compiler warnings if I remove it either. Examples: JsonbTypeName default: elog(ERROR, "unrecognized jsonb value type: %d", jbv->type); return "unknown"; jspOperationName default: elog(ERROR, "unrecognized jsonpath item type: %d", type); return NULL; compareItems default: elog(ERROR, "unrecognized jsonpath operation: %d", op); return jpbUnknown;
It seems to be a standard practice in jsonb code, so we followed it.
3) jsonpath_send is using makeStringInfo() for a value that is not returned - IMHO it should use regular stack-allocated variable and use initStringInfo() instead
Fixed.
4) the version number should be defined/used as a constant, not as a magic constant somewhere in the code
Fixed.
5) Why does jsonPathToCstring do this? appendBinaryStringInfo(out, "strict ", 7); Why not to use regular appendStringInfoString()? What am I missing?
appendStringInfoString() is a bit slower than appendBinaryStringInfo() because to strlen() call inside it.
6) comment typo: "Aling StringInfo"
Fixed.
7) alignStringInfoInt() should explain why we need this and why INTALIGN is the right alignment.
Comment was added.
8) I'm a bit puzzled by what flattenJsonPathParseItem does with 'next' I don't quite understand what it's doing with 'next' value? /* * Actual value will be recorded later, after next and children * processing. */ appendBinaryStringInfo(buf, (char *) &next, /* fake value */ sizeof(next)); Perhaps a comment explaining it (why we need a fake value at all?) would be a good idea here.
No fake value is needed here, zero placeholder is enough. I have refactored this using new function reserveSpaceForItemPointer().
9) I see printJsonPathItem is only calling check_stack_depth while flattenJsonPathParseItem also calls CHECK_INTERRUPTS. Why the difference, considering they seem about equally expensive?
CHECK_INTERRUPT() was added to printJsonPathItem() too.
10) executeNumericItemMethod is missing a comment (unlike the other executeXXX functions)
Comment was added.
11) Wording of some of the error messages in the execute methods seems a bit odd. For example executeNumericItemMethod may complain that it ... is applied to not a numeric value but perhaps a more natural wording would be ... is applied to a non-numeric value And similarly for the other execute methods. But I'm not a native speaker, so perhaps the original wording is just fine.
Error messages were changed on the advice of Robert Haas to: "... can only be applied to a numeric value"
-- Nikita Glukhov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
pgsql-hackers by date: