Thread: Q: Escapes in jsonpath Idents
Hackers, The jsonpath doc[1] has an excellent description of the format of strings, but for unquoted path keys, it simply says: > Member accessor that returns an object member with the specified key. If the key name matches some named variable startingwith $ or does not meet the JavaScript rules for an identifier, it must be enclosed in double quotes to make it astring literal. I went looking for the JavaScript rules for an identifier and found this in the MDN docs[2]: > In JavaScript, identifiers can contain Unicode letters, $, _, and digits (0-9), but may not start with a digit. An identifierdiffers from a string in that a string is data, while an identifier is part of the code. In JavaScript, there isno way to convert identifiers to strings, but sometimes it is possible to parse strings into identifiers. However, the Postgres parsing of jsonpath keys appears to follow the same rules as strings, allowing backslash escapes: david=# select '$.fo\u00f8 == $x'::jsonpath; jsonpath ------------------- ($."foø" == $"x") This would seem to contradict the documentation. Is this behavior required by the SQL standard? Do the docs need updating?Or should the code actually follow the JSON identifier behavior? Thanks, David PS: Those excellent docs on strings mentions support for \v, but the grammar in the right nav of https://www.json.org/json-en.htmldoes not. Another bonus feature? [1]: https://www.postgresql.org/docs/16/datatype-json.html#DATATYPE-JSONPATH [2]: https://developer.mozilla.org/en-US/docs/Glossary/Identifier
On Mar 16, 2024, at 14:39, David E. Wheeler <david@justatheory.com> wrote: > I went looking for the JavaScript rules for an identifier and found this in the MDN docs[2]: > >> In JavaScript, identifiers can contain Unicode letters, $, _, and digits (0-9), but may not start with a digit. An identifierdiffers from a string in that a string is data, while an identifier is part of the code. In JavaScript, there isno way to convert identifiers to strings, but sometimes it is possible to parse strings into identifiers. Coda: Dollar signs don’t work at all outside double-quoted string identifiers: david=# select '$.$foo'::jsonpath; ERROR: syntax error at or near "$foo" of jsonpath input LINE 1: select '$.$foo'::jsonpath; ^ david=# select '$.f$oo'::jsonpath; ERROR: syntax error at or near "$oo" of jsonpath input LINE 1: select '$.f$oo'::jsonpath; ^ david=# select '$."$foo"'::jsonpath; jsonpath ---------- $."$foo" This, too, contradicts the MDM definition an identifier (and some quick browser tests). Best, David
Hi David, On 2024-03-16 19:39 +0100, David E. Wheeler wrote: > The jsonpath doc[1] has an excellent description of the format of > strings, but for unquoted path keys, it simply says: > > > Member accessor that returns an object member with the specified > > key. If the key name matches some named variable starting with $ or > > does not meet the JavaScript rules for an identifier, it must be > > enclosed in double quotes to make it a string literal. > > I went looking for the JavaScript rules for an identifier and found > this in the MDN docs[2]: > > > In JavaScript, identifiers can contain Unicode letters, $, _, and > > digits (0-9), but may not start with a digit. An identifier differs > > from a string in that a string is data, while an identifier is part > > of the code. In JavaScript, there is no way to convert identifiers > > to strings, but sometimes it is possible to parse strings into > > identifiers. > > > However, the Postgres parsing of jsonpath keys appears to follow the > same rules as strings, allowing backslash escapes: > > david=# select '$.fo\u00f8 == $x'::jsonpath; > jsonpath ------------------- > ($."foø" == $"x") > > This would seem to contradict the documentation. Is this behavior > required by the SQL standard? Do the docs need updating? Or should the > code actually follow the JSON identifier behavior? That quoted MDN page does not give the whole picture. ECMAScript and JS do allow Unicode escape sequences in identifier names: https://262.ecma-international.org/#sec-identifier-names https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#identifiers > PS: Those excellent docs on strings mentions support for \v, but the > grammar in the right nav of https://www.json.org/json-en.html does > not. Another bonus feature? You refer to that sentence: "Other special backslash sequences include those recognized in JSON strings: \b, \f, \n, \r, \t, \v for various ASCII control characters, and \uNNNN for a Unicode character identified by its 4-hex-digit code point." Mentioning JSON and \v in the same sentence is wrong: JavaScript allows that escape in strings but JSON doesn't. I think the easiest is to just replace "JSON" with "JavaScript" in that sentence to make it right. The paragraph also already says "embedded string literals follow JavaScript/ ECMAScript conventions", so mentioning JSON seems unnecessary to me. The last sentence also mentions backslash escapes \xNN and \u{N...} as deviations from JSON when in fact those are valid escape sequences from ECMA-262: https://262.ecma-international.org/#prod-HexEscapeSequence So I think it makes sense to reword the entire backslash part of the paragraph and remove references to JSON entirely. The attached patch does that and also formats the backslash escapes as a bulleted list for readability. > [1]: https://www.postgresql.org/docs/16/datatype-json.html#DATATYPE-JSONPATH > [2]: https://developer.mozilla.org/en-US/docs/Glossary/Identifier On 2024-03-16 21:33 +0100, David E. Wheeler wrote: > On Mar 16, 2024, at 14:39, David E. Wheeler <david@justatheory.com> > wrote: > > > I went looking for the JavaScript rules for an identifier and found > > this in the MDN docs[2]: > > > >> In JavaScript, identifiers can contain Unicode letters, $, _, and > >> digits (0-9), but may not start with a digit. An identifier differs > >> from a string in that a string is data, while an identifier is part > >> of the code. In JavaScript, there is no way to convert identifiers > >> to strings, but sometimes it is possible to parse strings into > >> identifiers. > > Coda: Dollar signs don’t work at all outside double-quoted string > identifiers: > > david=# select '$.$foo'::jsonpath; > ERROR: syntax error at or near "$foo" of jsonpath input > LINE 1: select '$.$foo'::jsonpath; > ^ > > david=# select '$.f$oo'::jsonpath; > ERROR: syntax error at or near "$oo" of jsonpath input > LINE 1: select '$.f$oo'::jsonpath; > ^ > > david=# select '$."$foo"'::jsonpath; > jsonpath > ---------- > $."$foo" > > This, too, contradicts the MDM definition an identifier (and some > quick browser tests). The first case ($.$foo) is in line with the restriction on member accessors that you quoted first. The error message 'syntax error at or near "$oo" of jsonpath input' for the second case ($.f$oo), however, looks as if the scanner identifies '$oo' as a variable instead of contiuing the scan of identifier (f$oo) for the member accessor. Looks like a bug to me because a variable doesn't even make sense in that place. What works though, besides double quoting, is escaping the dollar sign: regress=# select '$.\u0024foo'::jsonpath; jsonpath ---------- $."$foo" (1 row) And we've come full circle :) -- Erik
Attachment
On Mar 17, 2024, at 15:12, Erik Wienhold <ewie@ewie.name> wrote: > Hi David, Hey Erik. Thanks for the detailed reply and patch! > So I think it makes sense to reword the entire backslash part of the > paragraph and remove references to JSON entirely. The attached patch > does that and also formats the backslash escapes as a bulleted list for > readability. Ah, it’s JavaScript format, not JSON! This does clarify things quite nicely, thank you. Happy to add my review once it’sin a commit fest. > The first case ($.$foo) is in line with the restriction on member > accessors that you quoted first. Huh, that’s now how I read it. Here it is again: >> Member accessor that returns an object member with the specified >> key. If the key name matches some named variable starting with $ or >> does not meet the JavaScript rules for an identifier, it must be >> enclosed in double quotes to make it a string literal. Note that in my example `$foo` does not match a variable. I mean it looks like a variable, but none is used here. I guessit’s being conservative because it might be used in one of the functions, like jsonb_path_exists(), to which variablesmight be passed. > The error message 'syntax error at or near "$oo" of jsonpath input' for > the second case ($.f$oo), however, looks as if the scanner identifies > '$oo' as a variable instead of contiuing the scan of identifier (f$oo) > for the member accessor. Looks like a bug to me because a variable > doesn't even make sense in that place. Right. Maybe the docs should be updated to say that a literal dollar sign isn’t supported in identifiers, unlike in JavaScript,except through escapes like this: > What works though, besides double quoting, is escaping the dollar sign: > > regress=# select '$.\u0024foo'::jsonpath; > jsonpath > ---------- > $."$foo" > (1 row) > > And we've come full circle :) 🎉 Best, David
On 2024-03-17 20:50 +0100, David E. Wheeler wrote: > On Mar 17, 2024, at 15:12, Erik Wienhold <ewie@ewie.name> wrote: > > So I think it makes sense to reword the entire backslash part of the > > paragraph and remove references to JSON entirely. The attached patch > > does that and also formats the backslash escapes as a bulleted list for > > readability. > > Ah, it’s JavaScript format, not JSON! This does clarify things quite > nicely, thank you. Happy to add my review once it’s in a commit fest. Thanks. https://commitfest.postgresql.org/48/4899/ > > The first case ($.$foo) is in line with the restriction on member > > accessors that you quoted first. > > Huh, that’s now how I read it. Here it is again: > > >> Member accessor that returns an object member with the specified > >> key. If the key name matches some named variable starting with $ or > >> does not meet the JavaScript rules for an identifier, it must be > >> enclosed in double quotes to make it a string literal. > > > Note that in my example `$foo` does not match a variable. I mean it > looks like a variable, but none is used here. I guess it’s being > conservative because it might be used in one of the functions, like > jsonb_path_exists(), to which variables might be passed. I had the same reasoning while writing my first reply but scrapped that part because I found it obvious: That jsonpath is parsed before calling jsonb_path_exists() and therefore the parser has no context about any variables, which might not even be hardcoded but may result from a query. > > The error message 'syntax error at or near "$oo" of jsonpath input' for > > the second case ($.f$oo), however, looks as if the scanner identifies > > '$oo' as a variable instead of contiuing the scan of identifier (f$oo) > > for the member accessor. Looks like a bug to me because a variable > > doesn't even make sense in that place. > > Right. Maybe the docs should be updated to say that a literal dollar > sign isn’t supported in identifiers, unlike in JavaScript, except > through escapes like this: Unfortunately, I don't have access to that part of the SQL spec. So I don't know how the jsonpath grammar is specified. I had a look into Oracle, MySQL, and SQLite docs to see what they implement: * Oracle requires the unquoted field names to match [A-Za-z][A-Za-z0-9]* (see "object steps"). It also supports variables. https://docs.oracle.com/en/database/oracle/oracle-database/23/adjsn/json-path-expressions.html * MySQL refers to ECMAScript identifiers but does not say anything about variables: https://dev.mysql.com/doc/refman/8.3/en/json.html#json-path-syntax * SQLite skimps on details and does not document a grammar: https://sqlite.org/json1.html#path_arguments But it looks as if it strives for compatibility with MySQL and our dear Postgres: https://sqlite.org/src/doc/json-in-core/doc/json-enhancements.md Also checked git log src/backend/utils/adt/jsonpath_scan.l for some insights but haven't found any yet. -- Erik
On Mar 17, 2024, at 20:09, Erik Wienhold <ewie@ewie.name> wrote: > > On 2024-03-17 20:50 +0100, David E. Wheeler wrote: >> On Mar 17, 2024, at 15:12, Erik Wienhold <ewie@ewie.name> wrote: >>> So I think it makes sense to reword the entire backslash part of the >>> paragraph and remove references to JSON entirely. The attached patch >>> does that and also formats the backslash escapes as a bulleted list for >>> readability. >> >> Ah, it’s JavaScript format, not JSON! This does clarify things quite >> nicely, thank you. Happy to add my review once it’s in a commit fest. > > Thanks. https://commitfest.postgresql.org/48/4899/ Applies cleanly, `make -C doc/src/sgml check` runs without error. Doc improvement welcome and much clearer than before. > I had the same reasoning while writing my first reply but scrapped that > part because I found it obvious: That jsonpath is parsed before calling > jsonb_path_exists() and therefore the parser has no context about any > variables, which might not even be hardcoded but may result from a > query. Right, there’s a chicken/egg problem. > Unfortunately, I don't have access to that part of the SQL spec. So I > don't know how the jsonpath grammar is specified. Seems quite logical; I think it should be documented, but I’d also be interested to know what the 2016 and 2023 standardssay, exactly. > Also checked git log src/backend/utils/adt/jsonpath_scan.l for some > insights but haven't found any yet. Everybody’s taking shortcuts relative to the standard, AFAICT. For example, jsonpath_scan.l matches unqouted identifierswith these two regular expressions: <xnq>{other}+ <xnq>\/\* <xnq,xq,xvq>\\. Plus the backslash escapes. {other} is defined as: /* "other" means anything that's not special, blank, or '\' or '"' */ other [^\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/\\\" \t\n\r\f] Which is waaaay more liberal than the ECMA standard[1], by my reading, but the MSDN[2] description is quite succinct (thanksfor the links!): > In JavaScript, identifiers are commonly made of alphanumeric characters, underscores (_), and dollar signs ($). Identifiersare not allowed to start with numbers. However, JavaScript identifiers are not only limited to ASCII — many Unicodecode points are allowed as well. Namely, any character in the ID_Start category can start an identifier, while anycharacter in the ID_Continue category can appear after the first character. ID_Start[3] and ID_Continue[4] point to the unicode standard codes lister, nether of which reference Emoji. Sure enough,in Safari: > x = {"🎉": true} < {🎉: true} > x.🎉 < SyntaxError: Invalid character '\ud83c’ But in Postgres jsonpath: david=# select '$.🎉'::jsonpath; jsonpath ---------- $."🎉" If the MSDN references to ID_Start and ID_Continue are correct, then the Postgres path parser is being overly-liberal. Maybethat’s totally fine? Not sure what should be documented and what’s not worth it. Aside: I’m only digging into these details because I’m busy porting the path parser, so trying to figure out where to becompatible and where not to. So far I’m rejecting '$' (but allowing '\$' and '\u0024') but taking advantage of the unicodesupport in Go to specifically validate against ID_Start and ID_Continue. Best, David [1] https://262.ecma-international.org/#sec-identifier-names [2] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#identifiers [3] https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BID_Start%7D [4] https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BID_Continue%7D
On 17.03.24 20:12, Erik Wienhold wrote: > Mentioning JSON and \v in the same sentence is wrong: JavaScript allows > that escape in strings but JSON doesn't. I think the easiest is to just > replace "JSON" with "JavaScript" in that sentence to make it right. The > paragraph also already says "embedded string literals follow JavaScript/ > ECMAScript conventions", so mentioning JSON seems unnecessary to me. > > The last sentence also mentions backslash escapes \xNN and \u{N...} as > deviations from JSON when in fact those are valid escape sequences from > ECMA-262:https://262.ecma-international.org/#prod-HexEscapeSequence > So I think it makes sense to reword the entire backslash part of the > paragraph and remove references to JSON entirely. The attached patch > does that and also formats the backslash escapes as a bulleted list for > readability. I have committed this patch, and backpatched it, as a bug fix, because the existing description was wrong. To keep the patch minimal for backpatching, I didn't do the conversion to a list. I'm not sure I like that anyway, because it tends to draw more attention to that part over the surrounding parts, which didn't seem appropriate in this case. But anyway, if you have any more non-bug-fix editing in this area, which would then target PG18, please send more patches.
On 18.03.24 01:09, Erik Wienhold wrote: >>> The error message 'syntax error at or near "$oo" of jsonpath input' for >>> the second case ($.f$oo), however, looks as if the scanner identifies >>> '$oo' as a variable instead of contiuing the scan of identifier (f$oo) >>> for the member accessor. Looks like a bug to me because a variable >>> doesn't even make sense in that place. >> Right. Maybe the docs should be updated to say that a literal dollar >> sign isn’t supported in identifiers, unlike in JavaScript, except >> through escapes like this: > Unfortunately, I don't have access to that part of the SQL spec. So I > don't know how the jsonpath grammar is specified. The SQL spec says that <JSON path identifier> corresponds to Identifier in ECMAScript. But it also says, A <JSON path identifier> is classified as follows. Case: a) A <JSON path identifier> that is a <dollar sign> is a <JSON path context variable>. b) A <JSON path identifier> that begins with <dollar sign> is a <JSON path named variable>. c) Otherwise, a <JSON path identifier> is a <JSON path key name>. Does this help? I wasn't following all the discussion to see if there is anything wrong with the implementation.
On Apr 24, 2024, at 05:51, Peter Eisentraut <peter@eisentraut.org> wrote: > A <JSON path identifier> is classified as follows. > > Case: > > a) A <JSON path identifier> that is a <dollar sign> is a <JSON path > context variable>. > > b) A <JSON path identifier> that begins with <dollar sign> is a > <JSON path named variable>. > > c) Otherwise, a <JSON path identifier> is a <JSON path key name>. > > Does this help? I wasn't following all the discussion to see if there is anything wrong with the implementation. Yes, it does, as it ties the special meaning of the dollar sign to the *beginning* of an expression. So it makes sense thatthis would be an error: david=# select '$.$foo'::jsonpath; ERROR: syntax error at or near "$foo" of jsonpath input LINE 1: select '$.$foo'::jsonpath; ^ But I’m less sure when a dollar sign is used in the *middle* (or end) of a json path identifier: david=# select '$.xx$foo'::jsonpath; ERROR: syntax error at or near "$foo" of jsonpath input LINE 1: select '$.xx$foo'::jsonpath; ^ Perhaps that should be valid? Best, David
On Apr 24, 2024, at 05:46, Peter Eisentraut <peter@eisentraut.org> wrote: > I have committed this patch, and backpatched it, as a bug fix, because the existing description was wrong. To keep thepatch minimal for backpatching, I didn't do the conversion to a list. I'm not sure I like that anyway, because it tendsto draw more attention to that part over the surrounding parts, which didn't seem appropriate in this case. But anyway,if you have any more non-bug-fix editing in this area, which would then target PG18, please send more patches. Makes sense, that level of detail gets into the weeks so maybe doesn’t need to be quite so prominent as a list. Thank you! David
On 2024-04-24 13:52 +0200, David E. Wheeler wrote: > On Apr 24, 2024, at 05:51, Peter Eisentraut <peter@eisentraut.org> wrote: > > > A <JSON path identifier> is classified as follows. > > > > Case: > > > > a) A <JSON path identifier> that is a <dollar sign> is a <JSON > > path context variable>. > > > > b) A <JSON path identifier> that begins with <dollar sign> is a > > <JSON path named variable>. > > > > c) Otherwise, a <JSON path identifier> is a <JSON path key name>. > > > > Does this help? I wasn't following all the discussion to see if > > there is anything wrong with the implementation. Thanks Peter! But what is the definition of the entire path expression? Perhaps something like: <JSON path> ::= <JSON path identifier> { "." <JSON path identifier> } That would imply that "$.$foo" is a valid path that accesses a variable member (but I guess the path evaluation is also specified somewhere). Does it say anything about double-quoted accessors? In table 8.25[1] we allow member accessor ."$varname" and it says "If the key name matches some named variable starting with $ or does not meet the JavaScript rules for an identifier, it must be enclosed in double quotes to make it a string literal." What bugs me about this description, after reading it a couple of times, is that it's not clear what is meant by ."$varname". It could mean two things: (1) the double-quoting masks $varname in order to not interpret those characters as a variable or (2) an interpolated string that resolves $varname and yields a dynamic member accessor. The current implementation supports (1), i.e., ."$foo" does not refer to variable foo but the actual property "$foo": => select jsonb_path_query('{"$foo":123,"bar":456}', '$."$foo"', '{"foo":"bar"}'); jsonb_path_query ------------------ 123 (1 row) Under case (2) I'd expect that query to return 456 (because $foo resolves to "bar"). (Similar to how psql would resolve :'foo' to 'bar'.) Variables already work in array accessors and table 8.25 says that "The specified index can be an integer, as well as an expression returning a single numeric value [...]". A variable is such an expression. => select jsonb_path_query('[2,3,5]', '$[$i]', '{"i":1}'); jsonb_path_query ------------------ 3 (1 row) So I'd expect a similar behavior for member accessors as well when seeing ."$varname" in the same table. > Yes, it does, as it ties the special meaning of the dollar sign to the > *beginning* of an expression. So it makes sense that this would be an > error: > > david=# select '$.$foo'::jsonpath; > ERROR: syntax error at or near "$foo" of jsonpath input > LINE 1: select '$.$foo'::jsonpath; > ^ > But I’m less sure when a dollar sign is used in the *middle* (or end) > of a json path identifier: > > david=# select '$.xx$foo'::jsonpath; > ERROR: syntax error at or near "$foo" of jsonpath input > LINE 1: select '$.xx$foo'::jsonpath; > ^ > Perhaps that should be valid? Yes, I think so. That would be case C in the spec excerpt provided by Peter. So it's just a key name that happens to contain (but not start with) the dollar sign. [1] https://www.postgresql.org/docs/current/datatype-json.html#TYPE-JSONPATH-ACCESSORS -- Erik
On Apr 24, 2024, at 3:22 PM, Erik Wienhold <ewie@ewie.name> wrote: > Thanks Peter! But what is the definition of the entire path expression? > Perhaps something like: > > <JSON path> ::= <JSON path identifier> { "." <JSON path identifier> } > > That would imply that "$.$foo" is a valid path that accesses a variable > member (but I guess the path evaluation is also specified somewhere). I read it as “if it starts with a dollar sign, it’s a variable and not a path identifier”, and I assume any `.foo` expressionis a path identifier. > What bugs me about this description, after reading it a couple of times, > is that it's not clear what is meant by ."$varname". It could mean two > things: (1) the double-quoting masks $varname in order to not interpret > those characters as a variable or (2) an interpolated string that > resolves $varname and yields a dynamic member accessor. My understanding is that if it’s in double quotes it’s never anything other than a string (whether a string literal or apath identifier string literal). IOW, variables don’t interpolate inside strings. > Under case (2) I'd expect that query to return 456 (because $foo > resolves to "bar"). (Similar to how psql would resolve :'foo' to > 'bar'.) Yes, I suspect this is the correct interpretation, but agree the wording could use some massaging, especially since pathidentifiers cannot start with a dollar sign anyway. Perhaps: "If the key name starts with $ or does not meet the JavaScript rules for an identifier, it must be enclosed in double quotesto make it a string literal." > Variables already work in array accessors and table 8.25 says that "The > specified index can be an integer, as well as an expression returning a > single numeric value [...]". A variable is such an expression. > > => select jsonb_path_query('[2,3,5]', '$[$i]', '{"i":1}'); > jsonb_path_query > ------------------ > 3 > (1 row) > > So I'd expect a similar behavior for member accessors as well when > seeing ."$varname" in the same table. Oh, interesting point! Now I wonder if the standard has this inconsistency (and is aware of it). > Yes, I think so. That would be case C in the spec excerpt provided by > Peter. So it's just a key name that happens to contain (but not start > with) the dollar sign. Exactly. It also matches the doc you quote above. Something would have to change in src/backend/utils/adt/jsonpath_scan.lto fix that, but that file makes my eyes water, so I’m not gonna take a stab at it.:-) D