jsonb subscripting vs SQL/JSON array accessor semantics (SQL:2023) - Mailing list pgsql-hackers

From Alexandra Wang
Subject jsonb subscripting vs SQL/JSON array accessor semantics (SQL:2023)
Date
Msg-id CAK98qZ1P_edO9hZJVMuUpnXjN-9H=dz2zUj-06G5qW9CMjLqzA@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi hackers,

While working on a patch [1] to add SQL/JSON simplified accessor
support (dot notation, wildcards, etc.) to jsonb, I ran into a
semantic mismatch between PostgreSQL’s existing jsonb subscripting [2]
syntax (value[...]), which predates the SQL/JSON standard, and the
SQL:2023 specification of the JSON array accessor (also value[...]) in
the SQL/JSON path language.

Because both use bracket notation for array access, this creates
several edge cases where the existing behavior in PostgreSQL differs
from what the standard specifies, and from what json_query() already
implemented today.

I’d like to summarize the differences, discuss whether we should
reconcile them, and get guidance on how we want to proceed.

---

## Context

PostgreSQL currently allows jsonb subscripts that are coercible to
int4 or text:

-- object member access
SELECT ('{"a": 1}'::jsonb)['a']; -- returns 1 (a jsonb number)

-- array element access
SELECT ('[1, "2", null]'::jsonb)[1]; -- returns "2" (a jsonb string)

In contrast, SQL:2023 defines array access as part of the SQL/JSON
path language:

<JSON array accessor> ::=
  <left bracket> <JSON subscript list> <right bracket>
<JSON subscript list> ::=
  <JSON subscript> [ { <comma> <JSON subscript> }... ]

where each subscript expression must evaluate to a singleton numeric
value (except for the special *last* keyword). Object member access is
performed using dot notation, not ['key'].

The standard further specifies that a simplified accessor expression
is semantically equivalent to:

    JSON_QUERY (VEP, 'lax $JC' WITH CONDITIONAL ARRAY WRAPPER NULL ON EMPTY NULL ON ERROR)

    where:
      VEP = <value expression primary>
      JC = <JSON simplified accessor op chain>

For example:

select ('[1, "2", null]'::jsonb)[1];
     
by the standard's specification is semantically equivalent to:

select json_query('[1, "2", null]'::jsonb, 'lax $[1]' with conditional array wrapper);

However, as Peter Eisentraut pointed out in [3], these two forms
currently diverge in several cases.

---

## Observed differences
     
Below are the main cases where jsonb subscripting and SQL/JSON path
semantics differ.
     
### Case 1a: Array access on a jsonb primitive value
select ('123'::jsonb)[0];
  jsonb
-------

(1 row)

select json_query('123'::jsonb, 'lax $[0]' with conditional array wrapper);
  json_query
------------
  123
(1 row)
     
The json_query() behavior is correct and comply with the SQL standard:
     
> In lax mode:
> If an operation requires an SQL/JSON array but the operand is not an
> SQL/JSON array, then the operand is first "wrapped" in an SQL/JSON
> array prior to performing the operation.

So, whenever there is a 'lax $[0]' jsonpath expression applying on a
non-json array, the return value should always be the same as the
original json value expression, in this case, the same number 123.

The jsonb subscripting result is not only different, but also
contradicts its current documentation [2], which suggests raising
error:
> However, if any val itself or one of the intermediary values is
> defined as a non-object such as a string, number, or jsonb null,
> traversal cannot proceed so an error is raised and the transaction
> aborted.

---

### Case 1b: Array access on a jsonb object
select ('{"a": 1}'::jsonb)[0];
  jsonb
-------

(1 row)

select json_query('{"a": 1}'::jsonb, 'lax $[0]' with conditional array
wrapper);
  json_query
------------
  {"a": 1}
(1 row)

Same issue as Case 1a, but with an jsonb object instead of a scalar.
json_query() returns the same input json object as expected, whereas
the jsonb subscripting returns null. The jsonb subscripting behavior
here is undocumented.

### Case 1c: Chained array access on a one-dimensional json array

select ('[1, "2", null]'::jsonb)[1][0];
  jsonb
-------

(1 row)

select json_query('[1, "2", null]'::jsonb, 'lax $[1][0]' with
conditional array wrapper);
  json_query
------------
  "2"
(1 row)

This reduces to Case 1a: the intermediate result of [1] is a scalar
("2"), which SQL/JSON treats as a single-element array in 'lax' mode.

---

### Case 2: Non-integer array index

select ('[1, "2", null]'::jsonb)[1.0];
ERROR:  subscript type numeric is not supported
LINE 1: select ('[1, "2", null]'::jsonb)[1.0];
                                         ^
HINT:  jsonb subscript must be coercible to either integer or text.

select json_query('[1, "2", null]'::jsonb, 'lax $[1.0]' with conditional
array wrapper);
  json_query
------------
  "2"
(1 row)

select json_query('[1, "2", null]'::jsonb, 'lax $[1.9]' with conditional
array wrapper);
  json_query
------------
  "2"
(1 row)

The standard allows non-integer numeric subscripts, with
implementation-defined rounding or truncation (IA073). PostgreSQL’s
json_query() uses truncation, which is compliant.
     
jsonb subscripting does not support this at all, which I think is
reasonable, but again diverges from the standard behavior.
     
For reference, since DuckDB and Oracle both support JSON simplified
accessor, I tried similar queries on them as well:
     
-- DuckDB
duckdb> select ('[1, "2", null]'::json)[0.4]; -- returns 1
duckdb> select ('[1, "2", null]'::json)[0.5]; -- returns 2

-- Oracle
WITH data AS (
  SELECT JSON('[1, "2", null]') AS jcol
)
SELECT d.jcol[1.5] FROM data d; -- error
     
So DuckDB does rounding, and Orcale errors due to "an invalid array
index".

---

### Case 3: Negative array index
select ('[1, "2", null]'::jsonb)[-2];
  jsonb
-------
  "2"
(1 row)

select json_query('[1, "2", null]'::jsonb, 'lax $[-2]' with conditional
array wrapper);
  json_query
------------

(1 row)
     
select json_query('[1, "2", null]'::jsonb, 'lax $[-0.9]' with conditional         array wrapper);
 json_query
------------
 1
(1 row)

The json_query() behavior is correct according to the standard for out
of range index in 'lax' mode; and the jsonb subscripting is a custom
extension (-1 means last element, etc.).

Interestingly, DuckDB's JSON array accessor made the same custom
choice as PostgreSQL for negative index.

---

## Summary
     
In all cases above, json_query() follows the SQL/JSON standard
specification. jsonb subscripting, which predates the standard,
differs in several ways:

1. Array access on non-arrays (scalars or objects) does not use 'lax'
mode wrapping. As a result, "[0]" does not return the original value.
2. Non-integer subscripts are not supported.
3. Negative subscripts use a PostgreSQL-specific extension.
     
These are all edge cases and likely low-impact in practice. However,
if we were to add broader support for SQL/JSON simplified accessors
(dot notation, wildcards, item methods, etc.), which can be chained
together and include array access using the existing bracket syntax,
it seems useful to clarify what semantics we would want for array
access in those situations.

For example, with expressions such as

    select (jb)[0].a
    select (jb).a[0].b
    select (jb).a[-1].b

(where jb is a jsonb value), it would be potentially confusing if
their behavior differed from the equivalent json_query() calls using
the same JSON path.

Given this, it is unclear to me whether the expectation should be to
move closer to the SQL/JSON path semantics, or to preserve the
existing jsonb subscripting behavior and document it as
PostgreSQL-specific.

This question comes up in the context of the dot-notation work I
mentioned at the beginning, but more generally it seems like something
we may want to be explicit about before extending the syntax further.

I would very much appreciate any thoughts or guidance on this.
     
Best,
Alex
     
[1] https://commitfest.postgresql.org/patch/5214/
[2] https://www.postgresql.org/docs/current/datatype-json.html#JSONB-SUBSCRIPTING
[3] https://www.postgresql.org/message-id/a8843d0a-8adb-4fdc-9ac8-8efd22f7d29c%40eisentraut.org
[4]https://www.postgresql.org/docs/current/functions-json.html#FUNCTIONS-SQLJSON-STRICT-AND-LAX-MODES

--
Alexandra Wang

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: pg_waldump: support decoding of WAL inside tarfile
Next
From: Tomas Vondra
Date:
Subject: Re: failed NUMA pages inquiry status: Operation not permitted