Thread: '{"x": 42, "y": null}'::jsonb != '{"x": 42}'::jsonb ... Really?
Here’s the minimal testcase:
do $body$
declare
j1 constant jsonb not null := '{"x": 42, "y": null}';
j2 constant jsonb not null := '{"x": 42 }';
predicate_1 constant boolean not null := (j1->>'y' is null) AND (j2->>'y' is null);
predicate_2 constant boolean not null := j1 = j2;
begin
assert predicate_1;
assert not predicate_2;
end;
$body$;
The block finishes silently.
do $body$
declare
t1 constant text not null := '
{
"x": 42,
"y": 17
}
';
j1 constant jsonb not null := t1::jsonb;
t2 constant text not null := j1::text;
j2 constant jsonb not null := t2::jsonb;
predicate_3 constant boolean not null := t2 = t1;
predicate_4 constant boolean not null := j2 = j1;
begin
assert not predicate_3;
assert predicate_4;
end;
select (exists(select 1 from a) or exists(select 1 from b))
);
do $body$
declare
j1 constant jsonb not null := '{"x": 42, "y": null}';
j2 constant jsonb not null := '{"x": 42 }';
predicate_1 constant boolean not null := (j1->>'y' is null) AND (j2->>'y' is null);
predicate_2 constant boolean not null := j1 = j2;
begin
assert predicate_1;
assert not predicate_2;
end;
$body$;
The block finishes silently.
I certainly expect "predicate_1" to be true. This reflects the defined JSON semantics that, within an object, the omission of a key-value pair is the same as its presence with a value equal to the (bizarre) JSON null.
As I reason it, the truth of "predicate_1" implies the truth of "predicate_2" because "jsonb" claims to represent the underlying semantics of a JSON document using its own secret post-parse representation.
- Am I missing a use case where an object with a key-value pair with a JSON null value is meaningfully different from one where the key is simply absent?
- If not, would you consider what I showed to be evidence of a bug?
My discovery let me design other tests.
This block confirms the basic idea that the meaning of a "jsonb" value is independent of the formatting of the incoming document that defined it.
declare
t1 constant text not null := '
{
"x": 42,
"y": 17
}
';
j1 constant jsonb not null := t1::jsonb;
t2 constant text not null := j1::text;
j2 constant jsonb not null := t2::jsonb;
predicate_3 constant boolean not null := t2 = t1;
predicate_4 constant boolean not null := j2 = j1;
begin
assert not predicate_3;
assert predicate_4;
end;
$body$;
Another test (I won't bore you with its code) confirms that the order in which the incoming document lists key-value pairs has no consequence for its meaning.
Here's another twist on the same basic issue:
create type t_ as (a int, b int);
create function f()
returns table(z text)
language plpgsql
as $body$
declare
j1_in constant jsonb not null := '{"a": 17, "b": null}';
j2_in constant jsonb not null := '{"a": 17 }';
r1 constant t_ not null := jsonb_populate_record(null::t_, j1_in);
r2 constant t_ not null := jsonb_populate_record(null::t_, j2_in);
j1_out constant jsonb not null := to_jsonb(r1);
j2_out constant jsonb not null := to_jsonb(r1);
begin
z := j1_out::text; return next;
z := j2_out::text; return next;
end;
$body$;
select f();
create function f()
returns table(z text)
language plpgsql
as $body$
declare
j1_in constant jsonb not null := '{"a": 17, "b": null}';
j2_in constant jsonb not null := '{"a": 17 }';
r1 constant t_ not null := jsonb_populate_record(null::t_, j1_in);
r2 constant t_ not null := jsonb_populate_record(null::t_, j2_in);
j1_out constant jsonb not null := to_jsonb(r1);
j2_out constant jsonb not null := to_jsonb(r1);
begin
z := j1_out::text; return next;
z := j2_out::text; return next;
end;
$body$;
select f();
This is the result:
{"a": 17, "b": null}
{"a": 17, "b": null}
{"a": 17, "b": null}
I have a little demo where I shred a set of "books" incoming JSON documents (where a book has a genre and many authors) into the classic Codd-and-Date four tables: books, authors, genres, and book_author_intersections. Then I scrunch each book back to a single JSON doc. I want to prove that I get back what I started with. So I have to clutter what should be a simple test:
differ constant boolean not null :=
(
with
a as (select * from j_in except select * from j_out),
b as (select * from j_out except select * from j_in )
(
with
a as (select * from j_in except select * from j_out),
b as (select * from j_out except select * from j_in )
select (exists(select 1 from a) or exists(select 1 from b))
);
with no end of ad-hoc-ery.
On Fri, Jun 3, 2022 at 6:41 PM Bryn Llewellyn <bryn@yugabyte.com> wrote:
declare
j1 constant jsonb not null := '{"x": 42, "y": null}';
j2 constant jsonb not null := '{"x": 42 }';
(j1->>'y' is null)
This produces a JSON Null which when asked for as a text data type results into SQL NULL due to implicit type coercion
(j2->>'y' is null)
This produces "key y not present in JSON" but someone decided that was too unfriendly and so we instead produce SQL NULL. This SQL NULL exists for a different reason than the SQL NULL in the previous case. Unfortunately, there is only a single concept of NULL in the SQL Language.
predicate_2 constant boolean not null := j1 = j2;
This is trivially false, j1 has a key of "y" while j2 does not. If there is a complaint to be had, this returning false isn't it.
Another test (I won't bore you with its code) confirms that the order in which the incoming document lists key-value pairs has no consequence for its meaning.
Yes, the equality test is semantic, do all the same keys and corresponding values exist? If so, return true.
j1_in constant jsonb not null := '{"a": 17, "b": null}';
j2_in constant jsonb not null := '{"a": 17 }';This is the result:{"a": 17, "b": null}
{"a": 17, "b": null}
Yep, for basically the same reason as the first point. The nulls are consequences of different situations (lack of key, value of key being null) being largely undistinguishable at the SQL level. We provide a "does key exists" operator if you do need to make that determination.
I have a little demo where I shred a set of "books" incoming JSON documents (where a book has a genre and many authors) into the classic Codd-and-Date four tables: books, authors, genres, and book_author_intersections. Then I scrunch each book back to a single JSON doc. I want to prove that I get back what I started with.
Yeah, the lack of any goal of round tripping conversions from JSON through SQL and back into SQL makes proving that the system does such a thing problematic. You'll get a back something meaningfully equivalent, by your own argument, but not identical on a key-by-key basis.
David J.
david.g.johnston@gmail.com wrote:
bryn@yugabyte.com wrote:declare
j1 constant jsonb not null := '{"x": 42, "y": null}';
j2 constant jsonb not null := '{"x": 42 }';
...
(j1->>'y' is null)
This produces a JSON Null which when asked for as a text data type results into SQL NULL due to implicit type coercion(j2->>'y' is null)
This produces "key y not present in JSON" but someone decided that was too unfriendly and so we instead produce SQL NULL. This SQL NULL exists for a different reason than the SQL NULL in the previous case. Unfortunately, there is only a single concept of NULL in the SQL Language.predicate_2 constant boolean not null := j1 = j2;
This is trivially false, j1 has a key of "y" while j2 does not. If there is a complaint to be had, this returning false isn't it.Another test (I won't bore you with its code) confirms that the order in which the incoming document lists key-value pairs has no consequence for its meaning.
Yes, the equality test is semantic, do all the same keys and corresponding values exist? If so, return true.j1_in constant jsonb not null := '{"a": 17, "b": null}';
j2_in constant jsonb not null := '{"a": 17 }';
This is the result:
{"a": 17, "b": null}
{"a": 17, "b": null}
Yep, for basically the same reason as the first point. The nulls are consequences of different situations (lack of key, value of key being null) being largely undistinguishable at the SQL level. We provide a "does key exists" operator if you do need to make that determination.I have a little demo where I shred a set of "books" incoming JSON documents (where a book has a genre and many authors) into the classic Codd-and-Date four tables: books, authors, genres, and book_author_intersections. Then I scrunch each book back to a single JSON doc. I want to prove that I get back what I started with.
Yeah, the lack of any goal of round tripping conversions from JSON through SQL and back into SQL makes proving that the system does such a thing problematic. You'll get a back something meaningfully equivalent, by your own argument, but not identical on a key-by-key basis.
Thanks for attempting to satisfy me, David—but I'm sad to say that I remain somewhat unsatisfied. This is doubtless my fault. I suppose that I knew, before I asked, that the ultimate answer would be "It is what it is and it will never change."
Of course, I can predict the outcomes of countless tests on the basis that I understand "what it is". Here's just one such (just like you suggested):
declare
j1 constant jsonb not null := '{"a": 1, "b": null}';
j2 constant jsonb not null := '{"a": 1 }';
n1 constant int := (select count(*) from jsonb_object_keys(j1));
n2 constant int := (select count(*) from jsonb_object_keys(j2));
begin
assert (j1 ? 'b');
assert not (j2 ? 'b');
assert (n1 = 2) and (n2 = 1);
end;
$body$;
create type t_ as (a int, b int);
do $body$
declare
b1 constant boolean := null;
b2 constant boolean := null;
r1 constant t_ not null := (b1, b2);
r2 constant t_ not null := (b1, b2);
begin
assert (b1 = b2) is null;
assert (b1 is not distinct from b2);
assert (r1 = r2);
end;
The idea of "null" is a strange beast just within PostgreSQL—let alone all SQL systems. The classic understanding of its semantics is "There is simply no information available about the value". And this explains the treacherous three-by-three outcome table for operators like OR and AND. However, this understanding can't cope with the outcome here:
do $body$
declare
b1 constant boolean := null;
b2 constant boolean := null;
r1 constant t_ not null := (b1, b2);
r2 constant t_ not null := (b1, b2);
begin
assert (b1 = b2) is null;
assert (b1 is not distinct from b2);
assert (r1 = r2);
end;
$body$;
I'd've thought that I'd need to weigh in with "is not distinct from" to get "r1" and "r2" to compare as the same just as I do with "b1" and "b2". So maybe it's not surprising that "null" in JSON is quirky too.
I'm going to try to think like this:
The number of possible spellings of the names of keys in a JSON object is some flavor of infinite. So including this in an object:
"k": null
really is saying something. It says that I do know about "k" and that yet I have simply no information available about its value.
The quirkiness that my first example showed can be seen differently from how I saw it first. When I now consider this expression:
('{"x": 42}'::jsonb)->>'y'
it seems that its evaluation should simply raise an exception. But you said:
This produces "key y not present in JSON" but someone decided that was too unfriendly and so we instead produce SQL NULL.
Oh well, I know how to program the cases that I care about to get the outcomes that I want. It just means lots of typing. But that's anyway what one signs up for who decides to work with JSON…
Bryn Llewellyn <bryn@yugabyte.com> writes: > I'm going to try to think like this: > The number of possible spellings of the names of keys in a JSON object is some flavor of infinite. So including this inan object: > "k": null > really is saying something. It says that I do know about "k" and that yet I have simply no information available aboutits value. I'd read it as asserting that key "k" is meaningful for this object, but the correct value for that key is not known. I have a hard time with your assertion that {"x": 42, "y": null} should be considered equivalent to {"x": 42}, because it would render key-exists predicates useless. Either you have to say that key "y" is claimed to exist in both of these objects and indeed every object, or you have to make it fail if the key's value is null (so that it'd say "false" in both of these cases). Either of those options seems both weird and useless. > The quirkiness that my first example showed can be seen differently from how I saw it first. When I now consider this expression: > ('{"x": 42}'::jsonb)->>'y' > it seems that its evaluation should simply raise an exception. But you said: >> This produces "key y not present in JSON" but someone decided that was too unfriendly and so we instead produce SQL NULL. Right. This is hard to justify from a purist semantic point of view, but having the operator throw an error in such cases would make it close to unusable on not-uniformly-structured data. And really the point of using JSON inside a SQL database is to cope with irregularly- structured data, so fuzziness seems like what we want. regards, tom lane
On Sat, Jun 4, 2022 at 12:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Bryn Llewellyn <bryn@yugabyte.com> writes:
> I'm going to try to think like this:
> The number of possible spellings of the names of keys in a JSON object is some flavor of infinite. So including this in an object:
> "k": null
> really is saying something. It says that I do know about "k" and that yet I have simply no information available about its value.
I'd read it as asserting that key "k" is meaningful for this object,
but the correct value for that key is not known.
I have a hard time with your assertion that {"x": 42, "y": null}
should be considered equivalent to {"x": 42}, because it would
render key-exists predicates useless. Either you have to say that
key "y" is claimed to exist in both of these objects and indeed every
object, or you have to make it fail if the key's value is null (so that
it'd say "false" in both of these cases). Either of those options
seems both weird and useless.
yeah. I would expect for json or jsonb, two values, a, b,
a is distinct from b
should give the same answer as
a::text is distinct from b::text
merlin
mmoncure@gmail.com wrote:yeah. I would expect for json or jsonb, two values, "a, b", "a" is distinct from "b" should give the same answer as "a::text is distinct from b::text".bryn@yugabyte.com wrote:I'm going to try to think like this: The number of possible spellings of the names of keys in a JSON object is some flavor of infinite. So including this in an object:
"k": nullreally is saying something. It says that I do know about "k" and that yet I have simply no information available about its value.
I'd read it as asserting that key "k" is meaningful for this object, but the correct value for that key is not known.
I have a hard time with your assertion that {"x": 42, "y": null} should be considered equivalent to {"x": 42}, because it would
render key-exists predicates useless. Either you have to say that key "y" is claimed to exist in both of these objects and indeed every object, or you have to make it fail if the key's value is null (so that it'd say "false" in both of these cases). Either of those options seems both weird and useless.The quirkiness that my first example showed can be seen differently from how I saw it first. When I now consider this expression:
('{"x": 42}'::jsonb)->>'y'
it seems that its evaluation should simply raise an exception. But you said:This produces "key y not present in JSON" but someone decided that was too unfriendly and so we instead produce SQL NULL.
Right. This is hard to justify from a purist semantic point of view, but having the operator throw an error in such cases would make it close to unusable on not-uniformly-structured data. And really the point of using JSON inside a SQL database is to cope with irregularly-structured data, so fuzziness seems like what we want.
Thank you very much for this, Tom. And thanks again to you, David, for your input. I hope that it's clear that the purpose of my questions is to discover what I'm missing—both w.r.t. actual semantics and w.r.t. the use cases that motivate PostgreSQL's functionality. Sorry if my questions (here and on other topics) might seem to challenge established wisdom and precedent.
Thank you very much for this, Tom. And thanks again to you, David, for your earlier input. I hope that it's clear that the purpose of my questions is to discover what I'm missing—both w.r.t. actual semantics and w.r.t. the use cases that motivate PostgreSQL's functionality. Sorry if, contrary to my intention, my questions (here and on other topics) might seem to challenge established wisdom and precedent.
I delayed my reply until I'd had time to think, to do some study, and (most importantly) to implement a complete, self-contained proof-of-concept to substantiate my conclusion. I'm ready, now, to report back.
Summary: I can meet my goal by using PG's native functionality appropriately.
So, w.r.t. this email’s subject, this:
select strip_null_keys('{"x": 42, "y": null}'::jsonb) = '{"x": 42}'::jsonb;
returns "true".
returns "true".
The point at issue is whether the presence of « "some key": null » is different from the absence of "some key". And you (all) have argued that the two locutions for what I have wanted to see as one notion are indeed different. Well, yes, of course they are. This is tautologically true if you think only of the Unicode text of a JSON document that's so far waiting to be ingested by who knows what system. And you pointed out that, in PostgreSQL
my_doc ? 'some key'
detects the difference. You've convinced me that some use cases will care about this—in other words, it's a distinction *with* a difference. But, in my use case, the two distinct locutions bring no semantic difference. But the actual distinction causes me a problem when I try to prove that this pair of transformations is idempotent:
JSON → relational → JSON
But that problem is behind me now.
So... (and if you read this far at all) you can stop now unless you're interested to read further.
Before getting to my use case, I noted that Tom said this about the fact that using ->> to read a non-existent key returns a SQL null:
[This] is hard to justify from a purist semantic point of view, but having the operator throw an error in such cases would make it close to unusable on not-uniformly-structured data.
It's easy to fix this, when the app requires strictness, by implementing a user-defined operator pair, say +>> and its partner. I did this (to respect what my use case needs) —and it passed all my tests. However, it turned out, for reasons that I explain below, that I didn't need it. The operator's implementation function reads the value with the appropriate native operator and only if it returns SQL null (or JSON null) does it do more testing. First it checks if the key is absent with the ? operator—raising an error if it is so. Then it reads the value (again) with the -> native operator and raises an error if it gets a JSON null.
————————————————————
My use case
I should stress that what I have, so far, is simply a demo implementation on my laptop. All the tests that I've managed to invent (including the idempotency test) work as I expect them to. However, any aspect of my code can be changed in a heartbeat if I realize that it's suspect.
Here's an example of the documents that I have in my table's "jsonb" column.
{
"isbn" : "978-0-14-303809-2",
"title" : "Joy Luck Club",
"year" : 2006,
"authors" : [
{"given name": "Amy", "family name" : "Tan"}
],
"genre" : "Novel"
}
The documents are very definitely supposed to adhere to a JSON Schema.
But I don't need the formality of JSON Schema's notation for such a simple case. Prose will do fine.
/* ————— START OF SPEC —————————————————————————————— */
The document's top-level object may use only these keys:
"isbn" — string
values must be unique across the entire set of documents (in other words, it defines the unique business key); values must have this pattern:
« ^[0-9]{3}-[0-9]{1}-[0-9]{2}-[0-9]{6}-[0-9]{1}$ »
"title" — string
"year" — number
must be a positive integral value
"authors" — array of objects;
must be at least one object
"genre" — string
Each object in the "authors" array object may use only these keys:
"family name" — string
"given name" — string
String values other than for "isbn" are unconstrained.
Any key other than the seven listed here is illegal. The "genre" and "given name" keys are not required. All the other keys are required.
The meaning of *required* is that no extracted value must bring a SQL null (so a required key must not have a JSON null value).
And the meaning of *not required* is simply "no information is available for this key" (with no nuances). The spec author goes further by adding a rule: this meaning must be expressed by the absence of such a key.
/* ————— END OF SPEC ———————————————————————————————— */
The rule that « "some key": null » is not allowed brings the benefit that a document can be maximally terse. Moreover, it helps document authors by removing the need to decide which locution to use; and it helps programmers to write code to check that incoming documents adhere to the spec and then to extract their meaning (for example to a classic relational representation).
It's easy to see that the information content implies these business rules:
—Each book must have at least one author. Each author may be among the authors of one or several books.
—Each book may be of exactly one (known) genre. Each genre may classify one or several books.
My code does this:
—Ingests the input JSON documents into a "source(k... primary key, book_info jsonb)" table.
—Shreds the books facts into a classic Codd-and-Date relational representation (with the obvious tables "books", "genres", "authors", and "books_authors" with the usual PK and FK constraints.
—Transforms the aggregated set of facts for each book back to a set of JSON documents for transport to a different system. Critically, these must adhere to the same JSON Schema (and rules) that govern the incoming documents.
This is where the idempotency requirement that I mentioned above comes from:
JSON → relational → JSON
It's been suggested that this is an ignoble and unattainable goal. I disagree—on both counts.
I implemented two complementary functions:
—"no_null_keys()" checks that a "jsonb" value has no occurrences of « "some key": null »
—"strip_null_keys()" removes « "some key": null » occurrences from a "jsonb" value
The code checks with "no_null_keys()" that, as expected, no ingested JSON document has an occurrence of « "some key": null ».
And it uses "strip_null_keys()" on the output of "to_jsonb()" — and, as appropriate, any other built-in JSON function that produces a "jsonb" value.
It was straightforward to implement these two functions by using REGEXP built-in functionality on the canonically formatted "text" value produced by the "jsonb::text" typecast.
The check on the incoming documents is included in the "j_books_book_info_is_conformant(jsonb)" function that is the basis of a constraint that's created on the "source" table's "book_info" column.
More code is needed to implement other constraints like, for example, the value of the "isbn" (string) key must satisfy a particular regular expression and the value of the "year" (number) key must convert to a positive integer (You've heard about that test already.)
These tests, too, (and other tests) are included in the "j_books_book_info_is_conformant(jsonb)" function.
Critically, one test uses "jsonb_object_keys()" to scan the top-level object to ensure that all the required keys are present, that every key has the specified JSON data type, and that no keys that the JSON Schema doesn't mention are present. A similar test does the same for the "authors" array. This is why I can be sure that the native ->> and -> operators are sufficient for my purpose.
I considered using the "j_books_book_info_is_conformant(jsonb)" function in a constraint for a domain based on "jsonb" but decided against that.
ISBN (was: '{"x": 42, "y": null}'::jsonb != '{"x": 42}'::jsonb ... Really?)
From
"Peter J. Holzer"
Date:
On 2022-06-17 17:14:14 -0700, Bryn Llewellyn wrote: > "isbn" — string > values must be unique across the entire set of documents (in other words, it > defines the unique business key); values must have this pattern: > > « ^[0-9]{3}-[0-9]{1}-[0-9]{2}-[0-9]{6}-[0-9]{1}$ » Completely off-topic, but this regexp doesn't describe ISBNs. In ISBNs the three middle subfields are all variable length. The first is a language code (there are more than 10 languages in the world), the second identifies the publisher (there are more than 100 publishers) and the third the book. For example, "Mastering PostgreSQL 9.6" has the ISBN 978-1-78355-535-2. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
> On 18 Jun 2022, at 2:14, Bryn Llewellyn <bryn@yugabyte.com> wrote: > > I implemented two complementary functions: > > —"no_null_keys()" checks that a "jsonb" value has no occurrences of « "some key": null » > > —"strip_null_keys()" removes « "some key": null » occurrences from a "jsonb" value > > The code checks with "no_null_keys()" that, as expected, no ingested JSON document has an occurrence of « "some key": null». > > And it uses "strip_null_keys()" on the output of "to_jsonb()" — and, as appropriate, any other built-in JSON function thatproduces a "jsonb" value. > > It was straightforward to implement these two functions by using REGEXP built-in functionality on the canonically formatted"text" value produced by the "jsonb::text" typecast. In my experience, using regular expressions applied to document formats tends to get you false positives. I’d be worriedabout books with titles similar to 'How we wrote a regular expression to detect occurrences of "some key": null inour JSON documents', for example. For stripping those null occurrences, you are aware of the json_strip_nulls(json) and jsonb_strip_nulls(jsonb) functions,right? For detecting them on a recent PG, the @? operator or json_path_exists(json, jsonpath) functions would probably do the trick. I am not too familiar with JSONPATH expressions, but I expect (it passed some preliminary testing) this would detect yournulls just fine, while taking JSON semantics into account: jsonb_path_exists(document, '$.** ? (@ == null)'::jsonpath) For PG-specifics on JSONPATH, see section 9.16.2 on: https://www.postgresql.org/docs/current/functions-json.html#FUNCTIONS-SQLJSON-OP-TABLE A recursive query is another possible solution. It would probably perform far worse, but I find them more rewarding to write.Some people prefer Sodoku. Regards, Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest.
Re: ISBN (was: '{"x": 42, "y": null}'::jsonb != '{"x": 42}'::jsonb ... Really?)
From
Bryn Llewellyn
Date:
hjp-pgsql@hjp.at wrote:Bryn wrote:
"isbn" — string
values must be unique across the entire set of documents (in other words, it
defines the unique business key); values must have this pattern:
« ^[0-9]{3}-[0-9]{1}-[0-9]{2}-[0-9]{6}-[0-9]{1}$ »
Completely off-topic, but this regexp doesn't describe ISBNs. In ISBNs the three middle subfields are all variable length. The first is a language code (there are more than 10 languages in the world), the second identifies the publisher (there are more than 100 publishers) and the third the book. For example, "Mastering PostgreSQL 9.6" has the ISBN 978-1-78355-535-2.
Yes, I know. Sorry. I should have said that I simply wanted to illustrate a proof of concept for the notion, viable (only?) when you have incoming JSON documents with a well-defined schema, that this is idempotent:
JSON → relational → JSON
And it's of note that PostgreSQL has had what you need to do the xform, in each direction, for a long time.
I simply typed up my "corpus" by hand. It didn’t matter that the ISBN format was a fairly tale. A book does have an ISBN and (as far as my P.o.C. is concerned) it's the unique business key. I should have cut out the REGEXP in my email and said something like "must conform to the specified format". But I was careless in my proof reading.
Thanks for not mentioning that books have publishers—so that "authors" isn't the only key that implies a Codd-and-Date many-to-many!
haramrae@gmail.com wrotebryn@yugabyte.com wrote:
I implemented two complementary functions:
—"no_null_keys()" checks that a "jsonb" value has no occurrences of « "some key": null »
—"strip_null_keys()" removes « "some key": null » occurrences from a "jsonb" value
The code checks with "no_null_keys()" that, as expected, no ingested JSON document has an occurrence of « "some key": null ».
And it uses "strip_null_keys()" on the output of "to_jsonb()" — and, as appropriate, any other built-in JSON function that produces a "jsonb" value.
It was straightforward to implement these two functions by using REGEXP built-in functionality on the canonically formatted "text" value produced by the "jsonb::text" typecast.
In my experience, using regular expressions applied to document formats tends to get you false positives. I’d be worried about books with titles similar to:«How we wrote a regular expression to detect occurrences of "some key": null in our JSON documents
»
For stripping those null occurrences, you are aware of the json_strip_nulls(json) and jsonb_strip_nulls(jsonb) functions, right?
For detecting them on a recent PG, the @? operator or json_path_exists(json, jsonpath) functions would probably do the trick.
I am not too familiar with JSONPATH expressions, but I expect (it passed some preliminary testing) this would detect your nulls just fine, while taking JSON semantics into account:
jsonb_path_exists(document, '$.** ? (@ == null)'::jsonpath)
For PG-specifics on JSONPATH, see section 9.16.2:
https://www.postgresql.org/docs/current/functions-json.html#FUNCTIONS-SQLJSON-OP-TABLE
A recursive query is another possible solution. It would probably perform far worse...
The "G", "E", "L", and "O" keys on my keyboard were broken. They're fixed now, and so "postgres strip keys with null value from jsonb object" got me to this (but, strangely, not to the PG docs):
This is embarrassing. I wrote that doc, along with the rest of the content in the "JSON data types and functionality" section just before COVID hit us. Apparently, life under lockdown has damaged my corpus callosum. I typed up my proof-of-concept code and emails with one half of my brain—and it was the other half that wrote that documentation.
So thanks for the memory jog. My excuse is that (as was the case with my malformed ISBN that Peter Holzer pointed out in a separate thread) I wanted just to show myself, as a proof-of-concept, that stripping nulls was feasible—so I gave it no more thought once I'd done that. But, I suppose, that's not excuse...
Anyway, my "strip_null_keys()" is already on the scrapheap. And the body of my "no_null_keys()" reduced to a single line:
create function no_null_keys(j in jsonb)
returns boolean
immutable
language sql
as $body$
select j = jsonb_strip_nulls(j);
$body$;
returns boolean
immutable
language sql
as $body$
select j = jsonb_strip_nulls(j);
$body$;
You might argue that I don't need to bother with the encapsulation. But it makes testing easier—and I'm trusting that inlining works as advertised.
Your point about false positives is well taken. So, just for sport:
create type t1 as (k int, v text);
create type t2 as (a int, b int, c t1, d t1, e text[]);
create function j()
returns jsonb
language plpgsql
as $body$
declare
t constant text not null := 'How we wrote a regular expression to detect occurrences of « "some key": null » in our JSON documents!';
c1 constant t1 not null := (17, t);
c2 constant t1 not null := (29, null);
arr constant text[] not null := array['x', null::text, t];
r constant t2 not null := (42, null, c1, c2, arr);
begin
return to_jsonb(r);
end;
$body$;
select jsonb_pretty(j());
create type t2 as (a int, b int, c t1, d t1, e text[]);
create function j()
returns jsonb
language plpgsql
as $body$
declare
t constant text not null := 'How we wrote a regular expression to detect occurrences of « "some key": null » in our JSON documents!';
c1 constant t1 not null := (17, t);
c2 constant t1 not null := (29, null);
arr constant text[] not null := array['x', null::text, t];
r constant t2 not null := (42, null, c1, c2, arr);
begin
return to_jsonb(r);
end;
$body$;
select jsonb_pretty(j());
The output includes two occurrences of this:
"How we wrote a regular expression to detect occurrences of « \"some key\": null » in our JSON documents!"
I believe that the "jsonb" to "text" conversion never produces an isolated double-quote within the representation of an object key's value. I checked that my "strip_null_keys()" handled your example before consigning it to the scrapheap—and it didn't let me down. But it would be foolish to argue that there isn't some way to provoke a false positive.
On Sat, Jun 18, 2022 at 5:44 AM Bryn Llewellyn <bryn@yugabyte.com> wrote:
/* ————— START OF SPEC —————————————————————————————— */The document's top-level object may use only these keys:"isbn" — stringvalues must be unique across the entire set of documents (in other words, it defines the unique business key); values must have this pattern:« ^[0-9]{3}-[0-9]{1}-[0-9]{2}-[0-9]{6}-[0-9]{1}$ »"title" — string"year" — numbermust be a positive integral value"authors" — array of objects;must be at least one object"genre" — stringEach object in the "authors" array object may use only these keys:"family name" — string"given name" — stringString values other than for "isbn" are unconstrained.Any key other than the seven listed here is illegal. The "genre" and "given name" keys are not required. All the other keys are required.The meaning of *required* is that no extracted value must bring a SQL null (so a required key must not have a JSON null value).And the meaning of *not required* is simply "no information is available for this key" (with no nuances). The spec author goes further by adding a rule: this meaning must be expressed by the absence of such a key./* ————— END OF SPEC ———————————————————————————————— */
create temp table source(
isbn text primary key,
book_info_text text,
book_info jsonb generated always as ( book_info_text::jsonb ) stored
CONSTRAINT
test_jsonb_constraints1 check (book_info_text is json)
CONSTRAINT test_jsonb_constraints2 check (JSON_EXISTS(book_info_text::jsonb,'$.title') )
CONSTRAINT test_jsonb_constraints3 check (JSON_VALUE(book_info_text::jsonb,'$.year' returning int) > 0)
CONSTRAINT test_jsonb_constraints4 check (JSON_EXISTS(book_info_text::jsonb,'$.genre'))
CONSTRAINT test_jsonb_constraints5 check (not JSON_EXISTS(book_info_text::jsonb,'$.not_as_toplevel_key'))
CONSTRAINT test_jsonb_constraints6 check (
(JSON_VALUE(book_info_text::jsonb,'$.authors[*]."family name"') is not null)
OR
(JSON_VALUE(book_info_text::jsonb,'$.authors[*]."given name"' ) is not null)
)
);
isbn text primary key,
book_info_text text,
book_info jsonb generated always as ( book_info_text::jsonb ) stored
CONSTRAINT
test_jsonb_constraints1 check (book_info_text is json)
CONSTRAINT test_jsonb_constraints2 check (JSON_EXISTS(book_info_text::jsonb,'$.title') )
CONSTRAINT test_jsonb_constraints3 check (JSON_VALUE(book_info_text::jsonb,'$.year' returning int) > 0)
CONSTRAINT test_jsonb_constraints4 check (JSON_EXISTS(book_info_text::jsonb,'$.genre'))
CONSTRAINT test_jsonb_constraints5 check (not JSON_EXISTS(book_info_text::jsonb,'$.not_as_toplevel_key'))
CONSTRAINT test_jsonb_constraints6 check (
(JSON_VALUE(book_info_text::jsonb,'$.authors[*]."family name"') is not null)
OR
(JSON_VALUE(book_info_text::jsonb,'$.authors[*]."given name"' ) is not null)
)
);
Some of the problems I don't know how to solve. My intuition feels like that isbn attribute in the json document column then enforcing the unique constraint would be anti-pattern. So I put the isbn outside as a separate column.
Another constraint is that there are only certain keys in the jsonb. I don't know how to implement it. But I feel like it's do-able.
two columns, one text, another generated column stored jsonb, So there is a duplication issue.....
So there is another alternative way to do it.
normal relation tables, insert is done via json_table construct convert json to table. output can be easily done with row_to_json.
For example:
select * from json_table('{
"title" : "Joy Luck Club",
"year" : 2006,
"authors" : [{"given name": "Amy", "family name" : "Tan"}],
"genre" : "Novel"
}'::jsonb,
'$'
COLUMNS(
id for ordinality,
title text path '$.title',
year int path '$.year',
genre text path '$.genre',
nested path '$.authors[*]'
columns(
"given name" text path '$."given name"'
,"family name" text path '$."family name"'
)
)
);
I recommend David Deutsch's <<The Beginning of Infinity>>
Jian
jian.universality@gmail.com wrote:bryn@yugabyte.com wrote:
/* ————— START OF SPEC —————————————————————————————— */
The document's top-level object may use only these keys:
...
And the meaning of *not required* is simply "no information is available for this key" (with no nuances). The spec author goes further by adding a rule: this meaning must be expressed by the absence of such a key.
/* ————— END OF SPEC ———————————————————————————————— */
create temp table source(
...
CONSTRAINT test_jsonb_constraints1 check (book_info_text is json)
CONSTRAINT test_jsonb_constraints2 check (JSON_EXISTS(book_info_text::jsonb,'$.title') )
...
)
);
Some of the problems I don't know how to solve. My intuition feels like that isbn attribute in the json document column then enforcing the unique constraint would be anti-pattern. So I put the isbn outside as a separate column. Another constraint is that there are only certain keys in the jsonb. I don't know how to implement it. But I feel like it's do-able.
two columns, one text, another generated column stored jsonb, So there is a duplication issue...
So there is another alternative way to do it...
Thank you very much for your suggestions, Jian. I'll try them and think carefully about how everything then looks over the next week or so.
Meanwhile, I hope that it's clear to all that I have nothing more than prototype code on my own laptop. I can change any aspect of it in no time when I come to see better alternatives. (In particular, I can easily heave out my use of "domains with not null constraints".)
My code, as it stands, does meet the goals that I set for it (esp. by meeting the "JSON → relational → JSON" idempotency requirement). I also handle the "required", or "not required", status of the attributes of the objects (and only specified keys present) quite easily. But I don't see any sign of "only specified keys present" in your code.
I haven't tried my prototype with large volumes of synthetic data—and nor have I considered performance at all. That can come later. (But with my current implementation, my tiny end-to-end test kit completes crazily quickly.)
I probably didn't say out loud that the kinds of updates and queries that the four table Codd-and-Date representation suggests can be supported with fairly ordinary SQL and not too much thought. However (at least as it feels to me), the corresponding operations on the native "jsonb" representation would be harder to design and write. Moreover, *any* change implies updating all of the indexes and re-checking all of the constraints.
In other words, my aim here is to treat JSON in the way that first motivated it (as a data transport format) and *not* in the way that it's often used (as a bucket for a set of noticeably heterogeneous documents). This is where the the "JSON → relational → JSON" idempotency requirement comes from.
Sorry if I didn't make this clear.
Bryn Llewellyn schrieb am 04.06.2022 um 03:41: > Am I missing a use case where an object with a key-value pair with a > JSON null value is meaningfully different from one where the key is > simply absent? It seems the JSON specification doesn't actually define equality. But the JSON patch RFC 6902[1] defines the equality of two objects as: objects: are considered equal if they contain the same number of members, and if each member can be considered equal to a member in the other object, by comparing their keys (as strings) and their values (using this list of type-specific rules). As {"x": 42, "y": null} and {"x": 42} do not contain the same number of members, I think Postgres' behaviour is correct. Thomas [1] https://www.rfc-editor.org/rfc/rfc6902#section-4.6
> shammat@gmx.net wrote: > >> bryn@yugabyte.com wrote: >> >> Am I missing a use case where an object with a key-value pair with a JSON null value is meaningfully different from onewhere the key is simply absent? > > It seems the JSON specification doesn't actually define equality. But the JSON patch RFC 6902: > > www.rfc-editor.org/rfc/rfc6902#section-4.6 > > defines the equality of two objects as: > > « > objects: are considered equal if they contain the same number of members, and if each member can be considered equal toa member in the other object, by comparing their keys (as strings) and their values (using this list of type-specific rules) > » > > As {"x": 42, "y": null} and {"x": 42} do not contain the same number of members, I think Postgres' behaviour is correct. Thanks. I believe that I addressed this in a separate branch in this thread. The distinction (between « "some key": null» and "some key" absent), which by all means matters in some contexts, does not matter w.r.t. the requirements of my(demo) app. And I've now proved (to myself, at least) that I can meet my goal easily with existing Postgres features—especially"jsonb_strip_nulls()". I hope that this means that we're all happy.