Thread: JSON for PG 9.2
Where are we with adding JSON for Postgres 9.2? We got bogged down in the data representation last time we discussed this. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, Dec 5, 2011 at 3:12 PM, Bruce Momjian <bruce@momjian.us> wrote: > Where are we with adding JSON for Postgres 9.2? We got bogged down in > the data representation last time we discussed this. We're waiting for you to send a patch that resolves all previously-raised issues. :-) In all seriousness, I think the right long-term answer here is to have two data types - one that simply validates JSON and stores it as text, and the other of which uses some binary encoding. The first would be similar to our existing xml datatype and would be suitable for cases when all or nearly all of your storage and retrieval operations will be full-column operations, and the json types is basically just providing validation. The second would be optimized for pulling out (or, perhaps, replacing) pieces of arrays or hashes, but would have additional serialization/deserialization overhead when working with the entire value. As far as I can see, these could be implemented independently of each other and in either order, but no one seems to have yet found the round tuits. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Dec 12, 2011 at 7:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Dec 5, 2011 at 3:12 PM, Bruce Momjian <bruce@momjian.us> wrote: >> Where are we with adding JSON for Postgres 9.2? We got bogged down in >> the data representation last time we discussed this. > > We're waiting for you to send a patch that resolves all > previously-raised issues. :-) > > In all seriousness, I think the right long-term answer here is to have > two data types - one that simply validates JSON and stores it as text, > and the other of which uses some binary encoding. The first would be > similar to our existing xml datatype and would be suitable for cases > when all or nearly all of your storage and retrieval operations will > be full-column operations, and the json types is basically just > providing validation. The second would be optimized for pulling out > (or, perhaps, replacing) pieces of arrays or hashes, but would have > additional serialization/deserialization overhead when working with > the entire value. As far as I can see, these could be implemented > independently of each other and in either order, but no one seems to > have yet found the round tuits. Rather than fuss with specific data formats, why not implement something a little more useful? At present we can have typmods passed as a cstring, so it should be possible to add typmods onto the TEXT data type. e.g. TEXT('JSON'), TEXT('JSONB') We then invent a new catalog table called pg_text_format which has oid PRIMARY KEY textformatname UNIQUE textformatvalidationproc textformatstorageproc The typmod must reduce to a single integer, so we just store the integer. If no typmod, we store 0, so we have a fastpath for normal TEXT datatypes. This would then allow people to have variations of the TEXT type that supports conversions, casts, indexing etc without additional fuss and without everything else outside the database breaking because it doesn't know that datatype name. We could then support JSON (both kinds), YAML, etc as well as providing a way to add validation into the datatype itself. We can replace citext with TEXT('CASE_INSENSITIVE') Think of this as using the object-relational capabilities of Postgres to extend the TEXT data type. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 12, 2011 at 3:38 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Rather than fuss with specific data formats, why not implement > something a little more useful? > > At present we can have typmods passed as a cstring, so it should be > possible to add typmods onto the TEXT data type. > > e.g. TEXT('JSON'), TEXT('JSONB') > > We then invent a new catalog table called pg_text_format which has > oid PRIMARY KEY > textformatname UNIQUE > textformatvalidationproc > textformatstorageproc > > The typmod must reduce to a single integer, so we just store the > integer. If no typmod, we store 0, so we have a fastpath for normal > TEXT datatypes. > > This would then allow people to have variations of the TEXT type that > supports conversions, casts, indexing etc without additional fuss and > without everything else outside the database breaking because it > doesn't know that datatype name. > > We could then support JSON (both kinds), YAML, etc > as well as providing a way to add validation into the datatype itself. > We can replace citext with TEXT('CASE_INSENSITIVE') > > Think of this as using the object-relational capabilities of Postgres > to extend the TEXT data type. Well, it's arguable that text-format JSON and YAML and our existing XML datatype ought to share some structure with text, but binary-format JSON is a different animal altogether; you might as well propose having text('int8'). In any case, I doubt that trying to make the typmod provide subclassing behavior is going to work out very well. There are way too many places that assume that the typmod can just be discarded. I don't think that's going to fly, because =(text,text) probably has different semantics from =(json,json). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Dec 12, 2011 at 8:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: > There are way too many places that assume that the typmod can > just be discarded. If true, that probably ought to be documented cos it sounds fairly important. Where and when is it true? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 12, 2011 at 4:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Mon, Dec 12, 2011 at 8:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> There are way too many places that assume that the typmod can >> just be discarded. > > If true, that probably ought to be documented cos it sounds fairly important. > > Where and when is it true? I'm not going to go compile an exhaustive list, since that would take a week and I don't have any particular desire to invest that much time in it, but just to take a couple of simple examples: rhaas=# create or replace function wuzzle(numeric(5,2)) returns int as $$select 1$$ language sql; CREATE FUNCTION rhaas=# \df wuzzle List of functionsSchema | Name | Result data type | Argument data types | Type --------+--------+------------------+---------------------+--------public | wuzzle | numeric | | normalpublic | wuzzle | integer | numeric | normal (2 rows) rhaas=# select pg_typeof(1.23::numeric(5,2));pg_typeof -----------numeric (1 row) There are a very large number of others. Possibly grepping for places where we do getBaseType() rather than getBaseTypeAndTypmod() would be a way to find some of them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On mån, 2011-12-12 at 21:08 +0000, Simon Riggs wrote: > On Mon, Dec 12, 2011 at 8:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > There are way too many places that assume that the typmod can > > just be discarded. > > If true, that probably ought to be documented cos it sounds fairly important. > > Where and when is it true? Function arguments and return values, for example.
On Dec 12, 2011, at 12:54 PM, Robert Haas wrote: > I don't think that's going to fly, because > =(text,text) probably has different semantics from =(json,json). No question: david=# select '{"foo": 1, "bar": 2}'::json = '{"bar": 2, "foo": 1}'::json; ?column? ---------- t (1 row) Best, David
On 12/12/2011 03:54 PM, Robert Haas wrote: > On Mon, Dec 12, 2011 at 3:38 PM, Simon Riggs<simon@2ndquadrant.com> wrote: >> Rather than fuss with specific data formats, why not implement >> something a little more useful? >> >> At present we can have typmods passed as a cstring, so it should be >> possible to add typmods onto the TEXT data type. >> >> e.g. TEXT('JSON'), TEXT('JSONB') >> [...] > There are way too many places that assume that the typmod can > just be discarded. I don't think that's going to fly, because > =(text,text) probably has different semantics from =(json,json). > And certain places where they are not allowed at all, I think (unless I am misremembering the early debates about enum types and output functions). cheers andrew
Bruce, I thought that Joseph Adams was still working on this, sponsored by Heroku. Joseph? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
We reached out to Joseph to see if we could help sponsor the project, but never really heard back from him. Because we haven't heard from him in a while we've been using PL/V8 to validate a JSON datatype simulated by a DOMAIN with a simple acceptance function. (See below.) This is not ideally performant but thanks to V8's JIT the JSON parser is actually reasonably good. I think releasing something simple and non-performant with reasonable semantics would be the best next step. If it were up to me, I'd probably even try to just land PL/V8 as PL/JavaScript for 9.2 if the crash bugs and deal breakers can be sifted out. PL/V8 is fast, it's sandboxed, and while it doesn't provide GIN or GIST operators out of the box, maybe those could be motivated by its inclusion. Andrew, you've been down in the guts here, what do you think? -pvh (Code sample.) create or replace function valid_json(json text) returns bool as $$ try { JSON.parse(json); return true } catch(e) { return false} $$ LANGUAGE plv8 IMMUTABLE STRICT; select valid_json('{"key": "value"}'), valid_json('lol'); valid_json | t valid_json | f Time: 0.283 ms create domain json as text check(valid_json(VALUE)); create table jsononly(data json); insert into jsononly values 'lol'; ERROR: syntax error at or near "'lol'" LINE 1: insert into jsononly values 'lol'; insert into jsononly values ('{"ok": true}'); INSERT 0 1 -p On Mon, Dec 12, 2011 at 1:34 PM, Josh Berkus <josh@agliodbs.com> wrote: > Bruce, > > I thought that Joseph Adams was still working on this, sponsored by > Heroku. Joseph? > > > -- > Josh Berkus > PostgreSQL Experts Inc. > http://pgexperts.com > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Peter van Hardenberg San Francisco, California "Everything was beautiful, and nothing hurt." -- Kurt Vonnegut
On Dec 12, 2011, at 4:51 PM, Peter van Hardenberg wrote: > Because we haven't heard from him in a while we've been using PL/V8 to > validate a JSON datatype simulated by a DOMAIN with a simple > acceptance function. (See below.) This is not ideally performant but > thanks to V8's JIT the JSON parser is actually reasonably good. > > I think releasing something simple and non-performant with reasonable > semantics would be the best next step. If it were up to me, I'd > probably even try to just land PL/V8 as PL/JavaScript for 9.2 if the > crash bugs and deal breakers can be sifted out. Note that Claes Jakobsson has been working on a JSON data type using the Jansson JSON library. http://pgxn.org/dist/pg-json/ I’ve submitted a pull request renaming it to jansson-json (though the data type is still "json"): https://github.com/theory/pg-json/tree/pgxn Anyway, it seems like a decent start to an extensible type implemented entirely as an extension. Claes tells me he plansto add index support soonish, so it could get to be pretty robust before long. Just another stab at the problem to alert folks to. Best, David
On 12/12/2011 07:51 PM, Peter van Hardenberg wrote: > We reached out to Joseph to see if we could help sponsor the project, > but never really heard back from him. > > Because we haven't heard from him in a while we've been using PL/V8 to > validate a JSON datatype simulated by a DOMAIN with a simple > acceptance function. (See below.) This is not ideally performant but > thanks to V8's JIT the JSON parser is actually reasonably good. > > I think releasing something simple and non-performant with reasonable > semantics would be the best next step. If it were up to me, I'd > probably even try to just land PL/V8 as PL/JavaScript for 9.2 if the > crash bugs and deal breakers can be sifted out. > > PL/V8 is fast, it's sandboxed, and while it doesn't provide GIN or > GIST operators out of the box, maybe those could be motivated by its > inclusion. > > Andrew, you've been down in the guts here, what do you think? The trouble with using JSON.parse() as a validator is that it's probably doing way too much work. PLV8 is cool, and I keep trying to get enough time to work on it more, but I don't think it's a substitute for a JSON type with a purpose built validator and some native operations. I think these efforts can continue in parallel. cheers andrew
On Mon, Dec 12, 2011 at 4:51 PM, Peter van Hardenberg <pvh@pvh.ca> wrote:> > PL/V8 is fast, it's sandboxed, and while it doesn't provide GIN or > GIST operators out of the box, maybe those could be motivated by its > inclusion. I also feel that a big problem with JSON as a data type is that there is not a powerful, common navigation method. JSON path is basically pretty obscure by comparison to XPath. As a result, the common approach to navigation in a JSON structure is basically "write procedures". And that is only perfectly supported by a full-blown interpreter. So that's why I'm personally more inclined to lend my attention to embedding JavaScript entirely. Not to say there aren't areas ripe for improvement: * It'd be nice to pass intermediate in-memory representations rather than calling JSON.parse all the time, similar to OPAQUE except sound (so bogus pointers cannot be passed). Basically, an "ephemeral type".It could save a lot of when composing operators. I'veneeded this for other projects, but for much the same reason. * It'd be nice to be able to safely define indexes in a trusted language somehow, writing the penalty and split functions, et al. Right now it's my recollection that defining GiST operators in a naive port to Javascript would give you the power to return garbage that is not merely wrong, but could also crash Postgres if it uses a bogus indexes. Ready and willing to be corrected* * Some kind of partial toasting of large datums (I think Simon Riggs briefly glossed over such an idea when we were talking about this general use case) But nothing I can quickly identify in the Postgres as-is is opposed to any of these improvements at a design level, so they can be chipped off into incremental work in the future. -- fdr * Madness, you say? http://bellard.org/jslinux/, if your browser is new enough. The relevant spec: https://www.khronos.org/registry/typedarray/specs/latest/
On Mon, Dec 12, 2011 at 5:36 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > The trouble with using JSON.parse() as a validator is that it's probably > doing way too much work. PLV8 is cool, and I keep trying to get enough time > to work on it more, but I don't think it's a substitute for a JSON type with > a purpose built validator and some native operations. I think these efforts > can continue in parallel. Hmm. Maybe? While I'm sure things could be faster, we've had results that are fast enough to be usable even with constant reparsing. Here are some microbenchmarks I did some time ago where I tried to find the overhead of calling JSON.parse and doing some really simple stuff in V8 that I thought would maximize the amount of constant-time overhead: https://gist.github.com/1150804 On my workstation, one core was able to do 130,000 JSON.parses + other stuff necessary to create an index per second. One could maybe try to improve the speed and memory footprint on large documents by having validators that don't actually build the V8 representation and possibly defining a space of operators that are known to build, by induction, valid JSON without rechecks. But in the end, I think there's already a class of problem where the performance plv8 provides is already quite sufficient, and provides a much more complete and familiar approach to the problem of how people choose to navigate, project, and manipulate JSON documents. I also haven't tried this for larger documents, as I was trying to get a sense of how much time was spent in a few primitive operations, and not testing performance with regard to document length. -- fdr
On 12/12/2011 08:46 PM, Daniel Farina wrote: > On Mon, Dec 12, 2011 at 5:36 PM, Andrew Dunstan<andrew@dunslane.net> wrote: >> The trouble with using JSON.parse() as a validator is that it's probably >> doing way too much work. PLV8 is cool, and I keep trying to get enough time >> to work on it more, but I don't think it's a substitute for a JSON type with >> a purpose built validator and some native operations. I think these efforts >> can continue in parallel. > Hmm. Maybe? While I'm sure things could be faster, we've had results > that are fast enough to be usable even with constant reparsing. Here > are some microbenchmarks I did some time ago where I tried to find the > overhead of calling JSON.parse and doing some really simple stuff in > V8 that I thought would maximize the amount of constant-time overhead: > > https://gist.github.com/1150804 > > On my workstation, one core was able to do 130,000 JSON.parses + other > stuff necessary to create an index per second. One could maybe try to > improve the speed and memory footprint on large documents by having > validators that don't actually build the V8 representation and > possibly defining a space of operators that are known to build, by > induction, valid JSON without rechecks. > > But in the end, I think there's already a class of problem where the > performance plv8 provides is already quite sufficient, and provides a > much more complete and familiar approach to the problem of how people > choose to navigate, project, and manipulate JSON documents. > > I also haven't tried this for larger documents, as I was trying to get > a sense of how much time was spent in a few primitive operations, and > not testing performance with regard to document length. > Yes, I didn't mean to say it's not fast. For many cases I agree it is probably fast enough. But in the end PLV8 is likely to remain an addon - nice as it can be I doubt the core team will want to add another PL to the core code, especially one written in C++. If we want a JSON type built in, which many people seem to want, we'll need to do it without the support of PLV8, I think. cheers andrew
Excerpts from Daniel Farina's message of lun dic 12 22:37:13 -0300 2011: > * It'd be nice to pass intermediate in-memory representations rather > than calling JSON.parse all the time, similar to OPAQUE except sound > (so bogus pointers cannot be passed). Basically, an "ephemeral type". > It could save a lot of when composing operators. I've needed this > for other projects, but for much the same reason. I remember there was the idea of doing something like this for regexes -- have a specialized data type that saves the trouble of parsing it. I imagine this is pretty much the same. Nobody got around to doing anything about it though. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Andrew Dunstan <andrew@dunslane.net> writes: > On 12/12/2011 03:54 PM, Robert Haas wrote: >> There are way too many places that assume that the typmod can >> just be discarded. I don't think that's going to fly, because >> =(text,text) probably has different semantics from =(json,json). > And certain places where they are not allowed at all, I think (unless I > am misremembering the early debates about enum types and output functions). Yeah. The current system design assumes that typmod specifies a constraint of some sort. It is not possible to use it to change the semantics of the datatype. The most obvious way in which this is true is that selection of which operators and functions to apply to values does not consider typmod of the values. This is not something we should lightly revisit. We don't even have a handle on how to make domains behave differently from their underlying datatypes, and those *do* have their own type OIDs. Injecting typmod into the algorithm seems like a disaster from here. regards, tom lane
On mån, 2011-12-12 at 16:51 -0800, Peter van Hardenberg wrote: > Because we haven't heard from him in a while we've been using PL/V8 to > validate a JSON datatype simulated by a DOMAIN with a simple > acceptance function. (See below.) This is not ideally performant but > thanks to V8's JIT the JSON parser is actually reasonably good. > > I think releasing something simple and non-performant with reasonable > semantics would be the best next step. If it were up to me, I'd > probably even try to just land PL/V8 as PL/JavaScript for 9.2 if the > crash bugs and deal breakers can be sifted out. You don't need a new PL to do that. The existing PLs can also parse JSON. So that's not nearly enough of a reason to consider adding this new PL.
On Mon, Dec 12, 2011 at 7:37 PM, Daniel Farina <daniel@heroku.com> wrote: > On Mon, Dec 12, 2011 at 4:51 PM, Peter van Hardenberg <pvh@pvh.ca> wrote:> >> PL/V8 is fast, it's sandboxed, and while it doesn't provide GIN or >> GIST operators out of the box, maybe those could be motivated by its >> inclusion. > > I also feel that a big problem with JSON as a data type is that there > is not a powerful, common navigation method. JSON path is basically > pretty obscure by comparison to XPath. As a result, the common > approach to navigation in a JSON structure is basically "write > procedures". And that is only perfectly supported by a full-blown > interpreter. This. For me, postgres xml extensions is 'a whole bunch of extra stuff that comes with the xpath function'. How you get data into and out of json is much more interesting than how the type is set up internally or how it's parsed. There must be some way to avoid iterative set up and tear down of json objects (maybe as a cast?) -- postgres arrays of composites can set up data in a way that feels very much like json in it's construction. One big reason why people might go to server side json is to try and avoid tedious marshaling of data between client and server. The xpath function has had its warts, but it offers very tight coupling between your database and your documents. In the case of json, I think you can go even further. merlin
----- Original message ----- > On Dec 12, 2011, at 4:51 PM, Peter van Hardenberg wrote: > > > Because we haven't heard from him in a while we've been using PL/V8 to > > validate a JSON datatype simulated by a DOMAIN with a simple > > acceptance function. (See below.) This is not ideally performant but > > thanks to V8's JIT the JSON parser is actually reasonably good. > > Note that Claes Jakobsson has been working on a JSON data type using the > Jansson JSON library. > > http://pgxn.org/dist/pg-json/ We recently needed to store/valisate JSON data and be able to do some trivial extraction of values from it and went withpg-json. The great benefit of having JSON as an extension type is being able to use an external library (Jansson is very small, MITlicensed, looks really well written and has been actively maintaied for years) and not being tied to a yearly releasecycle. Postgres jumps through a lot of hoops to be extensible and I think a JSON type is a kind of thing that fits the bill of anextension perfectly. Cheers, Jan
On Tue, Dec 13, 2011 at 5:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 12/12/2011 03:54 PM, Robert Haas wrote: >>> There are way too many places that assume that the typmod can >>> just be discarded. I don't think that's going to fly, because >>> =(text,text) probably has different semantics from =(json,json). > >> And certain places where they are not allowed at all, I think (unless I >> am misremembering the early debates about enum types and output functions). > > Yeah. The current system design assumes that typmod specifies a > constraint of some sort. It is not possible to use it to change the > semantics of the datatype. The most obvious way in which this is true > is that selection of which operators and functions to apply to values > does not consider typmod of the values. This is not something we should > lightly revisit. We don't even have a handle on how to make domains > behave differently from their underlying datatypes, and those *do* have > their own type OIDs. Injecting typmod into the algorithm seems like a > disaster from here. I'm glad I didn't think of doing that before then. Can we agree some wording to put into the docs? Sounds like some clear warnings are needed. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 12, 2011 at 9:25 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On mån, 2011-12-12 at 16:51 -0800, Peter van Hardenberg wrote: > You don't need a new PL to do that. The existing PLs can also parse > JSON. So that's not nearly enough of a reason to consider adding this > new PL. PL/V8 is interesting because it is very fast, sandboxed, and well embedded with little overhead. My experience with PL/Python and PL/Perl has not been thus, and although they are handy if you want to break out and run system work, they're not the kind of thing I'd consider for defining performant operators with. I feel PL/V8 has promise in that area. -- Peter van Hardenberg San Francisco, California "Everything was beautiful, and nothing hurt." -- Kurt Vonnegut
On Mon, Dec 5, 2011 at 3:12 PM, Bruce Momjian <bruce@momjian.us> wrote: > Where are we with adding JSON for Postgres 9.2? We got bogged down in > the data representation last time we discussed this. We should probably have a wiki page titled "JSON datatype status" to help break the cycle we're in: * Someone asks about the status of JSON * Various ideas are suggested * Patches are posted (maybe) * More discussion about fundamental issues ensues * Nothing is accomplished (as far as adding JSON to Postgres core) There are several JSON implementations for Postgres floating around, including: * http://pgxn.org/dist/pg-json/ : Mentioned in previous posts; a JSON library based on Jansson supporting path subscript and equality testing * http://git.postgresql.org/gitweb/?p=json-datatype.git;a=summary : The JSON datatype I implemented for Google Summer of Code 2010. It has the most features of any implementation I'm aware of, but: * Is in the form of a contrib module * Preserves input text verbatim, a guarantee that will be broken by more efficient implementations * http://git.postgresql.org/gitweb/?p=json-datatype.git;a=shortlog;h=refs/heads/json2: My rewrite of the JSON module thatcondenses input (but still stores it as text) and addresses the issue of JSON when either the server or client encoding is not UTF-8. Needs more features and documentation, but like my other implementation, may not be quite what we want. Issues we've encountered include: * Should JSON be stored as binary or as text? * How do we deal with Unicode escapes and characters if the server or client encoding is not UTF-8? Some (common!) character encodings have code points that don't map to Unicode. Also, the charset conversion modules do not provide fast entry points for converting individual characters; each conversion involves a funcapi call. --- In an application I'm working on, I store JSON-encoded objects in a PostgreSQL database (using TEXT). I do so because it allows me to store non-relational data that is easy for my JavaScript code to work with. However, I fail to see much benefit of a JSON type. When I need to work with the data in PHP, C, or Haskell, I use JSON parsing libraries available in each programming language. Although being able to transform or convert JSON data within SQL might be convenient, I can't think of any compelling reason to do it in my case. Can someone clarify why a JSON type would be useful, beyond storage and validation? What is a real-world, *concrete* example of a problem where JSON manipulation in the database would be much better than: * Using the application's programming language to manipulate the data (which it does a lot already) ? * Using CouchDB or similar instead of PostgreSQL? - Joey
On 12/13/2011 03:15 AM, Joey Adams wrote: > We should probably have a wiki page titled "JSON datatype status" to > help break the cycle we're in > I was about to point you to http://wiki.postgresql.org/wiki/JSON_API_Brainstorm , only to realize you created that thing in the first place. There's http://wiki.postgresql.org/wiki/JSON_datatype_GSoC_2010 too. I don't think it's completely stuck in a cycle yet--every pass around seems to accumulate some better informed ideas than the last still. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On Tue, Dec 13, 2011 at 12:25 AM, Peter Eisentraut <peter_e@gmx.net> wrote: > On mån, 2011-12-12 at 16:51 -0800, Peter van Hardenberg wrote: >> Because we haven't heard from him in a while we've been using PL/V8 to >> validate a JSON datatype simulated by a DOMAIN with a simple >> acceptance function. (See below.) This is not ideally performant but >> thanks to V8's JIT the JSON parser is actually reasonably good. >> >> I think releasing something simple and non-performant with reasonable >> semantics would be the best next step. If it were up to me, I'd >> probably even try to just land PL/V8 as PL/JavaScript for 9.2 if the >> crash bugs and deal breakers can be sifted out. > > You don't need a new PL to do that. The existing PLs can also parse > JSON. So that's not nearly enough of a reason to consider adding this > new PL. Just because all our languages are Turing-complete doesn't mean they are all equally well-suited to every task. Of course, that doesn't mean we'd add a whole new language just to get a JSON parser, but I don't think that's really what Peter was saying. Rather, I think the point is that embedded Javascript is *extremely* popular, lots and lots of people are supporting it, and we ought to seriously consider doing the same. It's hard to think of another PL that we could add that would give us anywhere near the bang for the buck that Javascript would. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/12/13 Robert Haas <robertmhaas@gmail.com>: > On Tue, Dec 13, 2011 at 12:25 AM, Peter Eisentraut <peter_e@gmx.net> wrote: >> On mån, 2011-12-12 at 16:51 -0800, Peter van Hardenberg wrote: >>> Because we haven't heard from him in a while we've been using PL/V8 to >>> validate a JSON datatype simulated by a DOMAIN with a simple >>> acceptance function. (See below.) This is not ideally performant but >>> thanks to V8's JIT the JSON parser is actually reasonably good. >>> >>> I think releasing something simple and non-performant with reasonable >>> semantics would be the best next step. If it were up to me, I'd >>> probably even try to just land PL/V8 as PL/JavaScript for 9.2 if the >>> crash bugs and deal breakers can be sifted out. >> >> You don't need a new PL to do that. The existing PLs can also parse >> JSON. So that's not nearly enough of a reason to consider adding this >> new PL. > > Just because all our languages are Turing-complete doesn't mean they > are all equally well-suited to every task. Of course, that doesn't > mean we'd add a whole new language just to get a JSON parser, but I > don't think that's really what Peter was saying. Rather, I think the > point is that embedded Javascript is *extremely* popular, lots and > lots of people are supporting it, and we ought to seriously consider > doing the same. It's hard to think of another PL that we could add > that would give us anywhere near the bang for the buck that Javascript > would. it is true - but there is a few questions * will be JSON supported from SQL? * what Javascript engine will be supported? * will be integrated JSON supported from PLPerl? I like to see Javacript's in pg, but I don't like Javascript just for JSON. JSON should be independent on javascript. Regards Pavel > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
On 12/13/2011 08:44 AM, Robert Haas wrote: > Rather, I think the point is that embedded Javascript is *extremely* > popular, lots and > lots of people are supporting it, and we ought to seriously consider > doing the same. It's hard to think of another PL that we could add > that would give us anywhere near the bang for the buck that Javascript > would. > Quite. I hate Javascript with a passion, wish it would just go away and stop meddling with my life. And even with that context, I think in-core PL/V8 would be a huge advocacy win. PostgreSQL has this great developer-oriented PL interface, it just doesn't work out of the box with any of the "pop" languages right now. Personal story on this. When my book came out, I was trying to take the #1 spot on Packt's bestseller list, even if it was just for a day. Never made it higher than #2. The #1 spot the whole time was "jQuery 1.4 Reference Guide", discussing the most popular JavaScript library out there. And you know what? Over a year later, it's *still there*. At no point did it over drop out of that top spot. The number of people who would consider server-side programming suddenly feasible if PL/V8 were easy to do is orders of magnitude larger than the current PostgreSQL community. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On Tue, Dec 13, 2011 at 8:11 AM, Greg Smith <greg@2ndquadrant.com> wrote: > On 12/13/2011 08:44 AM, Robert Haas wrote: >> >> Rather, I think the point is that embedded Javascript is *extremely* >> popular, lots and >> lots of people are supporting it, and we ought to seriously consider >> doing the same. It's hard to think of another PL that we could add >> that would give us anywhere near the bang for the buck that Javascript >> would. >> > > > Quite. I hate Javascript with a passion, wish it would just go away and > stop meddling with my life. And even with that context, I think in-core > PL/V8 would be a huge advocacy win. PostgreSQL has this great > developer-oriented PL interface, it just doesn't work out of the box with > any of the "pop" languages right now. > > Personal story on this. When my book came out, I was trying to take the #1 > spot on Packt's bestseller list, even if it was just for a day. Never made > it higher than #2. The #1 spot the whole time was "jQuery 1.4 Reference > Guide", discussing the most popular JavaScript library out there. And you > know what? Over a year later, it's *still there*. At no point did it over > drop out of that top spot. The number of people who would consider > server-side programming suddenly feasible if PL/V8 were easy to do is orders > of magnitude larger than the current PostgreSQL community. Yeah -- javascript is making strides server-side with technologies like node.js. Like you I have really mixed feelings about javascript -- there's a lot of nastiness but the asynchronous style of coding javascript developers tend to like is a great fit for postgres both inside the backend and in database clients. This is on top of the already nifty type system synergy I mentioned upthread. Postgres would in fact make a wonderful 'nosql' backend with some fancy json support -- document style transmission to/from the backend without sacrificing relational integrity in storage. Properly done this would be a fabulous public relations coup (PostgreSQL = better nosql). merlin
On 12/13/2011 09:11 AM, Greg Smith wrote: > On 12/13/2011 08:44 AM, Robert Haas wrote: >> Rather, I think the point is that embedded Javascript is *extremely* >> popular, lots and >> lots of people are supporting it, and we ought to seriously consider >> doing the same. It's hard to think of another PL that we could add >> that would give us anywhere near the bang for the buck that Javascript >> would. > > Quite. I hate Javascript with a passion, wish it would just go away > and stop meddling with my life. And even with that context, I think > in-core PL/V8 would be a huge advocacy win. PostgreSQL has this great > developer-oriented PL interface, it just doesn't work out of the box > with any of the "pop" languages right now. > > Personal story on this. When my book came out, I was trying to take > the #1 spot on Packt's bestseller list, even if it was just for a > day. Never made it higher than #2. The #1 spot the whole time was > "jQuery 1.4 Reference Guide", discussing the most popular JavaScript > library out there. And you know what? Over a year later, it's *still > there*. At no point did it over drop out of that top spot. The > number of people who would consider server-side programming suddenly > feasible if PL/V8 were easy to do is orders of magnitude larger than > the current PostgreSQL community. I think your passion is probably somewhat misdirected. I've long thought JS would be a good fit for Postgres. It's naturally sandboxed and its type system fits ours quite well. And, as you say, it's massively popular and getting a major second wind thanks to things like JQuery, Ext-JS and node.js. This last one has certainly convinced lots of people that JS is not just for browsers any more. Having said that, don't underestimate the complexity of trying to build in PLV8. In its current incarnation the interface is written in C++ (and of course so is the underlying V8 engine). I have been doing some development work for it, even though my C++ is rather rusty (that's an understatement, folks), which is why I haven't got a lot more done - I'm just having to go slowly and with reference books by my side. So either we'd need to rewrite the glue code entirely in C (and it's littered with C++isms) and handle the difficulties of embedding a C++ library, or we'd have a major new build infrastructure dependency which could well give us a major case of developmental indigestion. cheers andrew
Merlin Moncure wrote: > Postgres would in fact make a wonderful 'nosql' backend with some > fancy json support -- document style transmission to/from the > backend without sacrificing relational integrity in storage. > Properly done this would be a fabulous public relations coup > (PostgreSQL = better nosql). PostSQL? ;-) -Kevin
On Tue, Dec 13, 2011 at 8:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Rather, I think the > point is that embedded Javascript is *extremely* popular, lots and > lots of people are supporting it, and we ought to seriously consider > doing the same. It's hard to think of another PL that we could add > that would give us anywhere near the bang for the buck that Javascript > would. +1 to that. I'm not a huge fan of JS; wish that one of the Scheme variations had "made it" instead. But it's clear that a LOT of fairly successful work has gone into making JS implementations performant, and it's clearly heavily used. JS+hstore would probably draw in a bunch of users, and tempt them to the "SQL dark side" :-). Wanting a JSON processor isn't quite a good enough reason to add C++ support in order to draw in a JS interpreter. But I don't imagine things are restricted to just 1 JS implementation, and JSON isn't the only reason to do so. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On Dec 13, 2011, at 9:15 AM, Kevin Grittner wrote: > Merlin Moncure wrote: > >> Postgres would in fact make a wonderful 'nosql' backend with some >> fancy json support -- document style transmission to/from the >> backend without sacrificing relational integrity in storage. >> Properly done this would be a fabulous public relations coup >> (PostgreSQL = better nosql). > > PostSQL? ;-) I think you meant to say "Postgre"... ;P -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Dec 12, 2011, at 7:42 PM, Alvaro Herrera wrote: > I remember there was the idea of doing something like this for regexes > -- have a specialized data type that saves the trouble of parsing it. > I imagine this is pretty much the same. > > Nobody got around to doing anything about it though. (regex data type)++ David
On tis, 2011-12-13 at 00:06 -0800, Peter van Hardenberg wrote: > On Mon, Dec 12, 2011 at 9:25 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > > On mån, 2011-12-12 at 16:51 -0800, Peter van Hardenberg wrote: > > You don't need a new PL to do that. The existing PLs can also parse > > JSON. So that's not nearly enough of a reason to consider adding this > > new PL. > > PL/V8 is interesting because it is very fast, sandboxed, and well > embedded with little overhead. > > My experience with PL/Python and PL/Perl has not been thus, and > although they are handy if you want to break out and run system work, > they're not the kind of thing I'd consider for defining performant > operators with. Some performance numbers comparing a valid_json() functions implemented in different ways would clarify this. I wouldn't be surprised if PL/V8 won, but we need to work with some facts.
On tis, 2011-12-13 at 09:11 -0500, Greg Smith wrote: > Personal story on this. When my book came out, I was trying to take > the #1 spot on Packt's bestseller list, even if it was just for a day. > Never made it higher than #2. The #1 spot the whole time was "jQuery > 1.4 Reference Guide", discussing the most popular JavaScript library > out there. And you know what? Over a year later, it's *still > there*. I would guess that that's largely because there are a lot more people developing web sites than people tuning databases, and also because the on-board documentation of javascript and jquery is poor, at least for their audience.
On Tue, Dec 13, 2011 at 2:33 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On tis, 2011-12-13 at 09:11 -0500, Greg Smith wrote: >> Personal story on this. When my book came out, I was trying to take >> the #1 spot on Packt's bestseller list, even if it was just for a day. >> Never made it higher than #2. The #1 spot the whole time was "jQuery >> 1.4 Reference Guide", discussing the most popular JavaScript library >> out there. And you know what? Over a year later, it's *still >> there*. > > I would guess that that's largely because there are a lot more people > developing web sites than people tuning databases, and also because the > on-board documentation of javascript and jquery is poor, at least for > their audience. jquery being used in as much as 40%+ of all websites by some estimates is surely a contributing factor. merlin
On tis, 2011-12-13 at 08:44 -0500, Robert Haas wrote: > Just because all our languages are Turing-complete doesn't mean they > are all equally well-suited to every task. Of course, that doesn't > mean we'd add a whole new language just to get a JSON parser, but I > don't think that's really what Peter was saying. That was in fact what I was saying. > Rather, I think the > point is that embedded Javascript is *extremely* popular, lots and > lots of people are supporting it, and we ought to seriously consider > doing the same. It's hard to think of another PL that we could add > that would give us anywhere near the bang for the buck that Javascript > would. If JavaScript (trademark of Oracle, btw.; be careful about calling anything PL/JavaScript) had a near-canonical implementation with a stable shared library and a C API, then this might be a no-brainer. But instead we have lots of implementations, and the one being favored here is written in C++ and changes the soname every 3 months. I don't think that's the sort of thing we want to carry around. The way forward here is to maintain this as an extension, provide debs and rpms, and show that that is maintainable. I can see numerous advantages in maintaining a PL outside the core; especially if you are still starting up and want to iterate quickly.
Robert Haas wrote: > On Mon, Dec 12, 2011 at 4:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Mon, Dec 12, 2011 at 8:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >> There are way too many places that assume that the typmod can > >> just be discarded. > > > > If true, that probably ought to be documented cos it sounds fairly important. > > > > Where and when is it true? > > I'm not going to go compile an exhaustive list, since that would take > a week and I don't have any particular desire to invest that much time > in it, but just to take a couple of simple examples: > > rhaas=# create or replace function wuzzle(numeric(5,2)) returns int as > $$select 1$$ language sql; > CREATE FUNCTION > rhaas=# \df wuzzle > List of functions > Schema | Name | Result data type | Argument data types | Type > --------+--------+------------------+---------------------+-------- > public | wuzzle | numeric | | normal > public | wuzzle | integer | numeric | normal > (2 rows) > > rhaas=# select pg_typeof(1.23::numeric(5,2)); > pg_typeof > ----------- > numeric > (1 row) > > There are a very large number of others. Possibly grepping for places > where we do getBaseType() rather than getBaseTypeAndTypmod() would be > a way to find some of them. I think the most common one I see is with concatentation: test=> select 'abc'::varchar(3) || 'def'::varchar(3); ?column?---------- abcdef(1 row) It is not really clear how the typmod length should be passed in this example, but passing it unchanged seems wrong: test=> select ('abc'::varchar(3) || 'def'::varchar(3))::varchar(3); varchar--------- abc(1 row) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Tue, Dec 13, 2011 at 2:41 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On tis, 2011-12-13 at 08:44 -0500, Robert Haas wrote: >> Just because all our languages are Turing-complete doesn't mean they >> are all equally well-suited to every task. Of course, that doesn't >> mean we'd add a whole new language just to get a JSON parser, but I >> don't think that's really what Peter was saying. > > That was in fact what I was saying. > >> Rather, I think the >> point is that embedded Javascript is *extremely* popular, lots and >> lots of people are supporting it, and we ought to seriously consider >> doing the same. It's hard to think of another PL that we could add >> that would give us anywhere near the bang for the buck that Javascript >> would. > > If JavaScript (trademark of Oracle, btw.; be careful about calling > anything PL/JavaScript) had a near-canonical implementation with a > stable shared library and a C API, then this might be a no-brainer. But > instead we have lots of implementations, and the one being favored here > is written in C++ and changes the soname every 3 months. I don't think > that's the sort of thing we want to carry around. Mozilla SpiderMonkey seems like a good fit: it compiles to a dependency free .so, has excellent platform support, has a stable C API, and while it's C++ internally makes no use of exceptions (in fact, it turns them off in the c++ compiler). ISTM to be a suitable foundation for an external module, 'in core' parser, pl, or anything really. merlin
On Tue, Dec 13, 2011 at 1:13 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > Mozilla SpiderMonkey seems like a good fit: it compiles to a > dependency free .so, has excellent platform support, has a stable C > API, and while it's C++ internally makes no use of exceptions (in > fact, it turns them off in the c++ compiler). ISTM to be a suitable > foundation for an external module, 'in core' parser, pl, or anything > really. When I started to think about PL/js, I compared three of SpiderMonkey, SquirrelFish, and V8. SpiderMonkey at that time (around 2009) was not-fast, not-small in memory while what you raise, as well as its advanced features like JS1.7 (pure yield!), was attractive. Also SpiderMonkey was a little harder to build in arbitrary platform (including Windows) than v8. SquirrelFish was fastest of three, but yet it's sticky with Webkit and also hard to build itself. Dunno how they've changed since then. Thanks, -- Hitoshi Harada
On Tue, Dec 13, 2011 at 8:15 AM, Joey Adams <joeyadams3.14159@gmail.com> wrote: > Issues we've encountered include: > > * Should JSON be stored as binary or as text? Text > * How do we deal with Unicode escapes and characters if the server or > client encoding is not UTF-8? Some (common!) character encodings have > code points that don't map to Unicode. Also, the charset conversion > modules do not provide fast entry points for converting individual > characters; each conversion involves a funcapi call. Make JSON datatypes only selectable if client encoding is UTF-8. Lets JFDI -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 12/15/2011 01:34 PM, Simon Riggs wrote: > On Tue, Dec 13, 2011 at 8:15 AM, Joey Adams<joeyadams3.14159@gmail.com> wrote: > >> Issues we've encountered include: >> >> * Should JSON be stored as binary or as text? > Text Works for me. >> * How do we deal with Unicode escapes and characters if the server or >> client encoding is not UTF-8? Some (common!) character encodings have >> code points that don't map to Unicode. Also, the charset conversion >> modules do not provide fast entry points for converting individual >> characters; each conversion involves a funcapi call. > Make JSON datatypes only selectable if client encoding is UTF-8. Yuck. Do we have this sort of restriction for any other data type? ISTM that the encoding problem is at least as likely to be the reverse of what's above - i.e. that there's a code point in the stored JSON that's not represented in the client encoding. cheers andrew
On Thu, Dec 15, 2011 at 4:47 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >> Make JSON datatypes only selectable if client encoding is UTF-8. > > Yuck. Do we have this sort of restriction for any other data type? No, and I don't think it's necessary to do it here, either. Nor would it be a good idea, because then the return value of EXPLAIN (FORMAT JSON) couldn't unconditionally be json. But I think the important point is that this is an obscure corner case. Let me say that one more time: obscure corner case! The only reason JSON needs to care about this at all is that it allows \u1234 to mean Unicode code point 0x1234. But for that detail, JSON would be encoding-agnostic. So I think it's sufficient for us to simply decide that that particular feature may not work (or even, will not work) for non-ASCII characters if you use a non-UTF8 encoding. There's still plenty of useful things that can be done with JSON even if that particular feature is not available; and that way we don't have to completely disable the data type just because someone wants to use EUC-JP or something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 16, 2011 at 1:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Dec 15, 2011 at 4:47 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >>> Make JSON datatypes only selectable if client encoding is UTF-8. >> >> Yuck. Do we have this sort of restriction for any other data type? > > No, and I don't think it's necessary to do it here, either. Nor would > it be a good idea, because then the return value of EXPLAIN (FORMAT > JSON) couldn't unconditionally be json. But I think the important > point is that this is an obscure corner case. Let me say that one > more time: obscure corner case! > > The only reason JSON needs to care about this at all is that it allows > \u1234 to mean Unicode code point 0x1234. But for that detail, JSON > would be encoding-agnostic. So I think it's sufficient for us to > simply decide that that particular feature may not work (or even, will > not work) for non-ASCII characters if you use a non-UTF8 encoding. > There's still plenty of useful things that can be done with JSON even > if that particular feature is not available; and that way we don't > have to completely disable the data type just because someone wants to > use EUC-JP or something. Completely agree. I was going to write almost exactly this in reply. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Dec 16, 2011 at 8:52 AM, Robert Haas <robertmhaas@gmail.com> wrote: > But I think the important point is that this is an obscure corner case. Let me say that one more time: obscure corner case! +1 > The only reason JSON needs to care about this at all is that it allows > \u1234 to mean Unicode code point 0x1234. But for that detail, JSON > would be encoding-agnostic. So I think it's sufficient for us to > simply decide that that particular feature may not work (or even, will > not work) for non-ASCII characters if you use a non-UTF8 encoding. > There's still plenty of useful things that can be done with JSON even > if that particular feature is not available; and that way we don't > have to completely disable the data type just because someone wants to > use EUC-JP or something. So, if the server encoding is not UTF-8, should we ban Unicode escapes: "\u00FCber" or non-ASCII characters? "über" Also: * What if the server encoding is SQL_ASCII? * What if the server encoding is UTF-8, but the client encoding is something else (e.g. SQL_ASCII)? - Joey
On Tue, Dec 13, 2011 at 1:13 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Dec 13, 2011 at 2:41 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >> On tis, 2011-12-13 at 08:44 -0500, Robert Haas wrote: >>> Just because all our languages are Turing-complete doesn't mean they >>> are all equally well-suited to every task. Of course, that doesn't >>> mean we'd add a whole new language just to get a JSON parser, but I >>> don't think that's really what Peter was saying. >> >> That was in fact what I was saying. >> >>> Rather, I think the >>> point is that embedded Javascript is *extremely* popular, lots and >>> lots of people are supporting it, and we ought to seriously consider >>> doing the same. It's hard to think of another PL that we could add >>> that would give us anywhere near the bang for the buck that Javascript >>> would. >> >> If JavaScript (trademark of Oracle, btw.; be careful about calling >> anything PL/JavaScript) had a near-canonical implementation with a >> stable shared library and a C API, then this might be a no-brainer. But >> instead we have lots of implementations, and the one being favored here >> is written in C++ and changes the soname every 3 months. I don't think >> that's the sort of thing we want to carry around. > > Mozilla SpiderMonkey seems like a good fit: it compiles to a > dependency free .so, has excellent platform support, has a stable C > API, and while it's C++ internally makes no use of exceptions (in > fact, it turns them off in the c++ compiler). ISTM to be a suitable > foundation for an external module, 'in core' parser, pl, or anything > really. To the best of my knowledge: libv8 is also exception-free, and compiled with exceptions off. plv8 does make use of exceptions, though, something that gave me pause when reading it. At first I thought it was to integrate with libv8, but that did not seem to be the case, so it probably could learn to use return codes instead. libv8 also has a light dependency list: ldd /usr/lib/libv8.so (/lib/ entries and linux omitted) libicuuc.so.44 => /usr/lib/libicuuc.so.44 (0x00007fc838459000)libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6(0x00007fc838151000)libicudata.so.44 => /usr/lib/libicudata.so.44 (0x00007fc836aed000) So ICU and C++. In addition, more projects have been successful in embedding libv8; right now it has the entrenchment advantage over libmozjs in applications that are not closely tied to XUL/Mozilla, although that could change in a few years. Institutionally Mozilla has not historically been quick to prioritize anything not essential to shipping Firefox, and I would imagine V8 is in a similar situation, even though they occasionally make concessions for non-browsing use cases (ex: multi-gigabyte heap sizes). I would regard either choice as at least equally risky in this way, given what I know (refinements welcome). Both libv8 and libmozjs are maintained in Debian, and are parts of at least one stable release. In spite of the hazard posed by the aggressive releases and non-general-purpose focus of the maintainers of both of these runtimes at this time, I am still in favor of having a binding to at least one of them into mainline, with the ability to get new or alternative versions via extensions. If extensions were already pervasive and everyone was installing them everywhere I'd think otherwise (just leave it as an extension), but the cost of not being able to index and manipulate JSON efficiently and with a trusted language is just too huge to let slide. Right now the perception of Postgres...actually, databases in general, including virtually all of the newcomers -- is that they are monolithic systems, and for most people either "9.3" will "have" javascript and indexing of JSON documents, or it won't. In most cases I would say "meh, let them eat cake until extensions become so apparently dominant that we can wave someone aside to extension-land", but in this case I think that would be a strategic mistake. -- fdr
On 12/16/2011 05:39 PM, Daniel Farina wrote: > To the best of my knowledge: > > libv8 is also exception-free, and compiled with exceptions off. plv8 > does make use of exceptions, though, something that gave me pause when > reading it. At first I thought it was to integrate with libv8, but > that did not seem to be the case, so it probably could learn to use > return codes instead. Yeah. We should look at that. cheers andrew
On Fri, Dec 16, 2011 at 12:13 PM, Joey Adams <joeyadams3.14159@gmail.com> wrote: > So, if the server encoding is not UTF-8, should we ban Unicode escapes: > > "\u00FCber" > > or non-ASCII characters? > > "über" The former. Refusing the escapes makes sense, because it's totally unclear how we ought to interpret them. Refusing the characters would be just breaking something for no particular reason. Right now, for example, EXPLAIN (FORMAT JSON) could easily end up returning non-ASCII characters in whatever the database encoding happens to be. That command would be unusable if we arbitrarily chucked an error every time a non-ASCII character showed up and the database encoding wasn't UTF-8. > Also: > > * What if the server encoding is SQL_ASCII? > > * What if the server encoding is UTF-8, but the client encoding is > something else (e.g. SQL_ASCII)? It's not clear to me why these cases would require any special handling. In the spirit of Simon's suggestion that we JFDI, I cooked up a patch today that JFDI. See attached. This lacks any form of canonicalization and therefore doesn't support comparison operators. It also lacks documentation, regression testing, and probably an almost uncountable number of other bells and whistles that people would like to have. This is more or less a deliberate decision on my part: I feel that the biggest problem with this project is that we've spent far too much time dithering over what the exactly perfect set of functionality set would be, and not enough time getting good basic functionality committed. So this is as basic as it gets. It does exactly one thing: validation. If people are happy with it, we can extend from here incrementally. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sat, Dec 17, 2011 at 2:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: > In the spirit of Simon's suggestion that we JFDI, I cooked up a patch > today that JFDI. See attached. Which looks very good. Comments * Comment for IDENTIFICATION of json.c says contrib/json/json.c * json.c contains a duplicate of a line from header file "extern Datum json_in(PG_FUNCTION_ARGS);" And additionally, a quote from our fine manual... "Caution: Some XML-related functions may not work at all on non-ASCII data when the server encoding is not UTF-8. This is known to be an issue for xpath() in particular." .... so I think this approach works for JSON too. Adding tests and docs is a must, nothing else is right now. Once we have this, others can add the bells and whistles, possibly in 9.2 -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, Peter Eisentraut <peter_e@gmx.net> writes: > The way forward here is to maintain this as an extension, provide debs > and rpms, and show that that is maintainable. I can see numerous > advantages in maintaining a PL outside the core; especially if you are > still starting up and want to iterate quickly. I'd like to add some confusion on the implementation choice, because it looks damn too easy now… Guile 2.0 offers an implementation of the ECMAscript language and plscheme already exists as a PostgreSQL PL extension for integrating with Guile. http://plscheme.projects.postgresql.org/ http://wingolog.org/archives/2009/02/22/ecmascript-for-guile http://packages.debian.org/sid/guile-2.0 http://www.gnu.org/software/guile/ Guile is an extension language platform Guile is an efficient virtual machine that executes a portable instruction set generated by its optimizing compiler, andintegrates very easily with C and C++ application code. In addition to Scheme, Guile includes compiler front-ends forECMAScript and Emacs Lisp (support for Lua is underway), which means your application can be extended in the language(or languages) most appropriate for your user base. And Guile's tools for parsing and compiling are exposed as partof its standard module set, so support for additional languages can be added without writing a single line of C. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Sat, Dec 17, 2011 at 5:02 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > I'd like to add some confusion on the implementation choice, because it > looks damn too easy now… Guile 2.0 offers an implementation of the > ECMAscript language and plscheme already exists as a PostgreSQL PL > extension for integrating with Guile. It seems like the licensing there could potentially be problematic. It's GPL with a linking exception. Not sure we want to go there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Dec 17, 2011, at 3:53 AM, Simon Riggs wrote: > Which looks very good. Love having the start here. I forwarded this message to Claes Jakobsson, creator of the jansson-using pg-json extension.He’s a bit less supportive. He gave me permission to quote him here: > Frankly I see the inclusion of a JSON datatype in core as unnecessary. Stuff should be moved out of core rather than in,as we do in Perl. Also, does this patch mean that the 'json' type is forever claimed and can't be replaced by extensions? > > There's little reason to reimplement JSON parsing, comparision and other routines when there's a multitude of already goodlibraries. Personally, I think that there really should be a core key/value-type data type, and json is probably the best possible choice,absent a SQL-standard-mandated type (which would probably suck anyway). But I think it worthwhile to hear alternatepoints of view, and this isn't far from what Jan said last week. Best, David
On Sat, Dec 17, 2011 at 7:50 PM, David E. Wheeler <david@kineticode.com> wrote: > Love having the start here. I forwarded this message to Claes Jakobsson, creator of the jansson-using pg-json extension.He’s a bit less supportive. He gave me permission to quote him here: > >> Frankly I see the inclusion of a JSON datatype in core as unnecessary. Stuff should be moved out of core rather than in,as we do in Perl. Also, does this patch mean that the 'json' type is forever claimed and can't be replaced by extensions? >> >> There's little reason to reimplement JSON parsing, comparision and other routines when there's a multitude of alreadygood libraries. That's fair enough, but we've had *many* requests for this functionality in core, I don't see what we lose by having at least some basic functionality built in. There is always room for people to provide extensions and add-ons that build on whatever core support we provide. There must be an order of magnitude more demand for this data type than there is for any other potential new in-core data type.Off the top of my head, I can't think of anything elsethat's even in the same league; can you? As for whether to use somebody else's implementation or roll our own, I'm not convinced there's any value in reusing somebody else's implementation. Consider a library like json-c, just the first thing I happened to download. The license is compatible, so that's good. But the coding style is completely different from ours, the memory management is not compatible with ours (it uses calloc and it's not pluggable), the error messages don't follow our style guidelines and won't work with our translation infrastructure, it has its own printfbuf which is redundant with our StringInfoData, it has its own hash table implementation which is also redundant with code we already have, and of course it won't contain anything like CHECK_FOR_INTERRUPTS() any place that might be needed. Now, sure, all of those problems are fixable. But by the time you get done it's not any less work than writing your own, and probably not as well adapted to our particular needs. The jansson library has a pluggable memory allocator, but most of the other complaints above still apply; and I see, for example, that it contains its own UTF-8 validator and locale conversion routines, which is undoubtedly not what we want. jansson also interprets numeric values as either a native integer or floating point values, which limits the amount of precision available and means that values may be output in a manner quite different from how they were put in. Although this is probably legal, since RFC4627 states that JSON parsers may set limits on the range of numbers, I think it's an unnecessary and undesirable limitation. I have always enjoyed the ability to -- for example -- SELECT 400! in PostgreSQL and get an exact answer, and I think it would be nice if I could store the result in a JSON object. I am also quite certain that someone will propose (or, perhaps, actually write) a function to convert a record to a JSON object, and I think it would be mighty nice if numeric, int4, and int8 could be transformed into JSON numbers rather than JSON strings, which isn't going to work - at least for numeric - if there are significant range or precision limitations. For XML, it makes a lot of sense for us to integrate with an external library. Consider libxml2, the library we actually do integrate with.The root directory has over 275,000 lines of code in.c and .h files.Obviously, it's worth suffering through some pain (and we definitely have suffered through some pain) to avoid having to rewrite some substantial portion of that code. Providing some basic JSON support figures to require about two orders of magnitude less code, which IMHO makes the calculation completely different. The reason there are so many JSON libraries out there is because it only takes a day to write one. If you look at a couple of JSON parsers written by other people and don't find exactly what you're looking for, you just go write one of your own. You then have the option to hang out a shingle and critique the next three people who come along and do the same thing, but is that really justified? Odds are good that their reasons for rolling their own were just as good as yours. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Dec 17, 2011 at 7:50 PM, David E. Wheeler <david@kineticode.com> wrote: >> Love having the start here. I forwarded this message to Claes Jakobsson, creator of the jansson-using pg-json extension.He�s a bit less supportive. He gave me permission to quote him here: >>> Frankly I see the inclusion of a JSON datatype in core as unnecessary. Stuff should be moved out of core rather thanin, as we do in Perl. Also, does this patch mean that the 'json' type is forever claimed and can't be replaced by extensions? >>> There's little reason to reimplement JSON parsing, comparision and other routines when there's a multitude of alreadygood libraries. > That's fair enough, but we've had *many* requests for this > functionality in core, I don't see what we lose by having at least > some basic functionality built in. There is always room for people to > provide extensions and add-ons that build on whatever core support we > provide. Well, I think that that's exactly the question here: if we do something in core, will it foreclose options for people who want to do add-ons? The main thing that's troubling me there is the issue of plain text representation versus something precompiled (and if the latter, what exactly). I don't see how you "build on" a core datatype that makes a decision different from what you wanted for that. I'm also +1 to Claes' opinion that this can be perfectly well managed as an add-on. We've sweated blood over many years to make PG extensions work nicely, and they now work better than they ever have. It's not clear to me why the push to make something core when it can obviously be an extension. regards, tom lane
On Dec 17, 2011, at 7:40 PM, Tom Lane wrote: > Well, I think that that's exactly the question here: if we do something > in core, will it foreclose options for people who want to do add-ons? Why would it? They would just have to use a different name. Best, David
On Sat, Dec 17, 2011 at 10:42 PM, David E. Wheeler <david@kineticode.com> wrote: > On Dec 17, 2011, at 7:40 PM, Tom Lane wrote: >> Well, I think that that's exactly the question here: if we do something >> in core, will it foreclose options for people who want to do add-ons? > > Why would it? They would just have to use a different name. Yeah, exactly. Or for that matter, the same name in a different schema. And as for the question of text vs. binary, that's going to be two separate data types whether it gets done in core or elsewhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 18/12/11 04:21, Robert Haas wrote: > On Sat, Dec 17, 2011 at 7:50 PM, David E. Wheeler <david@kineticode.com> wrote: >> Love having the start here. I forwarded this message to Claes Jakobsson, creator of the jansson-using pg-json extension.He’s a bit less supportive. He gave me permission to quote him here: >> >>> Frankly I see the inclusion of a JSON datatype in core as unnecessary. Stuff should be moved out of core rather thanin, as we do in Perl. Also, does this patch mean that the 'json' type is forever claimed and can't be replaced by extensions? >>> >>> There's little reason to reimplement JSON parsing, comparision and other routines when there's a multitude of alreadygood libraries. > > That's fair enough, but we've had *many* requests for this > functionality in core, I don't see what we lose by having at least > some basic functionality built in. I think having a JSON data type in core would drastically limit the exposure third-party JSON extensions would get and that's bad. There are tons of interesting features a JSON type could have and tying its development to a one year release cycle might be a disservice both for people who are willing to provide these features earlier, the users which are faced with a choice between a fast-moving third-party addon and a blessed core type and would cause overall confusion. How about we try the tsearch way and let JSON extensions live outside core for some time and perhaps if one emerges dominant and would benefit from inclusion then consider it? If we keep treating extensions as second-class citizens, they'll never get the mindshare and importance we seem to want for them (or otherwise why go through all the trouble to provide an infrastructure for them). Cheers, Jan
2011/12/18 Jan Urbański <wulczer@wulczer.org>: > On 18/12/11 04:21, Robert Haas wrote: >> On Sat, Dec 17, 2011 at 7:50 PM, David E. Wheeler <david@kineticode.com> wrote: >>> Love having the start here. I forwarded this message to Claes Jakobsson, creator of the jansson-using pg-json extension.He’s a bit less supportive. He gave me permission to quote him here: >>> >>>> Frankly I see the inclusion of a JSON datatype in core as unnecessary. Stuff should be moved out of core rather thanin, as we do in Perl. Also, does this patch mean that the 'json' type is forever claimed and can't be replaced by extensions? >>>> >>>> There's little reason to reimplement JSON parsing, comparision and other routines when there's a multitude of alreadygood libraries. >> >> That's fair enough, but we've had *many* requests for this >> functionality in core, I don't see what we lose by having at least >> some basic functionality built in. > > I think having a JSON data type in core would drastically limit the > exposure third-party JSON extensions would get and that's bad. There are > tons of interesting features a JSON type could have and tying its > development to a one year release cycle might be a disservice both for > people who are willing to provide these features earlier, the users > which are faced with a choice between a fast-moving third-party addon > and a blessed core type and would cause overall confusion. > > How about we try the tsearch way and let JSON extensions live outside > core for some time and perhaps if one emerges dominant and would benefit > from inclusion then consider it? it should be contrib modules Pavel > > If we keep treating extensions as second-class citizens, they'll never > get the mindshare and importance we seem to want for them (or otherwise > why go through all the trouble to provide an infrastructure for them). > > Cheers, > Jan > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Dec 17, 2011 at 5:02 PM, Dimitri Fontaine > <dimitri@2ndquadrant.fr> wrote: >> I'd like to add some confusion on the implementation choice, because it >> looks damn too easy now… Guile 2.0 offers an implementation of the >> ECMAscript language and plscheme already exists as a PostgreSQL PL >> extension for integrating with Guile. > > It seems like the licensing there could potentially be problematic. > It's GPL with a linking exception. Not sure we want to go there. It's LGPL so it's compatible (only the readline part is subject to GPL, we're familiar enough with that though). http://www.gnu.org/software/guile/docs/docs-2.0/guile-ref/Guile-License.html The Guile library (libguile) and supporting files are published under the terms of the GNU Lesser General Public Licenseversion 3 or later. See the files COPYING.LESSER and COPYING. C code linking to the Guile library is subject to terms of that library. Basically such code may be published on any terms,provided users can re-link against a new or modified version of Guile. Scheme level code written to be run by Guile (but not derived from Guile itself) is not restricted in any way, and may bepublished on any terms. We encourage authors to publish on Free terms. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Sun, Dec 18, 2011 at 10:49, Jan Urbański <wulczer@wulczer.org> wrote: > On 18/12/11 04:21, Robert Haas wrote: >> On Sat, Dec 17, 2011 at 7:50 PM, David E. Wheeler <david@kineticode.com> wrote: >>> Love having the start here. I forwarded this message to Claes Jakobsson, creator of the jansson-using pg-json extension.He’s a bit less supportive. He gave me permission to quote him here: >>> >>>> Frankly I see the inclusion of a JSON datatype in core as unnecessary. Stuff should be moved out of core rather thanin, as we do in Perl. Also, does this patch mean that the 'json' type is forever claimed and can't be replaced by extensions? >>>> >>>> There's little reason to reimplement JSON parsing, comparision and other routines when there's a multitude of alreadygood libraries. >> >> That's fair enough, but we've had *many* requests for this >> functionality in core, I don't see what we lose by having at least >> some basic functionality built in. > > I think having a JSON data type in core would drastically limit the > exposure third-party JSON extensions would get and that's bad. There are The same way that having replication in core is bad for the rest of the replication engines? While it has certainly decreased the usage of for example Slony, I don't think anybody can say it's a bad thing that we have this in core... And of course, *not* having it in core, we didn't have people claiming for many years that "postgres has no replication" or anything like that... The fact is that a *lot* of our users, particularly in large companies, will never install an extension that's not part of core. Just look at other discussions about it even being a problem with it being in *contrib*, which is still maintained and distributed by the same developers. We can hopefully get around this for the extensions in contrib (and reasonably well has already), but few large companies are going to be happy to go to pgxn and download an extension that has a single maintainer (not "the team", and in most cases not even "a team"), usually no defined lifecycle, no support, etc. (I'm pretty sure you won't get support included for random pgxn modules when you buy a contract from EDB, or CMD, or us, or PGX, or anybody really - wheras if it the datatype is in core, you *will* get this) So I'm not sure it would really lessen the exposure much at all - those that are willing to install such extensions already, are surely capable of finding it themselves (using pgxn for example - or even google) > tons of interesting features a JSON type could have and tying its > development to a one year release cycle might be a disservice both for > people who are willing to provide these features earlier, the users > which are faced with a choice between a fast-moving third-party addon > and a blessed core type and would cause overall confusion. And the other option would be to *only* have a fast-moving third-party addon, which simply disqualifies it completely in many environments. Keeping it as a third party addon is better for the developer. Keeping it in core is better for the user (if the user is a large company - not a hacker). If we can find a way to have a stable part in core and then have addons that can provide these "tons of interesting features" (which I agree there are) until such time that they can be considered stable enough for core, I think that's the best compromise. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Sat, Dec 17, 2011 at 4:02 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:> to add some confusion on the implementation choice, because it > looks damn too easy now… Guile 2.0 offers an implementation of the> ECMAscript language and plscheme already exists asa PostgreSQL PL> extension for integrating with Guile. TBH, I think that's PFC (pretty cool). On Sun, Dec 18, 2011 at 6:41 AM, Magnus Hagander <magnus@hagander.net> wrote: > We can hopefully get around this for the extensions in contrib (and > reasonably well has already), but few large companies are going to be > happy to go to pgxn and download an extension that has a single > maintainer (not "the team", and in most cases not even "a team"), > usually no defined lifecycle, no support, etc. (I'm pretty sure you > won't get support included for random pgxn modules when you buy a > contract from EDB, or CMD, or us, or PGX, or anybody really - wheras > if it the datatype is in core, you *will* get this) 100% agree on all points. with the new extension system, contrib modules that are packaged with the core system can be considered to be in core because they are: *) documented in standard docs *) supported and bugfixed with postgresql releases *) ready to be used without compiler support or even shell access through most binary distributions One small note about the json type being an extension -- this probably means the json type oid won't be fixed -- not a huge deal but it could affect some corner cases with binary format consumers. merlin
On 12/18/2011 12:17 PM, Merlin Moncure wrote: > One small note about the json type being an extension -- this probably > means the json type oid won't be fixed -- not a huge deal but it could > affect some corner cases with binary format consumers. > > Why would that matter more for JSON than for any other non-core type? cheers andrew
On Sun, Dec 18, 2011 at 11:21 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 12/18/2011 12:17 PM, Merlin Moncure wrote: >> >> One small note about the json type being an extension -- this probably >> means the json type oid won't be fixed -- not a huge deal but it could >> affect some corner cases with binary format consumers. > > Why would that matter more for JSON than for any other non-core type? well, it's a minor headache for all the oid-isn't-in-pgtypes.h types, and only then for high traffic types (which presumably json will be).a while back we coded up a reworked dblink that wasvariadic and could optionally transfer data between database with the binary wire format. any container of a user defined (by oid) type had to be sent strictly as text which is a big performance hit for certain types. recent postgres has an undocumented facility to force type oids to a particular value, but the type definition being inside the create extension script makes this problematic. this is a pretty far out objection though, and I could certainly work around the problem if necessary, but there is some dependency on pg_types.h in the wild. merlin
Merlin Moncure <mmoncure@gmail.com> writes: >> Why would that matter more for JSON than for any other non-core type? > > well, it's a minor headache for all the oid-isn't-in-pgtypes.h types, > and only then for high traffic types (which presumably json will be). Extensions are going to be more and more used and “pervasive” in next years, and binary wire transfers is a good goal. What about creating something like the PostgreSQL types IANA? New type authors would register their OID and as a benefit would get listed on some public reference sheet, and we could add some mechanism so that default CREATE TYPE calls will not use reserved OID numbers. Then it would be all cooperative only, so not a security thing, just a way to ease binary and extension co-existence. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Sun, Dec 18, 2011 at 12:26 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >>> Why would that matter more for JSON than for any other non-core type? >> >> well, it's a minor headache for all the oid-isn't-in-pgtypes.h types, >> and only then for high traffic types (which presumably json will be). > > Extensions are going to be more and more used and “pervasive” in next > years, and binary wire transfers is a good goal. What about creating > something like the PostgreSQL types IANA? > > New type authors would register their OID and as a benefit would get > listed on some public reference sheet, and we could add some mechanism > so that default CREATE TYPE calls will not use reserved OID numbers. > > Then it would be all cooperative only, so not a security thing, just a > way to ease binary and extension co-existence. I think that's a fabulous idea,although we're drifting off the stated topic here. Getting back on point, I'm curious about your statement: "without writing a single line of C". I took a look at the pl/scheme docs and was pretty impressed -- what exactly would be involved to get a guile-based ECMAscript working over the pl/scheme implementation? How would that interact exactly with the stated topic -- JSON support? Do you even need a json type if you have strong library based parsing and composition features? merlin
Merlin Moncure <mmoncure@gmail.com> writes: > Getting back on point, I'm curious about your statement: "without > writing a single line of C". I took a look at the pl/scheme docs and > was pretty impressed -- what exactly would be involved to get a > guile-based ECMAscript working over the pl/scheme implementation? How > would that interact exactly with the stated topic -- JSON support? Do > you even need a json type if you have strong library based parsing and > composition features? My understanding is that JSON is a subset of ECMAscript, so if you get the latter you already have the former. Now, someone would have to check if plscheme still build with guile-2.0, and given that, how exactly you get pl/ecmascript (or pl/js) out of that. I won't be in a position to spend time on that this year… Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Dec 18, 2011, at 4:41 AM, Magnus Hagander wrote: > We can hopefully get around this for the extensions in contrib (and > reasonably well has already), but few large companies are going to be > happy to go to pgxn and download an extension that has a single > maintainer (not "the team", and in most cases not even "a team"), > usually no defined lifecycle, no support, etc. (I'm pretty sure you > won't get support included for random pgxn modules when you buy a > contract from EDB, or CMD, or us, or PGX, or anybody really - wheras > if it the datatype is in core, you *will* get this) I support having a JSON type in core, but question the assertions here. *Some* organizations won’t use PGXN, usually becausethey require things through a different ecosystem (RPMs, .debs, StackBuilder, etc.). But many others will. There area *lot* of companies out there that use CPAN, easy_install, and Gem. The same sorts of places will use PGXN. Oh, and at PGX, we’ll happily provide support for random modules, so long as you pay for our time. We’re not picky (and happyto send improvements back upstream), though we might recommend you switch to something better. But such evaluationsare based on quality, not simply on what ecosystem it came from. > If we can find a way to have a stable part in core and then have > addons that can provide these "tons of interesting features" (which I > agree there are) until such time that they can be considered stable > enough for core, I think that's the best compromise. +1, though I think the core type will at least need some basic operators and indexing support. Best, David
On Dec 19, 2011, at 3:39 PM, David E. Wheeler wrote: > Well, no, JSON is formally “a lightweight data-interchange format.” It’s derived from JavaScript syntax, but it is nota programming language, so I wouldn’t say it was accurate to describe it as a subset of JS or ECMAScript. Bah, it says “It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999.”But my point still holds: it is not a programming language, and one does not need a PL to have a JSON data type. Best, David
On Dec 19, 2011, at 2:49 AM, Dimitri Fontaine wrote: > My understanding is that JSON is a subset of ECMAscript Well, no, JSON is formally “a lightweight data-interchange format.” It’s derived from JavaScript syntax, but it is not aprogramming language, so I wouldn’t say it was accurate to describe it as a subset of JS or ECMAScript. http://json.org/ IOW, one does not need a new PL to get this type. Best, David
On Mon, Dec 19, 2011 at 6:26 PM, David E. Wheeler <david@kineticode.com> wrote: > +1, though I think the core type will at least need some basic operators and indexing support. And I'm willing to do that, but I thought it best to submit a bare bones patch first, in the hopes of minimizing the number of objectionable things therein. For example, if you want to be able to index a JSON column, you have to decide on some collation order that is consistent with JSON's notion of equality, and it's not obvious what is most logical. Heck, equality itself isn't 100% obvious. If there's adequate support for including JSON in core, and nobody objects to my implementation, then I'll throw some ideas for those things up against the wall and see what sticks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 20, 2011 at 00:26, David E. Wheeler <david@kineticode.com> wrote: > On Dec 18, 2011, at 4:41 AM, Magnus Hagander wrote: > >> We can hopefully get around this for the extensions in contrib (and >> reasonably well has already), but few large companies are going to be >> happy to go to pgxn and download an extension that has a single >> maintainer (not "the team", and in most cases not even "a team"), >> usually no defined lifecycle, no support, etc. (I'm pretty sure you >> won't get support included for random pgxn modules when you buy a >> contract from EDB, or CMD, or us, or PGX, or anybody really - wheras >> if it the datatype is in core, you *will* get this) > > I support having a JSON type in core, but question the assertions here. *Some* organizations won’t use PGXN, usually becausethey require things through a different ecosystem (RPMs, .debs, StackBuilder, etc.). But many others will. There area *lot* of companies out there that use CPAN, easy_install, and Gem. The same sorts of places will use PGXN. Yes, that's why I said "few" not "none". Though in my experience, most companies are a lot more restrictive about addons to their database than addons to their development environments. And note that it's not PGXN that's the problem I'm pointing at, neither is it CPAN or easy_install or gem. The problem is the vulnerability of the addon, and the maintenance. Meaning if it has a single maintainer, that's a whole different thing from being maintained by the PGDG. > Oh, and at PGX, we’ll happily provide support for random modules, so long as you pay for our time. We’re not picky (andhappy to send improvements back upstream), though we might recommend you switch to something better. But such evaluationsare based on quality, not simply on what ecosystem it came from. I think we're talking about different things here. While we can certainly provide support on specific modules, after that is entered into the agreement with the customer, we won't support a customer who just calls up and says "hey, I'm using module xyz which you've never heard of, and it crashes my database, please come fix it now". Are you saying you do that - providing SLAs, 24/7 and similar things, on modules you didn't even know the customer was using? And FWIW, I'm talking about the quality, and not the ecosystem as well. I'm just saying it takes a lot more work to verify the quality and maintenance of an external module - if it's part of postgresql, you have *already* got a quality stamp and a maintenance promise from that. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Tue, Dec 20, 2011 at 06:00, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Dec 19, 2011 at 6:26 PM, David E. Wheeler <david@kineticode.com> wrote: >> +1, though I think the core type will at least need some basic operators and indexing support. > > And I'm willing to do that, but I thought it best to submit a bare > bones patch first, in the hopes of minimizing the number of > objectionable things therein. For example, if you want to be able to > index a JSON column, you have to decide on some collation order that > is consistent with JSON's notion of equality, and it's not obvious > what is most logical. Heck, equality itself isn't 100% obvious. If > there's adequate support for including JSON in core, and nobody > objects to my implementation, then I'll throw some ideas for those > things up against the wall and see what sticks. +1 for getting the basics in first, and then adding more to it later. There's still a fair amount of time to do that for 9.2, but not if we get stuck bikeshedding again... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
"David E. Wheeler" <david@justatheory.com> writes: > holds: it is not a programming language, and one does not need a PL to have > a JSON data type. Exactly. That does not contradict the fact that if you have pl/ecmascript you already have JSON. And that we might as well have had the ecmascript PL for some time now, we just need to check about that. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Tuesday, December 20, 2011 07:23:43 PM Dimitri Fontaine wrote: > "David E. Wheeler" <david@justatheory.com> writes: > > holds: it is not a programming language, and one does not need a PL to > > have a JSON data type. > Exactly. That does not contradict the fact that if you have > pl/ecmascript you already have JSON. And that we might as well have had > the ecmascript PL for some time now, we just need to check about that. Not really. You need to be able to "evaluate" json without it possibly executing code. Many js implementations are likely to have such a feature though. Andres
On Dec 20, 2011, at 12:39 AM, David E. Wheeler wrote: > On Dec 19, 2011, at 2:49 AM, Dimitri Fontaine wrote: > >> My understanding is that JSON is a subset of ECMAscript > > Well, no, JSON is formally “a lightweight data-interchange format.” It’s derived from JavaScript syntax, but it is nota programming language, so I wouldn’t say it was accurate to describe it as a subset of JS or ECMAScript. > > http://json.org/ Are people explicitly asking for a) *JSON* datatype or b) a type that lets you store arbitrary complex semi-untyped datastructures? if b) then this might get a lot more interesting Cheers, Claes
On Mon, Dec 19, 2011 at 5:49 AM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote: > My understanding is that JSON is a subset of ECMAscript, so if you get > the latter you already have the former. Now, someone would have to > check if plscheme still build with guile-2.0, and given that, how > exactly you get pl/ecmascript (or pl/js) out of that. I don't think so. I checked it out (still on pgfoundry, still on CVS, and code hasn't been touched since 2008), and run into some issues. - It looks for libguile.h #include "libguile.h" which, on 2.0, has shifted around from /usr/include/libguile.h (1.8) to /usr/include/guile/2.0/libguile.h It's not doing enough indirections internally; there is a guile-config that is analogous to pg_config postgres@cbbrowne [03:48:43] [~/PostgreSQL/plscheme] -> % guile-config compile -pthread -I/usr/include/guile/2.0 postgres@cbbrowne [03:48:45] [~/PostgreSQL/plscheme] -> % guile-config link -lguile-2.0 -lgc It looks like there's something PG-related as a next issue: -> % ./install.sh pg_config : /var/lib/postgresql/dbs/postgresql-HEAD/bin/pg_config module-dir : /var/lib/postgresql/dbs/postgresql-HEAD/lib max-cache-size: 64 dbname : postgres safe-r5rs : NO dbacreate : NO PSQL : /var/lib/postgresql/dbs/postgresql-HEAD/bin/psql postgres CPPFLAGS : -g -Wall -fpic -c -I/var/lib/postgresql/dbs/postgresql-HEAD/include/server -I/usr/include/guile/2.0 LDFLAGS : -shared -lguile Compiling... failed! plscheme.c: In function '_PG_init': plscheme.c:647:2: warning: implicit declaration of function 'DefineCustomStringVariable' [-Wimplicit-function-declaration] plscheme.c:650:30: error: 'PGC_BACKEND' undeclared (first use in this function) plscheme.c:650:30: note: each undeclared identifier is reported only once for each function it appears in plscheme.c:652:2: warning: implicit declaration of function 'DefineCustomIntVariable' [-Wimplicit-function-declaration] plscheme.c: In function 'plscheme_func_handler': plscheme.c:742:2: warning: implicit declaration of function 'GetTopTransactionId' [-Wimplicit-function-declaration] plscheme.c: In function 'parse_trig_args': plscheme.c:1623:44: error: dereferencing pointer to incomplete type plscheme.c:1628:38: error: dereferencing pointer to incomplete type ... (See error.log for details.) I'm not sure to what degree this is bitrot relating to: a) Postgres changes b) Guile changes but there's doubtless a bit of both. I'd think it interesting to get this back to working order, whether it's useful for JavaScript or not. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On Dec 19, 2011, at 9:00 PM, Robert Haas wrote: >> +1, though I think the core type will at least need some basic operators and indexing support. > > And I'm willing to do that, but I thought it best to submit a bare > bones patch first, in the hopes of minimizing the number of > objectionable things therein. For example, if you want to be able to > index a JSON column, you have to decide on some collation order that > is consistent with JSON's notion of equality, and it's not obvious > what is most logical. Heck, equality itself isn't 100% obvious. If > there's adequate support for including JSON in core, and nobody > objects to my implementation, then I'll throw some ideas for those > things up against the wall and see what sticks. +1 Sounds good to me. David
On Dec 20, 2011, at 2:13 AM, Magnus Hagander wrote: > Yes, that's why I said "few" not "none". > > Though in my experience, most companies are a lot more restrictive > about addons to their database than addons to their development > environments. Yeah, we’re getting off-topic here, so I’ll just say something we can agree on: We’ll see. I do still want to see some processes for getting PGXN distributions into RPM/.deb/StackBuilder, though. Best, David
On Dec 20, 2011, at 10:39 AM, Claes Jakobsson wrote: > Are people explicitly asking for a) *JSON* datatype or b) a type that lets you store arbitrary complex semi-untyped datastructures? Yes. > if b) then this might get a lot more interesting JSON is the most popular/likely way to represent that, I think. David
Let me mention another lightweight data-interchange format. At http://www.janestreet.com we have developed a small c module to deal with S-expressions (sexp) as a way to store arbitrary data. As we write most of our code in OCaml sexps are a natural way for us to store data. http://hg.ocaml.info/release/sexplib/ provides automatic ways to convert "any" ocaml value into a sexp). The extension is still pretty new but we use it successfully on a daily basis. After we have upgraded to 9.x we will pack it as an extension and releast it opensource. API wise the module at the moment offers the following: sexp_validate(text) returns boolean Validate that the passed in text is a valid s expression. create domain sexp as text check (sexp_validate(value)); BTW: It is a PITA that arrays of domains are not valid types. And several functions to manipulate take apart sexp's or modify sexp's using a path into the sexp (similar to what xpath does for xml). Such as: sexp_get(sexp, text) returns sexp Get the sub sexp of sexp identified by the path. Returns NULL if path is not a valid path in sexp. Example: path=.a space.b.[1].x ((ignore this) ("a space" ((b (0 ((also ignored) (x "The Value")) )) ))) -> "The Value" And sexp_get_atom(sexp, text) returns text Get the sub atom of sexp identified by the path. Returns NULL if path is not a valid path in sexp or does not identify an atom. Example: path=.a space.b.[1].x ((ignore this) ("a space" ((b (0 ((also ignored) (x "The Value")) )) ))) ^^^^^^^^^ -> The Value Cheers, Bene On 20/12/11 19:39, Claes Jakobsson wrote: > On Dec 20, 2011, at 12:39 AM, David E. Wheeler wrote: > > On Dec 19, 2011, at 2:49 AM, Dimitri Fontaine wrote: > > > >> My understanding is that JSON is a subset of ECMAscript > > > > Well, no, JSON is formally “a lightweight data-interchange format.” It’s derived from JavaScript syntax, but it is nota programming language, so I wouldn’t say it was accurate to describe it as a subset of JS or ECMAScript. > > > > http://json.org/ > > Are people explicitly asking for a) *JSON* datatype or b) a type that lets you store arbitrary complex semi-untyped datastructures? > > if b) then this might get a lot more interesting > > Cheers, > Claes > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Dec 20, 2011 at 9:06 PM, David E. Wheeler <david@kineticode.com> wrote: > On Dec 20, 2011, at 10:39 AM, Claes Jakobsson wrote: >> Are people explicitly asking for a) *JSON* datatype or b) a type that lets you store arbitrary complex semi-untyped datastructures? > > Yes. > >> if b) then this might get a lot more interesting > > JSON is the most popular/likely way to represent that, I think. On that note, here's an updated version of the patch I posted upthread, with some regression tests and minimal documentation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
2012/1/11 Robert Haas <robertmhaas@gmail.com>: > On Tue, Dec 20, 2011 at 9:06 PM, David E. Wheeler <david@kineticode.com> wrote: >> On Dec 20, 2011, at 10:39 AM, Claes Jakobsson wrote: >>> Are people explicitly asking for a) *JSON* datatype or b) a type that lets you store arbitrary complex semi-untyped datastructures? >> >> Yes. >> >>> if b) then this might get a lot more interesting >> >> JSON is the most popular/likely way to represent that, I think. > > On that note, here's an updated version of the patch I posted > upthread, with some regression tests and minimal documentation. I like this patch and this feature. I see only one issue - there is not functionality that helps generate JSON in pg. What do you think about functions: array_to_json(anyarray), row_to_json(any) and format_json(text, text, ...) Regards Pavel > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
On 01/11/2012 01:18 AM, Pavel Stehule wrote: > > I like this patch and this feature. I'm about to read the patch in detail - I certainly like the feature. > > I see only one issue - there is not functionality that helps generate > JSON in pg. > > What do you think about functions: array_to_json(anyarray), > row_to_json(any) and format_json(text, text, ...) > Actually, more than these, I (and at least one very interested client) want query_to_json, which would do something like: # select q2json('select $$a$$ || x as b, y as c from generate_series(1,3) x, generate_series(4,5) y'); q2json --------------------------------------------------------------------------------------------------------- [{"b":"a1","c":4},{"b":"a1","c":5},{"b":"a2","c":4},{"b":"a2","c":5},{"b":"a3","c":4},{"b":"a3","c":5}] No doubt several variants are possible such as returning a setof json, one per row, instead of a single json, and allowing query parameters as separate arguments (maybe just using variadic functions), but probably for a first go just something as simple as this would meet the case. Given the short time span available before patches must be in, I am prepared to work on this ASAP. cheers andrew
2012/1/11 Andrew Dunstan <andrew@dunslane.net>: > > > On 01/11/2012 01:18 AM, Pavel Stehule wrote: >> >> >> I like this patch and this feature. > > > I'm about to read the patch in detail - I certainly like the feature. > > >> >> I see only one issue - there is not functionality that helps generate >> JSON in pg. >> >> What do you think about functions: array_to_json(anyarray), >> row_to_json(any) and format_json(text, text, ...) >> > > Actually, more than these, I (and at least one very interested client) want > query_to_json, which would do something like: > > # select q2json('select $$a$$ || x as b, y as c from generate_series(1,3) > x, generate_series(4,5) y'); > q2json > > --------------------------------------------------------------------------------------------------------- > > [{"b":"a1","c":4},{"b":"a1","c":5},{"b":"a2","c":4},{"b":"a2","c":5},{"b":"a3","c":4},{"b":"a3","c":5}] > we have a query_to_xml - so there should similar query_to_json. But this is not enough for usage from SP. What about two rich functions * query_to_json - by your proposal * array_to_json - with possibility to serialize array of records This can be a basic set Regards Pavel > > No doubt several variants are possible such as returning a setof json, one > per row, instead of a single json, and allowing query parameters as separate > arguments (maybe just using variadic functions), but probably for a first go > just something as simple as this would meet the case. > > Given the short time span available before patches must be in, I am prepared > to work on this ASAP. > > cheers > > andrew >
On Wed, Jan 11, 2012 at 1:18 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > 2012/1/11 Robert Haas <robertmhaas@gmail.com>: >> On Tue, Dec 20, 2011 at 9:06 PM, David E. Wheeler <david@kineticode.com> wrote: >>> On Dec 20, 2011, at 10:39 AM, Claes Jakobsson wrote: >>>> Are people explicitly asking for a) *JSON* datatype or b) a type that lets you store arbitrary complex semi-untypeddata structures? >>> >>> Yes. >>> >>>> if b) then this might get a lot more interesting >>> >>> JSON is the most popular/likely way to represent that, I think. >> >> On that note, here's an updated version of the patch I posted >> upthread, with some regression tests and minimal documentation. > > I like this patch and this feature. > > I see only one issue - there is not functionality that helps generate > JSON in pg. > > What do you think about functions: array_to_json(anyarray), > row_to_json(any) and format_json(text, text, ...) I think we might want all of that stuff, but I doubt there is time to do it for 9.2. Actually, I think the next logical step would be to define equality (is there an official definition of that for JSON?) and build a btree opclass. I believe the code I've already written could be extended to construct an abstract syntax tree for those operations that need it. But we need to make some decisions first. A btree opclass requires a total ordering, so we have to arbitrarily define whether 1 < true, 1 < [1], 1 < "1", etc. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2012/1/11 Robert Haas <robertmhaas@gmail.com>: > On Wed, Jan 11, 2012 at 1:18 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >> 2012/1/11 Robert Haas <robertmhaas@gmail.com>: >>> On Tue, Dec 20, 2011 at 9:06 PM, David E. Wheeler <david@kineticode.com> wrote: >>>> On Dec 20, 2011, at 10:39 AM, Claes Jakobsson wrote: >>>>> Are people explicitly asking for a) *JSON* datatype or b) a type that lets you store arbitrary complex semi-untypeddata structures? >>>> >>>> Yes. >>>> >>>>> if b) then this might get a lot more interesting >>>> >>>> JSON is the most popular/likely way to represent that, I think. >>> >>> On that note, here's an updated version of the patch I posted >>> upthread, with some regression tests and minimal documentation. >> >> I like this patch and this feature. >> >> I see only one issue - there is not functionality that helps generate >> JSON in pg. >> >> What do you think about functions: array_to_json(anyarray), >> row_to_json(any) and format_json(text, text, ...) > > I think we might want all of that stuff, but I doubt there is time to > do it for 9.2. > > Actually, I think the next logical step would be to define equality > (is there an official definition of that for JSON?) and build a btree > opclass. I believe the code I've already written could be extended to > construct an abstract syntax tree for those operations that need it. > But we need to make some decisions first. A btree opclass requires a > total ordering, so we have to arbitrarily define whether 1 < true, 1 < > [1], 1 < "1", etc. > I don't understand why we have to do it? We don't support similar functionality for XML, so why for JSON? Pavel > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On Wed, Jan 11, 2012 at 8:38 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > I don't understand why we have to do it? > > We don't support similar functionality for XML, so why for JSON? Hrm. Well, that's an interesting point. Maybe we don't. I assumed that people would eventually want to optimize queries of the form SELECT whatever FROM tab WHERE jsoncol = 'constant'. If that's a sufficiently marginal use case that we don't care, then fine. One difference between JSON and XML is that XML really has no well-defined comparison semantics. For example, consider: <foo><bar>1.0</bar></foo> <foo><bar>1.0</bar> </foo> If the XML is being used as a transport mechanism, then the extra space is semantically insignificant, but if this is markup, then it might matter a lot. Also, consider: <foo><bar>1.00</bar></foo> That one might be equal if we think 1.0 is intended to be a number, but if it's intended as a string then it's not. We could perhaps do comparisons in XML relative to some DTD or schema if those provide details about what the values mean, but in a vacuum it's not well-defined. On the other hand, in JSON, it's pretty clear that { 1, 2, 3 } is the same value as {1,2,3} but "1,2,3" is different from "1, 2, 3". There are some borderline cases that might need some sweat, like whether 1 = 1.0 = 1.00 = 1e0, but in general the level of ambiguity seems to me to be much less, making it more feasible here than it would be for XML. That having been said, uncertainties about whether we want this at all (and if so in what form) are exactly why I didn't include this kind of stuff in the patch to begin with, and I think that if we get this much committed for 9.2 we'll be doing pretty well. If we can agree on and do more, great; if not, we'll at least have this much, which IMHO would be an improvement over what we have now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2012/1/11 Robert Haas <robertmhaas@gmail.com>: > On Wed, Jan 11, 2012 at 8:38 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >> I don't understand why we have to do it? >> >> We don't support similar functionality for XML, so why for JSON? > > Hrm. Well, that's an interesting point. Maybe we don't. I assumed > that people would eventually want to optimize queries of the form > SELECT whatever FROM tab WHERE jsoncol = 'constant'. If that's a > sufficiently marginal use case that we don't care, then fine. > > One difference between JSON and XML is that XML really has no > well-defined comparison semantics. For example, consider: > > <foo><bar>1.0</bar></foo> > <foo><bar>1.0</bar> </foo> > > If the XML is being used as a transport mechanism, then the extra > space is semantically insignificant, but if this is markup, then it > might matter a lot. Also, consider: > > <foo><bar>1.00</bar></foo> > > That one might be equal if we think 1.0 is intended to be a number, > but if it's intended as a string then it's not. We could perhaps do > comparisons in XML relative to some DTD or schema if those provide > details about what the values mean, but in a vacuum it's not > well-defined. On the other hand, in JSON, it's pretty clear that { 1, > 2, 3 } is the same value as {1,2,3} but "1,2,3" is different from "1, > 2, 3". There are some borderline cases that might need some sweat, > like whether 1 = 1.0 = 1.00 = 1e0, but in general the level of > ambiguity seems to me to be much less, making it more feasible here > than it would be for XML. > > That having been said, uncertainties about whether we want this at all > (and if so in what form) are exactly why I didn't include this kind of > stuff in the patch to begin with, and I think that if we get this much > committed for 9.2 we'll be doing pretty well. If we can agree on and > do more, great; if not, we'll at least have this much, which IMHO > would be an improvement over what we have now. > I understand it now. My opinion is so some operators and index search can be in 9.2 - so use a JSON just as communication format now. * we need to build JSON * we need to check if some is valid JSON * we need to store JSON other steps should be (9.2) * basic operators eq, neeq * some tool like XQuery - simple query on JSON document available from SQL that can be used for functional indexes. > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > I understand it now. My opinion is so some operators and index search > can be in 9.2 - so use a JSON just as communication format now. > > * we need to build JSON > * we need to check if some is valid JSON > * we need to store JSON > > other steps should be (9.2) > * basic operators eq, neeq > * some tool like XQuery - simple query on JSON document available from > SQL that can be used for functional indexes. That would be nice, but let's not let the perfect be the enemy of the good. We don't have a lot of time here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2012/1/11 Robert Haas <robertmhaas@gmail.com>: > On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >> I understand it now. My opinion is so some operators and index search >> can be in 9.2 - so use a JSON just as communication format now. >> >> * we need to build JSON >> * we need to check if some is valid JSON >> * we need to store JSON >> >> other steps should be (9.2) >> * basic operators eq, neeq >> * some tool like XQuery - simple query on JSON document available from >> SQL that can be used for functional indexes. > > That would be nice, but let's not let the perfect be the enemy of the > good. We don't have a lot of time here. > sorry - replace 9.2 by 9.3 - I am sorry I am able to write array_to_json fce and Andrew can write query_to_json Pavel > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
" I am able to write array_to_json fce and Andrew can write query_to_json" +1 Thanks guys... We are using a lot of JSON as communication protocol... having core support for JSON, And those functions, will be a real life saver... Many thanks, Misa Sent from my Windows Phone From: Pavel Stehule Sent: 11/01/2012 16:22 To: Robert Haas Cc: David E. Wheeler; Claes Jakobsson; Dimitri Fontaine; Merlin Moncure; Andrew Dunstan; Magnus Hagander; Jan Urba=C5=84ski; Simon Riggs; Joey Adams; Bruce Momjian; PostgreSQL-development Hackers; Jan Wieck Subject: Re: [HACKERS] JSON for PG 9.2 2012/1/11 Robert Haas <robertmhaas@gmail.com>: > On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehule <pavel.stehule@gmail.com> = wrote: >> I understand it now. My opinion is so some operators and index search >> can be in 9.2 - so use a JSON just as communication format now. >> >> * we need to build JSON >> * we need to check if some is valid JSON >> * we need to store JSON >> >> other steps should be (9.2) >> * basic operators eq, neeq >> * some tool like XQuery - simple query on JSON document available from >> SQL that can be used for functional indexes. > > That would be nice, but let's not let the perfect be the enemy of the > good. =C2=A0We don't have a lot of time here. > sorry - replace 9.2 by 9.3 - I am sorry I am able to write array_to_json fce and Andrew can write query_to_json Pavel > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company --=20 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 01/11/2012 10:21 AM, Pavel Stehule wrote: > 2012/1/11 Robert Haas<robertmhaas@gmail.com>: >> On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehule<pavel.stehule@gmail.com> wrote: >>> I understand it now. My opinion is so some operators and index search >>> can be in 9.2 - so use a JSON just as communication format now. >>> >>> * we need to build JSON >>> * we need to check if some is valid JSON >>> * we need to store JSON >>> >>> other steps should be (9.2) >>> * basic operators eq, neeq >>> * some tool like XQuery - simple query on JSON document available from >>> SQL that can be used for functional indexes. >> That would be nice, but let's not let the perfect be the enemy of the >> good. We don't have a lot of time here. >> > sorry - replace 9.2 by 9.3 - I am sorry > > I am able to write array_to_json fce and Andrew can write query_to_json For those who want to play along, see <https://bitbucket.org/adunstan/pgdevel> which has Robert's patch and my additions to it. I'm actually half way through writing an array_to_json function, since it it necessary anyway for query_to_json. I hope to have a fairly complete working function in about 24 hours. cheers andrew
" I am able to write array_to_json fce and Andrew can write query_to_json" +1 Thanks guys... We are using a lot of JSON as communication protocol... having core support for JSON, And those functions, will be a real life saver... Many thanks, Misa Sent from my Windows Phone From: Pavel Stehule Sent: 11/01/2012 16:22 To: Robert Haas Cc: David E. Wheeler; Claes Jakobsson; Dimitri Fontaine; Merlin Moncure; Andrew Dunstan; Magnus Hagander; Jan Urbański; Simon Riggs; Joey Adams; Bruce Momjian; PostgreSQL-development Hackers; Jan Wieck Subject: Re: [HACKERS] JSON for PG 9.2 2012/1/11 Robert Haas <robertmhaas@gmail.com>: > On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >> I understand it now. My opinion is so some operators and index search >> can be in 9.2 - so use a JSON just as communication format now. >> >> * we need to build JSON >> * we need to check if some is valid JSON >> * we need to store JSON >> >> other steps should be (9.2) >> * basic operators eq, neeq >> * some tool like XQuery - simple query on JSON document available from >> SQL that can be used for functional indexes. > > That would be nice, but let's not let the perfect be the enemy of the > good. We don't have a lot of time here. > sorry - replace 9.2 by 9.3 - I am sorry I am able to write array_to_json fce and Andrew can write query_to_json Pavel > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
I wrote an array_to_json function during GSoC 2010: http://git.postgresql.org/gitweb/?p=json-datatype.git;a=blob;f=json_io.c#l289 It's not exposed as a procedure called array_to_json: it's part of the to_json function, which decides what to do based on the argument type. - Joey
On 01/12/2012 09:00 AM, Joey Adams wrote: > I wrote an array_to_json function during GSoC 2010: > > http://git.postgresql.org/gitweb/?p=json-datatype.git;a=blob;f=json_io.c#l289 > > It's not exposed as a procedure called array_to_json: it's part of the > to_json function, which decides what to do based on the argument type. > Excellent, this is just the point at which I stopped work last night, so with your permission I'll steal this and it will save me a good chunk of time. cheers andrew
2012/1/12 Andrew Dunstan <andrew@dunslane.net>: > > > On 01/12/2012 09:00 AM, Joey Adams wrote: >> >> I wrote an array_to_json function during GSoC 2010: >> >> >> http://git.postgresql.org/gitweb/?p=json-datatype.git;a=blob;f=json_io.c#l289 >> >> It's not exposed as a procedure called array_to_json: it's part of the >> to_json function, which decides what to do based on the argument type. >> > > > Excellent, this is just the point at which I stopped work last night, so > with your permission I'll steal this and it will save me a good chunk of > time. > this should be little bit more enhanced to support a row arrays - it can be merged with some routines from pst tool http://okbob.blogspot.com/2010/11/new-version-of-pst-collection-is.html Regards Pavel > cheers > > andrew >
On 01/12/2012 10:44 AM, Pavel Stehule wrote: > 2012/1/12 Andrew Dunstan<andrew@dunslane.net>: >> >> On 01/12/2012 09:00 AM, Joey Adams wrote: >>> I wrote an array_to_json function during GSoC 2010: >>> >>> >>> http://git.postgresql.org/gitweb/?p=json-datatype.git;a=blob;f=json_io.c#l289 >>> >>> It's not exposed as a procedure called array_to_json: it's part of the >>> to_json function, which decides what to do based on the argument type. >>> >> >> Excellent, this is just the point at which I stopped work last night, so >> with your permission I'll steal this and it will save me a good chunk of >> time. >> > this should be little bit more enhanced to support a row arrays - it > can be merged with some routines from pst tool > http://okbob.blogspot.com/2010/11/new-version-of-pst-collection-is.html > > I will be covering composites. cheers andrew
On Thu, Jan 12, 2012 at 9:51 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > On 01/12/2012 10:44 AM, Pavel Stehule wrote: >> this should be little bit more enhanced to support a row arrays - it >> can be merged with some routines from pst tool >> http://okbob.blogspot.com/2010/11/new-version-of-pst-collection-is.html > > I will be covering composites. curious, will your function work on unregistered composites? What would this do? select array_to_json(array[row('a', 1), row('b', 2)]); merlin
2012/1/12 Merlin Moncure <mmoncure@gmail.com>: > On Thu, Jan 12, 2012 at 9:51 AM, Andrew Dunstan <andrew@dunslane.net> wrote: >> On 01/12/2012 10:44 AM, Pavel Stehule wrote: >>> this should be little bit more enhanced to support a row arrays - it >>> can be merged with some routines from pst tool >>> http://okbob.blogspot.com/2010/11/new-version-of-pst-collection-is.html >> >> I will be covering composites. > > curious, will your function work on unregistered composites? What > would this do? > > select array_to_json(array[row('a', 1), row('b', 2)]); it can do it - but it has to use some defaults for names. Pavel > > merlin
On 01/12/2012 11:21 AM, Merlin Moncure wrote: > On Thu, Jan 12, 2012 at 9:51 AM, Andrew Dunstan<andrew@dunslane.net> wrote: >> On 01/12/2012 10:44 AM, Pavel Stehule wrote: >>> this should be little bit more enhanced to support a row arrays - it >>> can be merged with some routines from pst tool >>> http://okbob.blogspot.com/2010/11/new-version-of-pst-collection-is.html >> I will be covering composites. > curious, will your function work on unregistered composites? What > would this do? > > select array_to_json(array[row('a', 1), row('b', 2)]); > Expected behaviour is something like this: andrew=# select q2json(' select $$a$$ || x as b, y as c, array[row(x.*,array[1,2,3]), row(y.*,array[4,5,6])] as z from generate_series(1,2) x, generate_series(4,5) y'); [{"b":"a1","c":4,"z":[{"f1":1,"f2":[1,2,3]},{"f1":4,"f2":[4,5,6]}]}, {"b":"a1","c":5,"z":[{"f1":1,"f2":[1,2,3]},{"f1":5,"f2":[4,5,6]}]}, {"b":"a2","c":4,"z":[{"f1":2,"f2":[1,2,3]},{"f1":4,"f2":[4,5,6]}]}, {"b":"a2","c":5,"z":[{"f1":2,"f2":[1,2,3]},{"f1":5,"f2":[4,5,6]}]}] cheers andrew
On 01/12/2012 10:51 AM, Andrew Dunstan wrote: > > > On 01/12/2012 10:44 AM, Pavel Stehule wrote: >> 2012/1/12 Andrew Dunstan<andrew@dunslane.net>: >>> >>> On 01/12/2012 09:00 AM, Joey Adams wrote: >>>> I wrote an array_to_json function during GSoC 2010: >>>> >>>> >>>> http://git.postgresql.org/gitweb/?p=json-datatype.git;a=blob;f=json_io.c#l289 >>>> >>>> >>>> It's not exposed as a procedure called array_to_json: it's part of the >>>> to_json function, which decides what to do based on the argument type. >>>> >>> >>> Excellent, this is just the point at which I stopped work last >>> night, so >>> with your permission I'll steal this and it will save me a good >>> chunk of >>> time. >>> >> this should be little bit more enhanced to support a row arrays - it >> can be merged with some routines from pst tool >> http://okbob.blogspot.com/2010/11/new-version-of-pst-collection-is.html >> >> > > I will be covering composites. > > OK, here's a patch that does both query_to_json and array_to_json, along with docs and regression tests. It include Robert's original patch, although I can produce a differential patch if required. It can also be pulled from <https://bitbucket.org/adunstan/pgdevel> A couple of things to note. First, the problem about us losing column names that I noted a couple of months ago and Tom did a bit of work on is exercised by this. We really need to fix it. Example: andrew=# select array_to_json(array_agg(row(z.*))) from (select $$a$$ || x as b, y as c, array[row(x.*,array[1,2,3]), row(y.*,array[4,5,6])] as z from generate_series(1,1) x, generate_series(4,4) y) z; array_to_json ------------------------------------------------------------------------- [{"f1":"a1","f2":4,"f3":[{"f1":1,"f2":[1,2,3]},{"f1":4,"f2":[4,5,6]}]}] (1 row) Here we've lost b, c and z as column names. Second, what should be do when the database encoding isn't UTF8? I'm inclined to emit a \unnnn escape for any non-ASCII character (assuming it has a unicode code point - are there any code points in the non-unicode encodings that don't have unicode equivalents?). The alternative would be to fail on non-ASCII characters, which might be ugly. Of course, anyone wanting to deal with JSON should be using UTF8 anyway, but we still have to deal with these things. What about SQL_ASCII? If there's a non-ASCII sequence there we really have no way of telling what it should be. There at least I think we should probably error out. cheers andrew
Attachment
2012/1/14 Andrew Dunstan <andrew@dunslane.net>: > > > On 01/12/2012 10:51 AM, Andrew Dunstan wrote: >> >> >> >> On 01/12/2012 10:44 AM, Pavel Stehule wrote: >>> >>> 2012/1/12 Andrew Dunstan<andrew@dunslane.net>: >>>> >>>> >>>> On 01/12/2012 09:00 AM, Joey Adams wrote: >>>>> >>>>> I wrote an array_to_json function during GSoC 2010: >>>>> >>>>> >>>>> >>>>> http://git.postgresql.org/gitweb/?p=json-datatype.git;a=blob;f=json_io.c#l289 >>>>> >>>>> It's not exposed as a procedure called array_to_json: it's part of the >>>>> to_json function, which decides what to do based on the argument type. >>>>> >>>> >>>> Excellent, this is just the point at which I stopped work last night, so >>>> with your permission I'll steal this and it will save me a good chunk of >>>> time. >>>> >>> this should be little bit more enhanced to support a row arrays - it >>> can be merged with some routines from pst tool >>> http://okbob.blogspot.com/2010/11/new-version-of-pst-collection-is.html >>> >>> >> >> I will be covering composites. >> >> > > OK, here's a patch that does both query_to_json and array_to_json, along > with docs and regression tests. It include Robert's original patch, although > I can produce a differential patch if required. It can also be pulled from > <https://bitbucket.org/adunstan/pgdevel> > > A couple of things to note. First, the problem about us losing column names > that I noted a couple of months ago and Tom did a bit of work on is > exercised by this. We really need to fix it. Example: > support SELECT ROW (x AS "real name", y AS "real name") is good idea and should be used more time than only here. Regards Pavel > andrew=# select array_to_json(array_agg(row(z.*))) > from (select $$a$$ || x as b, > > y as c, > array[row(x.*,array[1,2,3]), > row(y.*,array[4,5,6])] as z > from generate_series(1,1) x, > generate_series(4,4) y) z; > array_to_json > ------------------------------------------------------------------------- > [{"f1":"a1","f2":4,"f3":[{"f1":1,"f2":[1,2,3]},{"f1":4,"f2":[4,5,6]}]}] > (1 row) > > > Here we've lost b, c and z as column names. > > Second, what should be do when the database encoding isn't UTF8? I'm > inclined to emit a \unnnn escape for any non-ASCII character (assuming it > has a unicode code point - are there any code points in the non-unicode > encodings that don't have unicode equivalents?). The alternative would be to > fail on non-ASCII characters, which might be ugly. Of course, anyone wanting > to deal with JSON should be using UTF8 anyway, but we still have to deal > with these things. What about SQL_ASCII? If there's a non-ASCII sequence > there we really have no way of telling what it should be. There at least I > think we should probably error out. > > cheers > > andrew > > >
On Sat, Jan 14, 2012 at 3:06 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > Second, what should be do when the database encoding isn't UTF8? I'm > inclined to emit a \unnnn escape for any non-ASCII character (assuming it > has a unicode code point - are there any code points in the non-unicode > encodings that don't have unicode equivalents?). The alternative would be to > fail on non-ASCII characters, which might be ugly. Of course, anyone wanting > to deal with JSON should be using UTF8 anyway, but we still have to deal > with these things. What about SQL_ASCII? If there's a non-ASCII sequence > there we really have no way of telling what it should be. There at least I > think we should probably error out. I don't think there is a satisfying solution to this problem. Things working against us: * Some server encodings support characters that don't map to Unicode characters (e.g. unused slots in Windows-1252). Thus, converting to UTF-8 and back is lossy in general. * We want a normalized representation for comparison. This will involve a mixture of server and Unicode characters, unless the encoding is UTF-8. * We can't efficiently convert individual characters to and from Unicode with the current API. * What do we do about \u0000 ? TEXT datums cannot contain NUL characters. I'd say just ban Unicode escapes and non-ASCII characters unless the server encoding is UTF-8, and ban all \u0000 escapes. It's easy, and whatever we support later will be a superset of this. Strategies for handling this situation have been discussed in prior emails. This is where things got stuck last time. - Joey
On 01/14/2012 06:11 PM, Joey Adams wrote: > On Sat, Jan 14, 2012 at 3:06 PM, Andrew Dunstan<andrew@dunslane.net> wrote: >> Second, what should be do when the database encoding isn't UTF8? I'm >> inclined to emit a \unnnn escape for any non-ASCII character (assuming it >> has a unicode code point - are there any code points in the non-unicode >> encodings that don't have unicode equivalents?). The alternative would be to >> fail on non-ASCII characters, which might be ugly. Of course, anyone wanting >> to deal with JSON should be using UTF8 anyway, but we still have to deal >> with these things. What about SQL_ASCII? If there's a non-ASCII sequence >> there we really have no way of telling what it should be. There at least I >> think we should probably error out. > I don't think there is a satisfying solution to this problem. Things > working against us: > > * Some server encodings support characters that don't map to Unicode > characters (e.g. unused slots in Windows-1252). Thus, converting to > UTF-8 and back is lossy in general. > > * We want a normalized representation for comparison. This will > involve a mixture of server and Unicode characters, unless the > encoding is UTF-8. > > * We can't efficiently convert individual characters to and from > Unicode with the current API. > > * What do we do about \u0000 ? TEXT datums cannot contain NUL characters. > > I'd say just ban Unicode escapes and non-ASCII characters unless the > server encoding is UTF-8, and ban all \u0000 escapes. It's easy, and > whatever we support later will be a superset of this. > > Strategies for handling this situation have been discussed in prior > emails. This is where things got stuck last time. > Well, from where I'm coming from, nuls are not a problem. But escape_json() is currently totally encoding-unaware. It produces \unnnn escapes for low ascii characters, and just passes through characters with the high bit set. That's possibly OK for EXPLAIN output - we really don't want don't want EXPLAIN failing. But maybe we should ban JSON output for EXPLAIN if the encoding isn't UTF8. Another question in my mind is what to do when the client encoding isn't UTF8. None of these is an insurmountable problem, ISTM - we just need to make some decisions. cheers andrew
I am very interested in experimenting with functional indexes into JSON structures. I think this could be very powerful combined with full text search as well as constraints. It would allow for using postgres as an unstructured data store without sacrificing the powerful indexing features, durability, and transactional semantics that comes with using postgres or RDBMSes in general.
One use case in particular I have been trying to solve for lately is persisting and synchronizing client-side state (in a mobile application) with a server. It would be nice to have a flat, schemaless table (maybe a table that's like (id, type, owner, data) where data would be a JSON blob). I could do this now without JSON support, but I think indexing inside that JSON blob and having validation database side is valuable as well. And as I mentioned before, i'd rather not throw out the baby with the bathwater by using a different type of database because ACID, replication, and constraints are also very important to this. As is being consistent with the rest of our technology stack. (I'd essentially be using a relational database to persist an object database)
I'm also not too concerned about storage consumption with this (even though columnar compression would help a lot in the future) since it's easily partitionable by user ID.
For my case the equivalent of postgres's XPath would work. Also having it as a maintained contrib module would be sufficient, although it being part of core as XPath is would be even better.
Just my $0.02... even if I'm a bit late to the conversation.
Thanks!
Mike
On 01/14/2012 03:06 PM, Andrew Dunstan wrote: > > > > OK, here's a patch that does both query_to_json and array_to_json, > along with docs and regression tests. It include Robert's original > patch, although I can produce a differential patch if required. It can > also be pulled from <https://bitbucket.org/adunstan/pgdevel> > > Here's an update that adds row_to_json, plus a bit more cleanup. Example: andrew=# SELECT row_to_json(q) FROM (SELECT $$a$$ || x AS b, y AS c, ARRAY[ROW(x.*,ARRAY[1,2,3]), ROW(y.*,ARRAY[4,5,6])] AS z FROM generate_series(1,2) x, generate_series(4,5) y) q; row_to_json -------------------------------------------------------------------- {"b":"a1","c":4,"z":[{"f1":1,"f2":[1,2,3]},{"f1":4,"f2":[4,5,6]}]} {"b":"a1","c":5,"z":[{"f1":1,"f2":[1,2,3]},{"f1":5,"f2":[4,5,6]}]} {"b":"a2","c":4,"z":[{"f1":2,"f2":[1,2,3]},{"f1":4,"f2":[4,5,6]}]} {"b":"a2","c":5,"z":[{"f1":2,"f2":[1,2,3]},{"f1":5,"f2":[4,5,6]}]} (4 rows) (This might be more to Robert's taste than query_to_json() :-) ) cheers andrew
Attachment
On Sat, Jan 14, 2012 at 3:06 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > Second, what should be do when the database encoding isn't UTF8? I'm > inclined to emit a \unnnn escape for any non-ASCII character (assuming it > has a unicode code point - are there any code points in the non-unicode > encodings that don't have unicode equivalents?). The alternative would be to > fail on non-ASCII characters, which might be ugly. Of course, anyone wanting > to deal with JSON should be using UTF8 anyway, but we still have to deal > with these things. What about SQL_ASCII? If there's a non-ASCII sequence > there we really have no way of telling what it should be. There at least I > think we should probably error out. I don't see any reason to escape anything more than the minimum required by the spec, which only requires it for control characters. If somebody's got a non-ASCII character in there, we can simply allow it to be represented by itself. That's almost certainly more compact (and very possibly more readable) than emitting \uXXXX for each such instance, and it also matches what the current EXPLAIN (FORMAT JSON) output does. In other words, let's decree that when the database encoding isn't UTF-8, *escaping* of non-ASCII characters doesn't work. But *unescaped* non-ASCII characters should still work just fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01/19/2012 03:49 PM, Robert Haas wrote: > > In other words, let's decree that when the database encoding isn't > UTF-8, *escaping* of non-ASCII characters doesn't work. But > *unescaped* non-ASCII characters should still work just fine. The spec only allows unescaped Unicode chars (and for our purposes that means UTF8). An unescaped non-ASCII character in, say, ISO-8859-1 will result in something that's not legal JSON. See <http://www.ietf.org/rfc/rfc4627.txt?number=4627> section 3. cheers andrew
On Thu, Jan 19, 2012 at 4:07 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > On 01/19/2012 03:49 PM, Robert Haas wrote: >> In other words, let's decree that when the database encoding isn't >> UTF-8, *escaping* of non-ASCII characters doesn't work. But >> *unescaped* non-ASCII characters should still work just fine. > > The spec only allows unescaped Unicode chars (and for our purposes that > means UTF8). An unescaped non-ASCII character in, say, ISO-8859-1 will > result in something that's not legal JSON. See > <http://www.ietf.org/rfc/rfc4627.txt?number=4627> section 3. I understand. I'm proposing that we not care. In other words, if the server encoding is UTF-8, it'll really be JSON. But if the server encoding is something else, it'll be almost-JSON. And specifically, the \uXXXX syntax won't work, and there might be some non-Unicode characters in there. If that's not the behavior you want, then use UTF-8. It seems pretty clear that we're going to have to make some trade-off to handle non-UTF8 encodings, and I think what I'm suggesting is a lot less painful than disabling high-bit characters altogether. If we do that, then what happens if a user runs EXPLAIN (FORMAT JSON) and his column label has a non-Unicode character in there? Should we say, oh, sorry, you can't explain that in JSON format? That is mighty unfriendly, and probably mighty complicated and expensive to figure out, too. We *do not support* mixing encodings in the same database, and if we make it the job of this patch to fix that problem, we're going to be in the same place for 9.2 that we have been for the last several releases: nowhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01/19/2012 04:12 PM, Robert Haas wrote: > On Thu, Jan 19, 2012 at 4:07 PM, Andrew Dunstan<andrew@dunslane.net> wrote: >> On 01/19/2012 03:49 PM, Robert Haas wrote: >>> In other words, let's decree that when the database encoding isn't >>> UTF-8, *escaping* of non-ASCII characters doesn't work. But >>> *unescaped* non-ASCII characters should still work just fine. >> The spec only allows unescaped Unicode chars (and for our purposes that >> means UTF8). An unescaped non-ASCII character in, say, ISO-8859-1 will >> result in something that's not legal JSON. See >> <http://www.ietf.org/rfc/rfc4627.txt?number=4627> section 3. > I understand. I'm proposing that we not care. In other words, if the > server encoding is UTF-8, it'll really be JSON. But if the server > encoding is something else, it'll be almost-JSON. And specifically, > the \uXXXX syntax won't work, and there might be some non-Unicode > characters in there. If that's not the behavior you want, then use > UTF-8. > > It seems pretty clear that we're going to have to make some trade-off > to handle non-UTF8 encodings, and I think what I'm suggesting is a lot > less painful than disabling high-bit characters altogether. If we do > that, then what happens if a user runs EXPLAIN (FORMAT JSON) and his > column label has a non-Unicode character in there? Should we say, oh, > sorry, you can't explain that in JSON format? That is mighty > unfriendly, and probably mighty complicated and expensive to figure > out, too. We *do not support* mixing encodings in the same database, > and if we make it the job of this patch to fix that problem, we're > going to be in the same place for 9.2 that we have been for the last > several releases: nowhere. OK, then we need to say that very clearly and up front (including in the EXPLAIN docs.) Of course, for data going to the client, if the client encoding is UTF8, they should get legal JSON, regardless of what the database encoding is, and conversely too, no? cheers andrew >
On Thu, Jan 19, 2012 at 5:59 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > OK, then we need to say that very clearly and up front (including in the > EXPLAIN docs.) Can do. > Of course, for data going to the client, if the client encoding is UTF8, > they should get legal JSON, regardless of what the database encoding is, and > conversely too, no? Well, that would be nice, but I don't think it's practical. It will certainly be the case, under the scheme I'm proposing, or probably any other sensible scheme also, that if a client whose encoding is UTF-8 gets a value of type json back fro the database, it's strictly valid JSON. But it won't be possible to store every legal JSON value in the database if the database encoding is anything other than UTF-8, even if the client encoding is UTF-8. The backend will get the client's UTF-8 bytes and transcode them to the server encoding before calling the type-input function, so if there are characters in there that can't be represented in UTF-8, then we'll error out before the JSON data type ever gets control. In theory, it would be possible to accept such strings if the client chooses to represent them using a \uXXXX sequence, but I'm unexcited about doing the work required to make that happen, because it will still be a pretty half-baked: we'll be able to accept some representations of the same JSON constant but not others. I think the real fix for this problem is to introduce an infrastructure inside the database that allows us to have different columns stored in different encodings. People use bytea for that right now, but that's pretty unfriendly: it would be nice to have a better system. However, I expect that to take a lot of work and break a lot of things, and until we do it I don't feel that compelled to provide buggy and incomplete support for it under the guise of implementing a JSON datatype. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jan 19, 2012, at 4:27 PM, Robert Haas wrote: > I think the real fix for this problem is to introduce an > infrastructure inside the database that allows us to have different > columns stored in different encodings. People use bytea for that > right now, but that's pretty unfriendly: it would be nice to have a > better system. However, I expect that to take a lot of work and break > a lot of things, and until we do it I don't feel that compelled to > provide buggy and incomplete support for it under the guise of > implementing a JSON datatype. +1 This seems like a reasonable compromise and course of action, especially if someone is interested in taking on column-levelencodings at some point in the next year or two. Best, David
Andrew Dunstan <andrew@dunslane.net> writes: > On 01/19/2012 04:12 PM, Robert Haas wrote: >> On Thu, Jan 19, 2012 at 4:07 PM, Andrew Dunstan<andrew@dunslane.net> wrote: >>> The spec only allows unescaped Unicode chars (and for our purposes that >>> means UTF8). An unescaped non-ASCII character in, say, ISO-8859-1 will >>> result in something that's not legal JSON. >> I understand. I'm proposing that we not care. In other words, if the >> server encoding is UTF-8, it'll really be JSON. But if the server >> encoding is something else, it'll be almost-JSON. > Of course, for data going to the client, if the client encoding is UTF8, > they should get legal JSON, regardless of what the database encoding is, > and conversely too, no? Yes. I think this argument has been mostly theologizing, along the lines of how many JSON characters can dance on the head of a pin. From a user's perspective, the database encoding is only a constraint on which characters he can store. He does not know or care what the bit representation is inside the server. As such, if we store a non-ASCII character in a JSON string, it's valid JSON as far as the user is concerned, so long as that character exists in the Unicode standard. If his client encoding is UTF8, the value will be letter-perfect JSON when it gets to him; and if his client encoding is not UTF8, then he's already pretty much decided that he doesn't give a fig about the Unicode-centricity of the JSON spec, no? So I'm with Robert: we should just plain not care. I would further suggest that maybe what we should do with incoming JSON escape sequences is convert them to Unicode code points and then to the equivalent character in the database encoding (or throw error if there is none). regards, tom lane
On Fri, Jan 20, 2012 at 12:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 01/19/2012 04:12 PM, Robert Haas wrote: >>> On Thu, Jan 19, 2012 at 4:07 PM, Andrew Dunstan<andrew@dunslane.net> wrote: >>>> The spec only allows unescaped Unicode chars (and for our purposes that >>>> means UTF8). An unescaped non-ASCII character in, say, ISO-8859-1 will >>>> result in something that's not legal JSON. > >>> I understand. I'm proposing that we not care. In other words, if the >>> server encoding is UTF-8, it'll really be JSON. But if the server >>> encoding is something else, it'll be almost-JSON. > >> Of course, for data going to the client, if the client encoding is UTF8, >> they should get legal JSON, regardless of what the database encoding is, >> and conversely too, no? > > Yes. I think this argument has been mostly theologizing, along the > lines of how many JSON characters can dance on the head of a pin. > From a user's perspective, the database encoding is only a constraint on > which characters he can store. Bingo. > He does not know or care what the bit > representation is inside the server. As such, if we store a non-ASCII > character in a JSON string, it's valid JSON as far as the user is > concerned, so long as that character exists in the Unicode standard. > If his client encoding is UTF8, the value will be letter-perfect JSON > when it gets to him; and if his client encoding is not UTF8, then he's > already pretty much decided that he doesn't give a fig about the > Unicode-centricity of the JSON spec, no? Also agreed. Personally, I think it may not have been a great idea to tie the JSON spec so closely to Unicode, but I understand that it would have been difficult to define an encoding-agnostic equivalent of \uXXXX, since it's hard to know for sure whether an arbitrary encoding even has a (sensible?) definition of code points, and they probably wanted to avoid ambiguity. But, it's bound to cause problems for any system that runs in some other encoding, which, when so requested, we do. Even if we had the ability to support multiple encodings in the same database, I'm not sure I'd be very excited about insisting that JSON data always be stored in UTF-8, because that would introduce a lot of unnecessary transcoding for people using other encodings and basically unnecessarily handicap the functionality provided by the datatype. But at least if we had that, people would have the *option* to use JSON with UTF-8 and get the fully spec-compliant behavior. As it is, they don't; the system we have forces the database encoding on all datatypes whether they like it or not, and that ain't changing for 9.2. > So I'm with Robert: we should just plain not care. I would further > suggest that maybe what we should do with incoming JSON escape sequences > is convert them to Unicode code points and then to the equivalent > character in the database encoding (or throw error if there is none). The code I've written so far does no canonicalization of the input value of any kind, just as we do for XML. I'm inclined to leave it that way. Eventually, we might want to make the JSON datatype support equality comparisons and so on, and that will require the system to knowing that the letter r can be encoded as some \uXXXX sequence and that the escape \r is equivalent to some other escape \uXXXX, but right now all the code does is try to validate that the JSON is legal, NOT second-guess the user's choice about how to spell things or where to insert whitespace. I think that's a good choice because (1) AFAIK, there's no official canonicalization method for JSON, so whatever we pick will be something we think is best, not an official method sanction by the spec, (2) users might prefer the way they chose to represent a given value over the way we choose to represent it, and (3) by simply validating and storing the JSON object, rather than doing any canonicalization, the input function avoids the need to do any data copying, hopefully maximizing speed. Canonicalization can be added on top of what I've done here and people who want or need it can use it; I have some ideas around how to make that leverage the existing code that I intend to pursue for 9.3, but right now I'd rather not go there. So, given that framework, what the patch does is this: if you're using UTF-8, then \uXXXX is accepted, provided that XXXX is something that equates to a legal Unicode code point. It isn't converted to the corresponding character: it's just validated. If you're NOT using UTF-8, then it allows \uXXXX for code points up through 127 (which we assume are the same in all encodings) and anything higher than that is rejected. If someone knows an easy way to check whether a \uXXXX sequence for XXXX > 007F is a legal Unicode code point that has an equivalent in the current server encoding, then we can add logic to allow that case also, but personally I'm not that excited about it. Anyone who is using \uXXXX escapes with a non-Unicode coding is probably hoping that we can store arbitrary code points, not just the ones that happen to exist in the server encoding, so they're probably going to be disappointed whether or not we spend time eating away at the edges of the problem. It's awfully interesting to think about how we could actually make cross-encoding stuff work for real. Would we make it a separate property that can be associated with each column, like we did for collations? Or would we handle it more like range types - give me an encoding, and I'll give you a new type OID that represents text stored in that encoding? But I guess that's a question for another day. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > The code I've written so far does no canonicalization of the input > value of any kind, just as we do for XML. Fair enough. > So, given that framework, what the patch does is this: if you're using > UTF-8, then \uXXXX is accepted, provided that XXXX is something that > equates to a legal Unicode code point. It isn't converted to the > corresponding character: it's just validated. If you're NOT using > UTF-8, then it allows \uXXXX for code points up through 127 (which we > assume are the same in all encodings) and anything higher than that is > rejected. This seems a bit silly. If you're going to leave the escape sequence as ASCII, then why not just validate that it names a legal Unicode code point and be done? There is no reason whatever that that behavior needs to depend on the database encoding. regards, tom lane
On Fri, Jan 20, 2012 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> The code I've written so far does no canonicalization of the input >> value of any kind, just as we do for XML. > > Fair enough. > >> So, given that framework, what the patch does is this: if you're using >> UTF-8, then \uXXXX is accepted, provided that XXXX is something that >> equates to a legal Unicode code point. It isn't converted to the >> corresponding character: it's just validated. If you're NOT using >> UTF-8, then it allows \uXXXX for code points up through 127 (which we >> assume are the same in all encodings) and anything higher than that is >> rejected. > > This seems a bit silly. If you're going to leave the escape sequence as > ASCII, then why not just validate that it names a legal Unicode code > point and be done? There is no reason whatever that that behavior needs > to depend on the database encoding. Mostly because that would prevent us from adding canonicalization in the future, AFAICS, and I don't want to back myself into a corner. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01/20/2012 09:19 AM, Robert Haas wrote: > On Fri, Jan 20, 2012 at 12:07 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> Andrew Dunstan<andrew@dunslane.net> writes: >>> On 01/19/2012 04:12 PM, Robert Haas wrote: >>>> On Thu, Jan 19, 2012 at 4:07 PM, Andrew Dunstan<andrew@dunslane.net> wrote: >>>>> The spec only allows unescaped Unicode chars (and for our purposes that >>>>> means UTF8). An unescaped non-ASCII character in, say, ISO-8859-1 will >>>>> result in something that's not legal JSON. >>>> I understand. I'm proposing that we not care. In other words, if the >>>> server encoding is UTF-8, it'll really be JSON. But if the server >>>> encoding is something else, it'll be almost-JSON. >>> Of course, for data going to the client, if the client encoding is UTF8, >>> they should get legal JSON, regardless of what the database encoding is, >>> and conversely too, no? >> Yes. I think this argument has been mostly theologizing, along the >> lines of how many JSON characters can dance on the head of a pin. >> From a user's perspective, the database encoding is only a constraint on >> which characters he can store. > Bingo. > >> He does not know or care what the bit >> representation is inside the server. As such, if we store a non-ASCII >> character in a JSON string, it's valid JSON as far as the user is >> concerned, so long as that character exists in the Unicode standard. >> If his client encoding is UTF8, the value will be letter-perfect JSON >> when it gets to him; and if his client encoding is not UTF8, then he's >> already pretty much decided that he doesn't give a fig about the >> Unicode-centricity of the JSON spec, no? > Also agreed. Personally, I think it may not have been a great idea to > tie the JSON spec so closely to Unicode, but I understand that it > would have been difficult to define an encoding-agnostic equivalent of > \uXXXX, since it's hard to know for sure whether an arbitrary encoding > even has a (sensible?) definition of code points, and they probably > wanted to avoid ambiguity. But, it's bound to cause problems for any > system that runs in some other encoding, which, when so requested, we > do. Even if we had the ability to support multiple encodings in the > same database, I'm not sure I'd be very excited about insisting that > JSON data always be stored in UTF-8, because that would introduce a > lot of unnecessary transcoding for people using other encodings and > basically unnecessarily handicap the functionality provided by the > datatype. But at least if we had that, people would have the *option* > to use JSON with UTF-8 and get the fully spec-compliant behavior. As > it is, they don't; the system we have forces the database encoding on > all datatypes whether they like it or not, and that ain't changing for > 9.2. > >> So I'm with Robert: we should just plain not care. I would further >> suggest that maybe what we should do with incoming JSON escape sequences >> is convert them to Unicode code points and then to the equivalent >> character in the database encoding (or throw error if there is none). > The code I've written so far does no canonicalization of the input > value of any kind, just as we do for XML. I'm inclined to leave it > that way. Eventually, we might want to make the JSON datatype support > equality comparisons and so on, and that will require the system to > knowing that the letter r can be encoded as some \uXXXX sequence and > that the escape \r is equivalent to some other escape \uXXXX, but > right now all the code does is try to validate that the JSON is legal, > NOT second-guess the user's choice about how to spell things or where > to insert whitespace. I think that's a good choice because (1) AFAIK, > there's no official canonicalization method for JSON, so whatever we > pick will be something we think is best, not an official method > sanction by the spec, (2) users might prefer the way they chose to > represent a given value over the way we choose to represent it, and > (3) by simply validating and storing the JSON object, rather than > doing any canonicalization, the input function avoids the need to do > any data copying, hopefully maximizing speed. Canonicalization can be > added on top of what I've done here and people who want or need it can > use it; I have some ideas around how to make that leverage the > existing code that I intend to pursue for 9.3, but right now I'd > rather not go there. > > So, given that framework, what the patch does is this: if you're using > UTF-8, then \uXXXX is accepted, provided that XXXX is something that > equates to a legal Unicode code point. It isn't converted to the > corresponding character: it's just validated. If you're NOT using > UTF-8, then it allows \uXXXX for code points up through 127 (which we > assume are the same in all encodings) and anything higher than that is > rejected. If someone knows an easy way to check whether a \uXXXX > sequence for XXXX> 007F is a legal Unicode code point that has an > equivalent in the current server encoding, then we can add logic to > allow that case also, but personally I'm not that excited about it. > Anyone who is using \uXXXX escapes with a non-Unicode coding is > probably hoping that we can store arbitrary code points, not just the > ones that happen to exist in the server encoding, so they're probably > going to be disappointed whether or not we spend time eating away at > the edges of the problem. > > It's awfully interesting to think about how we could actually make > cross-encoding stuff work for real. Would we make it a separate > property that can be associated with each column, like we did for > collations? Or would we handle it more like range types - give me an > encoding, and I'll give you a new type OID that represents text stored > in that encoding? But I guess that's a question for another day. (thinks out loud) XML's nnnn; escape mechanism is more or less the equivalent of JSON's \unnnn. But XML documents can be encoded in a variety of encodings, including non-unicode encodings such as Latin-1. However, no matter what the document encoding, nnnn; designates the character with Unicode code point nnnn, whether or not that is part of the document encoding's charset. Given that precedent, I'm wondering if we do need to enforce anything other than that it is a valid unicode code point. Equivalence comparison is going to be difficult anyway if you're not resolving all \unnnn escapes. Possibly we need some sort of canonicalization function to apply for comparison purposes. But we're not providing any comparison ops today anyway, so I don't think we need to make that decision now. As you say, there doesn't seem to be any defined canonical form - the spec is a bit light on in this respect. cheers andrew
On Fri, Jan 20, 2012 at 10:45 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > XML's nnnn; escape mechanism is more or less the equivalent of JSON's > \unnnn. But XML documents can be encoded in a variety of encodings, > including non-unicode encodings such as Latin-1. However, no matter what the > document encoding, nnnn; designates the character with Unicode code point > nnnn, whether or not that is part of the document encoding's charset. OK. > Given that precedent, I'm wondering if we do need to enforce anything other > than that it is a valid unicode code point. > > Equivalence comparison is going to be difficult anyway if you're not > resolving all \unnnn escapes. Possibly we need some sort of canonicalization > function to apply for comparison purposes. But we're not providing any > comparison ops today anyway, so I don't think we need to make that decision > now. As you say, there doesn't seem to be any defined canonical form - the > spec is a bit light on in this respect. Well, we clearly have to resolve all \uXXXX to do either comparison or canonicalization. The current patch does neither, but presumably we want to leave the door open to such things. If we're using UTF-8 and comparing two strings, and we get to a position where one of them has a character and the other has \uXXXX, it's pretty simple to do the comparison: we just turn XXXX into a wchar_t and test for equality. That should be trivial, unless I'm misunderstanding. If, however, we're not using UTF-8, we have to first turn \uXXXX into a Unicode code point, then covert that to a character in the database encoding, and then test for equality with the other character after that. I'm not sure whether that's possible in general, how to do it, or how efficient it is. Can you or anyone shed any light on that topic? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jan 19, 2012, at 9:07 PM, Tom Lane wrote: > If his client encoding is UTF8, the value will be letter-perfect JSON > when it gets to him; and if his client encoding is not UTF8, then he's > already pretty much decided that he doesn't give a fig about the > Unicode-centricity of the JSON spec, no? Don’t entirely agree with this. Some folks are stuck with other encodings and cannot change them for one reason or another.That said, they can convert JSON from their required encoding into UTF-8 on the client side, so there is a workaround. Not that this changes anything, and I agree with the overall direction of the discussion here. I just want to make sure wekeep in mind folks who don’t necessarily have the freedom to switch to UTF-8. (And I say this as someone who *always* usesUTF-8!) David
On Jan 20, 2012, at 8:58 AM, Robert Haas wrote: > If, however, > we're not using UTF-8, we have to first turn \uXXXX into a Unicode > code point, then covert that to a character in the database encoding, > and then test for equality with the other character after that. I'm > not sure whether that's possible in general, how to do it, or how > efficient it is. Can you or anyone shed any light on that topic? If it’s like the XML example, it should always represent a Unicode code point, and *not* be converted to the other characterset, no? At any rate, since the JSON standard requires UTF-8, such distinctions having to do with alternate encodings are not likelyto be covered, so I suspect we can do whatever we want here. It’s outside the spec. Best, David
On Fri, Jan 20, 2012 at 09:12:13AM -0800, David E. Wheeler wrote: > On Jan 19, 2012, at 9:07 PM, Tom Lane wrote: > > > If his client encoding is UTF8, the value will be letter-perfect JSON > > when it gets to him; and if his client encoding is not UTF8, then he's > > already pretty much decided that he doesn't give a fig about the > > Unicode-centricity of the JSON spec, no? > > Don’t entirely agree with this. Some folks are stuck with other encodings and > cannot change them for one reason or another. That said, they can convert > JSON from their required encoding into UTF-8 on the client side, so there is > a workaround. Perhaps in addition to trying to just 'do the right thing by default', it makes sense to have a two canonicalization functions? Say: json_utf8() and json_ascii(). They could give the same output no matter what encoding was set? json_utf8 would give nice output where characters were canonicalized to native utf8 characters and json_ascii() would output only non-control ascii characters literally and escape everything else or something like that? Garick
On 01/20/2012 11:58 AM, Robert Haas wrote: > On Fri, Jan 20, 2012 at 10:45 AM, Andrew Dunstan<andrew@dunslane.net> wrote: >> XML'snnnn; escape mechanism is more or less the equivalent of JSON's >> \unnnn. But XML documents can be encoded in a variety of encodings, >> including non-unicode encodings such as Latin-1. However, no matter what the >> document encoding,nnnn; designates the character with Unicode code point >> nnnn, whether or not that is part of the document encoding's charset. > OK. > >> Given that precedent, I'm wondering if we do need to enforce anything other >> than that it is a valid unicode code point. >> >> Equivalence comparison is going to be difficult anyway if you're not >> resolving all \unnnn escapes. Possibly we need some sort of canonicalization >> function to apply for comparison purposes. But we're not providing any >> comparison ops today anyway, so I don't think we need to make that decision >> now. As you say, there doesn't seem to be any defined canonical form - the >> spec is a bit light on in this respect. > Well, we clearly have to resolve all \uXXXX to do either comparison or > canonicalization. The current patch does neither, but presumably we > want to leave the door open to such things. If we're using UTF-8 and > comparing two strings, and we get to a position where one of them has > a character and the other has \uXXXX, it's pretty simple to do the > comparison: we just turn XXXX into a wchar_t and test for equality. > That should be trivial, unless I'm misunderstanding. If, however, > we're not using UTF-8, we have to first turn \uXXXX into a Unicode > code point, then covert that to a character in the database encoding, > and then test for equality with the other character after that. I'm > not sure whether that's possible in general, how to do it, or how > efficient it is. Can you or anyone shed any light on that topic? We know perfectly well how to turn two strings from encoding x to utf8 (see mb_utils.c::pg_do_encoding_conversion() ). Once we've done that ISTM we have reduced this to the previous problem, as the mathematicians like to say. cheers andrew
On Fri, Jan 20, 2012 at 12:14 PM, David E. Wheeler <david@kineticode.com> wrote: > On Jan 20, 2012, at 8:58 AM, Robert Haas wrote: > >> If, however, >> we're not using UTF-8, we have to first turn \uXXXX into a Unicode >> code point, then covert that to a character in the database encoding, >> and then test for equality with the other character after that. I'm >> not sure whether that's possible in general, how to do it, or how >> efficient it is. Can you or anyone shed any light on that topic? > > If it’s like the XML example, it should always represent a Unicode code point, and *not* be converted to the other characterset, no? Well, you can pick which way you want to do the conversion. If the database encoding is SJIS, and there's an SJIS character in a string that gets passed to json_in(), and there's another string which also gets passed to json_in() which contains \uXXXX, then any sort of canonicalization or equality testing is going to need to convert the SJIS character to a Unicode code point, or the Unicode code point to an SJIS character, to see whether they match. Err, actually, now that I think about it, that might be a problem: what happens if we're trying to test two characters for equality and the encoding conversion fails? We really just want to return false - the strings are clearly not equal if either contains even one character that can't be converted to the other encoding - so it's not good if an error gets thrown in there anywhere. > At any rate, since the JSON standard requires UTF-8, such distinctions having to do with alternate encodings are not likelyto be covered, so I suspect we can do whatever we want here. It’s outside the spec. I agree. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Err, actually, now that I think about it, that might be a problem: > what happens if we're trying to test two characters for equality and > the encoding conversion fails? This is surely all entirely doable given the encoding infrastructure we already have. We might need some minor refactoring, eg to have a way of not throwing an error, but it's not going to be that hard to achieve if somebody wants to do it. So I still see little reason for making the JSON type behave visibly differently in non-UTF8 database encodings. regards, tom lane
On Fri, Jan 20, 2012 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Err, actually, now that I think about it, that might be a problem: >> what happens if we're trying to test two characters for equality and >> the encoding conversion fails? > > This is surely all entirely doable given the encoding infrastructure > we already have. We might need some minor refactoring, eg to have > a way of not throwing an error, but it's not going to be that hard > to achieve if somebody wants to do it. So I still see little reason > for making the JSON type behave visibly differently in non-UTF8 database > encodings. OK. It feels a little grotty to me, but I'll go with the flow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jan 15, 2012 at 8:08 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 01/14/2012 03:06 PM, Andrew Dunstan wrote: >> >> >> >> >> OK, here's a patch that does both query_to_json and array_to_json, along >> with docs and regression tests. It include Robert's original patch, although >> I can produce a differential patch if required. It can also be pulled from >> <https://bitbucket.org/adunstan/pgdevel> >> >> > > > Here's an update that adds row_to_json, plus a bit more cleanup. This is bit-rotted such that initdb fails creating template1 database in /tmp/bar/src/test/regress/./tmp_check/data/base/1 ... FATAL: could not create unique index "pg_proc_oid_index" DETAIL: Key (oid)=(3145) is duplicated. I bumped up those oids in the patch, and it passes make check once I figure out how to get the test run under UTF-8. Is it supposed to pass under other encodings? I can't tell from the rest of thread whether it supposed to pass in other encodings or not. Cheers, Jeff
On 01/21/2012 11:40 PM, Jeff Janes wrote: > On Sun, Jan 15, 2012 at 8:08 AM, Andrew Dunstan<andrew@dunslane.net> wrote: >> >> On 01/14/2012 03:06 PM, Andrew Dunstan wrote: >>> >>> >>> >>> OK, here's a patch that does both query_to_json and array_to_json, along >>> with docs and regression tests. It include Robert's original patch, although >>> I can produce a differential patch if required. It can also be pulled from >>> <https://bitbucket.org/adunstan/pgdevel> >>> >>> >> >> Here's an update that adds row_to_json, plus a bit more cleanup. > This is bit-rotted such that initdb fails > > creating template1 database in > /tmp/bar/src/test/regress/./tmp_check/data/base/1 ... FATAL: could > not create unique index "pg_proc_oid_index" > DETAIL: Key (oid)=(3145) is duplicated. > > I bumped up those oids in the patch, and it passes make check once I > figure out how to get the test run under UTF-8. Is it supposed to > pass under other encodings? I can't tell from the rest of thread > whether it supposed to pass in other encodings or not. > Yeah, regression tests generally are supposed to run in all encodings. Either we could knock out the offending test, or we could supply an alternative result file. If we do the latter, maybe we should modify the query slightly, so it reads SELECT 'getdatabaseencoding() = 'UTF8' as is_utf8, "\uaBcD"'::json; cheers andrew
On 01/22/2012 04:28 AM, Andrew Dunstan wrote: > > > On 01/21/2012 11:40 PM, Jeff Janes wrote: >> On Sun, Jan 15, 2012 at 8:08 AM, Andrew Dunstan<andrew@dunslane.net> >> wrote: >>> >>> On 01/14/2012 03:06 PM, Andrew Dunstan wrote: >>>> >>>> >>>> >>>> OK, here's a patch that does both query_to_json and array_to_json, >>>> along >>>> with docs and regression tests. It include Robert's original patch, >>>> although >>>> I can produce a differential patch if required. It can also be >>>> pulled from >>>> <https://bitbucket.org/adunstan/pgdevel> >>>> >>>> >>> >>> Here's an update that adds row_to_json, plus a bit more cleanup. >> This is bit-rotted such that initdb fails >> >> creating template1 database in >> /tmp/bar/src/test/regress/./tmp_check/data/base/1 ... FATAL: could >> not create unique index "pg_proc_oid_index" >> DETAIL: Key (oid)=(3145) is duplicated. >> >> I bumped up those oids in the patch, and it passes make check once I >> figure out how to get the test run under UTF-8. Is it supposed to >> pass under other encodings? I can't tell from the rest of thread >> whether it supposed to pass in other encodings or not. >> > > Yeah, regression tests generally are supposed to run in all encodings. > Either we could knock out the offending test, or we could supply an > alternative result file. If we do the latter, maybe we should modify > the query slightly, so it reads > > SELECT 'getdatabaseencoding() = 'UTF8' as is_utf8, "\uaBcD"'::json; > > Actually, given recent discussion I think that test should just be removed from json.c. We don't actually have any test that the code point is valid (e.g. that it doesn't refer to an unallocated code point). We don't do that elsewhere either - the unicode_to_utf8() function the scanner uses to turn \unnnn escapes into utf8 doesn't look for unallocated code points. I'm not sure how much other validation we should do - for example on correct use of surrogate pairs. I'd rather get this as right as possible now - every time we tighten encoding rules to make sure incorrectly encoded data doesn't get into the database it causes someone real pain. cheers andrew
On sön, 2012-01-22 at 11:43 -0500, Andrew Dunstan wrote: > Actually, given recent discussion I think that test should just be > removed from json.c. We don't actually have any test that the code > point is valid (e.g. that it doesn't refer to an unallocated code > point). We don't do that elsewhere either - the unicode_to_utf8() > function the scanner uses to turn \unnnn escapes into utf8 doesn't > look for unallocated code points. I'm not sure how much other > validation we should do - for example on correct use of surrogate > pairs. We do check the correctness of surrogate pairs elsewhere. Search for "surrogate" in scan.l; should be easy to copy.
On Sun, Jan 15, 2012 at 10:08 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > Here's an update that adds row_to_json, plus a bit more cleanup. why not call all these functions 'to_json' and overload them? merlin
On 01/23/2012 05:21 PM, Merlin Moncure wrote: > On Sun, Jan 15, 2012 at 10:08 AM, Andrew Dunstan<andrew@dunslane.net> wrote: >> Here's an update that adds row_to_json, plus a bit more cleanup. > why not call all these functions 'to_json' and overload them? > I don't honestly feel that advances clarity much. And we might want to overload each at some stage with options that are specific to the datum type. We have various foo_to_xml() functions now. cheers andrew
2012/1/23 Merlin Moncure <mmoncure@gmail.com>: > On Sun, Jan 15, 2012 at 10:08 AM, Andrew Dunstan <andrew@dunslane.net> wrote: >> Here's an update that adds row_to_json, plus a bit more cleanup. > > why not call all these functions 'to_json' and overload them? -1 older proposal is more consistent with xml functions Pavel > > merlin
At 2012-01-15 11:08:05 -0500, andrew@dunslane.net wrote: > > Here's an update that adds row_to_json, plus a bit more cleanup. I've started reviewing this patch, but it'll take me a bit longer to go through json.c properly. Here are a few preliminary notes: 1. The patch has a lot of whitespace errors (primarily lines ending in whitespace), but applying with git apply --whitespace=fix isn't safe, because the test results need some (but not all) of those spaces. I applied the patch, backed out the changes to expected/json.out, and created the file from the patch, then removed the superfluous whitespace. 2. I bumped some function OIDs to avoid conflicts. 3. One documentation typo. Everything other than json.c (which I haven't read yet) looks fine (builds, passes tests). I've attached a patch covering the changes I made. More later. -- ams
Attachment
At 2012-01-27 09:47:05 +0530, ams@toroid.org wrote: > > I've started reviewing this patch, but it'll take me a bit longer to go > through json.c properly. OK, I finished reading json.c. I don't have an answer to the detoasting question in the XXX comment, but the code looks fine. Aside: is query_to_json really necessary? It seems rather ugly and easily avoidable using row_to_json. -- ams
On 01/30/2012 09:54 AM, Abhijit Menon-Sen wrote: > At 2012-01-27 09:47:05 +0530, ams@toroid.org wrote: >> I've started reviewing this patch, but it'll take me a bit longer to go >> through json.c properly. > OK, I finished reading json.c. I don't have an answer to the detoasting > question in the XXX comment, but the code looks fine. Looking at somewhat analogous code in xml.c, it doesn't seem to be done there. SO maybe we don't need to worry about it. > > Aside: is query_to_json really necessary? It seems rather ugly and > easily avoidable using row_to_json. > I started with this, again by analogy with query_to_xml(). But I agree it's a bit ugly. If we're not going to do it, then we definitely need to look at caching the output funcs in the function info. A closer approximation is actually: SELECT array_to_json(array_agg(q)) FROM ( your query here ) q; But then I'd want the ability to break that up a bit with line feeds, so we'd need to adjust the interface slightly. (Hint: don't try the above with "select * from pg_class".) I'll wait on further comments, but I can probably turn these changes around very quickly once we're agreed. Cheers andrew
On Mon, Jan 30, 2012 at 9:37 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > On 01/30/2012 09:54 AM, Abhijit Menon-Sen wrote: >> >> At 2012-01-27 09:47:05 +0530, ams@toroid.org wrote: >>> >>> I've started reviewing this patch, but it'll take me a bit longer to go >>> through json.c properly. >> >> OK, I finished reading json.c. I don't have an answer to the detoasting >> question in the XXX comment, but the code looks fine. > > > > Looking at somewhat analogous code in xml.c, it doesn't seem to be done > there. SO maybe we don't need to worry about it. > > >> >> Aside: is query_to_json really necessary? It seems rather ugly and >> easily avoidable using row_to_json. >> > > I started with this, again by analogy with query_to_xml(). But I agree it's > a bit ugly. If we're not going to do it, then we definitely need to look at > caching the output funcs in the function info. A closer approximation is > actually: > > SELECT array_to_json(array_agg(q)) > FROM ( your query here ) q; yup -- although I'd probably write it like this most of the time: select array_to_json(array( <query> )); if we did have a 'query_to_json', the array() constructor would be a lot more pleasant to to deal with than a textual query for obvious reasons even though it's highly irregular syntax. however, since arrays can already handle it, I wouldn't miss it at all. merlin
On Mon, Jan 23, 2012 at 3:20 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On sön, 2012-01-22 at 11:43 -0500, Andrew Dunstan wrote: >> Actually, given recent discussion I think that test should just be >> removed from json.c. We don't actually have any test that the code >> point is valid (e.g. that it doesn't refer to an unallocated code >> point). We don't do that elsewhere either - the unicode_to_utf8() >> function the scanner uses to turn \unnnn escapes into utf8 doesn't >> look for unallocated code points. I'm not sure how much other >> validation we should do - for example on correct use of surrogate >> pairs. > > We do check the correctness of surrogate pairs elsewhere. Search for > "surrogate" in scan.l; should be easy to copy. I've committed a version of this that does NOT do surrogate pair validation. Per discussion elsewhere, I also removed the check for \uXXXX with XXXX > 007F and database encoding != UTF8. This will complicate things somewhat when we get around to doing canonicalization and comparison, but Tom seems confident that those issues are manageble. I did not commit Andrew's further changes, either; I'm assuming he'll do that himself. With respect to the issue of whether we ought to check surrogate pairs, the JSON spec is not a whole lot of help. RFC4627 says: To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-charactersequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character(U+1D11E) may be represented as "\uD834\uDD1E". That fails to answer the question of what we ought to do if we get an invalid sequence there. You could make an argument that we ought to just allow it; it doesn't particularly hinder our ability to canonicalize or compare strings, because our notion of sort-ordering for characters that may span multiple encodings is going to be pretty funky anyway. We can just leave those bits as \uXXXX sequences and call it good. However, it would hinder our ability to convert a JSON string to a string in the database encoding: we could find an invalidate surrogate pair that was allowable as JSON but unrepresentable in the database encoding. On the flip side, given our decision to allow all \uXXXX sequences even when not using UTF-8, we could also run across a perfectly valid UTF-8 sequence that's not representable as a character in the server encoding, so it seems we have that problem anyway, so maybe it's not much worse to have two reasons why it can happen rather than one. On the third hand, most people are probably using UTF-8, and those people aren't going to have any transcoding issues, so the invalid surrogate pair case may be the only one they can hit (unless invalid code points are also an issue?), so maybe it's worth avoiding on that basis. Anyway, I defer to the wisdom of the collective on this one: how should we handle this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01/30/2012 10:37 AM, Andrew Dunstan wrote: > > >> Aside: is query_to_json really necessary? It seems rather ugly and >> easily avoidable using row_to_json. >> > > I started with this, again by analogy with query_to_xml(). But I agree > it's a bit ugly. If we're not going to do it, then we definitely need > to look at caching the output funcs in the function info. A closer > approximation is actually: > > SELECT array_to_json(array_agg(q)) > FROM ( your query here ) q; > > > But then I'd want the ability to break that up a bit with line feeds, > so we'd need to adjust the interface slightly. (Hint: don't try the > above with "select * from pg_class".) > > > I'll wait on further comments, but I can probably turn these changes > around very quickly once we're agreed. > > > based on Abhijit's feeling and some discussion offline, the consensus seems to be to remove query_to_json. cheers andrew
Andrew, > based on Abhijit's feeling and some discussion offline, the consensus > seems to be to remove query_to_json. If we do that, what would getting complete query results back from a query look like? It's important to make this as simple for developers as possible. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
At 2012-01-31 12:04:31 -0500, robertmhaas@gmail.com wrote: > > That fails to answer the question of what we ought to do if we get an > invalid sequence there. I think it's best to categorically reject invalid surrogates as early as possible, considering the number of bugs that are related to them (not in Postgres, just in general). I can't see anything good coming from letting them in and leaving them to surprise someone in future. -- ams
On Tue, Jan 31, 2012 at 12:15 PM, Josh Berkus <josh@agliodbs.com> wrote: > Andrew, > >> based on Abhijit's feeling and some discussion offline, the consensus >> seems to be to remove query_to_json. > > If we do that, what would getting complete query results back from a > query look like? It's important to make this as simple for developers > as possible. two options: 1. row_to_json(rowvar) SELECT row_to_json(foo) from foo; SELECT row_to_json(row(a,b,c)) from foo; 2. array_to_json(array_agg()/array()) SELECT array_to_json(array(select foo from foo)); SELECT array_to_json(array[1,2,3]); #1 I expect will be the more used version -- most json handling client side api (for example node.js drivers) are optimized for row by row processing, but via #2 you can stuff a whole query into single json object if you're so inclined. merlin
On 01/31/2012 01:32 PM, Merlin Moncure wrote: > On Tue, Jan 31, 2012 at 12:15 PM, Josh Berkus<josh@agliodbs.com> wrote: >> Andrew, >> >>> based on Abhijit's feeling and some discussion offline, the consensus >>> seems to be to remove query_to_json. >> If we do that, what would getting complete query results back from a >> query look like? It's important to make this as simple for developers >> as possible. > two options: > 1. row_to_json(rowvar) > SELECT row_to_json(foo) from foo; > SELECT row_to_json(row(a,b,c)) from foo; > > 2. array_to_json(array_agg()/array()) > SELECT array_to_json(array(select foo from foo)); > SELECT array_to_json(array[1,2,3]); > > #1 I expect will be the more used version -- most json handling client > side api (for example node.js drivers) are optimized for row by row > processing, but via #2 you can stuff a whole query into single json > object if you're so inclined. > You could also write a wrapper something like this: create function query_to_json(qtext text) returns json language plpgsql as $$ begin return query execute'select array_to_json(array(' || qtext || '))'; end; $$; cheers andrew
On Tue, Jan 31, 2012 at 12:48 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > On 01/31/2012 01:32 PM, Merlin Moncure wrote: >> On Tue, Jan 31, 2012 at 12:15 PM, Josh Berkus<josh@agliodbs.com> wrote: >>> >>> Andrew, >>> >>>> based on Abhijit's feeling and some discussion offline, the consensus >>>> seems to be to remove query_to_json. >>> >>> If we do that, what would getting complete query results back from a >>> query look like? It's important to make this as simple for developers >>> as possible. >> >> two options: >> 1. row_to_json(rowvar) >> SELECT row_to_json(foo) from foo; >> SELECT row_to_json(row(a,b,c)) from foo; >> >> 2. array_to_json(array_agg()/array()) >> SELECT array_to_json(array(select foo from foo)); >> SELECT array_to_json(array[1,2,3]); >> >> #1 I expect will be the more used version -- most json handling client >> side api (for example node.js drivers) are optimized for row by row >> processing, but via #2 you can stuff a whole query into single json >> object if you're so inclined. >> > > You could also write a wrapper something like this: > > create function query_to_json(qtext text) returns json language > plpgsql as > $$ > begin > return query execute 'select array_to_json(array(' || qtext || > '))'; > end; > $$; right -- then you can leverage execute/using parameterization etc. and/or rig a variadic version. The major hole in functionality I see for heavy json users is the reverse; how do you get json back into the database? With xml, at least you could (ab)use xpath for that...with json you have to rely on add-on support and/or ad hoc string parsing (that is, unless I'm missing something -- I just noted Robert's commit of the JSON type). since we can do: select array_to_json(array(select foo from foo)); it seems natural to be able to want do do something like: WITH foos AS (SELECT a_json_var::foo[] AS f) ( INSERT INTO foo SELECT (f).* FROM foos ); Of course, you'd have to have non-anonymous (that is, defined with CREATE TYPE AS) types defined to receive all the data, but that's not so bad. Also, could xxx_to_json be hypothetically executed via casts? e.g. select array(select foo from foo)::json; merlin
On 01/31/2012 02:49 PM, Merlin Moncure wrote: > The major hole in functionality I see for heavy json users is the > reverse; how do you get json back into the database? With xml, at > least you could (ab)use xpath for that...with json you have to rely on > add-on support and/or ad hoc string parsing (that is, unless I'm > missing something -- I just noted Robert's commit of the JSON type). > since we can do: select array_to_json(array(select foo from foo)); it > seems natural to be able to want do do something like: WITH foos AS > (SELECT a_json_var::foo[] AS f) ( INSERT INTO foo SELECT (f).* FROM > foos ); Of course, you'd have to have non-anonymous (that is, defined > with CREATE TYPE AS) types defined to receive all the data, but that's > not so bad. Also, could xxx_to_json be hypothetically executed via > casts? e.g. select array(select foo from foo)::json; At some stage there will possibly be some json-processing (as opposed to json-producing) functions, but not in 9.2 - it's too late for that. Until then there is PLV8: see <http://people.planetpostgresql.org/andrew/index.php?/archives/249-Using-PLV8-to-index-JSON.html> for example. Or someone could write an extension. cheers andrew
On Tue, Jan 31, 2012 at 1:29 PM, Abhijit Menon-Sen <ams@toroid.org> wrote: > At 2012-01-31 12:04:31 -0500, robertmhaas@gmail.com wrote: >> >> That fails to answer the question of what we ought to do if we get an >> invalid sequence there. > > I think it's best to categorically reject invalid surrogates as early as > possible, considering the number of bugs that are related to them (not > in Postgres, just in general). I can't see anything good coming from > letting them in and leaving them to surprise someone in future. > > -- ams +1 Another sequence to beware of is \u0000. While escaped NUL characters are perfectly valid in JSON, NUL characters aren't allowed in TEXT values. This means not all JSON strings can be converted to TEXT, even in UTF-8. This may also complicate collation, if comparison functions demand null-terminated strings. I'm mostly in favor of allowing \u0000. Banning \u0000 means users can't use JSON strings to marshal binary blobs, e.g. by escaping non-printable characters and only using U+0000..U+00FF. Instead, they have to use base64 or similar. Banning \u0000 doesn't quite violate the RFC: An implementation may set limits on the length and character contents of strings. -Joey
On 01/31/2012 01:48 PM, Andrew Dunstan wrote: > > > On 01/31/2012 01:32 PM, Merlin Moncure wrote: >> On Tue, Jan 31, 2012 at 12:15 PM, Josh Berkus<josh@agliodbs.com> wrote: >>> Andrew, >>> >>>> based on Abhijit's feeling and some discussion offline, the consensus >>>> seems to be to remove query_to_json. >>> If we do that, what would getting complete query results back from a >>> query look like? It's important to make this as simple for developers >>> as possible. >> two options: >> 1. row_to_json(rowvar) >> SELECT row_to_json(foo) from foo; >> SELECT row_to_json(row(a,b,c)) from foo; >> >> 2. array_to_json(array_agg()/array()) >> SELECT array_to_json(array(select foo from foo)); >> SELECT array_to_json(array[1,2,3]); >> >> #1 I expect will be the more used version -- most json handling client >> side api (for example node.js drivers) are optimized for row by row >> processing, but via #2 you can stuff a whole query into single json >> object if you're so inclined. >> > > You could also write a wrapper something like this: > > create function query_to_json(qtext text) returns json language > plpgsql as > $$ > begin > return query execute 'select array_to_json(array(' || qtext || > '))'; > end; > $$; The array(select...) locution turns out to have less flexibility than the array_agg(record-ref) locution. Anyway, for those playing along, I have removed query_to_json, and added optional pretty printing to array_to_json and row_to_json. WIP can be seen at <https://bitbucket.org/adunstan/pgdevel>. I still have docs and output function caching to do, and should post a revised patch within the next 48 hours. cheers andrew
On Tue, Jan 31, 2012 at 11:46 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > The array(select...) locution turns out to have less flexibility than the > array_agg(record-ref) locution. Less flexible maybe, but it can cleaner for exactly the type of queries that will tend to come up in exactly the type of functionality people are looking for with JSON output. libpqtypes does exactly the same stuff but for C clients -- so I've done tons of this kind of programming and am maybe a bit ahead of the curve here. Note: while the following contrived example may seem a bit complex it has a certain elegance and shows how the postgres type system can whip out document style 'nosql' objects to clients who can handle them. Perhaps there is more simplification through syntax possible, but as it stands things are pretty functional. The equivalent production through array_agg I find to be pretty awful looking although it can produce a better plan since it doesn't force everything through flattened subqueries: create table foo ( foo_id serial primary key, a int ); create table bar ( bar_id serial primary key, foo_id int references foo, b int ); create table baz ( baz_id serial primary key, bar_id int references bar, c int ); create type baz_t as ( c int ); create type bar_t as ( bazs baz_t[], b int ); create type foo_t as ( bars bar_t[], a int ); INSERT INTO foo(a) VALUES (1); INSERT INTO bar(foo_id, b) VALUES (currval('foo_foo_id_seq'), 100); INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 1000); INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 2000); INSERT INTO bar(foo_id, b) VALUES (currval('foo_foo_id_seq'), 200); INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 3000); INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 4000); INSERT INTO foo(a) VALUES (2); INSERT INTO bar(foo_id, b) VALUES (currval('foo_foo_id_seq'), 300); INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 5000); INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 6000); INSERT INTO bar(foo_id, b) VALUES (currval('foo_foo_id_seq'), 400); INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 7000); INSERT INTO baz(bar_id, c) VALUES (currval('bar_bar_id_seq'), 8000); -- nosql! select array( select row( array( select row( array( select row( c )::baz_t from bazwhere baz.bar_id = bar.bar_id )::baz_t[], b )::bar_t from bar where bar.foo_id = foo.foo_id )::bar_t[], a )::foo_t from foo )::foo_t[]; foo_t -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------{"(\"{\"\"(\\\\\"\"{(1000),(2000)}\\\\\"\",100)\"\",\"\"(\\\\\"\"{(3000),(4000)}\\\\\"\",200)\"\"}\",1)","(\"{\"\"(\\\\\"\"{(5000),(6000)}\\\\\"\",300)\"\",\"\"(\\\\\"\"{(7000),(8000)}\\\\\"\",400)\"\"}\",2)"} as you can see, the postgres default escaping format sucks for sending nested data -- throw even one quote or backslash in there and your data can explode in size 10+ times -- this is why we insisted on binary. json, of course, is much better suited for this type of communication. despite the complicated-ness look of the above, this type of code is in fact very easy to write once you get the knack. This type of coding also leads to much simpler coding on the cilent since relationships are directly built into the structure and don't have to be inferred or duplicated. merlin
On Tue, Jan 31, 2012 at 3:47 PM, Joey Adams <joeyadams3.14159@gmail.com> wrote: > I'm mostly in favor of allowing \u0000. Banning \u0000 means users > can't use JSON strings to marshal binary blobs, e.g. by escaping > non-printable characters and only using U+0000..U+00FF. Instead, they > have to use base64 or similar. I agree. I mean, representing data using six bytes per source byte is a bit unattractive from an efficiency point of view, but I'm sure someone is going to want to do it. It's also pretty clear that JSON string -> PG text data type is going to admit of a number of error conditions (transcoding errors and perhaps invalid surrogate pairs) so throwing one more on the pile doesn't cost much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
At 2012-02-01 11:28:50 -0500, robertmhaas@gmail.com wrote: > > It's also pretty clear that JSON > string -> PG text data type is going to admit of a number of error > conditions (transcoding errors and perhaps invalid surrogate pairs) so > throwing one more on the pile doesn't cost much. Hi Robert. I'm sorry for being slow, but I don't understand what you're proposing to do here (if anything). Could I ask you to explain, please? Are you talking about allowing the six literal bytes "\u0000" to be present in the JSON? If so, I agree, there seems to be no reason to disallow it. Are you also saying we should allow any "\uNNNN" sequence, without checking for errors (e.g. invalid surrogate pairs or parts thereof)? And what transcoding errors are you referring to? -- ams
On Thu, Feb 2, 2012 at 4:54 AM, Abhijit Menon-Sen <ams@toroid.org> wrote: > At 2012-02-01 11:28:50 -0500, robertmhaas@gmail.com wrote: >> It's also pretty clear that JSON >> string -> PG text data type is going to admit of a number of error >> conditions (transcoding errors and perhaps invalid surrogate pairs) so >> throwing one more on the pile doesn't cost much. > > I'm sorry for being slow, but I don't understand what you're proposing > to do here (if anything). Could I ask you to explain, please? > > Are you talking about allowing the six literal bytes "\u0000" to be > present in the JSON? If so, I agree, there seems to be no reason to > disallow it. > > Are you also saying we should allow any "\uNNNN" sequence, without > checking for errors (e.g. invalid surrogate pairs or parts thereof)? > > And what transcoding errors are you referring to? Consider the following JSON object: "abc" This is a JSON string. Someone is eventually going to propose a function with a name like json_to_string() which, when given this JSON object, returns a three-character string with the PostgreSQL text type. That's useful and I support it. But now suppose we pass this JSON object to that same function: "a\u0062c" The user will quite rightly expect that since this JSON object represents the same value as the first JSON object, they're going to get the same answer back from json_to_string(), namely "abc". So far, so good. But now suppose we pass this JSON object to that same function: "a\u0000c" This is going to have to be an error condition, because PostgreSQL does not allow values of type text to contain embedded NUL characters.Now consider this: "a\uABCDc" Suppose that \uABCD represents a character that exists in Unicode, but the server-encoding is SHIFT-JIS or EUC-JP or some other system which has no equivalent for the character represented by \uABCD. Again, when json_to_string() is applied to a value of this type, it must fail. In other words, we're knowingly allowing JSON strings to contain characters which might not be representable as PostgreSQL strings, because JSON allows any Unicode character, and the server encoding might not be Unicode, and the server doesn't allow embedded NULs in any encoding. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
At 2012-02-02 08:54:32 -0500, robertmhaas@gmail.com wrote: > > Someone is eventually going to propose a function with a name like > json_to_string() which, when given this JSON object, returns a > three-character string with the PostgreSQL text type. Ah, that's the bit I was missing. I thought you were talking about an immediate error condition. > That's useful and I support it. Agreed. Also, now I understand that you are saying that json_to_string() (json_string_to_text?) would fail if the result couldn't be represented as a text in the current encoding, and that's sensible as well. I had misunderstood "is going to admit of a number of error…" in your mail. As for surrogate pairs, just to be clear, what I was proposing earlier in the thread was to change json.c:json_lex_string() to detect errors (e.g. only one half of a surrogate pair, which is the commonest error I've encountered in the wild) and reject such strings. Thanks for the explanation. -- ams
On Tue, 2012-01-31 at 12:58 -0500, Andrew Dunstan wrote: > > On 01/30/2012 10:37 AM, Andrew Dunstan wrote: > > > > > >> Aside: is query_to_json really necessary? It seems rather ugly and > >> easily avoidable using row_to_json. > >> > > > > I started with this, again by analogy with query_to_xml(). But I agree > > it's a bit ugly. If we're not going to do it, then we definitely need > > to look at caching the output funcs in the function info. A closer > > approximation is actually: > > > > SELECT array_to_json(array_agg(q)) > > FROM ( your query here ) q; > > > > > > But then I'd want the ability to break that up a bit with line feeds, > > so we'd need to adjust the interface slightly. (Hint: don't try the > > above with "select * from pg_class".) > > > > > > I'll wait on further comments, but I can probably turn these changes > > around very quickly once we're agreed. > > > > > > > > based on Abhijit's feeling and some discussion offline, the consensus > seems to be to remove query_to_json. The only comment I have here is that query_to_json could have been replaced with json_agg, so thet you don't need to do double-buffering for the results of array(<yourquery>) call in SELECT array_to_json(array(<yourquery>)); Or is there some other way to avoid it except to wrap row_to_json() calls in own aggregate function which adds enclosing brackets and comma separator ( like this : '['<row1>[,<rowN>]']' ? > cheers > > andrew >
On 04/16/2012 09:34 AM, Hannu Krosing wrote: >> based on Abhijit's feeling and some discussion offline, the consensus >> seems to be to remove query_to_json. > The only comment I have here is that query_to_json could have been > replaced with json_agg, so thet you don't need to do double-buffering > for the results of array(<yourquery>) call in > > SELECT array_to_json(array(<yourquery>)); > > Or is there some other way to avoid it except to wrap row_to_json() > calls in own aggregate function which adds enclosing brackets and comma > separator ( like this : '['<row1>[,<rowN>]']' ? > > The way I usually write this is: select array_to_json(array_agg(q)) from (<yourquery>) q; It's a pity you didn't make this comment back in January when we were talking about this. I think it's too late now in this release cycle to be talking about adding the aggregate function. cheers andrew
On Mon, Apr 16, 2012 at 9:10 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 04/16/2012 09:34 AM, Hannu Krosing wrote: >>> >>> based on Abhijit's feeling and some discussion offline, the consensus >>> seems to be to remove query_to_json. >> >> The only comment I have here is that query_to_json could have been >> replaced with json_agg, so thet you don't need to do double-buffering >> for the results of array(<yourquery>) call in >> >> SELECT array_to_json(array(<yourquery>)); >> >> Or is there some other way to avoid it except to wrap row_to_json() >> calls in own aggregate function which adds enclosing brackets and comma >> separator ( like this : '['<row1>[,<rowN>]']' ? >> >> > > The way I usually write this is: > > select array_to_json(array_agg(q)) > from (<yourquery>) q; > > It's a pity you didn't make this comment back in January when we were > talking about this. I think it's too late now in this release cycle to be > talking about adding the aggregate function. I find array_agg to be pretty consistently slower than array()...although not much, say around 5-10%. I use array_agg only when grouping. try timing select array_to_json(array_agg(v)) from (select v from generate_series(1,1000000) v) q; vs select array_to_json(array(select v from generate_series(1,1000000) v)); I agree with Hannu but as things stand if I'm trying to avoid the extra buffer I've found myself doing the final aggregation on the client -- it's easy enough. BTW, I'm using the json stuff heavily and it's just absolutely fantastic. Finally I can write web applications without wondering exactly where it was that computer science went off the rails. I've already demoed a prototype app that integrates pg directly with the many high quality js libraries out there and it makes things very easy and quick by making writing data services trivial. Data pushes are still quite a pain but I figure something can be worked out. merlin
On Mon, 2012-04-16 at 10:10 -0400, Andrew Dunstan wrote: > > On 04/16/2012 09:34 AM, Hannu Krosing wrote: > >> based on Abhijit's feeling and some discussion offline, the consensus > >> seems to be to remove query_to_json. > > The only comment I have here is that query_to_json could have been > > replaced with json_agg, so thet you don't need to do double-buffering > > for the results of array(<yourquery>) call in > > > > SELECT array_to_json(array(<yourquery>)); > > > > Or is there some other way to avoid it except to wrap row_to_json() > > calls in own aggregate function which adds enclosing brackets and comma > > separator ( like this : '['<row1>[,<rowN>]']' ? > > > > > > The way I usually write this is: > > select array_to_json(array_agg(q)) > from (<yourquery>) q; > > It's a pity you didn't make this comment back in January when we were > talking about this. I think it's too late now in this release cycle to > be talking about adding the aggregate function. My comment is not meant to propose changing anything in 9.2. I think what we have here is absolutely fantastic :) If doing something in 9.3 then what I would like is some way to express multiple queries. Basically a variant of query_to_json(query text[]) where queries would be evaluated in order and then all the results aggregated into on json object. But "aggregation on client" as suggested by Merlin may be a better way to do it for larger result(set)s. Especially as it could enable streaming of the resultsets without having to first buffer everything on the server. If we can add something, then perhaps a "deeper" pretty_print feature samples: hannu=# \d test Table "public.test"Column | Type | Modifiers --------+-----------------------------+---------------------------------------------------id | integer | not null default nextval('test_id_seq'::regclass)data | text | tstamp | timestamp without time zone | default now() Indexes: "test_pkey" PRIMARY KEY, btree (id) hannu=# select array_to_json(array(select test from test),true); -[ RECORD 1 ]-+---------------------------------------------------------------------------- array_to_json | [{"id":1,"data":"0.262814193032682","tstamp":"2012-04-05 13:21:03.235204"}, | {"id":2,"data":"0.157406373415142","tstamp":"2012-04-05 13:21:05.2033"}] This is OK hannu=# \d test2 Table "public.test2"Column | Type | Modifiers --------+-----------------------------+----------------------------------------------------id | integer | not null default nextval('test2_id_seq'::regclass)data2 | test | tstamp | timestamp without time zone | default now() Indexes: "test2_pkey" PRIMARY KEY, btree (id) hannu=# select array_to_json(array(select test2 from test2),true); -[ RECORD 1 ]-+----------------------------------------------------------------------------------------------------------------------------------- array_to_json | [{"id":1,"data2":{"id":1,"data":"0.262814193032682","tstamp":"2012-04-05 13:21:03.235204"},"tstamp":"2012-04-05 13:25:03.644497"}, | {"id":2,"data2":{"id":2,"data":"0.157406373415142","tstamp":"2012-04-05 13:21:05.2033"},"tstamp":"2012-04-05 13:25:03.644497"}] This is "kind of OK" hannu=# \d test3 Table "public.test3"Column | Type | Modifiers --------+-----------------------------+----------------------------------------------------id | integer | not null default nextval('test3_id_seq'::regclass)data3 | test2[] | tstamp | timestamp without time zone | default now() Indexes: "test3_pkey" PRIMARY KEY, btree (id) hannu=# select array_to_json(array(select test3 from test3),true); -[ RECORD 1 ]-+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- array_to_json | [{"id":1,"data3":[{"id":1,"data2":{"id":1,"data":"0.262814193032682","tstamp":"2012-04-05 13:21:03.235204"},"tstamp":"2012-04-05 13:25:03.644497"},{"id":2,"data2":{"id":2,"data":"0.157406373415142","tstamp":"2012-04-05 13:21:05.2033"},"tstamp":"2012-04-0513:25:03.644497"}],"tstamp":"2012-04-16 14:40:15.795947"}] but this would be nicer if printed like pythons pprint : >>> pprint.pprint(d) [{'data3': [{'data2': {'data': '0.262814193032682', 'id': 1, 'tstamp': '2012-04-0513:21:03.235204'}, 'id': 1, 'tstamp': '2012-04-05 13:25:03.644497'}, {'data2':{'data': '0.157406373415142', 'id': 2, 'tstamp': '2012-04-05 13:21:05.2033'}, 'id': 2, 'tstamp': '2012-04-05 13:25:03.644497'}], 'id': 1, 'tstamp': '2012-04-16 14:40:15.795947'}] :D Again, I don't expect it anytime soon. What we will get in 9.2 is wonderful already. Cheers, Hannu -- ------- Hannu Krosing PostgreSQL Unlimited Scalability and Performance Consultant 2ndQuadrant Nordic PG Admin Book: http://www.2ndQuadrant.com/books/
On Mon, Apr 16, 2012 at 11:19 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > If doing something in 9.3 then what I would like is some way to express > multiple queries. Basically a variant of > > query_to_json(query text[]) > > where queries would be evaluated in order and then all the results > aggregated into on json object. I personally don't like variants of to_json that push the query in as text. They defeat parameterization and have other issues. Another point for client side processing is the new row level processing in libpq, so I'd argue that if the result is big enough to warrant worring about buffering (and it'd have to be a mighty big json doc), the best bet is to extract it as rows. I'm playing around with node.js for the json serving and the sending code looks like this: var first = true; query.on('row', function(row) { if(first) { first = false; response.write('['); } else response.write(','); response.write(row.jsondata); }); query.on('end', function() { response.write(']'); response.end();}); -- not too bad merlin