Thread: Re: Emitting JSON to file using COPY TO
On 11/29/23 10:32, Davin Shearer wrote: > Thanks for the responses everyone. > > I worked around the issue using the `psql -tc` method as Filip described. > > I think it would be great to support writing JSON using COPY TO at > some point so I can emit JSON to files using a PostgreSQL function directly. > > -Davin > > On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <filip@sedlakovi.org > <mailto:filip@sedlakovi.org>> wrote: > > This would be a very special case for COPY. It applies only to a single > column of JSON values. The original problem can be solved with psql > --tuples-only as David wrote earlier. > > > $ psql -tc 'select json_agg(row_to_json(t)) > from (select * from public.tbl_json_test) t;' > > [{"id":1,"t_test":"here's a \"string\""}] > > > Special-casing any encoding/escaping scheme leads to bugs and harder > parsing. (moved to hackers) I did a quick PoC patch (attached) -- if there interest and no hard objections I would like to get it up to speed for the January commitfest. Currently the patch lacks documentation and regression test support. Questions: ---------- 1. Is supporting JSON array format sufficient, or does it need to support some other options? How flexible does the support scheme need to be? 2. This only supports COPY TO and we would undoubtedly want to support COPY FROM for JSON as well, but is that required from the start? Thanks for any feedback. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Dec 01, 2023 at 02:28:55PM -0500, Joe Conway wrote: > I did a quick PoC patch (attached) -- if there interest and no hard > objections I would like to get it up to speed for the January commitfest. Cool. I would expect there to be interest, given all the other JSON support that has been added thus far. I noticed that, with the PoC patch, "json" is the only format that must be quoted. Without quotes, I see a syntax error. I'm assuming there's a conflict with another json-related rule somewhere in gram.y, but I haven't tracked down exactly which one is causing it. > 1. Is supporting JSON array format sufficient, or does it need to support > some other options? How flexible does the support scheme need to be? I don't presently have a strong opinion on this one. My instinct would be start with something simple, though. I don't think we offer any special options for log_destination... > 2. This only supports COPY TO and we would undoubtedly want to support COPY > FROM for JSON as well, but is that required from the start? I would vote for including COPY FROM support from the start. > ! if (!cstate->opts.json_mode) I think it's unfortunate that this further complicates the branching in CopyOneRowTo(), but after some quick glances at the code, I'm not sure it's worth refactoring a bunch of stuff to make this nicer. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Dec 01, 2023 at 02:28:55PM -0500, Joe Conway wrote:
> I did a quick PoC patch (attached) -- if there interest and no hard
> objections I would like to get it up to speed for the January commitfest.
Cool. I would expect there to be interest, given all the other JSON
support that has been added thus far.
I noticed that, with the PoC patch, "json" is the only format that must be
quoted. Without quotes, I see a syntax error. I'm assuming there's a
conflict with another json-related rule somewhere in gram.y, but I haven't
tracked down exactly which one is causing it.
> 1. Is supporting JSON array format sufficient, or does it need to support
> some other options? How flexible does the support scheme need to be?
I don't presently have a strong opinion on this one. My instinct would be
start with something simple, though. I don't think we offer any special
options for log_destination...
> 2. This only supports COPY TO and we would undoubtedly want to support COPY
> FROM for JSON as well, but is that required from the start?
I would vote for including COPY FROM support from the start.
> ! if (!cstate->opts.json_mode)
I think it's unfortunate that this further complicates the branching in
CopyOneRowTo(), but after some quick glances at the code, I'm not sure it's
worth refactoring a bunch of stuff to make this nicer.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On 12/1/23 22:00, Davin Shearer wrote: > I'm really glad to see this taken up as a possible new feature and will > definitely use it if it gets released. I'm impressed with how clean, > understandable, and approachable the postgres codebase is in general and > how easy it is to read and understand this patch. > > I reviewed the patch (though I didn't build and test the code) and have > a concern with adding the '[' at the beginning and ']' at the end of the > json output. Those are already added by `json_agg` > (https://www.postgresql.org/docs/current/functions-aggregate.html > <https://www.postgresql.org/docs/current/functions-aggregate.html>) as > you can see in my initial email. Adding them in the COPY TO may be > redundant (e.g., [[{"key":"value"...}....]]). With this patch in place you don't use json_agg() at all. See the example output (this is real output with the patch applied): (oops -- I meant to send this with the same email as the patch) 8<------------------------------------------------- create table foo(id int8, f1 text, f2 timestamptz); insert into foo select g.i, 'line: ' || g.i::text, clock_timestamp() from generate_series(1,4) as g(i); copy foo to stdout (format 'json'); [ {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"} ,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"} ,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"} ,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"} ] 8<------------------------------------------------- > I think COPY TO makes good sense to support, though COPY FROM maybe not > so much as JSON isn't necessarily flat and rectangular like CSV. Yeah -- definitely not as straight forward but possibly we just support the array-of-jsonobj-rows as input as well? > For my use-case, I'm emitting JSON files to Apache NiFi for processing, > and NiFi has superior handling of JSON (via JOLT parsers) versus CSV > where parsing is generally done with regex. I want to be able to emit > JSON using a postgres function and thus COPY TO. > > Definitely +1 for COPY TO. > > I don't think COPY FROM will work out well unless the JSON is required > to be flat and rectangular. I would vote -1 to leave it out due to the > necessary restrictions making it not generally useful. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/1/23 18:09, Nathan Bossart wrote: > On Fri, Dec 01, 2023 at 02:28:55PM -0500, Joe Conway wrote: >> I did a quick PoC patch (attached) -- if there interest and no hard >> objections I would like to get it up to speed for the January commitfest. > > Cool. I would expect there to be interest, given all the other JSON > support that has been added thus far. Thanks for the review > I noticed that, with the PoC patch, "json" is the only format that must be > quoted. Without quotes, I see a syntax error. I'm assuming there's a > conflict with another json-related rule somewhere in gram.y, but I haven't > tracked down exactly which one is causing it. It seems to be because 'json' is also a type name ($$ = SystemTypeName("json")). What do you think about using 'json_array' instead? It is more specific and accurate, and avoids the need to quote. test=# copy foo to stdout (format json_array); [ {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"} ,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"} ,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"} ,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"} ] >> 1. Is supporting JSON array format sufficient, or does it need to support >> some other options? How flexible does the support scheme need to be? > > I don't presently have a strong opinion on this one. My instinct would be > start with something simple, though. I don't think we offer any special > options for log_destination... WFM >> 2. This only supports COPY TO and we would undoubtedly want to support COPY >> FROM for JSON as well, but is that required from the start? > > I would vote for including COPY FROM support from the start. Check. My thought is to only accept the same format we emit -- i.e. only take a json array. >> ! if (!cstate->opts.json_mode) > > I think it's unfortunate that this further complicates the branching in > CopyOneRowTo(), but after some quick glances at the code, I'm not sure it's > worth refactoring a bunch of stuff to make this nicer. Yeah that was my conclusion. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Joe Conway <mail@joeconway.com> writes: >> I noticed that, with the PoC patch, "json" is the only format that must be >> quoted. Without quotes, I see a syntax error. I'm assuming there's a >> conflict with another json-related rule somewhere in gram.y, but I haven't >> tracked down exactly which one is causing it. While I've not looked too closely, I suspect this might be due to the FORMAT_LA hack in base_yylex: /* Replace FORMAT by FORMAT_LA if it's followed by JSON */ switch (next_token) { case JSON: cur_token = FORMAT_LA; break; } So if you are writing a production that might need to match FORMAT followed by JSON, you need to match FORMAT_LA too. (I spent a little bit of time last week trying to get rid of FORMAT_LA, thinking that it didn't look necessary. Did not succeed yet.) regards, tom lane
On Fri, Dec 1, 2023 at 11:32 AM Joe Conway <mail@joeconway.com> wrote: > 1. Is supporting JSON array format sufficient, or does it need to > support some other options? How flexible does the support scheme need to be? "JSON Lines" is a semi-standard format [1] that's basically just newline-separated JSON values. (In fact, this is what log_destination=jsonlog gives you for Postgres logs, no?) It might be worthwhile to support that, too. [1]: https://jsonlines.org/
On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote: > So if you are writing a production that might need to match > FORMAT followed by JSON, you need to match FORMAT_LA too. Thanks for the pointer. That does seem to be the culprit. diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index d631ac89a9..048494dd07 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -3490,6 +3490,10 @@ copy_generic_opt_elem: { $$ = makeDefElem($1, $2, @1); } + | FORMAT_LA copy_generic_opt_arg + { + $$ = makeDefElem("format", $2, @1); + } ; copy_generic_opt_arg: -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On 12/2/23 16:53, Nathan Bossart wrote: > On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote: >> So if you are writing a production that might need to match >> FORMAT followed by JSON, you need to match FORMAT_LA too. > > Thanks for the pointer. That does seem to be the culprit. > > diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y > index d631ac89a9..048494dd07 100644 > --- a/src/backend/parser/gram.y > +++ b/src/backend/parser/gram.y > @@ -3490,6 +3490,10 @@ copy_generic_opt_elem: > { > $$ = makeDefElem($1, $2, @1); > } > + | FORMAT_LA copy_generic_opt_arg > + { > + $$ = makeDefElem("format", $2, @1); > + } > ; > > copy_generic_opt_arg: Yep -- I concluded the same. Thanks Tom! -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/2/23 13:50, Maciek Sakrejda wrote: > On Fri, Dec 1, 2023 at 11:32 AM Joe Conway <mail@joeconway.com> wrote: >> 1. Is supporting JSON array format sufficient, or does it need to >> support some other options? How flexible does the support scheme need to be? > > "JSON Lines" is a semi-standard format [1] that's basically just > newline-separated JSON values. (In fact, this is what > log_destination=jsonlog gives you for Postgres logs, no?) It might be > worthwhile to support that, too. > > [1]: https://jsonlines.org/ Yes, I have seen examples of that associated with other databases (MSSQL and Duckdb at least) as well. It probably makes sense to support that format too. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 2023-12-02 Sa 17:43, Joe Conway wrote: > On 12/2/23 13:50, Maciek Sakrejda wrote: >> On Fri, Dec 1, 2023 at 11:32 AM Joe Conway <mail@joeconway.com> wrote: >>> 1. Is supporting JSON array format sufficient, or does it need to >>> support some other options? How flexible does the support scheme >>> need to be? >> >> "JSON Lines" is a semi-standard format [1] that's basically just >> newline-separated JSON values. (In fact, this is what >> log_destination=jsonlog gives you for Postgres logs, no?) It might be >> worthwhile to support that, too. >> >> [1]: https://jsonlines.org/ > > > Yes, I have seen examples of that associated with other databases > (MSSQL and Duckdb at least) as well. It probably makes sense to > support that format too. You can do that today, e.g. copy (select to_json(q) from table_or_query q) to stdout You can also do it as a single document as proposed here, like this: copy (select json_agg(q) from table_or_query q) to stdout The only downside to that is that it has to construct the aggregate, which could be ugly for large datasets, and that's why I'm not opposed to this patch. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 12/2/23 17:37, Joe Conway wrote: > On 12/2/23 16:53, Nathan Bossart wrote: >> On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote: >>> So if you are writing a production that might need to match >>> FORMAT followed by JSON, you need to match FORMAT_LA too. >> >> Thanks for the pointer. That does seem to be the culprit. >> >> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y >> index d631ac89a9..048494dd07 100644 >> --- a/src/backend/parser/gram.y >> +++ b/src/backend/parser/gram.y >> @@ -3490,6 +3490,10 @@ copy_generic_opt_elem: >> { >> $$ = makeDefElem($1, $2, @1); >> } >> + | FORMAT_LA copy_generic_opt_arg >> + { >> + $$ = makeDefElem("format", $2, @1); >> + } >> ; >> >> copy_generic_opt_arg: > > > Yep -- I concluded the same. Thanks Tom! The attached implements the above repair, as well as adding support for array decoration (or not) and/or comma row delimiters when not an array. This covers the three variations of json import/export formats that I have found after light searching (SQL Server and DuckDB). Still lacks and documentation, tests, and COPY FROM support, but here is what it looks like in a nutshell: 8<----------------------------------------------- create table foo(id int8, f1 text, f2 timestamptz); insert into foo select g.i, 'line: ' || g.i::text, clock_timestamp() from generate_series(1,4) as g(i); copy foo to stdout (format json); {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"} {"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"} {"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"} {"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"} copy foo to stdout (format json, force_array); [ {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"} ,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"} ,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"} ,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"} ] copy foo to stdout (format json, force_row_delimiter); {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"} ,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"} ,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"} ,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"} copy foo to stdout (force_array); ERROR: COPY FORCE_ARRAY requires JSON mode copy foo to stdout (force_row_delimiter); ERROR: COPY FORCE_ROW_DELIMITER requires JSON mode 8<----------------------------------------------- -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On 2023-12-01 Fr 14:28, Joe Conway wrote: > On 11/29/23 10:32, Davin Shearer wrote: >> Thanks for the responses everyone. >> >> I worked around the issue using the `psql -tc` method as Filip >> described. >> >> I think it would be great to support writing JSON using COPY TO at >> some point so I can emit JSON to files using a PostgreSQL function >> directly. >> >> -Davin >> >> On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <filip@sedlakovi.org >> <mailto:filip@sedlakovi.org>> wrote: >> >> This would be a very special case for COPY. It applies only to a >> single >> column of JSON values. The original problem can be solved with psql >> --tuples-only as David wrote earlier. >> >> >> $ psql -tc 'select json_agg(row_to_json(t)) >> from (select * from public.tbl_json_test) t;' >> >> [{"id":1,"t_test":"here's a \"string\""}] >> >> >> Special-casing any encoding/escaping scheme leads to bugs and harder >> parsing. > > (moved to hackers) > > I did a quick PoC patch (attached) -- if there interest and no hard > objections I would like to get it up to speed for the January commitfest. > > Currently the patch lacks documentation and regression test support. > > Questions: > ---------- > 1. Is supporting JSON array format sufficient, or does it need to > support some other options? How flexible does the support scheme need > to be? > > 2. This only supports COPY TO and we would undoubtedly want to support > COPY FROM for JSON as well, but is that required from the start? > > Thanks for any feedback. I realize this is just a POC, but I'd prefer to see composite_to_json() not exposed. You could use the already public datum_to_json() instead, passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third arguments. I think JSON array format is sufficient. I can see both sides of the COPY FROM argument, but I think insisting on that makes this less doable for release 17. On balance I would stick to COPY TO for now. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2023-12-01 Fr 14:28, Joe Conway wrote:
> On 11/29/23 10:32, Davin Shearer wrote:
>> Thanks for the responses everyone.
>>
>> I worked around the issue using the `psql -tc` method as Filip
>> described.
>>
>> I think it would be great to support writing JSON using COPY TO at
>> some point so I can emit JSON to files using a PostgreSQL function
>> directly.
>>
>> -Davin
>>
>> On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <filip@sedlakovi.org
>> <mailto:filip@sedlakovi.org>> wrote:
>>
>> This would be a very special case for COPY. It applies only to a
>> single
>> column of JSON values. The original problem can be solved with psql
>> --tuples-only as David wrote earlier.
>>
>>
>> $ psql -tc 'select json_agg(row_to_json(t))
>> from (select * from public.tbl_json_test) t;'
>>
>> [{"id":1,"t_test":"here's a \"string\""}]
>>
>>
>> Special-casing any encoding/escaping scheme leads to bugs and harder
>> parsing.
>
> (moved to hackers)
>
> I did a quick PoC patch (attached) -- if there interest and no hard
> objections I would like to get it up to speed for the January commitfest.
>
> Currently the patch lacks documentation and regression test support.
>
> Questions:
> ----------
> 1. Is supporting JSON array format sufficient, or does it need to
> support some other options? How flexible does the support scheme need
> to be?
>
> 2. This only supports COPY TO and we would undoubtedly want to support
> COPY FROM for JSON as well, but is that required from the start?
>
> Thanks for any feedback.
I realize this is just a POC, but I'd prefer to see composite_to_json()
not exposed. You could use the already public datum_to_json() instead,
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
arguments.
I think JSON array format is sufficient.
I can see both sides of the COPY FROM argument, but I think insisting on
that makes this less doable for release 17. On balance I would stick to
COPY TO for now.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On 12/3/23 10:31, Davin Shearer wrote: > Please be sure to include single and double quotes in the test values > since that was the original problem (double quoting in COPY TO breaking > the JSON syntax). test=# copy (select * from foo limit 4) to stdout (format json); {"id":2456092,"f1":"line with ' in it: 2456092","f2":"2023-12-03T10:44:40.9712-05:00"} {"id":2456093,"f1":"line with \\" in it: 2456093","f2":"2023-12-03T10:44:40.971221-05:00"} {"id":2456094,"f1":"line with ' in it: 2456094","f2":"2023-12-03T10:44:40.971225-05:00"} {"id":2456095,"f1":"line with \\" in it: 2456095","f2":"2023-12-03T10:44:40.971228-05:00"} -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/3/23 10:10, Andrew Dunstan wrote: > > On 2023-12-01 Fr 14:28, Joe Conway wrote: >> On 11/29/23 10:32, Davin Shearer wrote: >>> Thanks for the responses everyone. >>> >>> I worked around the issue using the `psql -tc` method as Filip >>> described. >>> >>> I think it would be great to support writing JSON using COPY TO at >>> some point so I can emit JSON to files using a PostgreSQL function >>> directly. >>> >>> -Davin >>> >>> On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <filip@sedlakovi.org >>> <mailto:filip@sedlakovi.org>> wrote: >>> >>> This would be a very special case for COPY. It applies only to a >>> single >>> column of JSON values. The original problem can be solved with psql >>> --tuples-only as David wrote earlier. >>> >>> >>> $ psql -tc 'select json_agg(row_to_json(t)) >>> from (select * from public.tbl_json_test) t;' >>> >>> [{"id":1,"t_test":"here's a \"string\""}] >>> >>> >>> Special-casing any encoding/escaping scheme leads to bugs and harder >>> parsing. >> >> (moved to hackers) >> >> I did a quick PoC patch (attached) -- if there interest and no hard >> objections I would like to get it up to speed for the January commitfest. >> >> Currently the patch lacks documentation and regression test support. >> >> Questions: >> ---------- >> 1. Is supporting JSON array format sufficient, or does it need to >> support some other options? How flexible does the support scheme need >> to be? >> >> 2. This only supports COPY TO and we would undoubtedly want to support >> COPY FROM for JSON as well, but is that required from the start? >> >> Thanks for any feedback. > > I realize this is just a POC, but I'd prefer to see composite_to_json() > not exposed. You could use the already public datum_to_json() instead, > passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third > arguments. Ok, thanks, will do > I think JSON array format is sufficient. The other formats make sense from a completeness standpoint (versus other databases) and the latest patch already includes them, so I still lean toward supporting all three formats. > I can see both sides of the COPY FROM argument, but I think insisting on > that makes this less doable for release 17. On balance I would stick to > COPY TO for now. WFM. From your earlier post, regarding constructing the aggregate -- not extensive testing but one data point: 8<-------------------------- test=# copy foo to '/tmp/buf' (format json, force_array); COPY 10000000 Time: 36353.153 ms (00:36.353) test=# copy (select json_agg(foo) from foo) to '/tmp/buf'; COPY 1 Time: 46835.238 ms (00:46.835) 8<-------------------------- -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/3/23 11:03, Joe Conway wrote: > From your earlier post, regarding constructing the aggregate -- not > extensive testing but one data point: > 8<-------------------------- > test=# copy foo to '/tmp/buf' (format json, force_array); > COPY 10000000 > Time: 36353.153 ms (00:36.353) > test=# copy (select json_agg(foo) from foo) to '/tmp/buf'; > COPY 1 > Time: 46835.238 ms (00:46.835) > 8<-------------------------- Also if the table is large enough, the aggregate method is not even feasible whereas the COPY TO method works: 8<-------------------------- test=# select count(*) from foo; count ---------- 20000000 (1 row) test=# copy (select json_agg(foo) from foo) to '/tmp/buf'; ERROR: out of memory DETAIL: Cannot enlarge string buffer containing 1073741822 bytes by 1 more bytes. test=# copy foo to '/tmp/buf' (format json, force_array); COPY 20000000 8<-------------------------- -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/3/23 11:03, Joe Conway wrote: > On 12/3/23 10:10, Andrew Dunstan wrote: >> I realize this is just a POC, but I'd prefer to see composite_to_json() >> not exposed. You could use the already public datum_to_json() instead, >> passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third >> arguments. > > Ok, thanks, will do Just FYI, this change does loose some performance in my not massively scientific A/B/A test: 8<--------------------------- -- with datum_to_json() test=# \timing Timing is on. test=# copy foo to '/tmp/buf' (format json, force_array); COPY 10000000 Time: 37196.898 ms (00:37.197) Time: 37408.161 ms (00:37.408) Time: 38393.309 ms (00:38.393) Time: 36855.438 ms (00:36.855) Time: 37806.280 ms (00:37.806) Avg = 37532 -- original patch test=# \timing Timing is on. test=# copy foo to '/tmp/buf' (format json, force_array); COPY 10000000 Time: 37426.207 ms (00:37.426) Time: 36068.187 ms (00:36.068) Time: 38285.252 ms (00:38.285) Time: 36971.042 ms (00:36.971) Time: 35690.822 ms (00:35.691) Avg = 36888 -- with datum_to_json() test=# \timing Timing is on. test=# copy foo to '/tmp/buf' (format json, force_array); COPY 10000000 Time: 39083.467 ms (00:39.083) Time: 37249.326 ms (00:37.249) Time: 38529.721 ms (00:38.530) Time: 38704.920 ms (00:38.705) Time: 39001.326 ms (00:39.001) Avg = 38513 8<--------------------------- That is somewhere in the 3% range. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 2023-12-03 Su 12:11, Joe Conway wrote: > On 12/3/23 11:03, Joe Conway wrote: >> From your earlier post, regarding constructing the aggregate -- not >> extensive testing but one data point: >> 8<-------------------------- >> test=# copy foo to '/tmp/buf' (format json, force_array); >> COPY 10000000 >> Time: 36353.153 ms (00:36.353) >> test=# copy (select json_agg(foo) from foo) to '/tmp/buf'; >> COPY 1 >> Time: 46835.238 ms (00:46.835) >> 8<-------------------------- > > Also if the table is large enough, the aggregate method is not even > feasible whereas the COPY TO method works: > 8<-------------------------- > test=# select count(*) from foo; > count > ---------- > 20000000 > (1 row) > > test=# copy (select json_agg(foo) from foo) to '/tmp/buf'; > ERROR: out of memory > DETAIL: Cannot enlarge string buffer containing 1073741822 bytes by 1 > more bytes. > > test=# copy foo to '/tmp/buf' (format json, force_array); > COPY 20000000 > 8<-------------------------- None of this is surprising. As I mentioned, limitations with json_agg() are why I support the idea of this patch. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2023-12-03 Su 14:24, Joe Conway wrote: > On 12/3/23 11:03, Joe Conway wrote: >> On 12/3/23 10:10, Andrew Dunstan wrote: >>> I realize this is just a POC, but I'd prefer to see >>> composite_to_json() >>> not exposed. You could use the already public datum_to_json() instead, >>> passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third >>> arguments. >> >> Ok, thanks, will do > > Just FYI, this change does loose some performance in my not massively > scientific A/B/A test: > > 8<--------------------------- > -- with datum_to_json() > test=# \timing > Timing is on. > test=# copy foo to '/tmp/buf' (format json, force_array); > COPY 10000000 > Time: 37196.898 ms (00:37.197) > Time: 37408.161 ms (00:37.408) > Time: 38393.309 ms (00:38.393) > Time: 36855.438 ms (00:36.855) > Time: 37806.280 ms (00:37.806) > > Avg = 37532 > > -- original patch > test=# \timing > Timing is on. > test=# copy foo to '/tmp/buf' (format json, force_array); > COPY 10000000 > Time: 37426.207 ms (00:37.426) > Time: 36068.187 ms (00:36.068) > Time: 38285.252 ms (00:38.285) > Time: 36971.042 ms (00:36.971) > Time: 35690.822 ms (00:35.691) > > Avg = 36888 > > -- with datum_to_json() > test=# \timing > Timing is on. > test=# copy foo to '/tmp/buf' (format json, force_array); > COPY 10000000 > Time: 39083.467 ms (00:39.083) > Time: 37249.326 ms (00:37.249) > Time: 38529.721 ms (00:38.530) > Time: 38704.920 ms (00:38.705) > Time: 39001.326 ms (00:39.001) > > Avg = 38513 > 8<--------------------------- > > That is somewhere in the 3% range. I assume it's because datum_to_json() constructs a text value from which you then need to extract the cstring, whereas composite_to_json(), just gives you back the stringinfo. I guess that's a good enough reason to go with exposing composite_to_json(). cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 12/3/23 14:52, Andrew Dunstan wrote: > > On 2023-12-03 Su 14:24, Joe Conway wrote: >> On 12/3/23 11:03, Joe Conway wrote: >>> On 12/3/23 10:10, Andrew Dunstan wrote: >>>> I realize this is just a POC, but I'd prefer to see >>>> composite_to_json() >>>> not exposed. You could use the already public datum_to_json() instead, >>>> passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third >>>> arguments. >>> >>> Ok, thanks, will do >> >> Just FYI, this change does loose some performance in my not massively >> scientific A/B/A test: >> >> 8<--------------------------- <snip> >> 8<--------------------------- >> >> That is somewhere in the 3% range. > > I assume it's because datum_to_json() constructs a text value from which > you then need to extract the cstring, whereas composite_to_json(), just > gives you back the stringinfo. I guess that's a good enough reason to go > with exposing composite_to_json(). Yeah, that was why I went that route in the first place. If you are good with it I will go back to that. The code is a bit simpler too. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/3/23 10:31, Davin Shearer wrote:
> Please be sure to include single and double quotes in the test values
> since that was the original problem (double quoting in COPY TO breaking
> the JSON syntax).
test=# copy (select * from foo limit 4) to stdout (format json);
{"id":2456092,"f1":"line with ' in it:
2456092","f2":"2023-12-03T10:44:40.9712-05:00"}
{"id":2456093,"f1":"line with \\" in it:
2456093","f2":"2023-12-03T10:44:40.971221-05:00"}
{"id":2456094,"f1":"line with ' in it:
2456094","f2":"2023-12-03T10:44:40.971225-05:00"}
{"id":2456095,"f1":"line with \\" in it:
2456095","f2":"2023-12-03T10:44:40.971228-05:00"}
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
(please don't top quote on the Postgres lists) On 12/3/23 17:38, Davin Shearer wrote: > " being quoted as \\" breaks the JSON. It needs to be \". This has been > my whole problem with COPY TO for JSON. > > Please validate that the output is in proper format with correct quoting > for special characters. I use `jq` on the command line to validate and > format the output. I just hooked existing "row-to-json machinery" up to the "COPY TO" statement. If the output is wrong (just for for this use case?), that would be a missing feature (or possibly a bug?). Davin -- how did you work around the issue with the way the built in functions output JSON? Andrew -- comments/thoughts? Joe > On Sun, Dec 3, 2023, 10:51 Joe Conway <mail@joeconway.com > <mailto:mail@joeconway.com>> wrote: > > On 12/3/23 10:31, Davin Shearer wrote: > > Please be sure to include single and double quotes in the test > values > > since that was the original problem (double quoting in COPY TO > breaking > > the JSON syntax). > > test=# copy (select * from foo limit 4) to stdout (format json); > {"id":2456092,"f1":"line with ' in it: > 2456092","f2":"2023-12-03T10:44:40.9712-05:00"} > {"id":2456093,"f1":"line with \\" in it: > 2456093","f2":"2023-12-03T10:44:40.971221-05:00"} > {"id":2456094,"f1":"line with ' in it: > 2456094","f2":"2023-12-03T10:44:40.971225-05:00"} > {"id":2456095,"f1":"line with \\" in it: > 2456095","f2":"2023-12-03T10:44:40.971228-05:00"} > > -- > Joe Conway > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com <https://aws.amazon.com> > -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 2023-12-03 Su 20:14, Joe Conway wrote: > (please don't top quote on the Postgres lists) > > On 12/3/23 17:38, Davin Shearer wrote: >> " being quoted as \\" breaks the JSON. It needs to be \". This has >> been my whole problem with COPY TO for JSON. >> >> Please validate that the output is in proper format with correct >> quoting for special characters. I use `jq` on the command line to >> validate and format the output. > > I just hooked existing "row-to-json machinery" up to the "COPY TO" > statement. If the output is wrong (just for for this use case?), that > would be a missing feature (or possibly a bug?). > > Davin -- how did you work around the issue with the way the built in > functions output JSON? > > Andrew -- comments/thoughts? > > I meant to mention this when I was making comments yesterday. The patch should not be using CopyAttributeOutText - it will try to escape characters such as \, which produces the effect complained of here, or else we need to change its setup so we have a way to inhibit that escaping. cheers andrew > > -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 12/4/23 07:41, Andrew Dunstan wrote: > > On 2023-12-03 Su 20:14, Joe Conway wrote: >> (please don't top quote on the Postgres lists) >> >> On 12/3/23 17:38, Davin Shearer wrote: >>> " being quoted as \\" breaks the JSON. It needs to be \". This has >>> been my whole problem with COPY TO for JSON. >>> >>> Please validate that the output is in proper format with correct >>> quoting for special characters. I use `jq` on the command line to >>> validate and format the output. >> >> I just hooked existing "row-to-json machinery" up to the "COPY TO" >> statement. If the output is wrong (just for for this use case?), that >> would be a missing feature (or possibly a bug?). >> >> Davin -- how did you work around the issue with the way the built in >> functions output JSON? >> >> Andrew -- comments/thoughts? > > I meant to mention this when I was making comments yesterday. > > The patch should not be using CopyAttributeOutText - it will try to > escape characters such as \, which produces the effect complained of > here, or else we need to change its setup so we have a way to inhibit > that escaping. Interesting. I am surprised this has never been raised as a problem with COPY TO before. Should the JSON output, as produced by composite_to_json(), be sent as-is with no escaping at all? If yes, is JSON somehow unique in this regard? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 2023-12-04 Mo 08:37, Joe Conway wrote: > On 12/4/23 07:41, Andrew Dunstan wrote: >> >> On 2023-12-03 Su 20:14, Joe Conway wrote: >>> (please don't top quote on the Postgres lists) >>> >>> On 12/3/23 17:38, Davin Shearer wrote: >>>> " being quoted as \\" breaks the JSON. It needs to be \". This has >>>> been my whole problem with COPY TO for JSON. >>>> >>>> Please validate that the output is in proper format with correct >>>> quoting for special characters. I use `jq` on the command line to >>>> validate and format the output. >>> >>> I just hooked existing "row-to-json machinery" up to the "COPY TO" >>> statement. If the output is wrong (just for for this use case?), >>> that would be a missing feature (or possibly a bug?). >>> >>> Davin -- how did you work around the issue with the way the built in >>> functions output JSON? >>> >>> Andrew -- comments/thoughts? >> >> I meant to mention this when I was making comments yesterday. >> >> The patch should not be using CopyAttributeOutText - it will try to >> escape characters such as \, which produces the effect complained of >> here, or else we need to change its setup so we have a way to inhibit >> that escaping. > > > Interesting. > > I am surprised this has never been raised as a problem with COPY TO > before. > > Should the JSON output, as produced by composite_to_json(), be sent > as-is with no escaping at all? If yes, is JSON somehow unique in this > regard? Text mode output is in such a form that it can be read back in using text mode input. There's nothing special about JSON in this respect - any text field will be escaped too. But output suitable for text mode input is not what you're trying to produce here; you're trying to produce valid JSON. So, yes, the result of composite_to_json, which is already suitably escaped, should not be further escaped in this case. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 12/4/23 09:25, Andrew Dunstan wrote: > > On 2023-12-04 Mo 08:37, Joe Conway wrote: >> On 12/4/23 07:41, Andrew Dunstan wrote: >>> >>> On 2023-12-03 Su 20:14, Joe Conway wrote: >>>> (please don't top quote on the Postgres lists) >>>> >>>> On 12/3/23 17:38, Davin Shearer wrote: >>>>> " being quoted as \\" breaks the JSON. It needs to be \". This has >>>>> been my whole problem with COPY TO for JSON. >>>>> >>>>> Please validate that the output is in proper format with correct >>>>> quoting for special characters. I use `jq` on the command line to >>>>> validate and format the output. >>>> >>>> I just hooked existing "row-to-json machinery" up to the "COPY TO" >>>> statement. If the output is wrong (just for for this use case?), >>>> that would be a missing feature (or possibly a bug?). >>>> >>>> Davin -- how did you work around the issue with the way the built in >>>> functions output JSON? >>>> >>>> Andrew -- comments/thoughts? >>> >>> I meant to mention this when I was making comments yesterday. >>> >>> The patch should not be using CopyAttributeOutText - it will try to >>> escape characters such as \, which produces the effect complained of >>> here, or else we need to change its setup so we have a way to inhibit >>> that escaping. >> >> >> Interesting. >> >> I am surprised this has never been raised as a problem with COPY TO >> before. >> >> Should the JSON output, as produced by composite_to_json(), be sent >> as-is with no escaping at all? If yes, is JSON somehow unique in this >> regard? > > > Text mode output is in such a form that it can be read back in using > text mode input. There's nothing special about JSON in this respect - > any text field will be escaped too. But output suitable for text mode > input is not what you're trying to produce here; you're trying to > produce valid JSON. > > So, yes, the result of composite_to_json, which is already suitably > escaped, should not be further escaped in this case. Gotcha. This patch version uses CopySendData() instead and includes documentation changes. Still lacks regression tests. Hopefully this looks better. Any other particular strings I ought to test with? 8<------------------ test=# copy (select * from foo limit 4) to stdout (format json, force_array true); [ {"id":1,"f1":"line with \" in it: 1","f2":"2023-12-03T12:26:41.596053-05:00"} ,{"id":2,"f1":"line with ' in it: 2","f2":"2023-12-03T12:26:41.596173-05:00"} ,{"id":3,"f1":"line with \" in it: 3","f2":"2023-12-03T12:26:41.596179-05:00"} ,{"id":4,"f1":"line with ' in it: 4","f2":"2023-12-03T12:26:41.596182-05:00"} ] 8<------------------ -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On 12/4/23 09:25, Andrew Dunstan wrote:
>
> On 2023-12-04 Mo 08:37, Joe Conway wrote:
>> On 12/4/23 07:41, Andrew Dunstan wrote:
>>>
>>> On 2023-12-03 Su 20:14, Joe Conway wrote:
>>>> (please don't top quote on the Postgres lists)
>>>>
>>>> On 12/3/23 17:38, Davin Shearer wrote:
>>>>> " being quoted as \\" breaks the JSON. It needs to be \". This has
>>>>> been my whole problem with COPY TO for JSON.
>>>>>
>>>>> Please validate that the output is in proper format with correct
>>>>> quoting for special characters. I use `jq` on the command line to
>>>>> validate and format the output.
>>>>
>>>> I just hooked existing "row-to-json machinery" up to the "COPY TO"
>>>> statement. If the output is wrong (just for for this use case?),
>>>> that would be a missing feature (or possibly a bug?).
>>>>
>>>> Davin -- how did you work around the issue with the way the built in
>>>> functions output JSON?
>>>>
>>>> Andrew -- comments/thoughts?
>>>
>>> I meant to mention this when I was making comments yesterday.
>>>
>>> The patch should not be using CopyAttributeOutText - it will try to
>>> escape characters such as \, which produces the effect complained of
>>> here, or else we need to change its setup so we have a way to inhibit
>>> that escaping.
>>
>>
>> Interesting.
>>
>> I am surprised this has never been raised as a problem with COPY TO
>> before.
>>
>> Should the JSON output, as produced by composite_to_json(), be sent
>> as-is with no escaping at all? If yes, is JSON somehow unique in this
>> regard?
>
>
> Text mode output is in such a form that it can be read back in using
> text mode input. There's nothing special about JSON in this respect -
> any text field will be escaped too. But output suitable for text mode
> input is not what you're trying to produce here; you're trying to
> produce valid JSON.
>
> So, yes, the result of composite_to_json, which is already suitably
> escaped, should not be further escaped in this case.
Gotcha.
This patch version uses CopySendData() instead and includes
documentation changes. Still lacks regression tests.
Hopefully this looks better. Any other particular strings I ought to
test with?
8<------------------
test=# copy (select * from foo limit 4) to stdout (format json,
force_array true);
[
{"id":1,"f1":"line with \" in it:
1","f2":"2023-12-03T12:26:41.596053-05:00"}
,{"id":2,"f1":"line with ' in it:
2","f2":"2023-12-03T12:26:41.596173-05:00"}
,{"id":3,"f1":"line with \" in it:
3","f2":"2023-12-03T12:26:41.596179-05:00"}
,{"id":4,"f1":"line with ' in it:
4","f2":"2023-12-03T12:26:41.596182-05:00"}
]
8<------------------
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Looking great!For testing, in addition to the quotes, include DOS and Unix EOL, \ and /, Byte Order Markers, and mulitbyte characters like UTF-8.Essentially anything considered textural is fair game to be a value.
Joe already asked you to avoid top-posting on PostgreSQL lists. See <http://idallen.com/topposting.html> for an explanation.
We don't process BOMs elsewhere, and probably should not here either. They are in fact neither required nor recommended for use with UTF8 data, AIUI. See a recent discussion on this list on that topic: <https://www.postgresql.org/message-id/flat/81ca2b25-6b3a-499a-9a09-2dd21253c2cb%40unitrunker.net>
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
"
(double quote)\
(backslash)/
(forward slash)\b
(backspace)\f
(form feed)\n
(new line)\r
(carriage return)\t
(horizontal tab)
These characters should be represented in the test cases to see how the escaping behaves and to ensure that the escaping is done properly per JSON requirements. Forward slash comes as a bit of a surprise to me, but `jq` handles it either way:
{
"key": "this / is a forward slash"
}
➜ echo '{"key": "this \/ is a forward slash"}' | jq .
{
"key": "this / is a forward slash"
}
On 12/4/23 17:55, Davin Shearer wrote: > Sorry about the top posting / top quoting... the link you sent me gives > me a 404. I'm not exactly sure what top quoting / posting means and > Googling those terms wasn't helpful for me, but I've removed the quoting > that my mail client is automatically "helpfully" adding to my emails. I > mean no offense. No offense taken. But it is worthwhile to conform to the very long established norms of the mailing lists on which you participate. See: https://en.wikipedia.org/wiki/Posting_style I would describe the Postgres list style (based on that link) as "inline replying, in which the different parts of the reply follow the relevant parts of the original post...[with]...trimming of the original text" > There are however a few characters that need to be escaped > 1. |"|(double quote) > 2. |\|(backslash) > 3. |/|(forward slash) > 4. |\b|(backspace) > 5. |\f|(form feed) > 6. |\n|(new line) > 7. |\r|(carriage return) > 8. |\t|(horizontal tab) > > These characters should be represented in the test cases to see how the > escaping behaves and to ensure that the escaping is done properly per > JSON requirements. I can look at adding these as test cases. The latest version of the patch (attached) includes some of that already. For reference, the tests so far include this: 8<------------------------------- test=# select * from copytest; style | test | filler ---------+----------+-------- DOS | abc\r +| 1 | def | Unix | abc +| 2 | def | Mac | abc\rdef | 3 esc\ape | a\r\\r\ +| 4 | \nb | (4 rows) test=# copy copytest to stdout (format json); {"style":"DOS","test":"abc\r\ndef","filler":1} {"style":"Unix","test":"abc\ndef","filler":2} {"style":"Mac","test":"abc\rdef","filler":3} {"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4} 8<------------------------------- At this point "COPY TO" should be sending exactly the unaltered output of the postgres JSON processing functions. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On 2023-12-04 Mo 17:55, Davin Shearer wrote: > Sorry about the top posting / top quoting... the link you sent me > gives me a 404. I'm not exactly sure what top quoting / posting means > and Googling those terms wasn't helpful for me, but I've removed the > quoting that my mail client is automatically "helpfully" adding to my > emails. I mean no offense. Hmm. Luckily the Wayback Machine has a copy: <http://web.archive.org/web/20230608210806/idallen.com/topposting.html> Maybe I'll put a copy in the developer wiki. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 12/4/23 21:54, Joe Conway wrote: > On 12/4/23 17:55, Davin Shearer wrote: >> There are however a few characters that need to be escaped > >> 1. |"|(double quote) >> 2. |\|(backslash) >> 3. |/|(forward slash) >> 4. |\b|(backspace) >> 5. |\f|(form feed) >> 6. |\n|(new line) >> 7. |\r|(carriage return) >> 8. |\t|(horizontal tab) >> >> These characters should be represented in the test cases to see how the >> escaping behaves and to ensure that the escaping is done properly per >> JSON requirements. > > I can look at adding these as test cases. So I did a quick check: 8<-------------------------- with t(f1) as ( values (E'aaa\"bbb'::text), (E'aaa\\bbb'::text), (E'aaa\/bbb'::text), (E'aaa\bbbb'::text), (E'aaa\fbbb'::text), (E'aaa\nbbb'::text), (E'aaa\rbbb'::text), (E'aaa\tbbb'::text) ) select length(t.f1), t.f1, row_to_json(t) from t; length | f1 | row_to_json --------+-------------+------------------- 7 | aaa"bbb | {"f1":"aaa\"bbb"} 7 | aaa\bbb | {"f1":"aaa\\bbb"} 7 | aaa/bbb | {"f1":"aaa/bbb"} 7 | aaa\x08bbb | {"f1":"aaa\bbbb"} 7 | aaa\x0Cbbb | {"f1":"aaa\fbbb"} 7 | aaa +| {"f1":"aaa\nbbb"} | bbb | 7 | aaa\rbbb | {"f1":"aaa\rbbb"} 7 | aaa bbb | {"f1":"aaa\tbbb"} (8 rows) 8<-------------------------- This is all independent of my patch for COPY TO. If I am reading that correctly, everything matches Davin's table *except* the forward slash ("/"). I defer to the experts on the thread to debate that... -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/5/23 12:43, Davin Shearer wrote: > Joe, those test cases look great and the outputs are the same as `jq`. <link to info regarding escaping of forward slashes> > Forward slash escaping is optional, so not escaping them in Postgres is > okay. The important thing is that the software _reading_ JSON > interprets both '\/' and '/' as '/'. Thanks for the review and info. I modified the existing regression test thus: 8<-------------------------- create temp table copyjsontest ( id bigserial, f1 text, f2 timestamptz); insert into copyjsontest select g.i, CASE WHEN g.i % 2 = 0 THEN 'line with '' in it: ' || g.i::text ELSE 'line with " in it: ' || g.i::text END, 'Mon Feb 10 17:32:01 1997 PST' from generate_series(1,5) as g(i); insert into copyjsontest (f1) values (E'aaa\"bbb'::text), (E'aaa\\bbb'::text), (E'aaa\/bbb'::text), (E'aaa\bbbb'::text), (E'aaa\fbbb'::text), (E'aaa\nbbb'::text), (E'aaa\rbbb'::text), (E'aaa\tbbb'::text); copy copyjsontest to stdout json; {"id":1,"f1":"line with \" in it: 1","f2":"1997-02-10T20:32:01-05:00"} {"id":2,"f1":"line with ' in it: 2","f2":"1997-02-10T20:32:01-05:00"} {"id":3,"f1":"line with \" in it: 3","f2":"1997-02-10T20:32:01-05:00"} {"id":4,"f1":"line with ' in it: 4","f2":"1997-02-10T20:32:01-05:00"} {"id":5,"f1":"line with \" in it: 5","f2":"1997-02-10T20:32:01-05:00"} {"id":1,"f1":"aaa\"bbb","f2":null} {"id":2,"f1":"aaa\\bbb","f2":null} {"id":3,"f1":"aaa/bbb","f2":null} {"id":4,"f1":"aaa\bbbb","f2":null} {"id":5,"f1":"aaa\fbbb","f2":null} {"id":6,"f1":"aaa\nbbb","f2":null} {"id":7,"f1":"aaa\rbbb","f2":null} {"id":8,"f1":"aaa\tbbb","f2":null} 8<-------------------------- I think the code, documentation, and tests are in pretty good shape at this point. Latest version attached. Any other comments or complaints out there? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On 2023-12-05 Tu 14:50, Davin Shearer wrote: > Hi Joe, > > In reviewing the 005 patch, I think that when used with FORCE ARRAY, > we should also _imply_ FORCE ROW DELIMITER. I can't envision a use > case where someone would want to use FORCE ARRAY without also using > FORCE ROW DELIMITER. I can, however, envision a use case where > someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe > including into a larger array. I definitely appreciate these options > and the flexibility that they afford from a user perspective. > > In the test output, will you also show the different variations with > FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), > (false, true), (true, true)}? Technically you've already shown me the > (false, false) case as those are the defaults. > > I don't understand the point of FORCE_ROW_DELIMITER at all. There is only one legal delimiter of array items in JSON, and that's a comma. There's no alternative and it's not optional. So in the array case you MUST have commas and in any other case (e.g. LINES) I can't see why you would have them. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 12/5/23 15:55, Andrew Dunstan wrote: > > On 2023-12-05 Tu 14:50, Davin Shearer wrote: >> Hi Joe, >> >> In reviewing the 005 patch, I think that when used with FORCE ARRAY, >> we should also _imply_ FORCE ROW DELIMITER. I can't envision a use >> case where someone would want to use FORCE ARRAY without also using >> FORCE ROW DELIMITER. I can, however, envision a use case where >> someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe >> including into a larger array. I definitely appreciate these options >> and the flexibility that they afford from a user perspective. >> >> In the test output, will you also show the different variations with >> FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), >> (false, true), (true, true)}? Technically you've already shown me the >> (false, false) case as those are the defaults. >> >> > > I don't understand the point of FORCE_ROW_DELIMITER at all. There is > only one legal delimiter of array items in JSON, and that's a comma. > There's no alternative and it's not optional. So in the array case you > MUST have commas and in any other case (e.g. LINES) I can't see why you > would have them. The current patch already *does* imply row delimiters in the array case. It says so here: 8<--------------------------- + <varlistentry> + <term><literal>FORCE_ARRAY</literal></term> + <listitem> + <para> + Force output of array decorations at the beginning and end of output. + This option implies the <literal>FORCE_ROW_DELIMITER</literal> + option. It is allowed only in <command>COPY TO</command>, and only + when using <literal>JSON</literal> format. + The default is <literal>false</literal>. + </para> 8<--------------------------- and it does so here: 8<--------------------------- + if (opts_out->force_array) + opts_out->force_row_delimiter = true; 8<--------------------------- and it shows that here: 8<--------------------------- + copy copytest to stdout (format json, force_array); + [ + {"style":"DOS","test":"abc\r\ndef","filler":1} + ,{"style":"Unix","test":"abc\ndef","filler":2} + ,{"style":"Mac","test":"abc\rdef","filler":3} + ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4} + ] 8<--------------------------- It also does not allow explicitly setting row delimiters false while force_array is true here: 8<--------------------------- + if (opts_out->force_array && + force_row_delimiter_specified && + !opts_out->force_row_delimiter) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot specify FORCE_ROW_DELIMITER false with FORCE_ARRAY true"))); 8<--------------------------- Am I understanding something incorrectly? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/5/23 16:02, Joe Conway wrote: > On 12/5/23 15:55, Andrew Dunstan wrote: >> and in any other case (e.g. LINES) I can't see why you >> would have them. Oh I didn't address this -- I saw examples in the interwebs of MSSQL server I think [1] which had the non-array with commas import and export style. It was not that tough to support and the code as written already does it, so why not? [1] https://learn.microsoft.com/en-us/sql/relational-databases/json/remove-square-brackets-from-json-without-array-wrapper-option?view=sql-server-ver16#example-multiple-row-result -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 2023-12-05 Tu 16:02, Joe Conway wrote: > On 12/5/23 15:55, Andrew Dunstan wrote: >> >> On 2023-12-05 Tu 14:50, Davin Shearer wrote: >>> Hi Joe, >>> >>> In reviewing the 005 patch, I think that when used with FORCE ARRAY, >>> we should also _imply_ FORCE ROW DELIMITER. I can't envision a use >>> case where someone would want to use FORCE ARRAY without also using >>> FORCE ROW DELIMITER. I can, however, envision a use case where >>> someone would want FORCE ROW DELIMITER without FORCE ARRAY, like >>> maybe including into a larger array. I definitely appreciate these >>> options and the flexibility that they afford from a user perspective. >>> >>> In the test output, will you also show the different variations with >>> FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, >>> false), (false, true), (true, true)}? Technically you've already >>> shown me the (false, false) case as those are the defaults. >>> >>> >> >> I don't understand the point of FORCE_ROW_DELIMITER at all. There is >> only one legal delimiter of array items in JSON, and that's a comma. >> There's no alternative and it's not optional. So in the array case you >> MUST have commas and in any other case (e.g. LINES) I can't see why you >> would have them. > > The current patch already *does* imply row delimiters in the array > case. It says so here: > 8<--------------------------- > + <varlistentry> > + <term><literal>FORCE_ARRAY</literal></term> > + <listitem> > + <para> > + Force output of array decorations at the beginning and end of > output. > + This option implies the <literal>FORCE_ROW_DELIMITER</literal> > + option. It is allowed only in <command>COPY TO</command>, and > only > + when using <literal>JSON</literal> format. > + The default is <literal>false</literal>. > + </para> > 8<--------------------------- > > and it does so here: > 8<--------------------------- > + if (opts_out->force_array) > + opts_out->force_row_delimiter = true; > 8<--------------------------- > > and it shows that here: > 8<--------------------------- > + copy copytest to stdout (format json, force_array); > + [ > + {"style":"DOS","test":"abc\r\ndef","filler":1} > + ,{"style":"Unix","test":"abc\ndef","filler":2} > + ,{"style":"Mac","test":"abc\rdef","filler":3} > + ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4} > + ] > 8<--------------------------- > > It also does not allow explicitly setting row delimiters false while > force_array is true here: > 8<--------------------------- > > + if (opts_out->force_array && > + force_row_delimiter_specified && > + !opts_out->force_row_delimiter) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot specify FORCE_ROW_DELIMITER > false with FORCE_ARRAY true"))); > 8<--------------------------- > > Am I understanding something incorrectly? But what's the point of having it if you're not using FORCE_ARRAY? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 12/5/23 16:12, Andrew Dunstan wrote: > > On 2023-12-05 Tu 16:02, Joe Conway wrote: >> On 12/5/23 15:55, Andrew Dunstan wrote: >>> >>> On 2023-12-05 Tu 14:50, Davin Shearer wrote: >>>> Hi Joe, >>>> >>>> In reviewing the 005 patch, I think that when used with FORCE ARRAY, >>>> we should also _imply_ FORCE ROW DELIMITER. I can't envision a use >>>> case where someone would want to use FORCE ARRAY without also using >>>> FORCE ROW DELIMITER. I can, however, envision a use case where >>>> someone would want FORCE ROW DELIMITER without FORCE ARRAY, like >>>> maybe including into a larger array. I definitely appreciate these >>>> options and the flexibility that they afford from a user perspective. >>>> >>>> In the test output, will you also show the different variations with >>>> FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, >>>> false), (false, true), (true, true)}? Technically you've already >>>> shown me the (false, false) case as those are the defaults. >>>> >>>> >>> >>> I don't understand the point of FORCE_ROW_DELIMITER at all. There is >>> only one legal delimiter of array items in JSON, and that's a comma. >>> There's no alternative and it's not optional. So in the array case you >>> MUST have commas and in any other case (e.g. LINES) I can't see why you >>> would have them. >> >> The current patch already *does* imply row delimiters in the array >> case. It says so here: >> 8<--------------------------- >> + <varlistentry> >> + <term><literal>FORCE_ARRAY</literal></term> >> + <listitem> >> + <para> >> + Force output of array decorations at the beginning and end of >> output. >> + This option implies the <literal>FORCE_ROW_DELIMITER</literal> >> + option. It is allowed only in <command>COPY TO</command>, and >> only >> + when using <literal>JSON</literal> format. >> + The default is <literal>false</literal>. >> + </para> >> 8<--------------------------- >> >> and it does so here: >> 8<--------------------------- >> + if (opts_out->force_array) >> + opts_out->force_row_delimiter = true; >> 8<--------------------------- >> >> and it shows that here: >> 8<--------------------------- >> + copy copytest to stdout (format json, force_array); >> + [ >> + {"style":"DOS","test":"abc\r\ndef","filler":1} >> + ,{"style":"Unix","test":"abc\ndef","filler":2} >> + ,{"style":"Mac","test":"abc\rdef","filler":3} >> + ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4} >> + ] >> 8<--------------------------- >> >> It also does not allow explicitly setting row delimiters false while >> force_array is true here: >> 8<--------------------------- >> >> + if (opts_out->force_array && >> + force_row_delimiter_specified && >> + !opts_out->force_row_delimiter) >> + ereport(ERROR, >> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), >> + errmsg("cannot specify FORCE_ROW_DELIMITER >> false with FORCE_ARRAY true"))); >> 8<--------------------------- >> >> Am I understanding something incorrectly? > > > But what's the point of having it if you're not using FORCE_ARRAY? See the follow up email -- other databases support it so why not? It seems to be a thing... -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 2023-12-05 Tu 16:09, Joe Conway wrote: > On 12/5/23 16:02, Joe Conway wrote: >> On 12/5/23 15:55, Andrew Dunstan wrote: >>> and in any other case (e.g. LINES) I can't see why you >>> would have them. > > Oh I didn't address this -- I saw examples in the interwebs of MSSQL > server I think [1] which had the non-array with commas import and > export style. It was not that tough to support and the code as written > already does it, so why not? > > [1] > https://learn.microsoft.com/en-us/sql/relational-databases/json/remove-square-brackets-from-json-without-array-wrapper-option?view=sql-server-ver16#example-multiple-row-result > > That seems quite absurd, TBH. I know we've catered for some absurdity in the CSV code (much of it down to me), so maybe we need to be liberal in what we accept here too. IMNSHO, we should produce either a single JSON document (the ARRAY case) or a series of JSON documents, one per row (the LINES case). cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 12/5/23 16:20, Andrew Dunstan wrote: > On 2023-12-05 Tu 16:09, Joe Conway wrote: >> On 12/5/23 16:02, Joe Conway wrote: >>> On 12/5/23 15:55, Andrew Dunstan wrote: >>>> and in any other case (e.g. LINES) I can't see why you >>>> would have them. >> >> Oh I didn't address this -- I saw examples in the interwebs of MSSQL >> server I think [1] which had the non-array with commas import and >> export style. It was not that tough to support and the code as written >> already does it, so why not? > > That seems quite absurd, TBH. I know we've catered for some absurdity in > the CSV code (much of it down to me), so maybe we need to be liberal in > what we accept here too. IMNSHO, we should produce either a single JSON > document (the ARRAY case) or a series of JSON documents, one per row > (the LINES case). So your preference would be to not allow the non-array-with-commas case but if/when we implement COPY FROM we would accept that format? As in Postel'a law ("be conservative in what you do, be liberal in what you accept from others")? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> the CSV code (much of it down to me), so maybe we need to be liberal in
> what we accept here too. IMNSHO, we should produce either a single JSON
> document (the ARRAY case) or a series of JSON documents, one per row
> (the LINES case).
On 2023-12-05 Tu 16:46, Joe Conway wrote: > On 12/5/23 16:20, Andrew Dunstan wrote: >> On 2023-12-05 Tu 16:09, Joe Conway wrote: >>> On 12/5/23 16:02, Joe Conway wrote: >>>> On 12/5/23 15:55, Andrew Dunstan wrote: >>>>> and in any other case (e.g. LINES) I can't see why you >>>>> would have them. >>> >>> Oh I didn't address this -- I saw examples in the interwebs of MSSQL >>> server I think [1] which had the non-array with commas import and >>> export style. It was not that tough to support and the code as >>> written already does it, so why not? >> >> That seems quite absurd, TBH. I know we've catered for some absurdity in >> the CSV code (much of it down to me), so maybe we need to be liberal in >> what we accept here too. IMNSHO, we should produce either a single JSON >> document (the ARRAY case) or a series of JSON documents, one per row >> (the LINES case). > > > So your preference would be to not allow the non-array-with-commas > case but if/when we implement COPY FROM we would accept that format? > As in Postel'a law ("be conservative in what you do, be liberal in > what you accept from others")? Yes, I think so. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 12/6/23 07:36, Andrew Dunstan wrote: > > On 2023-12-05 Tu 16:46, Joe Conway wrote: >> On 12/5/23 16:20, Andrew Dunstan wrote: >>> On 2023-12-05 Tu 16:09, Joe Conway wrote: >>>> On 12/5/23 16:02, Joe Conway wrote: >>>>> On 12/5/23 15:55, Andrew Dunstan wrote: >>>>>> and in any other case (e.g. LINES) I can't see why you >>>>>> would have them. >>>> >>>> Oh I didn't address this -- I saw examples in the interwebs of MSSQL >>>> server I think [1] which had the non-array with commas import and >>>> export style. It was not that tough to support and the code as >>>> written already does it, so why not? >>> >>> That seems quite absurd, TBH. I know we've catered for some absurdity in >>> the CSV code (much of it down to me), so maybe we need to be liberal in >>> what we accept here too. IMNSHO, we should produce either a single JSON >>> document (the ARRAY case) or a series of JSON documents, one per row >>> (the LINES case). >> >> So your preference would be to not allow the non-array-with-commas >> case but if/when we implement COPY FROM we would accept that format? >> As in Postel'a law ("be conservative in what you do, be liberal in >> what you accept from others")? > > > Yes, I think so. Awesome. The attached does it that way. I also ran pgindent. I believe this is ready to commit unless there are further comments or objections. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On 2023-12-06 We 08:49, Joe Conway wrote: > On 12/6/23 07:36, Andrew Dunstan wrote: >> >> On 2023-12-05 Tu 16:46, Joe Conway wrote: >>> On 12/5/23 16:20, Andrew Dunstan wrote: >>>> On 2023-12-05 Tu 16:09, Joe Conway wrote: >>>>> On 12/5/23 16:02, Joe Conway wrote: >>>>>> On 12/5/23 15:55, Andrew Dunstan wrote: >>>>>>> and in any other case (e.g. LINES) I can't see why you >>>>>>> would have them. >>>>> >>>>> Oh I didn't address this -- I saw examples in the interwebs of >>>>> MSSQL server I think [1] which had the non-array with commas >>>>> import and export style. It was not that tough to support and the >>>>> code as written already does it, so why not? >>>> >>>> That seems quite absurd, TBH. I know we've catered for some >>>> absurdity in >>>> the CSV code (much of it down to me), so maybe we need to be >>>> liberal in >>>> what we accept here too. IMNSHO, we should produce either a single >>>> JSON >>>> document (the ARRAY case) or a series of JSON documents, one per row >>>> (the LINES case). >>> >>> So your preference would be to not allow the non-array-with-commas >>> case but if/when we implement COPY FROM we would accept that format? >>> As in Postel'a law ("be conservative in what you do, be liberal in >>> what you accept from others")? >> >> >> Yes, I think so. > > Awesome. The attached does it that way. I also ran pgindent. > > I believe this is ready to commit unless there are further comments or > objections. Sorry to bikeshed a little more, I'm a bit late looking at this. I suspect that most users will actually want the table as a single JSON document, so it should probably be the default. In any case FORCE_ARRAY as an option has a slightly wrong feel to it. I'm having trouble coming up with a good name for the reverse of that, off the top of my head. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Joe Conway <mail@joeconway.com> writes: > I believe this is ready to commit unless there are further comments or > objections. I thought we were still mostly at proof-of-concept stage? In particular, has anyone done any performance testing? I'm concerned about that because composite_to_json() has zero capability to cache any metadata across calls, meaning there is going to be a large amount of duplicated work per row. regards, tom lane
On 12/6/23 10:32, Andrew Dunstan wrote: > > On 2023-12-06 We 08:49, Joe Conway wrote: >> On 12/6/23 07:36, Andrew Dunstan wrote: >>> >>> On 2023-12-05 Tu 16:46, Joe Conway wrote: >>>> On 12/5/23 16:20, Andrew Dunstan wrote: >>>>> On 2023-12-05 Tu 16:09, Joe Conway wrote: >>>>>> On 12/5/23 16:02, Joe Conway wrote: >>>>>>> On 12/5/23 15:55, Andrew Dunstan wrote: >>>>>>>> and in any other case (e.g. LINES) I can't see why you >>>>>>>> would have them. >>>>>> >>>>>> Oh I didn't address this -- I saw examples in the interwebs of >>>>>> MSSQL server I think [1] which had the non-array with commas >>>>>> import and export style. It was not that tough to support and the >>>>>> code as written already does it, so why not? >>>>> >>>>> That seems quite absurd, TBH. I know we've catered for some >>>>> absurdity in >>>>> the CSV code (much of it down to me), so maybe we need to be >>>>> liberal in >>>>> what we accept here too. IMNSHO, we should produce either a single >>>>> JSON >>>>> document (the ARRAY case) or a series of JSON documents, one per row >>>>> (the LINES case). >>>> >>>> So your preference would be to not allow the non-array-with-commas >>>> case but if/when we implement COPY FROM we would accept that format? >>>> As in Postel'a law ("be conservative in what you do, be liberal in >>>> what you accept from others")? >>> >>> >>> Yes, I think so. >> >> Awesome. The attached does it that way. I also ran pgindent. >> >> I believe this is ready to commit unless there are further comments or >> objections. > > Sorry to bikeshed a little more, I'm a bit late looking at this. > > I suspect that most users will actually want the table as a single JSON > document, so it should probably be the default. In any case FORCE_ARRAY > as an option has a slightly wrong feel to it. Sure, I can make that happen, although I figured that for the many-rows-scenario the single array size might be an issue for whatever you are importing into. > I'm having trouble coming up with a good name for the reverse of > that, off the top of my head. Will think about it and propose something with the next patch revision. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/6/23 10:44, Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: >> I believe this is ready to commit unless there are further comments or >> objections. > > I thought we were still mostly at proof-of-concept stage? The concept is narrowly scoped enough that I think we are homing in on the final patch. > In particular, has anyone done any performance testing? > I'm concerned about that because composite_to_json() has > zero capability to cache any metadata across calls, meaning > there is going to be a large amount of duplicated work > per row. I will devise some kind of test and report back. I suppose something with many rows and many narrow columns comparing time to COPY text/csv/json modes would do the trick? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 2023-12-06 We 10:44, Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: >> I believe this is ready to commit unless there are further comments or >> objections. > I thought we were still mostly at proof-of-concept stage? > > In particular, has anyone done any performance testing? > I'm concerned about that because composite_to_json() has > zero capability to cache any metadata across calls, meaning > there is going to be a large amount of duplicated work > per row. > > Yeah, that's hard to deal with, too, as it can be called recursively. OTOH I'd rather have a version of this that worked slowly than none at all. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Joe Conway <mail@joeconway.com> writes: > On 12/6/23 10:44, Tom Lane wrote: >> In particular, has anyone done any performance testing? > I will devise some kind of test and report back. I suppose something > with many rows and many narrow columns comparing time to COPY > text/csv/json modes would do the trick? Yeah. If it's at least in the same ballpark as the existing text/csv formats then I'm okay with it. I'm worried that it might be 10x worse, in which case I think we'd need to do something. regards, tom lane
This is something I've wanted for a long time as well. While it's possible to use a COPY with text output for a trivial case, the double escaping falls apart quickly for arbitrary data. It's really only usable when you know exactly what you are querying and know it will not be a problem.
Regarding the defaults for the output, I think JSON lines (rather than a JSON array of objects) would be preferred. It's more natural to combine them and generate that type of data on the fly rather than forcing aggregation into a single object.
Couple more features / use cases come to mind as well. Even if they're not part of a first round of this feature I think it'd be helpful to document them now as it might give some ideas for what does make that first cut:
1. Outputting a top level JSON object without the additional column keys. IIUC, the top level keys are always the column names. A common use case would be a single json/jsonb column that is already formatted exactly as the user would like for output. Rather than enveloping it in an object with a dedicated key, it would be nice to be able to output it directly. This would allow non-object results to be outputted as well (e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is structured, I think this would play nice with the JSON lines v.s. array concept.
COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_NOT_ENVELOPE)
{"foo":1}
{"foo":2}
{"foo":3}
2. An option to ignore null fields so they are excluded from the output. This would not be a default but would allow shrinking the total size of the output data in many situations. This would be recursive to allow nested objects to be shrunk down (not just the top level). This might be worthwhile as a standalone JSON function though handling it during output would be more efficient as it'd only be read once.
COPY (SELECT json_build_object('foo', CASE WHEN x > 1 THEN x END) FROM generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_NOT_ENVELOPE, JSON_SKIP_NULLS)
{}
{"foo":2}
{"foo":3}
3. Reverse of #2 when copying data in to allow defaulting missing fields to NULL.
Andrew Dunstan <andrew@dunslane.net> writes: > On 2023-12-06 We 10:44, Tom Lane wrote: >> In particular, has anyone done any performance testing? >> I'm concerned about that because composite_to_json() has >> zero capability to cache any metadata across calls, meaning >> there is going to be a large amount of duplicated work >> per row. > Yeah, that's hard to deal with, too, as it can be called recursively. Right. On the plus side, if we did improve this it would presumably also benefit other callers of composite_to_json[b]. > OTOH I'd rather have a version of this that worked slowly than none at all. It might be acceptable to plan on improving the performance later, depending on just how bad it is now. regards, tom lane
On Wed, Dec 06, 2023 at 11:28:59AM -0500, Tom Lane wrote: > It might be acceptable to plan on improving the performance later, > depending on just how bad it is now. On 10M rows with 11 integers each, I'm seeing the following: (format text) Time: 10056.311 ms (00:10.056) Time: 8789.331 ms (00:08.789) Time: 8755.070 ms (00:08.755) (format csv) Time: 12295.480 ms (00:12.295) Time: 12311.059 ms (00:12.311) Time: 12305.469 ms (00:12.305) (format json) Time: 24568.621 ms (00:24.569) Time: 23756.234 ms (00:23.756) Time: 24265.730 ms (00:24.266) 'perf top' tends to look a bit like this: 13.31% postgres [.] appendStringInfoString 7.57% postgres [.] datum_to_json_internal 6.82% postgres [.] SearchCatCache1 5.35% [kernel] [k] intel_gpio_irq 3.57% postgres [.] composite_to_json 3.31% postgres [.] IsValidJsonNumber -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Dec 06, 2023 at 10:33:49AM -0600, Nathan Bossart wrote: > (format csv) > Time: 12295.480 ms (00:12.295) > Time: 12311.059 ms (00:12.311) > Time: 12305.469 ms (00:12.305) > > (format json) > Time: 24568.621 ms (00:24.569) > Time: 23756.234 ms (00:23.756) > Time: 24265.730 ms (00:24.266) I should also note that the json output is 85% larger than the csv output. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Andrew Dunstan wrote: > IMNSHO, we should produce either a single JSON > document (the ARRAY case) or a series of JSON documents, one per row > (the LINES case). "COPY Operations" in the doc says: " The backend sends a CopyOutResponse message to the frontend, followed by zero or more CopyData messages (always one per row), followed by CopyDone". In the ARRAY case, the first messages with the copyjsontest regression test look like this (tshark output): PostgreSQL Type: CopyOut response Length: 13 Format: Text (0) Columns: 3 Format: Text (0) PostgreSQL Type: Copy data Length: 6 Copy data: 5b0a PostgreSQL Type: Copy data Length: 76 Copy data: 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031… The first Copy data message with contents "5b0a" does not qualify as a row of data with 3 columns as advertised in the CopyOut message. Isn't that a problem? At least the json non-ARRAY case ("json lines") doesn't have this issue, since every CopyData message corresponds effectively to a row in the table. [1] https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-COPY Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On 12/6/23 13:59, Daniel Verite wrote: > Andrew Dunstan wrote: > >> IMNSHO, we should produce either a single JSON >> document (the ARRAY case) or a series of JSON documents, one per row >> (the LINES case). > > "COPY Operations" in the doc says: > > " The backend sends a CopyOutResponse message to the frontend, followed > by zero or more CopyData messages (always one per row), followed by > CopyDone". > > In the ARRAY case, the first messages with the copyjsontest > regression test look like this (tshark output): > > PostgreSQL > Type: CopyOut response > Length: 13 > Format: Text (0) > Columns: 3 > Format: Text (0) > PostgreSQL > Type: Copy data > Length: 6 > Copy data: 5b0a > PostgreSQL > Type: Copy data > Length: 76 > Copy data: > 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031… > > The first Copy data message with contents "5b0a" does not qualify > as a row of data with 3 columns as advertised in the CopyOut > message. Isn't that a problem? Is it a real problem, or just a bit of documentation change that I missed? Anything receiving this and looking for a json array should know how to assemble the data correctly despite the extra CopyData messages. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/6/23 11:44, Nathan Bossart wrote: > On Wed, Dec 06, 2023 at 10:33:49AM -0600, Nathan Bossart wrote: >> (format csv) >> Time: 12295.480 ms (00:12.295) >> Time: 12311.059 ms (00:12.311) >> Time: 12305.469 ms (00:12.305) >> >> (format json) >> Time: 24568.621 ms (00:24.569) >> Time: 23756.234 ms (00:23.756) >> Time: 24265.730 ms (00:24.266) > > I should also note that the json output is 85% larger than the csv output. I'll see if I can add some caching to composite_to_json(), but based on the relative data size it does not sound like there is much performance left on the table to go after, no? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Joe Conway <mail@joeconway.com> writes: > I'll see if I can add some caching to composite_to_json(), but based on > the relative data size it does not sound like there is much performance > left on the table to go after, no? If Nathan's perf results hold up elsewhere, it seems like some micro-optimization around the text-pushing (appendStringInfoString) might be more useful than caching. The 7% spent in cache lookups could be worth going after later, but it's not the top of the list. The output size difference does say that maybe we should pay some attention to the nearby request to not always label every field. Perhaps there should be an option for each row to transform to a JSON array rather than an object? regards, tom lane
On 2023-12-06 We 15:20, Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: >> I'll see if I can add some caching to composite_to_json(), but based on >> the relative data size it does not sound like there is much performance >> left on the table to go after, no? > If Nathan's perf results hold up elsewhere, it seems like some > micro-optimization around the text-pushing (appendStringInfoString) > might be more useful than caching. The 7% spent in cache lookups > could be worth going after later, but it's not the top of the list. > > The output size difference does say that maybe we should pay some > attention to the nearby request to not always label every field. > Perhaps there should be an option for each row to transform to > a JSON array rather than an object? > > I doubt it. People who want this are likely to want pretty much what this patch is providing, not something they would have to transform in order to get it. If they want space-efficient data they won't really be wanting JSON. Maybe they want Protocol Buffers or something in that vein. I see there's nearby proposal to make this area pluggable at <https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com> cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 12/6/23 11:28, Sehrope Sarkuni wrote: > Big +1 to this overall feature. cool! > Regarding the defaults for the output, I think JSON lines (rather than a > JSON array of objects) would be preferred. It's more natural to combine > them and generate that type of data on the fly rather than forcing > aggregation into a single object. So that is +2 (Sehrope and me) for the status quo (JSON lines), and +2 (Andrew and Davin) for defaulting to json arrays. Anyone else want to weigh in on that issue? > Couple more features / use cases come to mind as well. Even if they're > not part of a first round of this feature I think it'd be helpful to > document them now as it might give some ideas for what does make that > first cut: > > 1. Outputting a top level JSON object without the additional column > keys. IIUC, the top level keys are always the column names. A common use > case would be a single json/jsonb column that is already formatted > exactly as the user would like for output. Rather than enveloping it in > an object with a dedicated key, it would be nice to be able to output it > directly. This would allow non-object results to be outputted as well > (e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is > structured, I think this would play nice with the JSON lines v.s. array > concept. > > COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM > generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, > SOME_OPTION_TO_NOT_ENVELOPE) > {"foo":1} > {"foo":2} > {"foo":3} Your example does not match what you describe, or do I misunderstand? I thought your goal was to eliminate the repeated "foo" from each row... > 2. An option to ignore null fields so they are excluded from the output. > This would not be a default but would allow shrinking the total size of > the output data in many situations. This would be recursive to allow > nested objects to be shrunk down (not just the top level). This might be > worthwhile as a standalone JSON function though handling it during > output would be more efficient as it'd only be read once. > > COPY (SELECT json_build_object('foo', CASE WHEN x > 1 THEN x END) FROM > generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, > SOME_OPTION_TO_NOT_ENVELOPE, JSON_SKIP_NULLS) > {} > {"foo":2} > {"foo":3} clear enough I think > 3. Reverse of #2 when copying data in to allow defaulting missing fields > to NULL. good to record the ask, but applies to a different feature (COPY FROM instead of COPY TO). -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> The output size difference does say that maybe we should pay some
> attention to the nearby request to not always label every field.
> Perhaps there should be an option for each row to transform to
> a JSON array rather than an object?
I doubt it. People who want this are likely to want pretty much what
this patch is providing, not something they would have to transform in
order to get it. If they want space-efficient data they won't really be
wanting JSON. Maybe they want Protocol Buffers or something in that vein.
Each one has its place and a default of the row_to_json(...) representation of the row still makes sense. But if the user has the option of outputting a single json/jsonb field for each row without an object or array wrapper, then it's possible to support all of these use cases as the user can explicitly pick whatever envelope makes sense:
-- Lines of JSON arrays:
COPY (SELECT json_build_array('test-' || a, b) FROM generate_series(1, 3) a, generate_series(5,6) b) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_DISABLE_ENVELOPE);
["test-1", 5]
["test-2", 5]
["test-3", 5]
["test-1", 6]
["test-2", 6]
["test-3", 6]
-- Lines of JSON strings:
COPY (SELECT to_json('test-' || x) FROM generate_series(1, 5) x) TO STDOUT WITH (FORMAT JSON, SOME_OPTION_TO_DISABLE_ENVELOPE);
"test-1"
"test-2"
"test-3"
"test-4"
"test-5"
row_to_json
---------------
{"a":1,"a":2}
On Wed, Dec 06, 2023 at 03:20:46PM -0500, Tom Lane wrote: > If Nathan's perf results hold up elsewhere, it seems like some > micro-optimization around the text-pushing (appendStringInfoString) > might be more useful than caching. The 7% spent in cache lookups > could be worth going after later, but it's not the top of the list. Agreed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
> 1. Outputting a top level JSON object without the additional column
> keys. IIUC, the top level keys are always the column names. A common use
> case would be a single json/jsonb column that is already formatted
> exactly as the user would like for output. Rather than enveloping it in
> an object with a dedicated key, it would be nice to be able to output it
> directly. This would allow non-object results to be outputted as well
> (e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is
> structured, I think this would play nice with the JSON lines v.s. array
> concept.
>
> COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM
> generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON,
> SOME_OPTION_TO_NOT_ENVELOPE)
> {"foo":1}
> {"foo":2}
> {"foo":3}
Your example does not match what you describe, or do I misunderstand? I
thought your goal was to eliminate the repeated "foo" from each row...
Regards,
On 12/6/23 16:42, Sehrope Sarkuni wrote: > On Wed, Dec 6, 2023 at 4:29 PM Joe Conway <mail@joeconway.com > <mailto:mail@joeconway.com>> wrote: > > > 1. Outputting a top level JSON object without the additional column > > keys. IIUC, the top level keys are always the column names. A > common use > > case would be a single json/jsonb column that is already formatted > > exactly as the user would like for output. Rather than enveloping > it in > > an object with a dedicated key, it would be nice to be able to > output it > > directly. This would allow non-object results to be outputted as > well > > (e.g., lines of JSON arrays, numbers, or strings). Due to how > JSON is > > structured, I think this would play nice with the JSON lines v.s. > array > > concept. > > > > COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM > > generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, > > SOME_OPTION_TO_NOT_ENVELOPE) > > {"foo":1} > > {"foo":2} > > {"foo":3} > > Your example does not match what you describe, or do I misunderstand? I > thought your goal was to eliminate the repeated "foo" from each row... > > > The "foo" in this case is explicit as I'm adding it when building the > object. What I was trying to show was not adding an additional object > wrapper / envelope. > > So each row is: > > {"foo":1} > > Rather than: > > "{"json_build_object":{"foo":1}} I am still getting confused ;-) Let's focus on the current proposed patch with a "minimum required feature set". Right now the default behavior is "JSON lines": 8<------------------------------- COPY (SELECT x.i, 'val' || x.i as v FROM generate_series(1, 3) x(i)) TO STDOUT WITH (FORMAT JSON); {"i":1,"v":"val1"} {"i":2,"v":"val2"} {"i":3,"v":"val3"} 8<------------------------------- and the other, non-default option is "JSON array": 8<------------------------------- COPY (SELECT x.i, 'val' || x.i as v FROM generate_series(1, 3) x(i)) TO STDOUT WITH (FORMAT JSON, FORCE_ARRAY); [ {"i":1,"v":"val1"} ,{"i":2,"v":"val2"} ,{"i":3,"v":"val3"} ] 8<------------------------------- So the questions are: 1. Do those two formats work for the initial implementation? 2. Is the default correct or should it be switched e.g. rather than specifying FORCE_ARRAY to get an array, something like FORCE_NO_ARRAY to get JSON lines and the JSON array is default? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
So the questions are:
1. Do those two formats work for the initial implementation?
2. Is the default correct or should it be switched
e.g. rather than specifying FORCE_ARRAY to get an
array, something like FORCE_NO_ARRAY to get JSON lines
and the JSON array is default?
On 12/6/23 14:47, Joe Conway wrote: > On 12/6/23 13:59, Daniel Verite wrote: >> Andrew Dunstan wrote: >> >>> IMNSHO, we should produce either a single JSON >>> document (the ARRAY case) or a series of JSON documents, one per row >>> (the LINES case). >> >> "COPY Operations" in the doc says: >> >> " The backend sends a CopyOutResponse message to the frontend, followed >> by zero or more CopyData messages (always one per row), followed by >> CopyDone". >> >> In the ARRAY case, the first messages with the copyjsontest >> regression test look like this (tshark output): >> >> PostgreSQL >> Type: CopyOut response >> Length: 13 >> Format: Text (0) >> Columns: 3 >> Format: Text (0) >> PostgreSQL >> Type: Copy data >> Length: 6 >> Copy data: 5b0a >> PostgreSQL >> Type: Copy data >> Length: 76 >> Copy data: >> 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031… >> >> The first Copy data message with contents "5b0a" does not qualify >> as a row of data with 3 columns as advertised in the CopyOut >> message. Isn't that a problem? > > > Is it a real problem, or just a bit of documentation change that I missed? > > Anything receiving this and looking for a json array should know how to > assemble the data correctly despite the extra CopyData messages. Hmm, maybe the real problem here is that Columns do not equal "3" for the json mode case -- that should really say "1" I think, because the row is not represented as 3 columns but rather 1 json object. Does that sound correct? Assuming yes, there is still maybe an issue that there are two more "rows" that actual output rows (the "[" and the "]"), but maybe those are less likely to cause some hazard? -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/6/23 14:47, Joe Conway wrote:
> On 12/6/23 13:59, Daniel Verite wrote:
>> Andrew Dunstan wrote:
>>
>>> IMNSHO, we should produce either a single JSON
>>> document (the ARRAY case) or a series of JSON documents, one per row
>>> (the LINES case).
>>
>> "COPY Operations" in the doc says:
>>
>> " The backend sends a CopyOutResponse message to the frontend, followed
>> by zero or more CopyData messages (always one per row), followed by
>> CopyDone".
>>
>> In the ARRAY case, the first messages with the copyjsontest
>> regression test look like this (tshark output):
>>
>> PostgreSQL
>> Type: CopyOut response
>> Length: 13
>> Format: Text (0)
>> Columns: 3
>> Format: Text (0)
> Anything receiving this and looking for a json array should know how to
> assemble the data correctly despite the extra CopyData messages.
Hmm, maybe the real problem here is that Columns do not equal "3" for
the json mode case -- that should really say "1" I think, because the
row is not represented as 3 columns but rather 1 json object.
Does that sound correct?
Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe those
are less likely to cause some hazard?
On Wed, Dec 6, 2023 at 4:09 PM Joe Conway <mail@joeconway.com> wrote:On 12/6/23 14:47, Joe Conway wrote:
> On 12/6/23 13:59, Daniel Verite wrote:
>> Andrew Dunstan wrote:
>>
>>> IMNSHO, we should produce either a single JSON
>>> document (the ARRAY case) or a series of JSON documents, one per row
>>> (the LINES case).
>>
>> "COPY Operations" in the doc says:
>>
>> " The backend sends a CopyOutResponse message to the frontend, followed
>> by zero or more CopyData messages (always one per row), followed by
>> CopyDone".
>>
>> In the ARRAY case, the first messages with the copyjsontest
>> regression test look like this (tshark output):
>>
>> PostgreSQL
>> Type: CopyOut response
>> Length: 13
>> Format: Text (0)
>> Columns: 3
>> Format: Text (0)
> Anything receiving this and looking for a json array should know how to
> assemble the data correctly despite the extra CopyData messages.
Hmm, maybe the real problem here is that Columns do not equal "3" for
the json mode case -- that should really say "1" I think, because the
row is not represented as 3 columns but rather 1 json object.
Does that sound correct?
Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe those
are less likely to cause some hazard?What is the limitation, if any, of introducing new type codes for these. n = 2..N for the different variants? Or even -1 for "raw text"? And document that columns and structural rows need to be determined out-of-band. Continuing to use 1 (text) for this non-csv data seems like a hack even if we can technically make it function. The semantics, especially for the array case, are completely discarded or wrong.
On 12/6/23 18:28, David G. Johnston wrote: > On Wed, Dec 6, 2023 at 4:09 PM Joe Conway <mail@joeconway.com > <mailto:mail@joeconway.com>> wrote: > > On 12/6/23 14:47, Joe Conway wrote: > > On 12/6/23 13:59, Daniel Verite wrote: > >> Andrew Dunstan wrote: > >> > >>> IMNSHO, we should produce either a single JSON > >>> document (the ARRAY case) or a series of JSON documents, one > per row > >>> (the LINES case). > >> > >> "COPY Operations" in the doc says: > >> > >> " The backend sends a CopyOutResponse message to the frontend, > followed > >> by zero or more CopyData messages (always one per row), > followed by > >> CopyDone". > >> > >> In the ARRAY case, the first messages with the copyjsontest > >> regression test look like this (tshark output): > >> > >> PostgreSQL > >> Type: CopyOut response > >> Length: 13 > >> Format: Text (0) > >> Columns: 3 > >> Format: Text (0) > > > Anything receiving this and looking for a json array should know > how to > > assemble the data correctly despite the extra CopyData messages. > > Hmm, maybe the real problem here is that Columns do not equal "3" for > the json mode case -- that should really say "1" I think, because the > row is not represented as 3 columns but rather 1 json object. > > Does that sound correct? > > Assuming yes, there is still maybe an issue that there are two more > "rows" that actual output rows (the "[" and the "]"), but maybe those > are less likely to cause some hazard? > > > What is the limitation, if any, of introducing new type codes for > these. n = 2..N for the different variants? Or even -1 for "raw > text"? And document that columns and structural rows need to be > determined out-of-band. Continuing to use 1 (text) for this non-csv > data seems like a hack even if we can technically make it function. The > semantics, especially for the array case, are completely discarded or wrong. I am not following you here. SendCopyBegin looks like this currently: 8<-------------------------------- SendCopyBegin(CopyToState cstate) { StringInfoData buf; int natts = list_length(cstate->attnumlist); int16 format = (cstate->opts.binary ? 1 : 0); int i; pq_beginmessage(&buf, PqMsg_CopyOutResponse); pq_sendbyte(&buf, format); /* overall format */ pq_sendint16(&buf, natts); for (i = 0; i < natts; i++) pq_sendint16(&buf, format); /* per-column formats */ pq_endmessage(&buf); cstate->copy_dest = COPY_FRONTEND; } 8<-------------------------------- The "1" is saying are we binary mode or not. JSON mode will never be sending in binary in the current implementation at least. And it always aggregates all the columns as one json object. So the correct answer is (I think): 8<-------------------------------- *************** SendCopyBegin(CopyToState cstate) *** 146,154 **** pq_beginmessage(&buf, PqMsg_CopyOutResponse); pq_sendbyte(&buf, format); /* overall format */ ! pq_sendint16(&buf, natts); ! for (i = 0; i < natts; i++) ! pq_sendint16(&buf, format); /* per-column formats */ pq_endmessage(&buf); cstate->copy_dest = COPY_FRONTEND; } --- 150,169 ---- pq_beginmessage(&buf, PqMsg_CopyOutResponse); pq_sendbyte(&buf, format); /* overall format */ ! if (!cstate->opts.json_mode) ! { ! pq_sendint16(&buf, natts); ! for (i = 0; i < natts; i++) ! pq_sendint16(&buf, format); /* per-column formats */ ! } ! else ! { ! /* ! * JSON mode is always one non-binary column ! */ ! pq_sendint16(&buf, 1); ! pq_sendint16(&buf, 0); ! } pq_endmessage(&buf); cstate->copy_dest = COPY_FRONTEND; } 8<-------------------------------- That still leaves the need to fix the documentation: " The backend sends a CopyOutResponse message to the frontend, followed by zero or more CopyData messages (always one per row), followed by CopyDone" probably "always one per row" would be changed to note that json array format outputs two extra rows for the start/end bracket. In fact, as written the patch does this: 8<-------------------------------- COPY (SELECT x.i, 'val' || x.i as v FROM generate_series(1, 3) x(i) WHERE false) TO STDOUT WITH (FORMAT JSON, FORCE_ARRAY); [ ] 8<-------------------------------- Not sure if that is a problem or not. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/6/23 18:38, David G. Johnston wrote: > On Wed, Dec 6, 2023 at 4:28 PM David G. Johnston > <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>> wrote: > > On Wed, Dec 6, 2023 at 4:09 PM Joe Conway <mail@joeconway.com > <mailto:mail@joeconway.com>> wrote: > > On 12/6/23 14:47, Joe Conway wrote: > > On 12/6/23 13:59, Daniel Verite wrote: > >> Andrew Dunstan wrote: > >> > >>> IMNSHO, we should produce either a single JSON > >>> document (the ARRAY case) or a series of JSON documents, > one per row > >>> (the LINES case). > >> > >> "COPY Operations" in the doc says: > >> > >> " The backend sends a CopyOutResponse message to the > frontend, followed > >> by zero or more CopyData messages (always one per row), > followed by > >> CopyDone". > >> > >> In the ARRAY case, the first messages with the copyjsontest > >> regression test look like this (tshark output): > >> > >> PostgreSQL > >> Type: CopyOut response > >> Length: 13 > >> Format: Text (0) > >> Columns: 3 > >> Format: Text (0) > > > Anything receiving this and looking for a json array should > know how to > > assemble the data correctly despite the extra CopyData messages. > > Hmm, maybe the real problem here is that Columns do not equal > "3" for > the json mode case -- that should really say "1" I think, > because the > row is not represented as 3 columns but rather 1 json object. > > Does that sound correct? > > Assuming yes, there is still maybe an issue that there are two more > "rows" that actual output rows (the "[" and the "]"), but maybe > those > are less likely to cause some hazard? > > > What is the limitation, if any, of introducing new type codes for > these. n = 2..N for the different variants? Or even -1 for "raw > text"? And document that columns and structural rows need to be > determined out-of-band. Continuing to use 1 (text) for this non-csv > data seems like a hack even if we can technically make it function. > The semantics, especially for the array case, are completely > discarded or wrong. > > Also, it seems like this answer would be easier to make if we implement > COPY FROM now since how is the server supposed to deal with decomposing > this data into tables without accurate type information? I don't see > implementing only half of the feature being a good idea. I've had much > more desire for FROM compared to TO personally. Several people have weighed in on the side of getting COPY TO done by itself first. Given how long this discussion has already become for a relatively small and simple feature, I am a big fan of not expanding the scope now. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
" The backend sends a CopyOutResponse message to the frontend, followed
by zero or more CopyData messages (always one per row), followed by
CopyDone"
probably "always one per row" would be changed to note that json array
format outputs two extra rows for the start/end bracket.
On 12/6/23 19:39, David G. Johnston wrote: > On Wed, Dec 6, 2023 at 4:45 PM Joe Conway <mail@joeconway.com > <mailto:mail@joeconway.com>> wrote: > > > " The backend sends a CopyOutResponse message to the frontend, followed > by zero or more CopyData messages (always one per row), followed by > CopyDone" > > probably "always one per row" would be changed to note that json array > format outputs two extra rows for the start/end bracket. > > > Fair, I was ascribing much more semantic meaning to this than it wants. > > I don't see any real requirement, given the lack of semantics, to > mention JSON at all. It is one CopyData per row, regardless of the > contents. We don't delineate between the header and non-header data in > CSV. It isn't a protocol concern. good point > But I still cannot shake the belief that using a format code of 1 - > which really could be interpreted as meaning "textual csv" in practice - > for this JSON output is unwise and we should introduce a new integer > value for the new fundamental output format. No, I am pretty sure you still have that wrong. The "1" means binary mode. As in 8<---------------------- FORMAT Selects the data format to be read or written: text, csv (Comma Separated Values), or binary. The default is text. 8<---------------------- That is completely separate from text and csv. It literally means to use the binary output functions instead of the usual ones: 8<---------------------- if (cstate->opts.binary) getTypeBinaryOutputInfo(attr->atttypid, &out_func_oid, &isvarlena); else getTypeOutputInfo(attr->atttypid, &out_func_oid, &isvarlena); 8<---------------------- Both "text" and "csv" mode use are non-binary output formats. I believe the JSON output format is also non-binary. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/6/23 19:39, David G. Johnston wrote:
> On Wed, Dec 6, 2023 at 4:45 PM Joe Conway <mail@joeconway.com
> <mailto:mail@joeconway.com>> wrote:
> But I still cannot shake the belief that using a format code of 1 -
> which really could be interpreted as meaning "textual csv" in practice -
> for this JSON output is unwise and we should introduce a new integer
> value for the new fundamental output format.
No, I am pretty sure you still have that wrong. The "1" means binary
mode
On 12/6/23 18:09, Joe Conway wrote: > On 12/6/23 14:47, Joe Conway wrote: >> On 12/6/23 13:59, Daniel Verite wrote: >>> Andrew Dunstan wrote: >>> >>>> IMNSHO, we should produce either a single JSON >>>> document (the ARRAY case) or a series of JSON documents, one per row >>>> (the LINES case). >>> >>> "COPY Operations" in the doc says: >>> >>> " The backend sends a CopyOutResponse message to the frontend, followed >>> by zero or more CopyData messages (always one per row), followed by >>> CopyDone". >>> >>> In the ARRAY case, the first messages with the copyjsontest >>> regression test look like this (tshark output): >>> >>> PostgreSQL >>> Type: CopyOut response >>> Length: 13 >>> Format: Text (0) >>> Columns: 3 >>> Format: Text (0) >>> PostgreSQL >>> Type: Copy data >>> Length: 6 >>> Copy data: 5b0a >>> PostgreSQL >>> Type: Copy data >>> Length: 76 >>> Copy data: >>> 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031… >>> >>> The first Copy data message with contents "5b0a" does not qualify >>> as a row of data with 3 columns as advertised in the CopyOut >>> message. Isn't that a problem? >> >> >> Is it a real problem, or just a bit of documentation change that I missed? >> >> Anything receiving this and looking for a json array should know how to >> assemble the data correctly despite the extra CopyData messages. > > Hmm, maybe the real problem here is that Columns do not equal "3" for > the json mode case -- that should really say "1" I think, because the > row is not represented as 3 columns but rather 1 json object. > > Does that sound correct? > > Assuming yes, there is still maybe an issue that there are two more > "rows" that actual output rows (the "[" and the "]"), but maybe those > are less likely to cause some hazard? The attached should fix the CopyOut response to say one column. I.e. it ought to look something like: PostgreSQL Type: CopyOut response Length: 13 Format: Text (0) Columns: 1 Format: Text (0) PostgreSQL Type: Copy data Length: 6 Copy data: 5b0a PostgreSQL Type: Copy data Length: 76 Copy data: [...] -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On 12/6/23 20:09, David G. Johnston wrote: > On Wed, Dec 6, 2023 at 5:57 PM Joe Conway <mail@joeconway.com > <mailto:mail@joeconway.com>> wrote: > > On 12/6/23 19:39, David G. Johnston wrote: > > On Wed, Dec 6, 2023 at 4:45 PM Joe Conway <mail@joeconway.com > <mailto:mail@joeconway.com> > > <mailto:mail@joeconway.com <mailto:mail@joeconway.com>>> wrote: > > > But I still cannot shake the belief that using a format code of 1 - > > which really could be interpreted as meaning "textual csv" in > practice - > > for this JSON output is unwise and we should introduce a new integer > > value for the new fundamental output format. > > No, I am pretty sure you still have that wrong. The "1" means binary > mode > > > Ok. I made the same typo twice, I did mean to write 0 instead of 1. Fair enough. > But the point that we should introduce a 2 still stands. The new code > would mean: use text output functions but that there is no inherent > tabular structure in the underlying contents. Instead the copy format > was JSON and the output layout is dependent upon the json options in the > copy command and that there really shouldn't be any attempt to turn the > contents directly into a tabular data structure like you presently do > with the CSV data under format 0. Ignore the column count and column > formats as they are fixed or non-existent. I think that amounts to a protocol change, which we tend to avoid at all costs. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> But the point that we should introduce a 2 still stands. The new code
> would mean: use text output functions but that there is no inherent
> tabular structure in the underlying contents. Instead the copy format
> was JSON and the output layout is dependent upon the json options in the
> copy command and that there really shouldn't be any attempt to turn the
> contents directly into a tabular data structure like you presently do
> with the CSV data under format 0. Ignore the column count and column
> formats as they are fixed or non-existent.
I think that amounts to a protocol change, which we tend to avoid at all
costs.
The first Copy data message with contents "5b0a" does not qualifyas a row of data with 3 columns as advertised in the CopyOutmessage. Isn't that a problem?At least the json non-ARRAY case ("json lines") doesn't havethis issue, since every CopyData message corresponds effectivelyto a row in the table.
On Wed, Dec 06, 2023 at 03:20:46PM -0500, Tom Lane wrote: > If Nathan's perf results hold up elsewhere, it seems like some > micro-optimization around the text-pushing (appendStringInfoString) > might be more useful than caching. The 7% spent in cache lookups > could be worth going after later, but it's not the top of the list. Hah, it turns out my benchmark of 110M integers really stresses the JSONTYPE_NUMERIC path in datum_to_json_internal(). That particular path calls strlen() twice: once for IsValidJsonNumber(), and once in appendStringInfoString(). If I save the result from IsValidJsonNumber() and give it to appendBinaryStringInfo() instead, the COPY goes ~8% faster. It's probably worth giving datum_to_json_internal() a closer look in a new thread. diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c index 71ae53ff97..1951e93d9d 100644 --- a/src/backend/utils/adt/json.c +++ b/src/backend/utils/adt/json.c @@ -180,6 +180,7 @@ datum_to_json_internal(Datum val, bool is_null, StringInfo result, { char *outputstr; text *jsontext; + int len; check_stack_depth(); @@ -223,8 +224,8 @@ datum_to_json_internal(Datum val, bool is_null, StringInfo result, * Don't call escape_json for a non-key if it's a valid JSON * number. */ - if (!key_scalar && IsValidJsonNumber(outputstr, strlen(outputstr))) - appendStringInfoString(result, outputstr); + if (!key_scalar && IsValidJsonNumber(outputstr, (len = strlen(outputstr)))) + appendBinaryStringInfo(result, outputstr, len); else escape_json(result, outputstr); pfree(outputstr); -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On 12/6/23 21:56, Nathan Bossart wrote: > On Wed, Dec 06, 2023 at 03:20:46PM -0500, Tom Lane wrote: >> If Nathan's perf results hold up elsewhere, it seems like some >> micro-optimization around the text-pushing (appendStringInfoString) >> might be more useful than caching. The 7% spent in cache lookups >> could be worth going after later, but it's not the top of the list. > > Hah, it turns out my benchmark of 110M integers really stresses the > JSONTYPE_NUMERIC path in datum_to_json_internal(). That particular path > calls strlen() twice: once for IsValidJsonNumber(), and once in > appendStringInfoString(). If I save the result from IsValidJsonNumber() > and give it to appendBinaryStringInfo() instead, the COPY goes ~8% faster. > It's probably worth giving datum_to_json_internal() a closer look in a new > thread. Yep, after looking through that code I was going to make the point that your 11 integer test was over indexing on that one type. I am sure there are other micro-optimizations to be made here, but I also think that it is outside the scope of the COPY TO JSON patch. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Dec 6, 2023 at 3:38 PM Joe Conway <mail@joeconway.com> wrote:So the questions are:
1. Do those two formats work for the initial implementation?Yes. We provide a stream-oriented format and one atomic-import format.2. Is the default correct or should it be switched
e.g. rather than specifying FORCE_ARRAY to get an
array, something like FORCE_NO_ARRAY to get JSON lines
and the JSON array is default?No default?Require explicit of a sub-format when the main format is JSON.JSON_OBJECT_ROWSJSON_ARRAY_OF_OBJECTSFor a future compact array-structured-composites sub-format:JSON_ARRAY_OF_ARRAYSJSON_ARRAY_ROWS
No default seems unlike the way we treat other COPY options. I'm not terribly fussed about which format to have as the default, but I think we should have one.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Joe Conway wrote: > The attached should fix the CopyOut response to say one column. I.e. it > ought to look something like: Spending more time with the doc I came to the opinion that in this bit of the protocol, in CopyOutResponse (B) ... Int16 The number of columns in the data to be copied (denoted N below). ... this number must be the number of columns in the source. That is for COPY table(a,b,c) the number is 3, independently on whether the result is formatted in text, cvs, json or binary. I think that changing it for json can reasonably be interpreted as a protocol break and we should not do it. The fact that this value does not help parsing the CopyData messages that come next is not a new issue. A reader that doesn't know the field separator and whether it's text or csv cannot parse these messages into fields anyway. But just knowing how much columns there are in the original data might be useful by itself and we don't want to break that. The other question for me is, in the CopyData message, this bit: " Messages sent from the backend will always correspond to single data rows" ISTM that considering that the "[" starting the json array is a "data row" is a stretch. That might be interpreted as a protocol break, depending on how strict the interpretation is. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Joe Conway wrote:
> The attached should fix the CopyOut response to say one column. I.e. it
> ought to look something like:
Spending more time with the doc I came to the opinion that in this bit
of the protocol, in CopyOutResponse (B)
...
Int16
The number of columns in the data to be copied (denoted N below).
...
this number must be the number of columns in the source.
That is for COPY table(a,b,c) the number is 3, independently
on whether the result is formatted in text, cvs, json or binary.
I think that changing it for json can reasonably be interpreted
as a protocol break and we should not do it.
The fact that this value does not help parsing the CopyData
messages that come next is not a new issue. A reader that
doesn't know the field separator and whether it's text or csv
cannot parse these messages into fields anyway.
But just knowing how much columns there are in the original
data might be useful by itself and we don't want to break that.
The other question for me is, in the CopyData message, this
bit:
" Messages sent from the backend will always correspond to single data rows"
ISTM that considering that the "[" starting the json array is a
"data row" is a stretch.
That might be interpreted as a protocol break, depending
on how strict the interpretation is.
On 12/7/23 08:35, Daniel Verite wrote: > Joe Conway wrote: > >> The attached should fix the CopyOut response to say one column. I.e. it >> ought to look something like: > > Spending more time with the doc I came to the opinion that in this bit > of the protocol, in CopyOutResponse (B) > ... > Int16 > The number of columns in the data to be copied (denoted N below). > ... > > this number must be the number of columns in the source. > That is for COPY table(a,b,c) the number is 3, independently > on whether the result is formatted in text, cvs, json or binary. > > I think that changing it for json can reasonably be interpreted > as a protocol break and we should not do it. > > The fact that this value does not help parsing the CopyData > messages that come next is not a new issue. A reader that > doesn't know the field separator and whether it's text or csv > cannot parse these messages into fields anyway. > But just knowing how much columns there are in the original > data might be useful by itself and we don't want to break that. Ok, that sounds reasonable to me -- I will revert that change. > The other question for me is, in the CopyData message, this > bit: > " Messages sent from the backend will always correspond to single data rows" > > ISTM that considering that the "[" starting the json array is a > "data row" is a stretch. > That might be interpreted as a protocol break, depending > on how strict the interpretation is. If we really think that is a problem I can see about changing it to this format for json array: 8<------------------ copy ( with ss(f1, f2) as ( select 1, g.i from generate_series(1, 3) g(i) ) select ss from ss ) to stdout (format json, force_array); [{"ss":{"f1":1,"f2":1}} ,{"ss":{"f1":1,"f2":2}} ,{"ss":{"f1":1,"f2":3}}] 8<------------------ Is this acceptable to everyone? Or maybe this is preferred? 8<------------------ [{"ss":{"f1":1,"f2":1}}, {"ss":{"f1":1,"f2":2}}, {"ss":{"f1":1,"f2":3}}] 8<------------------ Or as long as we are painting the shed, maybe this? 8<------------------ [{"ss":{"f1":1,"f2":1}}, {"ss":{"f1":1,"f2":2}}, {"ss":{"f1":1,"f2":3}}] 8<------------------ -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/7/23 08:52, Joe Conway wrote: > Or maybe this is preferred? > 8<------------------ > [{"ss":{"f1":1,"f2":1}}, > {"ss":{"f1":1,"f2":2}}, > {"ss":{"f1":1,"f2":3}}] > 8<------------------ I don't know why my mail client keeps adding extra spaces, but the intention here is a single space in front of row 2 and 3 in order to line the json objects up at column 2. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 12/7/23 08:35, Daniel Verite wrote:Joe Conway wrote:The attached should fix the CopyOut response to say one column. I.e. it ought to look something like:
Spending more time with the doc I came to the opinion that in this bit
of the protocol, in CopyOutResponse (B)
...
Int16
The number of columns in the data to be copied (denoted N below).
...
this number must be the number of columns in the source.
That is for COPY table(a,b,c) the number is 3, independently
on whether the result is formatted in text, cvs, json or binary.
I think that changing it for json can reasonably be interpreted
as a protocol break and we should not do it.
The fact that this value does not help parsing the CopyData
messages that come next is not a new issue. A reader that
doesn't know the field separator and whether it's text or csv
cannot parse these messages into fields anyway.
But just knowing how much columns there are in the original
data might be useful by itself and we don't want to break that.
Ok, that sounds reasonable to me -- I will revert that change.The other question for me is, in the CopyData message, this
bit:
" Messages sent from the backend will always correspond to single data rows"
ISTM that considering that the "[" starting the json array is a
"data row" is a stretch.
That might be interpreted as a protocol break, depending
on how strict the interpretation is.
If we really think that is a problem I can see about changing it to this format for json array:
8<------------------
copy
(
with ss(f1, f2) as
(
select 1, g.i from generate_series(1, 3) g(i)
)
select ss from ss
) to stdout (format json, force_array);
[{"ss":{"f1":1,"f2":1}}
,{"ss":{"f1":1,"f2":2}}
,{"ss":{"f1":1,"f2":3}}]
8<------------------
Is this acceptable to everyone?
Or maybe this is preferred?
8<------------------
[{"ss":{"f1":1,"f2":1}},
{"ss":{"f1":1,"f2":2}},
{"ss":{"f1":1,"f2":3}}]
8<------------------
Or as long as we are painting the shed, maybe this?
8<------------------
[{"ss":{"f1":1,"f2":1}},
{"ss":{"f1":1,"f2":2}},
{"ss":{"f1":1,"f2":3}}]
8<------------------
On 12/7/23 09:11, David G. Johnston wrote: > Those are all the same breakage though - if truly interpreted as data > rows the protocol is basically written such that the array format is not > supportable and only the lines format can be used. Hence my “format 0 > doesn’t work” comment for array output and we should explicitly add > format 2 where we explicitly decouple lines of output from rows of > data. That said, it would seem in practice format 0 already decouples > them and so the current choice of the brackets on their own lines is > acceptable. > > I’d prefer to keep them on their own line. WFM ¯\_(ツ)_/¯ I am merely responding with options to the many people opining on the thread. > I also don’t know why you introduced another level of object nesting > here. That seems quite undesirable. I didn't add anything. It is an artifact of the particular query I wrote in the copy to statement (I did "select ss from ss" instead of "select * from ss"), mea culpa. This is what the latest patch, as written today, outputs: 8<---------------------- copy (select 1, g.i from generate_series(1, 3) g(i)) to stdout (format json, force_array); [ {"?column?":1,"i":1} ,{"?column?":1,"i":2} ,{"?column?":1,"i":3} ] 8<---------------------- -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thursday, December 7, 2023, Daniel Verite <daniel@manitou-mail.org> wrote:Joe Conway wrote:
> The attached should fix the CopyOut response to say one column. I.e. it
> ought to look something like:
Spending more time with the doc I came to the opinion that in this bit
of the protocol, in CopyOutResponse (B)
...
Int16
The number of columns in the data to be copied (denoted N below).
...
this number must be the number of columns in the source.
That is for COPY table(a,b,c) the number is 3, independently
on whether the result is formatted in text, cvs, json or binary.
I think that changing it for json can reasonably be interpreted
as a protocol break and we should not do it.
The fact that this value does not help parsing the CopyData
messages that come next is not a new issue. A reader that
doesn't know the field separator and whether it's text or csv
cannot parse these messages into fields anyway.
But just knowing how much columns there are in the original
data might be useful by itself and we don't want to break that.This argument for leaving 3 as the column count makes sense to me. I agree this content is not meant to facilitate interpreting the contents at a protocol level.
The other question for me is, in the CopyData message, this
bit:
" Messages sent from the backend will always correspond to single data rows"
ISTM that considering that the "[" starting the json array is a
"data row" is a stretch.
That might be interpreted as a protocol break, depending
on how strict the interpretation is.
Joe Conway wrote: > copyto_json.007.diff When the source has json fields with non-significant line feeds, the COPY output has these line feeds too, which makes the output incompatible with rule #2 at https://jsonlines.org ("2. Each Line is a Valid JSON Value"). create table j(f json); insert into j values('{"a":1, "b":2 }'); copy j to stdout (format json); Result: {"f":{"a":1, "b":2 }} Is that expected? copy.sgml in 007 doesn't describe the output in terms of lines so it's hard to tell from the doc. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Dave Cramer wrote: > > This argument for leaving 3 as the column count makes sense to me. I > > agree this content is not meant to facilitate interpreting the contents at > > a protocol level. > > > > I'd disagree. From my POV if the data comes back as a JSON Array this is > one object and this should be reflected in the column count. The doc says this: "Int16 The number of columns in the data to be copied (denoted N below)." and this formulation is repeated in PQnfields() for libpq: "PQnfields Returns the number of columns (fields) to be copied." How to interpret that sentence? "to be copied" from what, into what, and by what way? A plausible interpretation is "to be copied from the source data into the COPY stream, by the backend". So the number of columns to be copied likely refers to the columns of the dataset, not the "in-transit form" that is text or csv or json. The interpetation you're proposing also makes sense, that it's just one json column per row, or even a single-row single-column for the entire dataset in the force_array case, but then the question is why isn't that number of columns always 1 for the original "text" format, since each row is represented in the stream as a single long piece of text? Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On 12/8/23 14:45, Daniel Verite wrote: > Joe Conway wrote: > >> copyto_json.007.diff > > When the source has json fields with non-significant line feeds, the COPY > output has these line feeds too, which makes the output incompatible > with rule #2 at https://jsonlines.org ("2. Each Line is a Valid JSON > Value"). > > create table j(f json); > > insert into j values('{"a":1, > "b":2 > }'); > > copy j to stdout (format json); > > Result: > {"f":{"a":1, > "b":2 > }} > > Is that expected? copy.sgml in 007 doesn't describe the output > in terms of lines so it's hard to tell from the doc. The patch as-is just does the equivalent of row_to_json(): 8<---------------------------- select row_to_json(j) from j; row_to_json -------------- {"f":{"a":1,+ "b":2 + }} (1 row) 8<---------------------------- So yeah, that is how it works today. I will take a look at what it would take to fix it. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> On Sat, Dec 2, 2023 at 4:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Joe Conway <mail@joeconway.com> writes: > >> I noticed that, with the PoC patch, "json" is the only format that must be > >> quoted. Without quotes, I see a syntax error. In longer term we should move any specific COPY flag names and values out of grammar and their checking into the parts that actually implement whatever the flag is influencing Similar to what we do with OPTIONS in all levels of FDW definitions (WRAPPER itself, SERVER, USER MAPPING, FOREIGN TABLE) [*] https://www.postgresql.org/docs/current/sql-createforeigndatawrapper.html > >> I'm assuming there's a > >> conflict with another json-related rule somewhere in gram.y, but I haven't > >> tracked down exactly which one is causing it. > > While I've not looked too closely, I suspect this might be due to the > FORMAT_LA hack in base_yylex: > > /* Replace FORMAT by FORMAT_LA if it's followed by JSON */ > switch (next_token) > { > case JSON: > cur_token = FORMAT_LA; > break; > } My hope is that turning the WITH into a fully independent part with no grammar-defined keys or values would also solve the issue of quoting "json". For backwards compatibility we may even go the route of keeping the WITH as is but add the OPTIONS which can take any values at grammar level. I shared my "Pluggable Copy " talk slides from Berlin '22 in another thread -- Hannu
On Thu, 7 Dec 2023 at 01:10, Joe Conway <mail@joeconway.com> wrote: > > The attached should fix the CopyOut response to say one column. > Playing around with this, I found a couple of cases that generate an error: COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json); COPY (VALUES (1), (2)) TO stdout WITH (format json); both of those generate the following: ERROR: record type has not been registered Regards, Dean
On 1/8/24 14:36, Dean Rasheed wrote: > On Thu, 7 Dec 2023 at 01:10, Joe Conway <mail@joeconway.com> wrote: >> >> The attached should fix the CopyOut response to say one column. >> > > Playing around with this, I found a couple of cases that generate an error: > > COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json); > > COPY (VALUES (1), (2)) TO stdout WITH (format json); > > both of those generate the following: > > ERROR: record type has not been registered Thanks -- will have a look -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Jan 9, 2024 at 4:40 AM Joe Conway <mail@joeconway.com> wrote: > > On 1/8/24 14:36, Dean Rasheed wrote: > > On Thu, 7 Dec 2023 at 01:10, Joe Conway <mail@joeconway.com> wrote: > >> > >> The attached should fix the CopyOut response to say one column. > >> > > > > Playing around with this, I found a couple of cases that generate an error: > > > > COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json); > > > > COPY (VALUES (1), (2)) TO stdout WITH (format json); > > > > both of those generate the following: > > > > ERROR: record type has not been registered > > > Thanks -- will have a look > > -- > Joe Conway > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com > > > In the function CopyOneRowTo, I try to call the function BlessTupleDesc again. +BlessTupleDesc(slot->tts_tupleDescriptor); rowdata = ExecFetchSlotHeapTupleDatum(slot); Please check the attachment. (one test.sql file, one patch, one bless twice). Now the error cases are gone, less cases return error. but the new result is not the expected. `COPY (SELECT g from generate_series(1,1) g) TO stdout WITH (format json);` returns {"":1} The expected result would be `{"g":1}`. I think the reason is maybe related to the function copy_dest_startup.
Attachment
On Tue, Jan 16, 2024 at 11:46 AM jian he <jian.universality@gmail.com> wrote: > > > I think the reason is maybe related to the function copy_dest_startup. I was wrong about this sentence. in the function CopyOneRowTo `if (!cstate->opts.json_mode)` else branch change to the following: else { Datum rowdata; StringInfo result; if (slot->tts_tupleDescriptor->natts == 1) { /* Flat-copy the attribute array */ memcpy(TupleDescAttr(slot->tts_tupleDescriptor, 0), TupleDescAttr(cstate->queryDesc->tupDesc, 0), 1 * sizeof(FormData_pg_attribute)); } BlessTupleDesc(slot->tts_tupleDescriptor); rowdata = ExecFetchSlotHeapTupleDatum(slot); result = makeStringInfo(); composite_to_json(rowdata, result, false); if (json_row_delim_needed && cstate->opts.force_array) { CopySendChar(cstate, ','); } else if (cstate->opts.force_array) { /* first row needs no delimiter */ CopySendChar(cstate, ' '); json_row_delim_needed = true; } CopySendData(cstate, result->data, result->len); } all the cases work, more like a hack. because I cannot fully explain it to you why it works. ------------------------------------------------------------------------------- demo drop function if exists execute_into_test cascade; NOTICE: function execute_into_test() does not exist, skipping DROP FUNCTION drop type if exists execute_into_test cascade; NOTICE: type "execute_into_test" does not exist, skipping DROP TYPE create type eitype as (i integer, y integer); CREATE TYPE create or replace function execute_into_test() returns eitype as $$ declare _v eitype; begin execute 'select 1,2' into _v; return _v; end; $$ language plpgsql; CREATE FUNCTION COPY (SELECT 1 from generate_series(1,1) g) TO stdout WITH (format json); {"?column?":1} COPY (SELECT g from generate_series(1,1) g) TO stdout WITH (format json); {"g":1} COPY (SELECT g,1 from generate_series(1,1) g) TO stdout WITH (format json); {"g":1,"?column?":1} COPY (select * from execute_into_test()) TO stdout WITH (format json); {"i":1,"y":2} COPY (select * from execute_into_test() sub) TO stdout WITH (format json); {"i":1,"y":2} COPY (select sub from execute_into_test() sub) TO stdout WITH (format json); {"sub":{"i":1,"y":2}} COPY (select sub.i from execute_into_test() sub) TO stdout WITH (format json); {"i":1} COPY (select sub.y from execute_into_test() sub) TO stdout WITH (format json); {"y":2} COPY (VALUES (1), (2)) TO stdout WITH (format json); {"column1":1} {"column1":2} COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json); {"?column?":1} {"?column?":2}
On Thu, Dec 7, 2023 at 10:10 AM Joe Conway <mail@joeconway.com> wrote: > > On 12/6/23 18:09, Joe Conway wrote: > > On 12/6/23 14:47, Joe Conway wrote: > >> On 12/6/23 13:59, Daniel Verite wrote: > >>> Andrew Dunstan wrote: > >>> > >>>> IMNSHO, we should produce either a single JSON > >>>> document (the ARRAY case) or a series of JSON documents, one per row > >>>> (the LINES case). > >>> > >>> "COPY Operations" in the doc says: > >>> > >>> " The backend sends a CopyOutResponse message to the frontend, followed > >>> by zero or more CopyData messages (always one per row), followed by > >>> CopyDone". > >>> > >>> In the ARRAY case, the first messages with the copyjsontest > >>> regression test look like this (tshark output): > >>> > >>> PostgreSQL > >>> Type: CopyOut response > >>> Length: 13 > >>> Format: Text (0) > >>> Columns: 3 > >>> Format: Text (0) > >>> PostgreSQL > >>> Type: Copy data > >>> Length: 6 > >>> Copy data: 5b0a > >>> PostgreSQL > >>> Type: Copy data > >>> Length: 76 > >>> Copy data: > >>> 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031… > >>> > >>> The first Copy data message with contents "5b0a" does not qualify > >>> as a row of data with 3 columns as advertised in the CopyOut > >>> message. Isn't that a problem? > >> > >> > >> Is it a real problem, or just a bit of documentation change that I missed? > >> > >> Anything receiving this and looking for a json array should know how to > >> assemble the data correctly despite the extra CopyData messages. > > > > Hmm, maybe the real problem here is that Columns do not equal "3" for > > the json mode case -- that should really say "1" I think, because the > > row is not represented as 3 columns but rather 1 json object. > > > > Does that sound correct? > > > > Assuming yes, there is still maybe an issue that there are two more > > "rows" that actual output rows (the "[" and the "]"), but maybe those > > are less likely to cause some hazard? > > > The attached should fix the CopyOut response to say one column. I.e. it > ought to look something like: > > PostgreSQL > Type: CopyOut response > Length: 13 > Format: Text (0) > Columns: 1 > Format: Text (0) > PostgreSQL > Type: Copy data > Length: 6 > Copy data: 5b0a > PostgreSQL > Type: Copy data > Length: 76 > Copy data: [...] > If I'm not missing, copyto_json.007.diff is the latest patch but it needs to be rebased to the current HEAD. Here are random comments: --- if (opts_out->json_mode) + { + if (is_from) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot use JSON mode in COPY FROM"))); + } + else if (opts_out->force_array) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("COPY FORCE_ARRAY requires JSON mode"))); I think that flatting these two condition make the code more readable: if (opts_out->json_mode && is_from) ereport(ERROR, ...); if (!opts_out->json_mode && opts_out->force_array) ereport(ERROR, ...); Also these checks can be moved close to other checks at the end of ProcessCopyOptions(). --- @@ -3395,6 +3395,10 @@ copy_opt_item: { $$ = makeDefElem("format", (Node *) makeString("csv"), @1); } + | JSON + { + $$ = makeDefElem("format", (Node *) makeString("json"), @1); + } | HEADER_P { $$ = makeDefElem("header", (Node *) makeBoolean(true), @1); @@ -3427,6 +3431,10 @@ copy_opt_item: { $$ = makeDefElem("encoding", (Node *) makeString($2), @1); } + | FORCE ARRAY + { + $$ = makeDefElem("force_array", (Node *) makeBoolean(true), @1); + } ; I believe we don't need to support new options in old-style syntax. --- @@ -3469,6 +3477,10 @@ copy_generic_opt_elem: { $$ = makeDefElem($1, $2, @1); } + | FORMAT_LA copy_generic_opt_arg + { + $$ = makeDefElem("format", $2, @1); + } ; I think it's not necessary. "format" option is already handled in copy_generic_opt_elem. --- +/* need delimiter to start next json array element */ +static bool json_row_delim_needed = false; I think it's cleaner to include json_row_delim_needed into CopyToStateData. --- Splitting the patch into two patches: add json format and add force_array option would make reviews easy. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Jan 19, 2024 at 4:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > If I'm not missing, copyto_json.007.diff is the latest patch but it > needs to be rebased to the current HEAD. Here are random comments: > please check the latest version. > if (opts_out->json_mode) > + { > + if (is_from) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot use JSON mode in COPY FROM"))); > + } > + else if (opts_out->force_array) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("COPY FORCE_ARRAY requires JSON mode"))); > > I think that flatting these two condition make the code more readable: I make it two condition check > if (opts_out->json_mode && is_from) > ereport(ERROR, ...); > > if (!opts_out->json_mode && opts_out->force_array) > ereport(ERROR, ...); > > Also these checks can be moved close to other checks at the end of > ProcessCopyOptions(). > Yes. I did it, please check it. > @@ -3395,6 +3395,10 @@ copy_opt_item: > { > $$ = makeDefElem("format", (Node *) makeString("csv"), @1); > } > + | JSON > + { > + $$ = makeDefElem("format", (Node *) makeString("json"), @1); > + } > | HEADER_P > { > $$ = makeDefElem("header", (Node *) makeBoolean(true), @1); > @@ -3427,6 +3431,10 @@ copy_opt_item: > { > $$ = makeDefElem("encoding", (Node *) makeString($2), @1); > } > + | FORCE ARRAY > + { > + $$ = makeDefElem("force_array", (Node *) > makeBoolean(true), @1); > + } > ; > > I believe we don't need to support new options in old-style syntax. > > --- > @@ -3469,6 +3477,10 @@ copy_generic_opt_elem: > { > $$ = makeDefElem($1, $2, @1); > } > + | FORMAT_LA copy_generic_opt_arg > + { > + $$ = makeDefElem("format", $2, @1); > + } > ; > > I think it's not necessary. "format" option is already handled in > copy_generic_opt_elem. > test it, I found out this part is necessary. because a query with WITH like `copy (select 1) to stdout with (format json, force_array false); ` will fail. > --- > +/* need delimiter to start next json array element */ > +static bool json_row_delim_needed = false; > > I think it's cleaner to include json_row_delim_needed into CopyToStateData. yes. I agree. So I did it. > --- > Splitting the patch into two patches: add json format and add > force_array option would make reviews easy. > done. one patch for json format, another one for force_array option. I also made the following cases fail. copy copytest to stdout (format csv, force_array false); ERROR: specify COPY FORCE_ARRAY is only allowed in JSON mode. If copy to table then call table_scan_getnextslot no need to worry about the Tupdesc. however if we copy a query output as format json, we may need to consider it. cstate->queryDesc->tupDesc is the output of Tupdesc, we can rely on this. for copy a query result to json, I memcpy( cstate->queryDesc->tupDesc) to the the slot's slot->tts_tupleDescriptor so composite_to_json can use cstate->queryDesc->tupDesc to do the work. I guess this will make it more bullet-proof.
Attachment
Hi hackers, Kou-san(CCed) has been working on *Make COPY format extendable[1]*, so I think making *copy to json* based on that work might be the right direction. I write an extension for that purpose, and here is the patch set together with Kou-san's *extendable copy format* implementation: 0001-0009 is the implementation of extendable copy format 00010 is the pg_copy_json extension I also created a PR[2] if anybody likes the github review style. The *extendable copy format* feature is still being developed, I post this email in case the patch set in this thread is committed without knowing the *extendable copy format* feature. I'd like to hear your opinions. [1]: https://www.postgresql.org/message-id/20240124.144936.67229716500876806.kou%40clear-code.com [2]: https://github.com/zhjwpku/postgres/pull/2/files -- Regards Junwang Zhao
Attachment
- v8-0004-Add-support-for-implementing-custom-COPY-TO-forma.patch
- v8-0003-Export-CopyToStateData.patch
- v8-0001-Extract-COPY-TO-format-implementations.patch
- v8-0002-Add-support-for-adding-custom-COPY-TO-format.patch
- v8-0005-Extract-COPY-FROM-format-implementations.patch
- v8-0008-Add-support-for-implementing-custom-COPY-FROM-for.patch
- v8-0007-Export-CopyFromStateData.patch
- v8-0006-Add-support-for-adding-custom-COPY-FROM-format.patch
- v8-0009-change-CopyToGetFormat-to-CopyToSendCopyBegin-and.patch
- v8-0010-introduce-contrib-pg_copy_json.patch
On Sat, 27 Jan 2024 at 11:25, Junwang Zhao <zhjwpku@gmail.com> wrote: > > Hi hackers, > > Kou-san(CCed) has been working on *Make COPY format extendable[1]*, so > I think making *copy to json* based on that work might be the right direction. > > I write an extension for that purpose, and here is the patch set together > with Kou-san's *extendable copy format* implementation: > > 0001-0009 is the implementation of extendable copy format > 00010 is the pg_copy_json extension > > I also created a PR[2] if anybody likes the github review style. > > The *extendable copy format* feature is still being developed, I post this > email in case the patch set in this thread is committed without knowing > the *extendable copy format* feature. > > I'd like to hear your opinions. CFBot shows that one of the test is failing as in [1]: [05:46:41.678] /bin/sh: 1: cannot open /tmp/cirrus-ci-build/contrib/pg_copy_json/sql/test_copy_format.sql: No such file [05:46:41.678] diff: /tmp/cirrus-ci-build/contrib/pg_copy_json/expected/test_copy_format.out: No such file or directory [05:46:41.678] diff: /tmp/cirrus-ci-build/contrib/pg_copy_json/results/test_copy_format.out: No such file or directory [05:46:41.678] # diff command failed with status 512: diff "/tmp/cirrus-ci-build/contrib/pg_copy_json/expected/test_copy_format.out" "/tmp/cirrus-ci-build/contrib/pg_copy_json/results/test_copy_format.out" > "/tmp/cirrus-ci-build/contrib/pg_copy_json/results/test_copy_format.out.diff" [05:46:41.678] Bail out!make[2]: *** [../../src/makefiles/pgxs.mk:454: check] Error 2 [05:46:41.679] make[1]: *** [Makefile:96: check-pg_copy_json-recurse] Error 2 [05:46:41.679] make: *** [GNUmakefile:71: check-world-contrib-recurse] Error 2 Please post an updated version for the same. [1] - https://cirrus-ci.com/task/5322439115145216 Regards, Vignesh
Hi Vignesh, On Wed, Jan 31, 2024 at 5:50 PM vignesh C <vignesh21@gmail.com> wrote: > > On Sat, 27 Jan 2024 at 11:25, Junwang Zhao <zhjwpku@gmail.com> wrote: > > > > Hi hackers, > > > > Kou-san(CCed) has been working on *Make COPY format extendable[1]*, so > > I think making *copy to json* based on that work might be the right direction. > > > > I write an extension for that purpose, and here is the patch set together > > with Kou-san's *extendable copy format* implementation: > > > > 0001-0009 is the implementation of extendable copy format > > 00010 is the pg_copy_json extension > > > > I also created a PR[2] if anybody likes the github review style. > > > > The *extendable copy format* feature is still being developed, I post this > > email in case the patch set in this thread is committed without knowing > > the *extendable copy format* feature. > > > > I'd like to hear your opinions. > > CFBot shows that one of the test is failing as in [1]: > [05:46:41.678] /bin/sh: 1: cannot open > /tmp/cirrus-ci-build/contrib/pg_copy_json/sql/test_copy_format.sql: No > such file > [05:46:41.678] diff: > /tmp/cirrus-ci-build/contrib/pg_copy_json/expected/test_copy_format.out: > No such file or directory > [05:46:41.678] diff: > /tmp/cirrus-ci-build/contrib/pg_copy_json/results/test_copy_format.out: > No such file or directory > [05:46:41.678] # diff command failed with status 512: diff > "/tmp/cirrus-ci-build/contrib/pg_copy_json/expected/test_copy_format.out" > "/tmp/cirrus-ci-build/contrib/pg_copy_json/results/test_copy_format.out" > > "/tmp/cirrus-ci-build/contrib/pg_copy_json/results/test_copy_format.out.diff" > [05:46:41.678] Bail out!make[2]: *** [../../src/makefiles/pgxs.mk:454: > check] Error 2 > [05:46:41.679] make[1]: *** [Makefile:96: check-pg_copy_json-recurse] Error 2 > [05:46:41.679] make: *** [GNUmakefile:71: check-world-contrib-recurse] Error 2 > > Please post an updated version for the same. Thanks for the reminder, the patch set I posted is not for commit but for further discussion. I will post more information about the *extendable copy* feature when it's about to be committed. > > [1] - https://cirrus-ci.com/task/5322439115145216 > > Regards, > Vignesh -- Regards Junwang Zhao
On 2024-Jan-23, jian he wrote: > > + | FORMAT_LA copy_generic_opt_arg > > + { > > + $$ = makeDefElem("format", $2, @1); > > + } > > ; > > > > I think it's not necessary. "format" option is already handled in > > copy_generic_opt_elem. > > test it, I found out this part is necessary. > because a query with WITH like `copy (select 1) to stdout with > (format json, force_array false); ` will fail. Right, because "FORMAT JSON" is turned into FORMAT_LA JSON by parser.c (see base_yylex there). I'm not really sure but I think it might be better to make it "| FORMAT_LA JSON" instead of invoking the whole copy_generic_opt_arg syntax. Not because of performance, but just because it's much clearer what's going on. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On Wed, Jan 31, 2024 at 9:26 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Jan-23, jian he wrote: > > > > + | FORMAT_LA copy_generic_opt_arg > > > + { > > > + $$ = makeDefElem("format", $2, @1); > > > + } > > > ; > > > > > > I think it's not necessary. "format" option is already handled in > > > copy_generic_opt_elem. > > > > test it, I found out this part is necessary. > > because a query with WITH like `copy (select 1) to stdout with > > (format json, force_array false); ` will fail. > > Right, because "FORMAT JSON" is turned into FORMAT_LA JSON by parser.c > (see base_yylex there). I'm not really sure but I think it might be > better to make it "| FORMAT_LA JSON" instead of invoking the whole > copy_generic_opt_arg syntax. Not because of performance, but just > because it's much clearer what's going on. > sorry to bother you. Now I didn't apply any patch, just at the master. I don't know much about gram.y. copy (select 1) to stdout with (format json1); ERROR: COPY format "json1" not recognized LINE 1: copy (select 1) to stdout with (format json1); ^ copy (select 1) to stdout with (format json); ERROR: syntax error at or near "format" LINE 1: copy (select 1) to stdout with (format json); ^ json is a keyword. Is it possible to escape it? make `copy (select 1) to stdout with (format json)` error message the same as `copy (select 1) to stdout with (format json1)`
On 2024-Feb-02, jian he wrote: > copy (select 1) to stdout with (format json); > ERROR: syntax error at or near "format" > LINE 1: copy (select 1) to stdout with (format json); > ^ > > json is a keyword. Is it possible to escape it? > make `copy (select 1) to stdout with (format json)` error message the same as > `copy (select 1) to stdout with (format json1)` Sure, you can use copy (select 1) to stdout with (format "json"); and then you get ERROR: COPY format "json" not recognized is that what you meant? If you want the server to send this message when the JSON word is not in quotes, I'm afraid that's not possible, due to the funny nature of the FORMAT keyword when the JSON keyword appears after it. But why do you care? If you use the patch, then you no longer need to have the "not recognized" error messages anymore, because the JSON format is indeed a recognized one. Maybe I didn't understand your question. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
On Fri, Feb 2, 2024 at 5:48 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > If you want the server to send this message when the JSON word is not in > quotes, I'm afraid that's not possible, due to the funny nature of the > FORMAT keyword when the JSON keyword appears after it. But why do you > care? If you use the patch, then you no longer need to have the "not > recognized" error messages anymore, because the JSON format is indeed > a recognized one. > "JSON word is not in quotes" is my intention. Now it seems when people implement any custom format for COPY, if the format_name is a keyword then we need single quotes. Thanks for clarifying!
On Fri, Jan 19, 2024 at 4:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > if (opts_out->json_mode && is_from) > ereport(ERROR, ...); > > if (!opts_out->json_mode && opts_out->force_array) > ereport(ERROR, ...); > > Also these checks can be moved close to other checks at the end of > ProcessCopyOptions(). > > --- > @@ -3395,6 +3395,10 @@ copy_opt_item: > { > $$ = makeDefElem("format", (Node *) makeString("csv"), @1); > } > + | JSON > + { > + $$ = makeDefElem("format", (Node *) makeString("json"), @1); > + } > | HEADER_P > { > $$ = makeDefElem("header", (Node *) makeBoolean(true), @1); > @@ -3427,6 +3431,10 @@ copy_opt_item: > { > $$ = makeDefElem("encoding", (Node *) makeString($2), @1); > } > + | FORCE ARRAY > + { > + $$ = makeDefElem("force_array", (Node *) > makeBoolean(true), @1); > + } > ; > > I believe we don't need to support new options in old-style syntax. > you are right about the force_array case. we don't need to add force_array related changes in gram.y. On Wed, Jan 31, 2024 at 9:26 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Jan-23, jian he wrote: > > > > + | FORMAT_LA copy_generic_opt_arg > > > + { > > > + $$ = makeDefElem("format", $2, @1); > > > + } > > > ; > > > > > > I think it's not necessary. "format" option is already handled in > > > copy_generic_opt_elem. > > > > test it, I found out this part is necessary. > > because a query with WITH like `copy (select 1) to stdout with > > (format json, force_array false); ` will fail. > > Right, because "FORMAT JSON" is turned into FORMAT_LA JSON by parser.c > (see base_yylex there). I'm not really sure but I think it might be > better to make it "| FORMAT_LA JSON" instead of invoking the whole > copy_generic_opt_arg syntax. Not because of performance, but just > because it's much clearer what's going on. > I am not sure what alternative you are referring to. I've rebased the patch, made some cosmetic changes. Now I think it's pretty neat. you can, based on it, make your change, then I may understand the alternative you are referring to.
Attachment
Hello everyone! Thanks for working on this, really nice feature! > On 9 Jan 2024, at 01:40, Joe Conway <mail@joeconway.com> wrote: > > Thanks -- will have a look Joe, recently folks proposed a lot of patches in this thread that seem like diverted from original way of implementation. As an author of CF entry [0] can you please comment on which patch version needs review? Thanks! Best regards, Andrey Borodin. [0] https://commitfest.postgresql.org/47/4716/
On 3/8/24 12:28, Andrey M. Borodin wrote: > Hello everyone! > > Thanks for working on this, really nice feature! > >> On 9 Jan 2024, at 01:40, Joe Conway <mail@joeconway.com> wrote: >> >> Thanks -- will have a look > > Joe, recently folks proposed a lot of patches in this thread that seem like diverted from original way of implementation. > As an author of CF entry [0] can you please comment on which patch version needs review? I don't know if I agree with the proposed changes, but I have also been waiting to see how the parallel discussion regarding COPY extensibility shakes out. And there were a couple of issues found that need to be tracked down. Additionally I have had time/availability challenges recently. Overall, chances seem slim that this will make it into 17, but I have not quite given up hope yet either. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Sat, Mar 9, 2024 at 2:03 AM Joe Conway <mail@joeconway.com> wrote: > > On 3/8/24 12:28, Andrey M. Borodin wrote: > > Hello everyone! > > > > Thanks for working on this, really nice feature! > > > >> On 9 Jan 2024, at 01:40, Joe Conway <mail@joeconway.com> wrote: > >> > >> Thanks -- will have a look > > > > Joe, recently folks proposed a lot of patches in this thread that seem like diverted from original way of implementation. > > As an author of CF entry [0] can you please comment on which patch version needs review? > > > I don't know if I agree with the proposed changes, but I have also been > waiting to see how the parallel discussion regarding COPY extensibility > shakes out. > > And there were a couple of issues found that need to be tracked down. > > Additionally I have had time/availability challenges recently. > > Overall, chances seem slim that this will make it into 17, but I have > not quite given up hope yet either. Hi. summary changes I've made in v9 patches at [0] meta: rebased. Now you need to use `git apply` or `git am`, previously copyto_json.007.diff, you need to use GNU patch. at [1], Dean Rasheed found some corner cases when the returned slot's tts_tupleDescriptor from ` ExecutorRun(cstate->queryDesc, ForwardScanDirection, 0, true); processed = ((DR_copy *) cstate->queryDesc->dest)->processed; ` cannot be used for composite_to_json. generally DestReceiver->rStartup is to send the TupleDesc to the DestReceiver, The COPY TO DestReceiver's rStartup function is copy_dest_startup, however copy_dest_startup is a no-op. That means to make the final TupleDesc of COPY TO (FORMAT JSON) operation bullet proof, we need to copy the tupDesc from CopyToState's queryDesc. This only applies to when the COPY TO source is a query (example: copy (select 1) to stdout), not a table. The above is my interpretation. at [2], Masahiko Sawada made several points. Mainly split the patch to two, one for format json, second is for options force_array. Splitting into two is easier to review, I think. My changes also addressed all the points Masahiko Sawada had mentioned. [0] https://postgr.es/m/CACJufxHd6ZRmJJBsDOGpovaVAekMS-u6AOrcw0Ja-Wyi-0kGtA@mail.gmail.com [1] https://postgr.es/m/CAEZATCWh29787xf=4NgkoixeqRHrqi0Qd33Z6_-F8t2dZ0yLCQ@mail.gmail.com [2] https://postgr.es/m/CAD21AoCb02zhZM3vXb8HSw8fwOsL+iRdEFb--Kmunv8PjPAWjw@mail.gmail.com
On Sat, Mar 9, 2024 at 9:13 AM jian he <jian.universality@gmail.com> wrote: > > On Sat, Mar 9, 2024 at 2:03 AM Joe Conway <mail@joeconway.com> wrote: > > > > On 3/8/24 12:28, Andrey M. Borodin wrote: > > > Hello everyone! > > > > > > Thanks for working on this, really nice feature! > > > > > >> On 9 Jan 2024, at 01:40, Joe Conway <mail@joeconway.com> wrote: > > >> > > >> Thanks -- will have a look > > > > > > Joe, recently folks proposed a lot of patches in this thread that seem like diverted from original way of implementation. > > > As an author of CF entry [0] can you please comment on which patch version needs review? > > > > > > I don't know if I agree with the proposed changes, but I have also been > > waiting to see how the parallel discussion regarding COPY extensibility > > shakes out. > > > > And there were a couple of issues found that need to be tracked down. > > > > Additionally I have had time/availability challenges recently. > > > > Overall, chances seem slim that this will make it into 17, but I have > > not quite given up hope yet either. > > Hi. > summary changes I've made in v9 patches at [0] > > meta: rebased. Now you need to use `git apply` or `git am`, previously > copyto_json.007.diff, you need to use GNU patch. > > > at [1], Dean Rasheed found some corner cases when the returned slot's > tts_tupleDescriptor > from > ` > ExecutorRun(cstate->queryDesc, ForwardScanDirection, 0, true); > processed = ((DR_copy *) cstate->queryDesc->dest)->processed; > ` > cannot be used for composite_to_json. > generally DestReceiver->rStartup is to send the TupleDesc to the DestReceiver, > The COPY TO DestReceiver's rStartup function is copy_dest_startup, > however copy_dest_startup is a no-op. > That means to make the final TupleDesc of COPY TO (FORMAT JSON) > operation bullet proof, > we need to copy the tupDesc from CopyToState's queryDesc. > This only applies to when the COPY TO source is a query (example: > copy (select 1) to stdout), not a table. > The above is my interpretation. > trying to simplify the explanation. first refer to the struct DestReceiver. COPY TO (FORMAT JSON), we didn't send the preliminary Tupdesc to the DestReceiver via the rStartup function pointer within struct _DestReceiver. `CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot)` the slot is the final slot returned via query execution. but we cannot use the tupdesc (slot->tts_tupleDescriptor) to do composite_to_json. because the final return slot Tupdesc may change during the query execution. so we need to copy the tupDesc from CopyToState's queryDesc. aslo rebased, now we can apply it cleanly.