Thread: WIP Incremental JSON Parser

WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

26 December 2023, 16:48:25

Quite a long time ago Robert asked me about the possibility of an 
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the 
performance is significantly worse that that of the current Recursive 
Descent parser. Nevertheless, I'm attaching my current WIP state for it, 
and I'll add it to the next CF to keep the conversation going.

One possible use would be in parsing large manifest files for 
incremental backup. However, it struck me a few days ago that this might 
not work all that well. The current parser and the new parser both 
palloc() space for each field name and scalar token in the JSON (unless 
they aren't used, which is normally not the case), and they don't free 
it, so that particularly if done in frontend code this amounts to a 
possible memory leak, unless the semantic routines do the freeing 
themselves. So while we can save some memory by not having to slurp in 
the whole JSON in one hit, we aren't saving any of that other allocation 
of memory, which amounts to almost as much space as the raw JSON.

In any case, I've had fun so it's not a total loss come what may :-)


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

json-incremental-parser-2023-12-26.patch

Re: WIP Incremental JSON Parser

From

Robert Haas

Date:

02 January 2024, 15:14:16

On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Quite a long time ago Robert asked me about the possibility of an
> incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
> performance is significantly worse that that of the current Recursive
> Descent parser. Nevertheless, I'm attaching my current WIP state for it,
> and I'll add it to the next CF to keep the conversation going.

Thanks for doing this. I think it's useful even if it's slower than
the current parser, although that probably necessitates keeping both,
which isn't great, but I don't see a better alternative.

> One possible use would be in parsing large manifest files for
> incremental backup. However, it struck me a few days ago that this might
> not work all that well. The current parser and the new parser both
> palloc() space for each field name and scalar token in the JSON (unless
> they aren't used, which is normally not the case), and they don't free
> it, so that particularly if done in frontend code this amounts to a
> possible memory leak, unless the semantic routines do the freeing
> themselves. So while we can save some memory by not having to slurp in
> the whole JSON in one hit, we aren't saving any of that other allocation
> of memory, which amounts to almost as much space as the raw JSON.

It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.

As far as addressing that other issue, we could address the issue
either by having the semantic routines free the memory if they don't
need it, or alternatively by having the parser itself free the memory
after invoking any callbacks to which it might be passed. The latter
approach feels more conceptually pure, but the former might be the
more practical approach. I think what really matters here is that we
document who must or may do which things. When a callback gets passed
a pointer, we can document either that (1) it's a palloc'd chunk that
the calllback can free if they want or (2) that it's a palloc'd chunk
that the caller must not free or (3) that it's not a palloc'd chunk.
We can further document the memory context in which the chunk will be
allocated, if applicable, and when/if the parser will free it.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

03 January 2024, 11:57:57

On 2024-01-02 Tu 10:14, Robert Haas wrote:
> On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Quite a long time ago Robert asked me about the possibility of an
>> incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
>> performance is significantly worse that that of the current Recursive
>> Descent parser. Nevertheless, I'm attaching my current WIP state for it,
>> and I'll add it to the next CF to keep the conversation going.
> Thanks for doing this. I think it's useful even if it's slower than
> the current parser, although that probably necessitates keeping both,
> which isn't great, but I don't see a better alternative.
>
>> One possible use would be in parsing large manifest files for
>> incremental backup. However, it struck me a few days ago that this might
>> not work all that well. The current parser and the new parser both
>> palloc() space for each field name and scalar token in the JSON (unless
>> they aren't used, which is normally not the case), and they don't free
>> it, so that particularly if done in frontend code this amounts to a
>> possible memory leak, unless the semantic routines do the freeing
>> themselves. So while we can save some memory by not having to slurp in
>> the whole JSON in one hit, we aren't saving any of that other allocation
>> of memory, which amounts to almost as much space as the raw JSON.
> It seems like a pretty significant savings no matter what. Suppose the
> backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
> create an 1MB buffer and feed the data to the parser in 1MB chunks.
> Well, that saves 2GB less 1MB, full stop. Now if we address the issue
> you raise here in some way, we can potentially save even more memory,
> which is great, but even if we don't, we still saved a bunch of memory
> that could not have been saved in any other way.
>
> As far as addressing that other issue, we could address the issue
> either by having the semantic routines free the memory if they don't
> need it, or alternatively by having the parser itself free the memory
> after invoking any callbacks to which it might be passed. The latter
> approach feels more conceptually pure, but the former might be the
> more practical approach. I think what really matters here is that we
> document who must or may do which things. When a callback gets passed
> a pointer, we can document either that (1) it's a palloc'd chunk that
> the calllback can free if they want or (2) that it's a palloc'd chunk
> that the caller must not free or (3) that it's not a palloc'd chunk.
> We can further document the memory context in which the chunk will be
> allocated, if applicable, and when/if the parser will free it.


Yeah. One idea I had yesterday was to stash the field names, which in 
large JSON docs tent to be pretty repetitive, in a hash table instead of 
pstrduping each instance. The name would be valid until the end of the 
parse, and would only need to be duplicated by the callback function if 
it were needed beyond that. That's not the case currently with the 
parse_manifest code. I'll work on using a hash table.

The parse_manifest code does seem to pfree the scalar values it no 
longer needs fairly well, so maybe we don't need to to anything there.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Robert Haas

Date:

03 January 2024, 13:45:06

On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Yeah. One idea I had yesterday was to stash the field names, which in
> large JSON docs tent to be pretty repetitive, in a hash table instead of
> pstrduping each instance. The name would be valid until the end of the
> parse, and would only need to be duplicated by the callback function if
> it were needed beyond that. That's not the case currently with the
> parse_manifest code. I'll work on using a hash table.

IMHO, this is not a good direction. Anybody who is parsing JSON
probably wants to discard the duplicated labels and convert other
heavily duplicated strings to enum values or something. (e.g. if every
record has {"color":"red"} or {"color":"green"}). So the hash table
lookups will cost but won't really save anything more than just
freeing the memory not needed, but will probably be more expensive.

> The parse_manifest code does seem to pfree the scalar values it no
> longer needs fairly well, so maybe we don't need to to anything there.

Hmm. This makes me wonder if you've measured how much actual leakage there is?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

03 January 2024, 14:59:37

On 2024-01-03 We 08:45, Robert Haas wrote:
> On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Yeah. One idea I had yesterday was to stash the field names, which in
>> large JSON docs tent to be pretty repetitive, in a hash table instead of
>> pstrduping each instance. The name would be valid until the end of the
>> parse, and would only need to be duplicated by the callback function if
>> it were needed beyond that. That's not the case currently with the
>> parse_manifest code. I'll work on using a hash table.
> IMHO, this is not a good direction. Anybody who is parsing JSON
> probably wants to discard the duplicated labels and convert other
> heavily duplicated strings to enum values or something. (e.g. if every
> record has {"color":"red"} or {"color":"green"}). So the hash table
> lookups will cost but won't really save anything more than just
> freeing the memory not needed, but will probably be more expensive.

I don't quite follow.

Say we have a document with an array 1m objects, each with a field 
called "color". As it stands we'll allocate space for that field name 1m 
times. Using a hash table we'd allocated space for it once. And 
allocating the memory isn't free, although it might be cheaper than 
doing hash lookups.

I guess we can benchmark it and see what the performance impact of using 
a hash table might be.

Another possibility would be simply to have the callback free the field 
name after use. for the parse_manifest code that could be a one-line 
addition to the code at the bottom of json_object_manifest_field_start().

>> The parse_manifest code does seem to pfree the scalar values it no
>> longer needs fairly well, so maybe we don't need to to anything there.
> Hmm. This makes me wonder if you've measured how much actual leakage there is?

No I haven't. I have simply theorized about how much memory we might 
consume if nothing were done by the callers to free the memory.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Robert Haas

Date:

03 January 2024, 15:12:37

On Wed, Jan 3, 2024 at 9:59 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Say we have a document with an array 1m objects, each with a field
> called "color". As it stands we'll allocate space for that field name 1m
> times. Using a hash table we'd allocated space for it once. And
> allocating the memory isn't free, although it might be cheaper than
> doing hash lookups.
>
> I guess we can benchmark it and see what the performance impact of using
> a hash table might be.
>
> Another possibility would be simply to have the callback free the field
> name after use. for the parse_manifest code that could be a one-line
> addition to the code at the bottom of json_object_manifest_field_start().

Yeah. So I'm arguing that allocating the memory each time and then
freeing it sounds cheaper than looking it up in the hash table every
time, discovering it's there, and thus skipping the allocate/free.

I might be wrong about that. It's just that allocating and freeing a
small chunk of memory should boil down to popping it off of a linked
list and then pushing it back on. And that sounds cheaper than hashing
the string and looking for it in a hash bucket.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

03 January 2024, 16:55:47

On 2024-01-03 We 10:12, Robert Haas wrote:
> On Wed, Jan 3, 2024 at 9:59 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Say we have a document with an array 1m objects, each with a field
>> called "color". As it stands we'll allocate space for that field name 1m
>> times. Using a hash table we'd allocated space for it once. And
>> allocating the memory isn't free, although it might be cheaper than
>> doing hash lookups.
>>
>> I guess we can benchmark it and see what the performance impact of using
>> a hash table might be.
>>
>> Another possibility would be simply to have the callback free the field
>> name after use. for the parse_manifest code that could be a one-line
>> addition to the code at the bottom of json_object_manifest_field_start().
> Yeah. So I'm arguing that allocating the memory each time and then
> freeing it sounds cheaper than looking it up in the hash table every
> time, discovering it's there, and thus skipping the allocate/free.
>
> I might be wrong about that. It's just that allocating and freeing a
> small chunk of memory should boil down to popping it off of a linked
> list and then pushing it back on. And that sounds cheaper than hashing
> the string and looking for it in a hash bucket.



OK, cleaning up in the client code will be much simpler, so let's go 
with that for now and revisit it later if necessary.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Nico Williams

Date:

03 January 2024, 23:36:45

On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote:
> It seems like a pretty significant savings no matter what. Suppose the
> backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
> create an 1MB buffer and feed the data to the parser in 1MB chunks.
> Well, that saves 2GB less 1MB, full stop. Now if we address the issue
> you raise here in some way, we can potentially save even more memory,
> which is great, but even if we don't, we still saved a bunch of memory
> that could not have been saved in any other way.

You could also build a streaming incremental parser.  That is, one that
outputs a path and a leaf value (where leaf values are scalar values,
`null`, `true`, `false`, numbers, and strings).  Then if the caller is
doing something JSONPath-like then the caller can probably immediately
free almost all allocations and even terminate the parse early.

Nico
--

Re: WIP Incremental JSON Parser

From

Robert Haas

Date:

04 January 2024, 15:06:18

On Wed, Jan 3, 2024 at 6:36 PM Nico Williams <nico@cryptonector.com> wrote:
> On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote:
> > It seems like a pretty significant savings no matter what. Suppose the
> > backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
> > create an 1MB buffer and feed the data to the parser in 1MB chunks.
> > Well, that saves 2GB less 1MB, full stop. Now if we address the issue
> > you raise here in some way, we can potentially save even more memory,
> > which is great, but even if we don't, we still saved a bunch of memory
> > that could not have been saved in any other way.
>
> You could also build a streaming incremental parser.  That is, one that
> outputs a path and a leaf value (where leaf values are scalar values,
> `null`, `true`, `false`, numbers, and strings).  Then if the caller is
> doing something JSONPath-like then the caller can probably immediately
> free almost all allocations and even terminate the parse early.

I think our current parser is event-based rather than this ... but it
seems like this could easily be built on top of it, if someone wanted
to.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

09 January 2024, 18:46:17

On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Quite a long time ago Robert asked me about the possibility of an
> incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
> performance is significantly worse that that of the current Recursive
> Descent parser.

The prediction stack is neat. It seems like the main loop is hit so
many thousands of times that micro-optimization would be necessary...
I attached a sample diff to get rid of the strlen calls during
push_prediction(), which speeds things up a bit (8-15%, depending on
optimization level) on my machines.

Maybe it's possible to condense some of those productions down, and
reduce the loop count? E.g. does every "scalar" production need to go
three times through the loop/stack, or can the scalar semantic action
just peek at the next token prediction and do all the callback work at
once?

> +               case JSON_SEM_SCALAR_CALL:
> +                   {
> +                       json_scalar_action sfunc = sem->scalar;
> +
> +                       if (sfunc != NULL)
> +                           (*sfunc) (sem->semstate, scalar_val, scalar_tok);
> +                   }

Is it safe to store state (scalar_val/scalar_tok) on the stack, or
does it disappear if the parser hits an incomplete token?

> One possible use would be in parsing large manifest files for
> incremental backup.

I'm keeping an eye on this thread for OAuth, since the clients have to
parse JSON as well. Those responses tend to be smaller, though, so
you'd have to really be hurting for resources to need this.

--Jacob

Attachment

no-strlen.diff.txt

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

10 January 2024, 11:27:24

On 2024-01-09 Tu 13:46, Jacob Champion wrote:
> On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Quite a long time ago Robert asked me about the possibility of an
>> incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
>> performance is significantly worse that that of the current Recursive
>> Descent parser.
> The prediction stack is neat. It seems like the main loop is hit so
> many thousands of times that micro-optimization would be necessary...
> I attached a sample diff to get rid of the strlen calls during
> push_prediction(), which speeds things up a bit (8-15%, depending on
> optimization level) on my machines.


Thanks for looking! I've been playing around with a similar idea, but 
yours might be better.



>
> Maybe it's possible to condense some of those productions down, and
> reduce the loop count? E.g. does every "scalar" production need to go
> three times through the loop/stack, or can the scalar semantic action
> just peek at the next token prediction and do all the callback work at
> once?


Also a good suggestion. Will look and see. IIRC I had trouble with this bit.


>
>> +               case JSON_SEM_SCALAR_CALL:
>> +                   {
>> +                       json_scalar_action sfunc = sem->scalar;
>> +
>> +                       if (sfunc != NULL)
>> +                           (*sfunc) (sem->semstate, scalar_val, scalar_tok);
>> +                   }
> Is it safe to store state (scalar_val/scalar_tok) on the stack, or
> does it disappear if the parser hits an incomplete token?


Good point. In fact it might be responsible for the error I'm currently 
trying to get to the bottom of.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

15 January 2024, 15:48:00

On 2024-01-09 Tu 13:46, Jacob Champion wrote:
> On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Quite a long time ago Robert asked me about the possibility of an
>> incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
>> performance is significantly worse that that of the current Recursive
>> Descent parser.
> The prediction stack is neat. It seems like the main loop is hit so
> many thousands of times that micro-optimization would be necessary...
> I attached a sample diff to get rid of the strlen calls during
> push_prediction(), which speeds things up a bit (8-15%, depending on
> optimization level) on my machines.
>
> Maybe it's possible to condense some of those productions down, and
> reduce the loop count? E.g. does every "scalar" production need to go
> three times through the loop/stack, or can the scalar semantic action
> just peek at the next token prediction and do all the callback work at
> once?
>
>> +               case JSON_SEM_SCALAR_CALL:
>> +                   {
>> +                       json_scalar_action sfunc = sem->scalar;
>> +
>> +                       if (sfunc != NULL)
>> +                           (*sfunc) (sem->semstate, scalar_val, scalar_tok);
>> +                   }
> Is it safe to store state (scalar_val/scalar_tok) on the stack, or
> does it disappear if the parser hits an incomplete token?
>
>> One possible use would be in parsing large manifest files for
>> incremental backup.
> I'm keeping an eye on this thread for OAuth, since the clients have to
> parse JSON as well. Those responses tend to be smaller, though, so
> you'd have to really be hurting for resources to need this.
>

I've incorporated your suggestion, and fixed the bug you identified.


The attached also adds support for incrementally parsing backup 
manifests, and uses that in the three places we call the manifest parser.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: WIP Incremental JSON Parser

From

Peter Smith

Date:

22 January 2024, 06:29:17

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4725/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4725

Kind Regards,
Peter Smith.

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

22 January 2024, 19:16:07

On 2024-01-22 Mo 01:29, Peter Smith wrote:
> 2024-01 Commitfest.
>
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there were CFbot test failures last time it was run [2]. Please have a
> look and post an updated version if necessary.


Thanks.

Let's see if the attached does better.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

22 January 2024, 23:01:20

On 2024-01-22 Mo 14:16, Andrew Dunstan wrote:
>
> On 2024-01-22 Mo 01:29, Peter Smith wrote:
>> 2024-01 Commitfest.
>>
>> Hi, This patch has a CF status of "Needs Review" [1], but it seems
>> there were CFbot test failures last time it was run [2]. Please have a
>> look and post an updated version if necessary.
>
>
> Thanks.
>
> Let's see if the attached does better.



This time for sure! (Watch me pull a rabbit out of my hat!)


It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed it, 
and I think the test will now pass.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

23 January 2024, 02:02:23

On 2024-01-22 Mo 18:01, Andrew Dunstan wrote:
>
> On 2024-01-22 Mo 14:16, Andrew Dunstan wrote:
>>
>> On 2024-01-22 Mo 01:29, Peter Smith wrote:
>>> 2024-01 Commitfest.
>>>
>>> Hi, This patch has a CF status of "Needs Review" [1], but it seems
>>> there were CFbot test failures last time it was run [2]. Please have a
>>> look and post an updated version if necessary.
>>
>>
>> Thanks.
>>
>> Let's see if the attached does better.
>
>
>
> This time for sure! (Watch me pull a rabbit out of my hat!)
>
>
> It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed 
> it, and I think the test will now pass.
>
>
>

Fixed one problem but there are some others. I'm hoping this will 
satisfy the cfbot.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

24 January 2024, 15:04:34

On 2024-01-22 Mo 21:02, Andrew Dunstan wrote:

On 2024-01-22 Mo 18:01, Andrew Dunstan wrote:

On 2024-01-22 Mo 14:16, Andrew Dunstan wrote:

On 2024-01-22 Mo 01:29, Peter Smith wrote:
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

Thanks.

Let's see if the attached does better.

This time for sure! (Watch me pull a rabbit out of my hat!)

It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed it, and I think the test will now pass.

Fixed one problem but there are some others. I'm hoping this will satisfy the cfbot.

The cfbot reports an error on a 32 bit build <https://api.cirrus-ci.com/v1/artifact/task/6055909135220736/testrun/build-32/testrun/pg_combinebackup/003_timeline/log/regress_log_003_timeline>:

# Running: pg_basebackup -D /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2 --no-sync -cfast --incremental /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup1/backup_manifest
pg_basebackup: error: could not upload manifest: ERROR:  could not parse backup manifest: file size is not an integer
pg_basebackup: removing data directory "/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2"
[02:41:07.830](0.073s) not ok 2 - incremental backup from node1
[02:41:07.830](0.000s) #   Failed test 'incremental backup from node1'

I have set up a Debian 12 EC2 instance following the recipe at <https://raw.githubusercontent.com/anarazel/pg-vm-images/main/scripts/linux_debian_install_deps.sh>, and ran what I think are the same tests dozens of times, but the failure did not reappear in my setup. Unfortunately, the test doesn't show the failing manifest or log the failing field, so trying to divine what happened here is more than difficult.

Not sure how to address this.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Robert Haas

Date:

24 January 2024, 18:08:34

On Wed, Jan 24, 2024 at 10:04 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> The cfbot reports an error on a 32 bit build
<https://api.cirrus-ci.com/v1/artifact/task/6055909135220736/testrun/build-32/testrun/pg_combinebackup/003_timeline/log/regress_log_003_timeline>:
>
> # Running: pg_basebackup -D
/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2
--no-sync-cfast --incremental
/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup1/backup_manifest
> pg_basebackup: error: could not upload manifest: ERROR:  could not parse backup manifest: file size is not an integer
> pg_basebackup: removing data directory
"/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2"
> [02:41:07.830](0.073s) not ok 2 - incremental backup from node1
> [02:41:07.830](0.000s) #   Failed test 'incremental backup from node1'
>
> I have set up a Debian 12 EC2 instance following the recipe at
<https://raw.githubusercontent.com/anarazel/pg-vm-images/main/scripts/linux_debian_install_deps.sh>,and ran what I
thinkare the same tests dozens of times, but the failure did not reappear in my setup. Unfortunately, the test doesn't
showthe failing manifest or log the failing field, so trying to divine what happened here is more than difficult. 
>
> Not sure how to address this.

Yeah, that's really odd. The backup size field is printed like this:

appendStringInfo(&buf, "\"Size\": %zu, ", size);

And parsed like this:

        size = strtoul(parse->size, &ep, 10);
        if (*ep)
                json_manifest_parse_failure(parse->context,

 "file size is not an integer");

I confess to bafflement -- how could the output of the first fail to
be parsed by the second? The manifest has to look pretty much valid in
order not to error out before it gets to this check, with just that
one field corrupted. But I don't understand how that could happen.

I agree that the error reporting could be better here, but it didn't
seem worth spending time on when I wrote the code. I figured the only
way we could end up with something like "Size": "Walrus" is if the
user was messing with us on purpose. Apparently that's not so, yet the
mechanism eludes me. Or maybe it's not some random string, but is
something like an empty string or a number with trailing garbage or a
number that's out of range. But I don't see how any of those things
can happen either.

Maybe you should adjust your patch to dump the manifests into the log
file with note(). Then when cfbot runs on it you can see exactly what
the raw file looks like. Although I wonder if it's possible that the
manifest itself is OK, but somehow it gets garbled when uploaded to
the server, either because the client code that sends it or the server
code that receives it does something that isn't safe in 32-bit mode.
If we hypothesize an off-by-one error or a buffer overrun, that could
possibly explain how one field got garbled while the rest of the file
is OK.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

26 January 2024, 17:15:27

On 2024-01-24 We 13:08, Robert Haas wrote:
>
> Maybe you should adjust your patch to dump the manifests into the log
> file with note(). Then when cfbot runs on it you can see exactly what
> the raw file looks like. Although I wonder if it's possible that the
> manifest itself is OK, but somehow it gets garbled when uploaded to
> the server, either because the client code that sends it or the server
> code that receives it does something that isn't safe in 32-bit mode.
> If we hypothesize an off-by-one error or a buffer overrun, that could
> possibly explain how one field got garbled while the rest of the file
> is OK.


Yeah, I thought earlier today I was on the track of an off by one error, 
but I was apparently mistaken, so here's the same patch set with an 
extra patch that logs a bunch of stuff, and might let us see what's 
upsetting the cfbot.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

20 February 2024, 04:41:07

On 2024-01-26 Fr 12:15, Andrew Dunstan wrote:
>
> On 2024-01-24 We 13:08, Robert Haas wrote:
>>
>> Maybe you should adjust your patch to dump the manifests into the log
>> file with note(). Then when cfbot runs on it you can see exactly what
>> the raw file looks like. Although I wonder if it's possible that the
>> manifest itself is OK, but somehow it gets garbled when uploaded to
>> the server, either because the client code that sends it or the server
>> code that receives it does something that isn't safe in 32-bit mode.
>> If we hypothesize an off-by-one error or a buffer overrun, that could
>> possibly explain how one field got garbled while the rest of the file
>> is OK.
>
>
> Yeah, I thought earlier today I was on the track of an off by one 
> error, but I was apparently mistaken, so here's the same patch set 
> with an extra patch that logs a bunch of stuff, and might let us see 
> what's upsetting the cfbot.
>
>
>

Well, that didn't help a lot, but meanwhile the CFBot seems to have 
decided in the last few days that it's now happy, so full steam aead! ;-)


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

21 February 2024, 00:53:35

On Tue, Feb 20, 2024 at 2:10 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> Well, that didn't help a lot, but meanwhile the CFBot seems to have
> decided in the last few days that it's now happy, so full steam aead! ;-)

I haven't been able to track down the root cause yet, but I am able to
reproduce the failure consistently on my machine:

    ERROR:  could not parse backup manifest: file size is not an
integer: '16384, "Last-Modified": "2024-02-20 23:07:43 GMT",
"Checksum-Algorithm": "CRC32C", "Checksum": "66c829cd" },
    { "Path": "base/4/2606", "Size": 24576, "Last-Modified": "20

Full log is attached.

--Jacob

Attachment

regress_log_003_timeline.gz

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

21 February 2024, 05:32:07

On 2024-02-20 Tu 19:53, Jacob Champion wrote:
> On Tue, Feb 20, 2024 at 2:10 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Well, that didn't help a lot, but meanwhile the CFBot seems to have
>> decided in the last few days that it's now happy, so full steam aead! ;-)
> I haven't been able to track down the root cause yet, but I am able to
> reproduce the failure consistently on my machine:
>
>      ERROR:  could not parse backup manifest: file size is not an
> integer: '16384, "Last-Modified": "2024-02-20 23:07:43 GMT",
> "Checksum-Algorithm": "CRC32C", "Checksum": "66c829cd" },
>      { "Path": "base/4/2606", "Size": 24576, "Last-Modified": "20
>
> Full log is attached.
>


*sigh* That's weird. I wonder why you can reproduce it and I can't. Can 
you give me details of the build? OS, compiler, path to source, build 
setup etc.? Anything that might be remotely relevant.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

21 February 2024, 14:50:21

On Tue, Feb 20, 2024 at 9:32 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> *sigh* That's weird. I wonder why you can reproduce it and I can't. Can
> you give me details of the build? OS, compiler, path to source, build
> setup etc.? Anything that might be remotely relevant.

Sure:
- guest VM running in UTM (QEMU 7.2) is Ubuntu 22.04 for ARM, default
core count, 8GB
- host is macOS Sonoma 14.3.1, Apple Silicon (M3 Pro), 36GB
- it's a Meson build (plus a diff for the test utilities, attached,
but that's hopefully not relevant to the failure)
- buildtype=debug, optimization=g, cassert=true
- GCC 11.4
- build path is nested a bit (~/src/postgres/worktree-inc-json/build-dev)

--Jacob

Attachment

meson.diff.txt

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

21 February 2024, 20:26:51

On Wed, Feb 21, 2024 at 6:50 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
> On Tue, Feb 20, 2024 at 9:32 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> > *sigh* That's weird. I wonder why you can reproduce it and I can't. Can
> > you give me details of the build? OS, compiler, path to source, build
> > setup etc.? Anything that might be remotely relevant.

This construction seems suspect, in json_lex_number():

>       if (lex->incremental && !lex->inc_state->is_last_chunk &&
>               len >= lex->input_length)
>       {
>               appendStringInfoString(&lex->inc_state->partial_token,
>                                                          lex->token_start);
>               return JSON_INCOMPLETE;
>       }

appendStringInfoString() isn't respecting the end of the chunk: if
there's extra data after the chunk boundary (as
AppendIncrementalManifestData() does) then all of that will be stuck
onto the end of the partial_token.

I'm about to context-switch off of this for the day, but I can work on
a patch tomorrow if that'd be helpful. It looks like this is not the
only call to appendStringInfoString().

--Jacob

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

22 February 2024, 09:38:32

On 2024-02-21 We 15:26, Jacob Champion wrote:
> On Wed, Feb 21, 2024 at 6:50 AM Jacob Champion
> <jacob.champion@enterprisedb.com> wrote:
>> On Tue, Feb 20, 2024 at 9:32 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>>> *sigh* That's weird. I wonder why you can reproduce it and I can't. Can
>>> you give me details of the build? OS, compiler, path to source, build
>>> setup etc.? Anything that might be remotely relevant.
> This construction seems suspect, in json_lex_number():
>
>>        if (lex->incremental && !lex->inc_state->is_last_chunk &&
>>                len >= lex->input_length)
>>        {
>>                appendStringInfoString(&lex->inc_state->partial_token,
>>                                                           lex->token_start);
>>                return JSON_INCOMPLETE;
>>        }
> appendStringInfoString() isn't respecting the end of the chunk: if
> there's extra data after the chunk boundary (as
> AppendIncrementalManifestData() does) then all of that will be stuck
> onto the end of the partial_token.
>
> I'm about to context-switch off of this for the day, but I can work on
> a patch tomorrow if that'd be helpful. It looks like this is not the
> only call to appendStringInfoString().
>

Yeah, the issue seems to be with chunks of json that are not 
null-terminated. We don't require that they be so this code was buggy. 
It wasn't picked up earlier because the tests that I wrote did put a 
null byte at the end. Patch 5 in this series fixes those issues and 
adjusts most of the tests to add some trailing junk to the pieces of 
json, so we can be sure that this is done right.

cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

22 February 2024, 20:29:18

On Thu, Feb 22, 2024 at 1:38 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Patch 5 in this series fixes those issues and
> adjusts most of the tests to add some trailing junk to the pieces of
> json, so we can be sure that this is done right.

This fixes the test failure for me, thanks! I've attached my current
mesonification diff, which just adds test_json_parser to the suite. It
relies on the PATH that's set up, which appears to include the build
directory for both VPATH builds and Meson.

Are there plans to fill out the test suite more? Since we should be
able to control all the initial conditions, it'd be good to get fairly
comprehensive coverage of the new code.

As an aside, I find the behavior of need_escapes=false to be
completely counterintuitive. I know the previous code did this, but it
seems like the opposite of "provides unescaped strings" should be
"provides raw strings", not "all strings are now NULL".

--Jacob

Attachment

meson.diff.txt

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

22 February 2024, 23:43:01

On 2024-02-22 Th 15:29, Jacob Champion wrote:
> On Thu, Feb 22, 2024 at 1:38 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Patch 5 in this series fixes those issues and
>> adjusts most of the tests to add some trailing junk to the pieces of
>> json, so we can be sure that this is done right.
> This fixes the test failure for me, thanks! I've attached my current
> mesonification diff, which just adds test_json_parser to the suite. It
> relies on the PATH that's set up, which appears to include the build
> directory for both VPATH builds and Meson.

OK, thanks, will add this in the next version.

>
> Are there plans to fill out the test suite more? Since we should be
> able to control all the initial conditions, it'd be good to get fairly
> comprehensive coverage of the new code.

Well, it's tested (as we know) by the backup manifest tests. During 
development, I tested by making the regular parser use the 
non-recursive  parser (see FORCE_JSON_PSTACK). That doesn't test the 
incremental piece of it, but it does check that the rest of it is doing 
the right thing. We could probably extend the incremental test by making 
it output a stream of tokens and making sure that was correct.

> As an aside, I find the behavior of need_escapes=false to be
> completely counterintuitive. I know the previous code did this, but it
> seems like the opposite of "provides unescaped strings" should be
> "provides raw strings", not "all strings are now NULL".

Yes, we could possibly call it "need_strings"  or something like that.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

26 February 2024, 15:08:18

On Thu, Feb 22, 2024 at 3:43 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> > Are there plans to fill out the test suite more? Since we should be
> > able to control all the initial conditions, it'd be good to get fairly
> > comprehensive coverage of the new code.
>
> Well, it's tested (as we know) by the backup manifest tests. During
> development, I tested by making the regular parser use the
> non-recursive  parser (see FORCE_JSON_PSTACK). That doesn't test the
> incremental piece of it, but it does check that the rest of it is doing
> the right thing. We could probably extend the incremental test by making
> it output a stream of tokens and making sure that was correct.

That would also cover all the semantic callbacks (currently,
OFIELD_END and AELEM_* are uncovered), so +1 from me.

Looking at lcov, it'd be good to
- test failure modes as well as the happy path, so we know we're
rejecting invalid syntax correctly
- test the prediction stack resizing code
- provide targeted coverage of the partial token support, since at the
moment we're at the mercy of the manifest format and the default chunk
size

As a brute force example of the latter, with the attached diff I get
test failures at chunk sizes 1, 2, 3, 4, 6, and 12.

> > As an aside, I find the behavior of need_escapes=false to be
> > completely counterintuitive. I know the previous code did this, but it
> > seems like the opposite of "provides unescaped strings" should be
> > "provides raw strings", not "all strings are now NULL".
>
> Yes, we could possibly call it "need_strings"  or something like that.

+1

--Jacob

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

26 February 2024, 15:10:08

On Mon, Feb 26, 2024 at 7:08 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
> As a brute force example of the latter, with the attached diff I get
> test failures at chunk sizes 1, 2, 3, 4, 6, and 12.

But this time with the diff.

--Jacob

Attachment

chunksize.diff.txt

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

27 February 2024, 00:20:46

On 2024-02-26 Mo 10:10, Jacob Champion wrote:
> On Mon, Feb 26, 2024 at 7:08 AM Jacob Champion
> <jacob.champion@enterprisedb.com> wrote:
>> As a brute force example of the latter, with the attached diff I get
>> test failures at chunk sizes 1, 2, 3, 4, 6, and 12.
> But this time with the diff.
>

Ouch. I'll check it out. Thanks!


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

27 February 2024, 05:20:42

On 2024-02-26 Mo 19:20, Andrew Dunstan wrote:
>
> On 2024-02-26 Mo 10:10, Jacob Champion wrote:
>> On Mon, Feb 26, 2024 at 7:08 AM Jacob Champion
>> <jacob.champion@enterprisedb.com> wrote:
>>> As a brute force example of the latter, with the attached diff I get
>>> test failures at chunk sizes 1, 2, 3, 4, 6, and 12.
>> But this time with the diff.
>>
>
> Ouch. I'll check it out. Thanks!
>
>
>

The good news is that the parser is doing fine - this issue was due to a 
thinko on my part in the test program that got triggered by the input 
file size being an exact multiple of the chunk size. I'll have a new 
patch set later this week.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

27 February 2024, 15:21:34

On Mon, Feb 26, 2024 at 9:20 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> The good news is that the parser is doing fine - this issue was due to a
> thinko on my part in the test program that got triggered by the input
> file size being an exact multiple of the chunk size. I'll have a new
> patch set later this week.

Ah, feof()! :D Good to know it's not the partial token logic.

--Jacob

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

07 March 2024, 15:28:45

Some more observations as I make my way through the patch:

In src/common/jsonapi.c,

> +#define JSON_NUM_NONTERMINALS 6

Should this be 5 now?

> +       res = pg_parse_json_incremental(&(incstate->lex), &(incstate->sem),
> +                                                                       chunk, size, is_last);
> +
> +       expected = is_last ? JSON_SUCCESS : JSON_INCOMPLETE;
> +
> +       if (res != expected)
> +               json_manifest_parse_failure(context, "parsing failed");

This leads to error messages like

    pg_verifybackup: error: could not parse backup manifest: parsing failed

which I would imagine is going to lead to confused support requests in
the event that someone does manage to corrupt their manifest. Can we
make use of json_errdetail() and print the line and column numbers?
Patch 0001 over at [1] has one approach to making json_errdetail()
workable in frontend code.

Top-level scalars like `false` or `12345` do not parse correctly if
the chunk size is too small; instead json_errdetail() reports 'Token
"" is invalid'. With small chunk sizes, json_errdetail() additionally
segfaults on constructions like `[tru]` or `12zz`.

For my local testing, I'm carrying the following diff in
001_test_json_parser_incremental.pl:

> - ok($stdout =~ /SUCCESS/, "chunk size $size: test succeeds");
> - ok(!$stderr, "chunk size $size: no error output");
> + like($stdout, qr/SUCCESS/, "chunk size $size: test succeeds");
> + is($stderr, "", "chunk size $size: no error output");

This is particularly helpful when a test fails spuriously due to code
coverage spray on stderr.

Thanks,
--Jacob

[1] https://www.postgresql.org/message-id/CAOYmi%2BmSSY4SvOtVN7zLyUCQ4-RDkxkzmTuPEN%2Bt-PsB7GHnZA%40mail.gmail.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

08 March 2024, 03:42:06

On 2024-03-07 Th 10:28, Jacob Champion wrote:
> Some more observations as I make my way through the patch:
>
> In src/common/jsonapi.c,
>
>> +#define JSON_NUM_NONTERMINALS 6
> Should this be 5 now?


Yep.


>
>> +       res = pg_parse_json_incremental(&(incstate->lex), &(incstate->sem),
>> +                                                                       chunk, size, is_last);
>> +
>> +       expected = is_last ? JSON_SUCCESS : JSON_INCOMPLETE;
>> +
>> +       if (res != expected)
>> +               json_manifest_parse_failure(context, "parsing failed");
> This leads to error messages like
>
>      pg_verifybackup: error: could not parse backup manifest: parsing failed
>
> which I would imagine is going to lead to confused support requests in
> the event that someone does manage to corrupt their manifest. Can we
> make use of json_errdetail() and print the line and column numbers?
> Patch 0001 over at [1] has one approach to making json_errdetail()
> workable in frontend code.


Looks sound on a first look. Maybe we should get that pushed ASAP so we 
can take advantage of it.



>
> Top-level scalars like `false` or `12345` do not parse correctly if
> the chunk size is too small; instead json_errdetail() reports 'Token
> "" is invalid'. With small chunk sizes, json_errdetail() additionally
> segfaults on constructions like `[tru]` or `12zz`.


Ugh. Will investigate.


>
> For my local testing, I'm carrying the following diff in
> 001_test_json_parser_incremental.pl:
>
>> - ok($stdout =~ /SUCCESS/, "chunk size $size: test succeeds");
>> - ok(!$stderr, "chunk size $size: no error output");
>> + like($stdout, qr/SUCCESS/, "chunk size $size: test succeeds");
>> + is($stderr, "", "chunk size $size: no error output");
> This is particularly helpful when a test fails spuriously due to code
> coverage spray on stderr.
>

Makes sense, thanks.


I'll have a fresh patch set soon which will also take care of the bitrot.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

11 March 2024, 06:43:17

On 2024-03-07 Th 22:42, Andrew Dunstan wrote:
>
>
>
>
>
>
>
>>
>> Top-level scalars like `false` or `12345` do not parse correctly if
>> the chunk size is too small; instead json_errdetail() reports 'Token
>> "" is invalid'. With small chunk sizes, json_errdetail() additionally
>> segfaults on constructions like `[tru]` or `12zz`.
>
>
> Ugh. Will investigate.


I haven't managed to reproduce this. But I'm including some tests for it.

>
>
>
>
> I'll have a fresh patch set soon which will also take care of the bitrot.
>
>
>

See attached.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

11 March 2024, 13:47:16

On Sun, Mar 10, 2024 at 11:43 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> I haven't managed to reproduce this. But I'm including some tests for it.

If I remove the newline from the end of the new tests:

> @@ -25,7 +25,7 @@ for (my $size = 64; $size > 0; $size--)
>  foreach my $test_string (@inlinetests)
>  {
>         my ($fh, $fname) = tempfile();
> -       print $fh "$test_string\n";
> +       print $fh "$test_string";
>         close($fh);
>
>         foreach my $size (1..6, 10, 20, 30)

then I can reproduce the same result as my local tests. That extra
whitespace appears to help the partial token logic out somehow.

--Jacob

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

12 March 2024, 01:43:33

On 2024-03-11 Mo 09:47, Jacob Champion wrote:
> On Sun, Mar 10, 2024 at 11:43 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>> I haven't managed to reproduce this. But I'm including some tests for it.
> If I remove the newline from the end of the new tests:
>
>> @@ -25,7 +25,7 @@ for (my $size = 64; $size > 0; $size--)
>>   foreach my $test_string (@inlinetests)
>>   {
>>          my ($fh, $fname) = tempfile();
>> -       print $fh "$test_string\n";
>> +       print $fh "$test_string";
>>          close($fh);
>>
>>          foreach my $size (1..6, 10, 20, 30)
> then I can reproduce the same result as my local tests. That extra
> whitespace appears to help the partial token logic out somehow.
>


Yep, here's a patch set with that fixed, and the tests adjusted.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

14 March 2024, 19:35:29

I've been poking at the partial token logic. The json_errdetail() bug
mentioned upthread (e.g. for an invalid input `[12zz]` and small chunk
size) seems to be due to the disconnection between the "main" lex
instance and the dummy_lex that's created from it. The dummy_lex
contains all the information about the failed token, which is
discarded upon an error return:

> partial_result = json_lex(&dummy_lex);
> if (partial_result != JSON_SUCCESS)
>     return partial_result;

In these situations, there's an additional logical error:
lex->token_start is pointing to a spot in the string after
lex->token_terminator, which breaks an invariant that will mess up
later pointer math. Nothing appears to be setting lex->token_start to
point into the partial token buffer until _after_ the partial token is
successfully lexed, which doesn't seem right -- in addition to the
pointer math problems, if a previous chunk was freed (or on a stale
stack frame), lex->token_start will still be pointing off into space.
Similarly, wherever we set token_terminator, we need to know that
token_start is pointing into the same buffer.

Determining the end of a token is now done in two separate places
between the partial- and full-lexer code paths, which is giving me a
little heartburn. I'm concerned that those could drift apart, and if
the two disagree on where to end a token, we could lose data into the
partial token buffer in a way that would be really hard to debug. Is
there a way to combine them?

Thanks,
--Jacob

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

18 March 2024, 10:32:32

On Thu, Mar 14, 2024 at 3:35 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote:

I've been poking at the partial token logic. The json_errdetail() bug
mentioned upthread (e.g. for an invalid input `[12zz]` and small chunk
size) seems to be due to the disconnection between the "main" lex
instance and the dummy_lex that's created from it. The dummy_lex
contains all the information about the failed token, which is
discarded upon an error return:

> partial_result = json_lex(&dummy_lex);
> if (partial_result != JSON_SUCCESS)
> return partial_result;

In these situations, there's an additional logical error:
lex->token_start is pointing to a spot in the string after
lex->token_terminator, which breaks an invariant that will mess up
later pointer math. Nothing appears to be setting lex->token_start to
point into the partial token buffer until _after_ the partial token is
successfully lexed, which doesn't seem right -- in addition to the
pointer math problems, if a previous chunk was freed (or on a stale
stack frame), lex->token_start will still be pointing off into space.
Similarly, wherever we set token_terminator, we need to know that
token_start is pointing into the same buffer.

Determining the end of a token is now done in two separate places
between the partial- and full-lexer code paths, which is giving me a
little heartburn. I'm concerned that those could drift apart, and if
the two disagree on where to end a token, we could lose data into the
partial token buffer in a way that would be really hard to debug. Is
there a way to combine them?

Not very easily. But I think and hope I've fixed the issue you've identified above about returning before lex->token_start is properly set.

Attached is a new set of patches that does that and is updated for the json_errdetaiil() changes.

cheers

andrew

Attachment

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

18 March 2024, 19:34:51

On Mon, Mar 18, 2024 at 3:32 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Not very easily. But I think and hope I've fixed the issue you've identified above about returning before
lex->token_startis properly set. 
>
>  Attached is a new set of patches that does that and is updated for the json_errdetaiil() changes.

Thanks!

>    ++           * Normally token_start would be ptok->data, but it could be later,
>    ++           * see json_lex_string's handling of invalid escapes.
>     +           */
>    -+          lex->token_start = ptok->data;
>    ++          lex->token_start = dummy_lex.token_start;
>     +          lex->token_terminator = ptok->data + ptok->len;

By the same token (ha), the lex->token_terminator needs to be updated
from dummy_lex for some error paths. (IIUC, on success, the
token_terminator should always point to the end of the buffer. If it's
not possible to combine the two code paths, maybe it'd be good to
check that and assert/error out if we've incorrectly pulled additional
data into the partial token.)

With the incremental parser, I think prev_token_terminator is not
likely to be safe to use except in very specific circumstances, since
it could be pointing into a stale chunk. Some documentation around how
to use that safely in a semantic action would be good.

It looks like some of the newly added error handling paths cannot be
hit, because the production stack makes it logically impossible to get
there. (For example, if it takes a successfully lexed comma to
transition into JSON_PROD_MORE_ARRAY_ELEMENTS to begin with, then when
we pull that production's JSON_TOKEN_COMMA off the stack, we can't
somehow fail to match that same comma.) Assuming I haven't missed a
different way to get into that situation, could the "impossible" cases
have assert calls added?

I've attached two diffs. One is the group of tests I've been using
locally (called 002_inline.pl; I replaced the existing inline tests
with it), and the other is a set of potential fixes to get those tests
green.

Thanks,
--Jacob

Attachment

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

19 March 2024, 22:07:21

On Mon, Mar 18, 2024 at 3:35 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote:

On Mon, Mar 18, 2024 at 3:32 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Not very easily. But I think and hope I've fixed the issue you've identified above about returning before lex->token_start is properly set.
>
> Attached is a new set of patches that does that and is updated for the json_errdetaiil() changes.

Thanks!

> ++ * Normally token_start would be ptok->data, but it could be later,
> ++ * see json_lex_string's handling of invalid escapes.
> + */
> -+ lex->token_start = ptok->data;
> ++ lex->token_start = dummy_lex.token_start;
> + lex->token_terminator = ptok->data + ptok->len;

By the same token (ha), the lex->token_terminator needs to be updated
from dummy_lex for some error paths. (IIUC, on success, the
token_terminator should always point to the end of the buffer. If it's
not possible to combine the two code paths, maybe it'd be good to
check that and assert/error out if we've incorrectly pulled additional
data into the partial token.)

Yes, good point. Will take a look at that.

With the incremental parser, I think prev_token_terminator is not
likely to be safe to use except in very specific circumstances, since
it could be pointing into a stale chunk. Some documentation around how
to use that safely in a semantic action would be good.

Quite right. It's not safe. Should we ensure it's set to something like NULL or -1?

Also, where do you think we should put a warning about it?

It looks like some of the newly added error handling paths cannot be
hit, because the production stack makes it logically impossible to get
there. (For example, if it takes a successfully lexed comma to
transition into JSON_PROD_MORE_ARRAY_ELEMENTS to begin with, then when
we pull that production's JSON_TOKEN_COMMA off the stack, we can't
somehow fail to match that same comma.) Assuming I haven't missed a
different way to get into that situation, could the "impossible" cases
have assert calls added?

Good idea.

I've attached two diffs. One is the group of tests I've been using
locally (called 002_inline.pl; I replaced the existing inline tests
with it), and the other is a set of potential fixes to get those tests
green.

Thanks. Here's a patch set that incorporates your two patches.

It also removes the frontend exits I had. In the case of stack depth, we follow the example of the RD parser and only check stack depth for backend code. In the case of the check that the lexer is set up for incremental parsing, the exit is replaced by an Assert. That means your test for an over-nested array doesn't work any more, so I have commented it out.

cheers

andrew

Attachment

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

20 March 2024, 07:19:39

On Tue, Mar 19, 2024 at 6:07 PM Andrew Dunstan <andrew@dunslane.net> wrote:

It also removes the frontend exits I had. In the case of stack depth, we follow the example of the RD parser and only check stack depth for backend code. In the case of the check that the lexer is set up for incremental parsing, the exit is replaced by an Assert.

On second thoughts, I think it might be better if we invent a new error return code for a lexer mode mismatch.

cheers

andrew

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

20 March 2024, 17:09:33

On Tue, Mar 19, 2024 at 3:07 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> On Mon, Mar 18, 2024 at 3:35 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote:
>> With the incremental parser, I think prev_token_terminator is not
>> likely to be safe to use except in very specific circumstances, since
>> it could be pointing into a stale chunk. Some documentation around how
>> to use that safely in a semantic action would be good.
>
> Quite right. It's not safe. Should we ensure it's set to something like NULL or -1?

Nulling it out seems reasonable.

> Also, where do you think we should put a warning about it?

I was thinking in the doc comment for JsonLexContext.

> It also removes the frontend exits I had. In the case of stack depth, we follow the example of the RD parser and only
checkstack depth for backend code. In the case of the check that the lexer is set up for incremental parsing, the exit
isreplaced by an Assert. That means your test for an over-nested array doesn't work any more, so I have commented it
out.

Hm, okay. We really ought to fix the recursive parser, but that's for
a separate thread. (Probably OAuth.) The ideal behavior IMO would be
for the caller to configure a maximum depth in the JsonLexContext.

Note that the repalloc will eventually still exit() if the pstack gets
too big; is that a concern? Alternatively, could unbounded heap growth
be a problem for a superuser? I guess the scalars themselves aren't
capped for length...

On Wed, Mar 20, 2024 at 12:19 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> On second thoughts, I think it might be better if we invent a new error return code for a lexer mode mismatch.

+1

--Jacob

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

20 March 2024, 19:06:33

This new return path...

> +       if (!tok_done)
> +       {
> +           if (lex->inc_state->is_last_chunk)
> +               return JSON_INVALID_TOKEN;

...also needs to set the token pointers. See one approach in the
attached diff, which additionally asserts that we've consumed the
entire chunk in this case, along with a handful of new tests.

--Jacob

Attachment

more.diff.txt

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

21 March 2024, 06:56:00

On Wed, Mar 20, 2024 at 3:06 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote:

This new return path...

> + if (!tok_done)
> + {
> + if (lex->inc_state->is_last_chunk)
> + return JSON_INVALID_TOKEN;

...also needs to set the token pointers. See one approach in the
attached diff, which additionally asserts that we've consumed the
entire chunk in this case, along with a handful of new tests.

Thanks, included that and attended to the other issues we discussed. I think this is pretty close now.

cheers

andrew

Attachment

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

25 March 2024, 22:14:52

On Wed, Mar 20, 2024 at 11:56 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> Thanks, included that and attended to the other issues we discussed. I think this is pretty close now.

Okay, looking over the thread, there are the following open items:
- extend the incremental test in order to exercise the semantic callbacks [1]
- add Assert calls in impossible error cases [2]
- error out if the non-incremental lex doesn't consume the entire token [2]
- double-check that out of memory is an appropriate failure mode for
the frontend [3]

Just as a general style nit:

> +   if (lex->incremental)
> +   {
> +       lex->input = lex->token_terminator = lex->line_start = json;
> +       lex->input_length = len;
> +       lex->inc_state->is_last_chunk = is_last;
> +   }
> +   else
> +       return JSON_INVALID_LEXER_TYPE;

I think flipping this around would probably make it more readable;
something like:

    if (!lex->incremental)
        return JSON_INVALID_LEXER_TYPE;

    lex->input = ...

Thanks,
--Jacob

[1] https://www.postgresql.org/message-id/CAOYmi%2BnHV55Uhz%2Bo-HKq0GNiWn2L5gMcuyRQEz_fqpGY%3DpFxKA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAD5tBcLi2ffZkktV2qrsKSBykE-N8CiYgrfbv0vZ-F7%3DxLFeqw%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAOYmi%2BnY%3DrF6dJCzaOuA3d-3FbwXCcecOs_S1NutexFA3dRXAw%40mail.gmail.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

25 March 2024, 23:02:10

On Mon, Mar 25, 2024 at 6:15 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote:

On Wed, Mar 20, 2024 at 11:56 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> Thanks, included that and attended to the other issues we discussed. I think this is pretty close now.

Okay, looking over the thread, there are the following open items:
- extend the incremental test in order to exercise the semantic callbacks [1]

Yeah, I'm on a super long plane trip later this week, so I might get it done then :-)

- add Assert calls in impossible error cases [2]

ok, will do

- error out if the non-incremental lex doesn't consume the entire token [2]

ok, will do

- double-check that out of memory is an appropriate failure mode for
the frontend [3]

Well, what's the alternative? The current parser doesn't check stack depth in frontend code. Presumably it too will eventually just run out of memory, possibly rather sooner as the stack frames could be more expensive than the incremental parser stack extensions.

Just as a general style nit:

> + if (lex->incremental)
> + {
> + lex->input = lex->token_terminator = lex->line_start = json;
> + lex->input_length = len;
> + lex->inc_state->is_last_chunk = is_last;
> + }
> + else
> + return JSON_INVALID_LEXER_TYPE;

I think flipping this around would probably make it more readable;
something like:

if (!lex->incremental)
return JSON_INVALID_LEXER_TYPE;

lex->input = ...

Noted. will do, Thanks.

cheers

andrew

[1] https://www.postgresql.org/message-id/CAOYmi%2BnHV55Uhz%2Bo-HKq0GNiWn2L5gMcuyRQEz_fqpGY%3DpFxKA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAD5tBcLi2ffZkktV2qrsKSBykE-N8CiYgrfbv0vZ-F7%3DxLFeqw%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAOYmi%2BnY%3DrF6dJCzaOuA3d-3FbwXCcecOs_S1NutexFA3dRXAw%40mail.gmail.com

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

25 March 2024, 23:12:00

On Mon, Mar 25, 2024 at 4:02 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> Well, what's the alternative? The current parser doesn't check stack depth in frontend code. Presumably it too will
eventuallyjust run out of memory, possibly rather sooner as the stack frames could  be more expensive than the
incrementalparser stack extensions. 

Stack size should be pretty limited, at least on the platforms I'm
familiar with. So yeah, the recursive descent will segfault pretty
quickly, but it won't repalloc() an unbounded amount of heap space.
The alternative would just be to go back to a hardcoded limit in the
short term, I think.

--Jacob

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

25 March 2024, 23:21:40

On Mon, Mar 25, 2024 at 4:12 PM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
> Stack size should be pretty limited, at least on the platforms I'm
> familiar with. So yeah, the recursive descent will segfault pretty
> quickly, but it won't repalloc() an unbounded amount of heap space.
> The alternative would just be to go back to a hardcoded limit in the
> short term, I think.

And I should mention that there are other ways to consume a bunch of
memory, but I think they're bounded by the size of the JSON file.
Looks like the repalloc()s amplify the JSON size by a factor of ~20
(JS_MAX_PROD_LEN + sizeof(char*) + sizeof(bool)). That may or may not
be enough to be concerned about in the end, since I think it's still
linear, but I wanted to make sure it was known.

--Jacob

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

25 March 2024, 23:24:08

On Mon, Mar 25, 2024 at 7:12 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote:

On Mon, Mar 25, 2024 at 4:02 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> Well, what's the alternative? The current parser doesn't check stack depth in frontend code. Presumably it too will eventually just run out of memory, possibly rather sooner as the stack frames could be more expensive than the incremental parser stack extensions.

Stack size should be pretty limited, at least on the platforms I'm
familiar with. So yeah, the recursive descent will segfault pretty
quickly, but it won't repalloc() an unbounded amount of heap space.
The alternative would just be to go back to a hardcoded limit in the
short term, I think.

OK, so we invent a new error code and have the parser return that if the stack depth gets too big?

cheers

andrew

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

25 March 2024, 23:27:46

On Mon, Mar 25, 2024 at 4:24 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> OK, so we invent a new error code and have the parser  return that if the stack depth gets too big?

Yeah, that seems reasonable. I'd potentially be able to build on that
via OAuth for next cycle, too, since that client needs to limit its
memory usage.

--Jacob

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

26 March 2024, 21:52:23

On Mon, Mar 25, 2024 at 3:14 PM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
> - add Assert calls in impossible error cases [2]

To expand on this one, I think these parts of the code (marked with
`<- here`) are impossible to reach:

> switch (top)
> {
>     case JSON_TOKEN_STRING:
>         if (next_prediction(pstack) == JSON_TOKEN_COLON)
>             ctx = JSON_PARSE_STRING;
>         else
>             ctx = JSON_PARSE_VALUE;                    <- here
>         break;
>     case JSON_TOKEN_NUMBER:                            <- here
>     case JSON_TOKEN_TRUE:                              <- here
>     case JSON_TOKEN_FALSE:                             <- here
>     case JSON_TOKEN_NULL:                              <- here
>     case JSON_TOKEN_ARRAY_START:                       <- here
>     case JSON_TOKEN_OBJECT_START:                      <- here
>         ctx = JSON_PARSE_VALUE;
>         break;
>     case JSON_TOKEN_ARRAY_END:                         <- here
>         ctx = JSON_PARSE_ARRAY_NEXT;
>         break;
>     case JSON_TOKEN_OBJECT_END:                        <- here
>         ctx = JSON_PARSE_OBJECT_NEXT;
>         break;
>     case JSON_TOKEN_COMMA:                             <- here
>         if (next_prediction(pstack) == JSON_TOKEN_STRING)
>             ctx = JSON_PARSE_OBJECT_NEXT;
>         else
>             ctx = JSON_PARSE_ARRAY_NEXT;
>         break;

Since none of these cases are non-terminals, the only way to get to
this part of the code is if (top != tok). But inspecting the
productions and transitions that can put these tokens on the stack, it
looks like the only way for them to be at the top of the stack to
begin with is if (tok == top). (Otherwise, we would have chosen a
different production, or else errored out on a non-terminal.)

This case is possible...

>     case JSON_TOKEN_STRING:
>         if (next_prediction(pstack) == JSON_TOKEN_COLON)
>             ctx = JSON_PARSE_STRING;

...if we're in the middle of JSON_PROD_[MORE_]KEY_PAIRS. But the
corresponding else case is not, because if we're in the middle of a
_KEY_PAIRS production, the next_prediction() _must_ be
JSON_TOKEN_COLON.

Do you agree, or am I missing a way to get there?

Thanks,
--Jacob

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

29 March 2024, 16:38:38

On 2024-03-26 Tu 17:52, Jacob Champion wrote:
> On Mon, Mar 25, 2024 at 3:14 PM Jacob Champion
> <jacob.champion@enterprisedb.com> wrote:
>> - add Assert calls in impossible error cases [2]
> To expand on this one, I think these parts of the code (marked with
> `<- here`) are impossible to reach:
>
>> switch (top)
>> {
>>      case JSON_TOKEN_STRING:
>>          if (next_prediction(pstack) == JSON_TOKEN_COLON)
>>              ctx = JSON_PARSE_STRING;
>>          else
>>              ctx = JSON_PARSE_VALUE;                    <- here
>>          break;
>>      case JSON_TOKEN_NUMBER:                            <- here
>>      case JSON_TOKEN_TRUE:                              <- here
>>      case JSON_TOKEN_FALSE:                             <- here
>>      case JSON_TOKEN_NULL:                              <- here
>>      case JSON_TOKEN_ARRAY_START:                       <- here
>>      case JSON_TOKEN_OBJECT_START:                      <- here
>>          ctx = JSON_PARSE_VALUE;
>>          break;
>>      case JSON_TOKEN_ARRAY_END:                         <- here
>>          ctx = JSON_PARSE_ARRAY_NEXT;
>>          break;
>>      case JSON_TOKEN_OBJECT_END:                        <- here
>>          ctx = JSON_PARSE_OBJECT_NEXT;
>>          break;
>>      case JSON_TOKEN_COMMA:                             <- here
>>          if (next_prediction(pstack) == JSON_TOKEN_STRING)
>>              ctx = JSON_PARSE_OBJECT_NEXT;
>>          else
>>              ctx = JSON_PARSE_ARRAY_NEXT;
>>          break;
> Since none of these cases are non-terminals, the only way to get to
> this part of the code is if (top != tok). But inspecting the
> productions and transitions that can put these tokens on the stack, it
> looks like the only way for them to be at the top of the stack to
> begin with is if (tok == top). (Otherwise, we would have chosen a
> different production, or else errored out on a non-terminal.)
>
> This case is possible...
>
>>      case JSON_TOKEN_STRING:
>>          if (next_prediction(pstack) == JSON_TOKEN_COLON)
>>              ctx = JSON_PARSE_STRING;
> ...if we're in the middle of JSON_PROD_[MORE_]KEY_PAIRS. But the
> corresponding else case is not, because if we're in the middle of a
> _KEY_PAIRS production, the next_prediction() _must_ be
> JSON_TOKEN_COLON.
>
> Do you agree, or am I missing a way to get there?
>


One good way of testing would be to add the Asserts, build with 
-DFORCE_JSON_PSTACK, and run the standard regression suite, which has a 
fairly comprehensive set of JSON errors. I'll play with that.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

29 March 2024, 16:42:16

On 2024-03-25 Mo 19:02, Andrew Dunstan wrote:

On Mon, Mar 25, 2024 at 6:15 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote:
On Wed, Mar 20, 2024 at 11:56 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> Thanks, included that and attended to the other issues we discussed. I think this is pretty close now.

Okay, looking over the thread, there are the following open items:
- extend the incremental test in order to exercise the semantic callbacks [1]

Yeah, I'm on a super long plane trip later this week, so I might get it done then :-)

- add Assert calls in impossible error cases [2]

ok, will do

- error out if the non-incremental lex doesn't consume the entire token [2]

ok, will do

- double-check that out of memory is an appropriate failure mode for
the frontend [3]

Well, what's the alternative? The current parser doesn't check stack depth in frontend code. Presumably it too will eventually just run out of memory, possibly rather sooner as the stack frames could be more expensive than the incremental parser stack extensions.

Just as a general style nit:

> + if (lex->incremental)
> + {
> + lex->input = lex->token_terminator = lex->line_start = json;
> + lex->input_length = len;
> + lex->inc_state->is_last_chunk = is_last;
> + }
> + else
> + return JSON_INVALID_LEXER_TYPE;

I think flipping this around would probably make it more readable;
something like:

if (!lex->incremental)
return JSON_INVALID_LEXER_TYPE;

lex->input = ...

Noted. will do, Thanks.

Here's a new set of patches, with I think everything except the error case Asserts attended to. There's a change to add semantic processing to the test suite in patch 4, but I'd fold that into patch 1 when committing.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

29 March 2024, 21:21:07

On Fri, Mar 29, 2024 at 9:42 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Here's a new set of patches, with I think everything except the error case Asserts attended to. There's a change to
addsemantic processing to the test suite in patch 4, but I'd fold that into patch 1 when committing. 

Thanks! 0004 did highlight one last bug for me -- the return value
from semantic callbacks is ignored, so applications can't interrupt
the parsing early:

> +                       if (ostart != NULL)
> +                           (*ostart) (sem->semstate);
> +                       inc_lex_level(lex);

Anything not JSON_SUCCESS should be returned immediately, I think, for
this and the other calls.

> +           /* make sure we've used all the input */
> +           if (lex->token_terminator - lex->token_start != ptok->len)
> +               return JSON_INVALID_TOKEN;

nit: Debugging this, if anyone ever hits it, is going to be confusing.
The error message is going to claim that a valid token is invalid, as
opposed to saying that the parser has a bug. Short of introducing
something like JSON_INTERNAL_ERROR, maybe an Assert() should be added
at least?

> +       case JSON_NESTING_TOO_DEEP:
> +           return(_("Json nested too deep, maximum permitted depth is 6400"));

nit: other error messages use all-caps, "JSON"

> +       char        chunk_start[100],
> +                   chunk_end[100];
> +
> +       snprintf(chunk_start, 100, "%s", ib->buf.data);
> +       snprintf(chunk_end, 100, "%s", ib->buf.data + (ib->buf.len - (MIN_CHUNK + 99)));
> +       elog(NOTICE, "incremental manifest:\nchunk_start='%s',\nchunk_end='%s'", chunk_start, chunk_end);

Is this NOTICE intended to be part of the final commit?

> +       inc_state = json_parse_manifest_incremental_init(&context);
> +
> +       buffer = pg_malloc(chunk_size + 64);

These `+ 64`s (there are two of them) seem to be left over from
debugging. In v7 they were just `+ 1`.

--

From this point onward, I think all of my feedback is around
maintenance characteristics, which is firmly in YMMV territory, and I
don't intend to hold up a v1 of the feature for it since I'm not the
maintainer. :D

The complexity around the checksum handling and "final chunk"-ing is
unfortunate, but not a fault of this patch -- just a weird consequence
of the layering violation in the format, where the checksum is inside
the object being checksummed. I don't like it, but I don't think
there's much to be done about it.

By far the trickiest part of the implementation is the partial token
logic, because of all the new ways there are to handle different types
of failures. I think any changes to the incremental handling in
json_lex() are going to need intense scrutiny from someone who already
has the mental model loaded up. I'm going snowblind on the patch and
I'm no longer the right person to review how hard it is to get up to
speed with the architecture, but I'd say it's not easy.

For something as security-critical as a JSON parser, I'd usually want
to counteract that by sinking the observable behaviors in concrete and
getting both the effective test coverage *and* the code coverage as
close to 100% as we can stand. (By that, I mean that purposely
introducing observable bugs into the parser should result in test
failures. We're not there yet when it comes to the semantic callbacks,
at least.) But I don't think the current JSON parser is held to that
standard currently, so it seems strange for me to ask for that here.

I think it'd be good for a v1.x of this feature to focus on
simplification of the code, and hopefully consolidate and/or eliminate
some of the duplicated parsing work so that the mental model isn't
quite so big.

Thanks!
--Jacob

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

01 April 2024, 23:52:59

On 2024-03-29 Fr 17:21, Jacob Champion wrote:
> On Fri, Mar 29, 2024 at 9:42 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Here's a new set of patches, with I think everything except the error case Asserts attended to. There's a change to
addsemantic processing to the test suite in patch 4, but I'd fold that into patch 1 when committing.
 
> Thanks! 0004 did highlight one last bug for me -- the return value
> from semantic callbacks is ignored, so applications can't interrupt
> the parsing early:
>
>> +                       if (ostart != NULL)
>> +                           (*ostart) (sem->semstate);
>> +                       inc_lex_level(lex);
> Anything not JSON_SUCCESS should be returned immediately, I think, for
> this and the other calls.


Fixed


>
>> +           /* make sure we've used all the input */
>> +           if (lex->token_terminator - lex->token_start != ptok->len)
>> +               return JSON_INVALID_TOKEN;
> nit: Debugging this, if anyone ever hits it, is going to be confusing.
> The error message is going to claim that a valid token is invalid, as
> opposed to saying that the parser has a bug. Short of introducing
> something like JSON_INTERNAL_ERROR, maybe an Assert() should be added
> at least?


Done


>
>> +       case JSON_NESTING_TOO_DEEP:
>> +           return(_("Json nested too deep, maximum permitted depth is 6400"));
> nit: other error messages use all-caps, "JSON"


Fixed


>
>> +       char        chunk_start[100],
>> +                   chunk_end[100];
>> +
>> +       snprintf(chunk_start, 100, "%s", ib->buf.data);
>> +       snprintf(chunk_end, 100, "%s", ib->buf.data + (ib->buf.len - (MIN_CHUNK + 99)));
>> +       elog(NOTICE, "incremental manifest:\nchunk_start='%s',\nchunk_end='%s'", chunk_start, chunk_end);
> Is this NOTICE intended to be part of the final commit?


No, removed.


>
>> +       inc_state = json_parse_manifest_incremental_init(&context);
>> +
>> +       buffer = pg_malloc(chunk_size + 64);
> These `+ 64`s (there are two of them) seem to be left over from
> debugging. In v7 they were just `+ 1`.


Fixed


I've also added the Asserts to the error handling code you suggested. 
This tests fine with the regression suite under FORCE_JSON_PSTACK.


>
> --
>
> >From this point onward, I think all of my feedback is around
> maintenance characteristics, which is firmly in YMMV territory, and I
> don't intend to hold up a v1 of the feature for it since I'm not the
> maintainer. :D
>
> The complexity around the checksum handling and "final chunk"-ing is
> unfortunate, but not a fault of this patch -- just a weird consequence
> of the layering violation in the format, where the checksum is inside
> the object being checksummed. I don't like it, but I don't think
> there's much to be done about it.


Yeah. I agree it's ugly, but I don't think we should let it hold us up 
at all. Working around it doesn't involve too much extra code.


>
> By far the trickiest part of the implementation is the partial token
> logic, because of all the new ways there are to handle different types
> of failures. I think any changes to the incremental handling in
> json_lex() are going to need intense scrutiny from someone who already
> has the mental model loaded up. I'm going snowblind on the patch and
> I'm no longer the right person to review how hard it is to get up to
> speed with the architecture, but I'd say it's not easy.


json_lex() is not really a very hot piece of code. I have added a 
comment at the top giving a brief description of the partial token 
handling. Even without the partial token bit it can be a bit scary, but 
I think the partial token piece here is not so scary that we should not 
proceed.


>
> For something as security-critical as a JSON parser, I'd usually want
> to counteract that by sinking the observable behaviors in concrete and
> getting both the effective test coverage *and* the code coverage as
> close to 100% as we can stand. (By that, I mean that purposely
> introducing observable bugs into the parser should result in test
> failures. We're not there yet when it comes to the semantic callbacks,
> at least.) But I don't think the current JSON parser is held to that
> standard currently, so it seems strange for me to ask for that here.


Yeah.


>
> I think it'd be good for a v1.x of this feature to focus on
> simplification of the code, and hopefully consolidate and/or eliminate
> some of the duplicated parsing work so that the mental model isn't
> quite so big.
>

I'm not sure how you think that can be done. What parts of the 
processing are you having difficulty in coming to grips with?

Anyway, here are new patches. I've rolled the new semantic test into the 
first patch.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

02 April 2024, 19:38:10

On Mon, Apr 1, 2024 at 4:53 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> Anyway, here are new patches. I've rolled the new semantic test into the
> first patch.

Looks good! I've marked RfC.

> json_lex() is not really a very hot piece of code.

Sure, but I figure if someone is trying to get the performance of the
incremental parser to match the recursive one, so we can eventually
replace it, it might get a little warmer. :)

> > I think it'd be good for a v1.x of this feature to focus on
> > simplification of the code, and hopefully consolidate and/or eliminate
> > some of the duplicated parsing work so that the mental model isn't
> > quite so big.
>
> I'm not sure how you think that can be done.

I think we'd need to teach the lower levels of the lexer about
incremental parsing too, so that we don't have two separate sources of
truth about what ends a token. Bonus points if we could keep the parse
state across chunks to the extent that we didn't need to restart at
the beginning of the token every time. (Our current tools for this are
kind of poor, like the restartable state machine in PQconnectPoll.
While I'm dreaming, I'd like coroutines.) Now, whether the end result
would be more or less maintainable is left as an exercise...

Thanks!
--Jacob

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

02 April 2024, 21:12:51

On 2024-04-02 Tu 15:38, Jacob Champion wrote:
> On Mon, Apr 1, 2024 at 4:53 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Anyway, here are new patches. I've rolled the new semantic test into the
>> first patch.
> Looks good! I've marked RfC.

Thanks! I appreciate all the work you've done on this. I will give it 
one more pass and commit RSN.

>
>> json_lex() is not really a very hot piece of code.
> Sure, but I figure if someone is trying to get the performance of the
> incremental parser to match the recursive one, so we can eventually
> replace it, it might get a little warmer. :)

I don't think this is where the performance penalty lies. Rather, I 
suspect it's the stack operations in the non-recursive parser itself. 
The speed test doesn't involve any partial token processing at all, and 
yet the non-recursive parser is significantly slower in that test.

>>> I think it'd be good for a v1.x of this feature to focus on
>>> simplification of the code, and hopefully consolidate and/or eliminate
>>> some of the duplicated parsing work so that the mental model isn't
>>> quite so big.
>> I'm not sure how you think that can be done.
> I think we'd need to teach the lower levels of the lexer about
> incremental parsing too, so that we don't have two separate sources of
> truth about what ends a token. Bonus points if we could keep the parse
> state across chunks to the extent that we didn't need to restart at
> the beginning of the token every time. (Our current tools for this are
> kind of poor, like the restartable state machine in PQconnectPoll.
> While I'm dreaming, I'd like coroutines.) Now, whether the end result
> would be more or less maintainable is left as an exercise...
>

I tried to disturb the main lexer processing as little as possible. We 
could possibly unify the two paths, but I have a strong suspicion that 
would result in a performance hit (the main part of the lexer doesn't 
copy any characters at all, it just keeps track of pointers into the 
input). And while the code might not undergo lots of change, the routine 
itself is quite performance critical.

Anyway, I think that's all something for another day.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

04 April 2024, 11:00:32

On 2024-04-02 Tu 17:12, Andrew Dunstan wrote:
>
> On 2024-04-02 Tu 15:38, Jacob Champion wrote:
>> On Mon, Apr 1, 2024 at 4:53 PM Andrew Dunstan <andrew@dunslane.net> 
>> wrote:
>>> Anyway, here are new patches. I've rolled the new semantic test into 
>>> the
>>> first patch.
>> Looks good! I've marked RfC.
>
>
> Thanks! I appreciate all the work you've done on this. I will give it 
> one more pass and commit RSN.


pushed.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Tom Lane

Date:

04 April 2024, 14:26:42

Andrew Dunstan <andrew@dunslane.net> writes:
> pushed.

My animals don't like this [1][2]:

parse_manifest.c:99:3: error: redefinition of typedef 'JsonManifestParseIncrementalState' is a C11 feature
[-Werror,-Wtypedef-redefinition]
} JsonManifestParseIncrementalState;
  ^
../../src/include/common/parse_manifest.h:23:50: note: previous definition is here
typedef struct JsonManifestParseIncrementalState JsonManifestParseIncrementalState;
                                                 ^
1 error generated.

(and similarly in other places)

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2024-04-04%2013%3A08%3A02
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka&dt=2024-04-04%2013%3A07%3A01

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

04 April 2024, 14:57:22

On 2024-04-04 Th 10:26, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> pushed.
> My animals don't like this [1][2]:
>
> parse_manifest.c:99:3: error: redefinition of typedef 'JsonManifestParseIncrementalState' is a C11 feature
[-Werror,-Wtypedef-redefinition]
> } JsonManifestParseIncrementalState;
>    ^
> ../../src/include/common/parse_manifest.h:23:50: note: previous definition is here
> typedef struct JsonManifestParseIncrementalState JsonManifestParseIncrementalState;
>                                                   ^
> 1 error generated.
>
> (and similarly in other places)
>
>             regards, tom lane
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2024-04-04%2013%3A08%3A02
> [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka&dt=2024-04-04%2013%3A07%3A01


Darn it. Ok, will try to fix.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Tom Lane

Date:

04 April 2024, 15:00:15

Andrew Dunstan <andrew@dunslane.net> writes:
> On 2024-04-04 Th 10:26, Tom Lane wrote:
>> My animals don't like this [1][2]:

> Darn it. Ok, will try to fix.

I think you just need to follow the standard pattern:

typedef struct foo foo;

... foo* is ok to use here ...

struct foo {
   ... fields
};

            regards, tom lane

Re: WIP Incremental JSON Parser

From

Tom Lane

Date:

04 April 2024, 15:16:28

I wrote:
> I think you just need to follow the standard pattern:

Yeah, the attached is enough to silence it for me.
(But personally I'd add comments saying that the typedef
appears in thus-and-such header file; see examples in
our tree.)

            regards, tom lane

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 3d1bd37ac26..0bb46b43024 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -79,7 +79,7 @@ typedef enum
  * and the token and value for scalars that need to be preserved
  * across calls.
  */
-typedef struct JsonParserStack
+struct JsonParserStack
 {
     int            stack_size;
     char       *prediction;
@@ -89,18 +89,18 @@ typedef struct JsonParserStack
     bool       *fnull;
     JsonTokenType scalar_tok;
     char       *scalar_val;
-} JsonParserStack;
+};
 
 /*
  * struct containing state used when there is a possible partial token at the
  * end of a json chunk when we are doing incremental parsing.
  */
-typedef struct JsonIncrementalState
+struct JsonIncrementalState
 {
     bool        is_last_chunk;
     bool        partial_completed;
     StringInfoData partial_token;
-} JsonIncrementalState;
+};
 
 /*
  * constants and macros used in the nonrecursive parser
diff --git a/src/common/parse_manifest.c b/src/common/parse_manifest.c
index 040c5597df4..f9c026a6369 100644
--- a/src/common/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -91,12 +91,12 @@ typedef struct
     char       *manifest_checksum;
 } JsonManifestParseState;
 
-typedef struct JsonManifestParseIncrementalState
+struct JsonManifestParseIncrementalState
 {
     JsonLexContext lex;
     JsonSemAction sem;
     pg_cryptohash_ctx *manifest_ctx;
-} JsonManifestParseIncrementalState;
+};
 
 static JsonParseErrorType json_manifest_object_start(void *state);
 static JsonParseErrorType json_manifest_object_end(void *state);

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

04 April 2024, 15:37:40

On 2024-04-04 Th 11:16, Tom Lane wrote:
> I wrote:
>> I think you just need to follow the standard pattern:
> Yeah, the attached is enough to silence it for me.


Thanks, that's what I came up with too (after converting my setup to use 
clang)


> (But personally I'd add comments saying that the typedef
> appears in thus-and-such header file; see examples in
> our tree.)


Done


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Tom Lane

Date:

04 April 2024, 16:04:45

Oh, more problems: after running check-world in a non-VPATH build,
there are droppings all over my tree:

$ git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        src/interfaces/ecpg/test/sql/sqljson_jsontable
        src/interfaces/ecpg/test/sql/sqljson_jsontable.c
        src/test/modules/test_json_parser/test_json_parser_incremental
        src/test/modules/test_json_parser/test_json_parser_perf
        src/test/modules/test_json_parser/tmp_check/

nothing added to commit but untracked files present (use "git add" to track)

"make clean" or "make distclean" removes some of that but not all:

$ make -s distclean
$ git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        src/test/modules/test_json_parser/test_json_parser_incremental
        src/test/modules/test_json_parser/test_json_parser_perf

nothing added to commit but untracked files present (use "git add" to track)

So we're short several .gitignore entries, and apparently also
shy a couple of make-clean rules.

            regards, tom lane

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

04 April 2024, 16:13:04

On 2024-04-04 Th 12:04, Tom Lane wrote:
> Oh, more problems: after running check-world in a non-VPATH build,
> there are droppings all over my tree:
>
> $ git status
> On branch master
> Your branch is up to date with 'origin/master'.
>
> Untracked files:
>    (use "git add <file>..." to include in what will be committed)
>          src/interfaces/ecpg/test/sql/sqljson_jsontable
>          src/interfaces/ecpg/test/sql/sqljson_jsontable.c
>          src/test/modules/test_json_parser/test_json_parser_incremental
>          src/test/modules/test_json_parser/test_json_parser_perf
>          src/test/modules/test_json_parser/tmp_check/
>
> nothing added to commit but untracked files present (use "git add" to track)
>
> "make clean" or "make distclean" removes some of that but not all:
>
> $ make -s distclean
> $ git status
> On branch master
> Your branch is up to date with 'origin/master'.
>
> Untracked files:
>    (use "git add <file>..." to include in what will be committed)
>          src/test/modules/test_json_parser/test_json_parser_incremental
>          src/test/modules/test_json_parser/test_json_parser_perf
>
> nothing added to commit but untracked files present (use "git add" to track)
>
> So we're short several .gitignore entries, and apparently also
> shy a couple of make-clean rules.


Argh. You get out of the habit when you're running with meson :-(


There's another issue I'm running down too.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Jelte Fennema-Nio

Date:

04 April 2024, 17:16:50

On Thu, 4 Apr 2024 at 18:13, Andrew Dunstan <andrew@dunslane.net> wrote:
> Argh. You get out of the habit when you're running with meson :-(

I'm having some issues with meson too actually. "meson test -C build"
is now failing with this for me:

Command 'test_json_parser_incremental' not found in
/home/jelte/opensource/postgres/build/tmp_install//home/jelte/.pgenv/pgsql-17beta9/bin,
...

Re: WIP Incremental JSON Parser

From

Nathan Bossart

Date:

04 April 2024, 18:06:02

Apologies for piling on, but my compiler (gcc 9.4.0) is unhappy:

../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
../postgresql/src/common/jsonapi.c:2016:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
 2016 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
      |                           ~~~^~~~~~~~~~~
../postgresql/src/common/jsonapi.c:2020:36: error: ‘dummy_lex.token_start’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
 2020 |           lex->token_start, s - lex->token_start);
      |                                 ~~~^~~~~~~~~~~~~
../postgresql/src/common/jsonapi.c:302:26: error: ‘numeric_error’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
  302 |  return (!numeric_error) && (total_len == dummy_lex.input_length);
      |         ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

04 April 2024, 18:12:18

On Thu, Apr 4, 2024 at 10:17 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> Command 'test_json_parser_incremental' not found in
> /home/jelte/opensource/postgres/build/tmp_install//home/jelte/.pgenv/pgsql-17beta9/bin,
> ...

I can't reproduce this locally...

What's in the `...`? I wouldn't expect to find the test binary in your
tmp_install.

--Jacob

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

04 April 2024, 18:48:26

On Thu, Apr 4, 2024 at 11:12 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
> What's in the `...`? I wouldn't expect to find the test binary in your
> tmp_install.

Oh, I wonder if this is just a build dependency thing? I typically run
a bare `ninja` right before testing, because I think most of those
dependencies are missing for the tests at the moment. (For example, if
I manually remove the `libpq_pipeline` executable and then try to run
`meson test` without a rebuild, it fails.)

We can certainly fix that (see attached patch as a first draft) but I
wonder if there was a reason we decided not to during the Meson port?

--Jacob

Attachment

patch

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

04 April 2024, 19:28:59

On Thu, Apr 4, 2024 at 11:06 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
> ../postgresql/src/common/jsonapi.c:2016:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
>  2016 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
>       |                           ~~~^~~~~~~~~~~
> ../postgresql/src/common/jsonapi.c:2020:36: error: ‘dummy_lex.token_start’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
>  2020 |           lex->token_start, s - lex->token_start);
>       |                                 ~~~^~~~~~~~~~~~~

Ah, it hasn't figured out that the incremental path is disabled.
Zero-initializing the dummy_lex would be good regardless.

> ../postgresql/src/common/jsonapi.c:302:26: error: ‘numeric_error’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
>   302 |  return (!numeric_error) && (total_len == dummy_lex.input_length);
>       |         ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For this case, though, I think it'd be good for API consistency if
*numeric_error were always set on return, instead of relying on the
caller to initialize.

--Jacob

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

04 April 2024, 19:31:12

On 2024-04-04 Th 14:06, Nathan Bossart wrote:
> Apologies for piling on, but my compiler (gcc 9.4.0) is unhappy:
>
> ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
> ../postgresql/src/common/jsonapi.c:2016:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
>   2016 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
>        |                           ~~~^~~~~~~~~~~
> ../postgresql/src/common/jsonapi.c:2020:36: error: ‘dummy_lex.token_start’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
>   2020 |           lex->token_start, s - lex->token_start);
>        |                                 ~~~^~~~~~~~~~~~~
> ../postgresql/src/common/jsonapi.c:302:26: error: ‘numeric_error’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
>    302 |  return (!numeric_error) && (total_len == dummy_lex.input_length);
>        |         ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>

Please pile on. I want to get this fixed.


It seems odd that my much later gcc didn't complain.


Does the attached patch fix it for you?


cheers


andrew







--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

fix_lex_number.patch

Re: WIP Incremental JSON Parser

From

Nathan Bossart

Date:

04 April 2024, 19:54:46

On Thu, Apr 04, 2024 at 03:31:12PM -0400, Andrew Dunstan wrote:
> Does the attached patch fix it for you?

It clears up 2 of the 3 warnings for me:

../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
../postgresql/src/common/jsonapi.c:2018:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
 2018 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
      |

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: WIP Incremental JSON Parser

From

Jelte Fennema-Nio

Date:

04 April 2024, 20:42:09

On Thu, 4 Apr 2024 at 20:48, Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
>
> On Thu, Apr 4, 2024 at 11:12 AM Jacob Champion
> <jacob.champion@enterprisedb.com> wrote:
> > What's in the `...`? I wouldn't expect to find the test binary in your
> > tmp_install.
>
> Oh, I wonder if this is just a build dependency thing? I typically run
> a bare `ninja` right before testing, because I think most of those
> dependencies are missing for the tests at the moment. (For example, if
> I manually remove the `libpq_pipeline` executable and then try to run
> `meson test` without a rebuild, it fails.)
>
> We can certainly fix that (see attached patch as a first draft) but I
> wonder if there was a reason we decided not to during the Meson port?

Yeah, that patch fixed my problem. (as well as running plain ninja)

To clarify: I did do a rebuild. But it turns out that install-quiet
doesn't build everything... The full command I was having problems
with was:

ninja -C build install-quiet && meson test -C build

Maybe that's something worth addressing too. I expected that
install/install-quiet was a strict superset of a plain ninja
invocation.

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

04 April 2024, 21:34:13

On Thu, Apr 4, 2024 at 1:42 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> Maybe that's something worth addressing too. I expected that
> install/install-quiet was a strict superset of a plain ninja
> invocation.

Maybe that's the intent, but I hope not, because I don't see any
reason for `ninja install` to worry about test-only binaries that
won't be installed. For the tests, there's a pretty clear rationale
for the dependency.

--Jacob

Re: WIP Incremental JSON Parser

From

Jelte Fennema-Nio

Date:

04 April 2024, 21:52:14

On Thu, 4 Apr 2024 at 23:34, Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
>
> On Thu, Apr 4, 2024 at 1:42 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> > Maybe that's something worth addressing too. I expected that
> > install/install-quiet was a strict superset of a plain ninja
> > invocation.
>
> Maybe that's the intent, but I hope not, because I don't see any
> reason for `ninja install` to worry about test-only binaries that
> won't be installed. For the tests, there's a pretty clear rationale
> for the dependency.

Fair enough, I guess I'll change my invocation to include the ninja
"test" target too:
ninja -C build test install-quiet && meson test -C build

Re: WIP Incremental JSON Parser

From

Jelte Fennema-Nio

Date:

04 April 2024, 21:55:43

On Thu, 4 Apr 2024 at 23:52, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> Fair enough, I guess I'll change my invocation to include the ninja
> "test" target too:
> ninja -C build test install-quiet && meson test -C build

Actually that doesn't do what I want either because that actually runs
the tests too... And I prefer to do that using meson directly, so I
can add some filters to it. I'll figure something out when I'm more
fresh tomorrow.

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

05 April 2024, 14:15:45

On 2024-04-04 Th 15:54, Nathan Bossart wrote:
> On Thu, Apr 04, 2024 at 03:31:12PM -0400, Andrew Dunstan wrote:
>> Does the attached patch fix it for you?
> It clears up 2 of the 3 warnings for me:
>
> ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
> ../postgresql/src/common/jsonapi.c:2018:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
>   2018 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
>        |
>

Thanks, please try this instead.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

fix_lex_number2.patch

Re: WIP Incremental JSON Parser

From

Nathan Bossart

Date:

05 April 2024, 15:43:22

On Fri, Apr 05, 2024 at 10:15:45AM -0400, Andrew Dunstan wrote:
> On 2024-04-04 Th 15:54, Nathan Bossart wrote:
>> On Thu, Apr 04, 2024 at 03:31:12PM -0400, Andrew Dunstan wrote:
>> > Does the attached patch fix it for you?
>> It clears up 2 of the 3 warnings for me:
>> 
>> ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
>> ../postgresql/src/common/jsonapi.c:2018:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
>>   2018 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
>>        |
> 
> Thanks, please try this instead.

LGTM, thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

05 April 2024, 20:09:58

On 2024-04-05 Fr 11:43, Nathan Bossart wrote:
> On Fri, Apr 05, 2024 at 10:15:45AM -0400, Andrew Dunstan wrote:
>> On 2024-04-04 Th 15:54, Nathan Bossart wrote:
>>> On Thu, Apr 04, 2024 at 03:31:12PM -0400, Andrew Dunstan wrote:
>>>> Does the attached patch fix it for you?
>>> It clears up 2 of the 3 warnings for me:
>>>
>>> ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
>>> ../postgresql/src/common/jsonapi.c:2018:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
>>>    2018 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
>>>         |
>> Thanks, please try this instead.
> LGTM, thanks!
>

Thanks for checking. Pushed.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Tom Lane

Date:

08 April 2024, 00:58:36

Coverity complained that this patch leaks memory:

/srv/coverity/git/pgsql-git/postgresql/src/bin/pg_combinebackup/load_manifest.c: 212 in load_backup_manifest()
206                 bytes_left -= rc;
207                 json_parse_manifest_incremental_chunk(
208                                                       inc_state, buffer, rc, bytes_left == 0);
209             }
210
211             close(fd);
>>>     CID 1596259:    (RESOURCE_LEAK)
>>>     Variable "inc_state" going out of scope leaks the storage it points to.
212         }
213
214         /* All done. */
215         pfree(buffer);
216         return result;
217     }

/srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c: 488 in parse_manifest_file()
482                 bytes_left -= rc;
483                 json_parse_manifest_incremental_chunk(
484                                                       inc_state, buffer, rc, bytes_left == 0);
485             }
486
487             close(fd);
>>>     CID 1596257:    (RESOURCE_LEAK)
>>>     Variable "inc_state" going out of scope leaks the storage it points to.
488         }
489
490         /* Done with the buffer. */
491         pfree(buffer);
492
493         return result;

It's right about that AFAICS, and not only is the "inc_state" itself
leaked but so is its assorted infrastructure.  Perhaps we don't care
too much about that in the existing applications, but ISTM that
isn't going to be a tenable assumption across the board.  Shouldn't
there be a "json_parse_manifest_incremental_shutdown()" or the like
to deallocate all the storage allocated by the parser?

            regards, tom lane

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

08 April 2024, 13:29:57

On 2024-04-07 Su 20:58, Tom Lane wrote:
> Coverity complained that this patch leaks memory:
>
> /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_combinebackup/load_manifest.c: 212 in load_backup_manifest()
> 206                 bytes_left -= rc;
> 207                 json_parse_manifest_incremental_chunk(
> 208                                                       inc_state, buffer, rc, bytes_left == 0);
> 209             }
> 210
> 211             close(fd);
>>>>      CID 1596259:    (RESOURCE_LEAK)
>>>>      Variable "inc_state" going out of scope leaks the storage it points to.
> 212         }
> 213
> 214         /* All done. */
> 215         pfree(buffer);
> 216         return result;
> 217     }
>
> /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c: 488 in parse_manifest_file()
> 482                 bytes_left -= rc;
> 483                 json_parse_manifest_incremental_chunk(
> 484                                                       inc_state, buffer, rc, bytes_left == 0);
> 485             }
> 486
> 487             close(fd);
>>>>      CID 1596257:    (RESOURCE_LEAK)
>>>>      Variable "inc_state" going out of scope leaks the storage it points to.
> 488         }
> 489
> 490         /* Done with the buffer. */
> 491         pfree(buffer);
> 492
> 493         return result;
>
> It's right about that AFAICS, and not only is the "inc_state" itself
> leaked but so is its assorted infrastructure.  Perhaps we don't care
> too much about that in the existing applications, but ISTM that
> isn't going to be a tenable assumption across the board.  Shouldn't
> there be a "json_parse_manifest_incremental_shutdown()" or the like
> to deallocate all the storage allocated by the parser?



yeah, probably. Will work on it.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

08 April 2024, 18:24:32

Michael pointed out over at [1] that the new tiny.json is pretty
inscrutable given its size, and I have to agree. Attached is a patch
to pare it down 98% or so. I think people wanting to run the
performance comparisons will need to come up with their own gigantic
files.

Michael, with your "Jacob might be a nefarious cabal of
state-sponsored hackers" hat on, is this observable enough, or do we
need to get it smaller? I was thinking we may want to replace the URLs
with stuff that doesn't link randomly around the Internet. Delicious
in its original form is long gone.

Thanks,
--Jacob

[1] https://www.postgresql.org/message-id/ZhCTtpGoXbERacoN@paquier.xyz

Attachment

shrink-tiny.patch.gz

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

08 April 2024, 21:42:35

On 2024-04-08 Mo 14:24, Jacob Champion wrote:
> Michael pointed out over at [1] that the new tiny.json is pretty
> inscrutable given its size, and I have to agree. Attached is a patch
> to pare it down 98% or so. I think people wanting to run the
> performance comparisons will need to come up with their own gigantic
> files.

Let's see if we can do a bit better than that. Maybe a script to 
construct a larger input for the speed test from the smaller file. 
Should be pretty simple.

>
> Michael, with your "Jacob might be a nefarious cabal of
> state-sponsored hackers" hat on, is this observable enough, or do we
> need to get it smaller? I was thinking we may want to replace the URLs
> with stuff that doesn't link randomly around the Internet. Delicious
> in its original form is long gone.
>

Arguably the fact that it points nowhere is a good thing. But feel free 
to replace it with something else. It doesn't have to be URLs at all. 
That happened simply because it was easy to extract from a very large 
piece of JSON I had lying around, probably from the last time I wrote a 
JSON parser :-)

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

08 April 2024, 22:20:13

On 2024-04-08 Mo 09:29, Andrew Dunstan wrote:
>
> On 2024-04-07 Su 20:58, Tom Lane wrote:
>> Coverity complained that this patch leaks memory:
>>
>> /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_combinebackup/load_manifest.c: 
>> 212 in load_backup_manifest()
>> 206                 bytes_left -= rc;
>> 207                 json_parse_manifest_incremental_chunk(
>> 208 inc_state, buffer, rc, bytes_left == 0);
>> 209             }
>> 210
>> 211             close(fd);
>>>>>      CID 1596259:    (RESOURCE_LEAK)
>>>>>      Variable "inc_state" going out of scope leaks the storage it 
>>>>> points to.
>> 212         }
>> 213
>> 214         /* All done. */
>> 215         pfree(buffer);
>> 216         return result;
>> 217     }
>>
>> /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c: 
>> 488 in parse_manifest_file()
>> 482                 bytes_left -= rc;
>> 483                 json_parse_manifest_incremental_chunk(
>> 484 inc_state, buffer, rc, bytes_left == 0);
>> 485             }
>> 486
>> 487             close(fd);
>>>>>      CID 1596257:    (RESOURCE_LEAK)
>>>>>      Variable "inc_state" going out of scope leaks the storage it 
>>>>> points to.
>> 488         }
>> 489
>> 490         /* Done with the buffer. */
>> 491         pfree(buffer);
>> 492
>> 493         return result;
>>
>> It's right about that AFAICS, and not only is the "inc_state" itself
>> leaked but so is its assorted infrastructure.  Perhaps we don't care
>> too much about that in the existing applications, but ISTM that
>> isn't going to be a tenable assumption across the board. Shouldn't
>> there be a "json_parse_manifest_incremental_shutdown()" or the like
>> to deallocate all the storage allocated by the parser?
>
>
>
> yeah, probably. Will work on it.
>
>
>

Here's a patch. In addition to the leaks Coverity found, there was 
another site in the backend code that should call the shutdown function, 
and a probably memory leak from a logic bug in the incremental json 
parser code. All these are fixed.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

fix-incremental-manifest-parser-memory-leak.patch

Re: WIP Incremental JSON Parser

From

Michael Paquier

Date:

09 April 2024, 00:48:18

On Mon, Apr 08, 2024 at 05:42:35PM -0400, Andrew Dunstan wrote:
> Arguably the fact that it points nowhere is a good thing. But feel free to
> replace it with something else. It doesn't have to be URLs at all. That
> happened simply because it was easy to extract from a very large piece of
> JSON I had lying around, probably from the last time I wrote a JSON parser
> :-)

: "http://www.theatermania.com/broadway/",

As as matter of fact, this one points to something that seems to be
real.  Most of the blobs had better be trimmed down.

Looking at the code paths in this module, it looks like this could be
even simpler than what Jacob is shaving, checking for:
- Objects start and end, including booleans, ints and strings for the
quotations.
- Nested objects, including arrays made of simple JSON objects.

Some escape patterns are already checked, but most of them are
missing:
https://coverage.postgresql.org/src/test/modules/test_json_parser/test_json_parser_incremental.c.gcov.html

There is no direct check on test_json_parser_perf.c, either, only a
custom rule in the Makefile without specifying something for meson.
So it looks like you could do short execution check in a TAP test, at
least.
--
Michael

Attachment

signature.asc

Re: WIP Incremental JSON Parser

From

Michael Paquier

Date:

09 April 2024, 05:23:55

On Tue, Apr 09, 2024 at 09:48:18AM +0900, Michael Paquier wrote:
> There is no direct check on test_json_parser_perf.c, either, only a
> custom rule in the Makefile without specifying something for meson.
> So it looks like you could do short execution check in a TAP test, at
> least.

While reading the code, I've noticed a few things, as well:

+   /* max delicious line length is less than this */
+   char        buff[6001];

Delicious applies to the contents, nothing to do with this code even
if I'd like to think that these lines of code are edible and good.
Using a hardcoded limit for some JSON input sounds like a bad idea to
me even if this is a test module.  Comment that applies to both the
perf and the incremental tools.  You could use a #define'd buffer size
for readability rather than assuming this value in many places.

+++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c
+ * This progam tests incremental parsing of json. The input is fed into
+ * full range of incement handling, especially in the lexer, is exercised.
+++ b/src/test/modules/test_json_parser/test_json_parser_perf.c
+ *    Performancet est program for both flavors of the JSON parser
+ * This progam tests either the standard (recursive descent) JSON parser
+++ b/src/test/modules/test_json_parser/README
+  reads in a file and pases it in very small chunks (60 bytes at a time) to

Collection of typos across various files.

+       appendStringInfoString(&json, "1+23 trailing junk");
What's the purpose here?  Perhaps the intention should be documented
in a comment?

At the end, having a way to generate JSON blobs randomly to test this
stuff would be more appealing than what you have currently, with
perhaps a few factors like:
- Array and simple object density.
- Max Level of nesting.
- Limitation to ASCII characters that can be escaped.
- Perhaps more things I cannot think about?

So the current state of things is kind of disappointing, and the size
of the data set added to the tree is not that portable either if you
want to test various scenarios depending on the data set.  It seems to
me that this has been committed too hastily and that this is not ready
for integration, even if that's just a test module.  Tom also has
shared some concerns upthread, as far as I can see.
--
Michael

Attachment

signature.asc

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

09 April 2024, 11:54:25

On 2024-04-09 Tu 01:23, Michael Paquier wrote:

On Tue, Apr 09, 2024 at 09:48:18AM +0900, Michael Paquier wrote:

There is no direct check on test_json_parser_perf.c, either, only a
custom rule in the Makefile without specifying something for meson.
So it looks like you could do short execution check in a TAP test, at
least.

Not adding a test for that was deliberate - any sane test takes a while, and I didn't want to spend that much time on it every time someone runs "make check-world" or equivalent. However, adding a test to run it with a trivial number of iterations seems reasonable, so I'll add that. I'll also add a meson target for the binary.

While reading the code, I've noticed a few things, as well:

+   /* max delicious line length is less than this */
+   char        buff[6001];

Delicious applies to the contents, nothing to do with this code even
if I'd like to think that these lines of code are edible and good.
Using a hardcoded limit for some JSON input sounds like a bad idea to
me even if this is a test module.  Comment that applies to both the
perf and the incremental tools.  You could use a #define'd buffer size
for readability rather than assuming this value in many places.

The comment is a remnant of when I hadn't yet added support for incomplete tokens, and so I had to parse the input line by line. I agree it can go, and we can use a manifest constant for the buffer size.


+++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c 
+ * This progam tests incremental parsing of json. The input is fed into 
+ * full range of incement handling, especially in the lexer, is exercised. 
+++ b/src/test/modules/test_json_parser/test_json_parser_perf.c
+ *    Performancet est program for both flavors of the JSON parser 
+ * This progam tests either the standard (recursive descent) JSON parser 
+++ b/src/test/modules/test_json_parser/README 
+  reads in a file and pases it in very small chunks (60 bytes at a time) to 

Collection of typos across various files.

Will fix. (The older I get the more typos I seem to make and the harder it is to notice them. It's very annoying.)


+       appendStringInfoString(&json, "1+23 trailing junk"); 
What's the purpose here?  Perhaps the intention should be documented
in a comment?

The purpose is to ensure that if there is not a trailing '\0' on the json chunk the parser will still do the right thing. I'll add a comment to that effect.


At the end, having a way to generate JSON blobs randomly to test this
stuff would be more appealing than what you have currently, with
perhaps a few factors like:
- Array and simple object density.
- Max Level of nesting.
- Limitation to ASCII characters that can be escaped.
- Perhaps more things I cannot think about?

No, I disagree. Maybe we need those things as well, but we do need a static test where we can test the output against known results. I have no objection to changing the input and output files.

It's worth noting that t/002_inline.pl does generate some input and test e.g., the maximum nesting levels among other errors. Perhaps you missed that. If you think we need more tests there adding them would be extremely simple.


So the current state of things is kind of disappointing, and the size
of the data set added to the tree is not that portable either if you
want to test various scenarios depending on the data set.  It seems to
me that this has been committed too hastily and that this is not ready
for integration, even if that's just a test module.  Tom also has
shared some concerns upthread, as far as I can see.

I have posted a patch already that addresses the issue Tom raised.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

09 April 2024, 13:45:11

On Tue, Apr 9, 2024 at 4:54 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> On 2024-04-09 Tu 01:23, Michael Paquier wrote:
> There is no direct check on test_json_parser_perf.c, either, only a
> custom rule in the Makefile without specifying something for meson.
> So it looks like you could do short execution check in a TAP test, at
> least.
>
> Not adding a test for that was deliberate - any sane test takes a while, and I didn't want to spend that much time on
itevery time someone runs "make check-world" or equivalent. However, adding a test to run it with a trivial number of
iterationsseems reasonable, so I'll add that. I'll also add a meson target for the binary. 

Okay, but for what purpose? My understanding during review was that
this was a convenience utility for people who were actively hacking on
the code (and I used it for exactly that purpose a few months back, so
I didn't question that any further). Why does the farm need to spend
any time running it at all?

--Jacob

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

09 April 2024, 14:30:07

On 2024-04-09 Tu 09:45, Jacob Champion wrote:
> On Tue, Apr 9, 2024 at 4:54 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> On 2024-04-09 Tu 01:23, Michael Paquier wrote:
>> There is no direct check on test_json_parser_perf.c, either, only a
>> custom rule in the Makefile without specifying something for meson.
>> So it looks like you could do short execution check in a TAP test, at
>> least.
>>
>> Not adding a test for that was deliberate - any sane test takes a while, and I didn't want to spend that much time
onit every time someone runs "make check-world" or equivalent. However, adding a test to run it with a trivial number
ofiterations seems reasonable, so I'll add that. I'll also add a meson target for the binary.
 
> Okay, but for what purpose? My understanding during review was that
> this was a convenience utility for people who were actively hacking on
> the code (and I used it for exactly that purpose a few months back, so
> I didn't question that any further). Why does the farm need to spend
> any time running it at all?
>

I think Michael's point was that if we carry the code we should test we 
can run it. The other possibility would be just to remove it. I can see 
arguments for both.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

09 April 2024, 16:26:49

On Tue, Apr 9, 2024 at 7:30 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> I think Michael's point was that if we carry the code we should test we
> can run it. The other possibility would be just to remove it. I can see
> arguments for both.

Hm. If it's not acceptable to carry this (as a worse-is-better smoke
test) without also running it during tests, then my personal vote
would be to tear it out and just have people write/contribute targeted
benchmarks when they end up working on performance. I don't think the
cost/benefit makes sense at that point.

--Jacob

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

09 April 2024, 16:41:44

On Mon, Apr 8, 2024 at 10:24 PM Michael Paquier <michael@paquier.xyz> wrote:
> At the end, having a way to generate JSON blobs randomly to test this
> stuff would be more appealing

For the record, I'm working on an LLVM fuzzer target for the JSON
parser. I think that would be a lot more useful than anything we can
hand-code.

But I want it to cover both the recursive and incremental code paths,
and we'd need to talk about where it would live. libfuzzer is seeded
with a bunch of huge incomprehensible blobs, which is something we're
now trying to avoid checking in. There's also the security aspect of
"what do we do when it finds something", and at that point maybe we
need to look into a service like oss-fuzz.

Thanks,
--Jacob

Re: WIP Incremental JSON Parser

From

Jacob Champion

Date:

09 April 2024, 17:54:58

On Mon, Apr 8, 2024 at 5:48 PM Michael Paquier <michael@paquier.xyz> wrote:
> : "http://www.theatermania.com/broadway/",
>
> As as matter of fact, this one points to something that seems to be
> real.  Most of the blobs had better be trimmed down.

This new version keeps all the existing structure but points the
hostnames to example.com [1].

--Jacob

[1] https://datatracker.ietf.org/doc/html/rfc2606#section-3

Attachment

shrink-tiny.patch.gz

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

09 April 2024, 19:42:28

On 2024-04-09 Tu 07:54, Andrew Dunstan wrote:
>
>
> On 2024-04-09 Tu 01:23, Michael Paquier wrote:
>> On Tue, Apr 09, 2024 at 09:48:18AM +0900, Michael Paquier wrote:
>>> There is no direct check on test_json_parser_perf.c, either, only a
>>> custom rule in the Makefile without specifying something for meson.
>>> So it looks like you could do short execution check in a TAP test, at
>>> least.
>
>
> Not adding a test for that was deliberate - any sane test takes a 
> while, and I didn't want to spend that much time on it every time 
> someone runs "make check-world" or equivalent. However, adding a test 
> to run it with a trivial number of iterations seems reasonable, so 
> I'll add that. I'll also add a meson target for the binary.
>
>
>> While reading the code, I've noticed a few things, as well:
>>
>> +   /* max delicious line length is less than this */
>> +   char        buff[6001];
>>
>> Delicious applies to the contents, nothing to do with this code even
>> if I'd like to think that these lines of code are edible and good.
>> Using a hardcoded limit for some JSON input sounds like a bad idea to
>> me even if this is a test module.  Comment that applies to both the
>> perf and the incremental tools.  You could use a #define'd buffer size
>> for readability rather than assuming this value in many places.
>
>
> The comment is a remnant of when I hadn't yet added support for 
> incomplete tokens, and so I had to parse the input line by line. I 
> agree it can go, and we can use a manifest constant for the buffer size.
>
>
>> +++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c
>> + * This progam tests incremental parsing of json. The input is fed into
>> + * full range of incement handling, especially in the lexer, is exercised.
>> +++ b/src/test/modules/test_json_parser/test_json_parser_perf.c
>> + *    Performancet est program for both flavors of the JSON parser
>> + * This progam tests either the standard (recursive descent) JSON parser
>> +++ b/src/test/modules/test_json_parser/README
>> +  reads in a file and pases it in very small chunks (60 bytes at a time) to
>>
>> Collection of typos across various files.
>
>
> Will fix. (The older I get the more typos I seem to make and the 
> harder it is to notice them. It's very annoying.)
>
>
>> +       appendStringInfoString(&json, "1+23 trailing junk");
>> What's the purpose here?  Perhaps the intention should be documented
>> in a comment?
>
>
> The purpose is to ensure that if there is not a trailing '\0' on the 
> json chunk the parser will still do the right thing. I'll add a 
> comment to that effect.
>
>
>> At the end, having a way to generate JSON blobs randomly to test this
>> stuff would be more appealing than what you have currently, with
>> perhaps a few factors like:
>> - Array and simple object density.
>> - Max Level of nesting.
>> - Limitation to ASCII characters that can be escaped.
>> - Perhaps more things I cannot think about?
>
>
> No, I disagree. Maybe we need those things as well, but we do need a 
> static test where we can test the output against known results. I have 
> no objection to changing the input and output files.
>
> It's worth noting that t/002_inline.pl does generate some input and 
> test e.g., the maximum nesting levels among other errors. Perhaps you 
> missed that. If you think we need more tests there adding them would 
> be extremely simple.
>
>
>> So the current state of things is kind of disappointing, and the size
>> of the data set added to the tree is not that portable either if you
>> want to test various scenarios depending on the data set.  It seems to
>> me that this has been committed too hastily and that this is not ready
>> for integration, even if that's just a test module.  Tom also has
>> shared some concerns upthread, as far as I can see.
>
>
> I have posted a patch already that addresses the issue Tom raised.
>
>


Here's a consolidated set of cleanup patches, including the memory leak 
patch and Jacob's shrink-tiny patch.

I think the biggest open issue is whether or not we remove the 
performance test program.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

10 April 2024, 11:47:38

On 2024-04-09 Tu 15:42, Andrew Dunstan wrote:
>
> On 2024-04-09 Tu 07:54, Andrew Dunstan wrote:
>>
>>
>> On 2024-04-09 Tu 01:23, Michael Paquier wrote:
>>> On Tue, Apr 09, 2024 at 09:48:18AM +0900, Michael Paquier wrote:
>>>> There is no direct check on test_json_parser_perf.c, either, only a
>>>> custom rule in the Makefile without specifying something for meson.
>>>> So it looks like you could do short execution check in a TAP test, at
>>>> least.
>>
>>
>> Not adding a test for that was deliberate - any sane test takes a 
>> while, and I didn't want to spend that much time on it every time 
>> someone runs "make check-world" or equivalent. However, adding a test 
>> to run it with a trivial number of iterations seems reasonable, so 
>> I'll add that. I'll also add a meson target for the binary.
>>
>>
>>> While reading the code, I've noticed a few things, as well:
>>>
>>> +   /* max delicious line length is less than this */
>>> +   char        buff[6001];
>>>
>>> Delicious applies to the contents, nothing to do with this code even
>>> if I'd like to think that these lines of code are edible and good.
>>> Using a hardcoded limit for some JSON input sounds like a bad idea to
>>> me even if this is a test module.  Comment that applies to both the
>>> perf and the incremental tools.  You could use a #define'd buffer size
>>> for readability rather than assuming this value in many places.
>>
>>
>> The comment is a remnant of when I hadn't yet added support for 
>> incomplete tokens, and so I had to parse the input line by line. I 
>> agree it can go, and we can use a manifest constant for the buffer size.
>>
>>
>>> +++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c
>>> + * This progam tests incremental parsing of json. The input is fed 
>>> into
>>> + * full range of incement handling, especially in the lexer, is 
>>> exercised.
>>> +++ b/src/test/modules/test_json_parser/test_json_parser_perf.c
>>> + *    Performancet est program for both flavors of the JSON parser
>>> + * This progam tests either the standard (recursive descent) JSON 
>>> parser
>>> +++ b/src/test/modules/test_json_parser/README
>>> +  reads in a file and pases it in very small chunks (60 bytes at a 
>>> time) to
>>>
>>> Collection of typos across various files.
>>
>>
>> Will fix. (The older I get the more typos I seem to make and the 
>> harder it is to notice them. It's very annoying.)
>>
>>
>>> + appendStringInfoString(&json, "1+23 trailing junk");
>>> What's the purpose here?  Perhaps the intention should be documented
>>> in a comment?
>>
>>
>> The purpose is to ensure that if there is not a trailing '\0' on the 
>> json chunk the parser will still do the right thing. I'll add a 
>> comment to that effect.
>>
>>
>>> At the end, having a way to generate JSON blobs randomly to test this
>>> stuff would be more appealing than what you have currently, with
>>> perhaps a few factors like:
>>> - Array and simple object density.
>>> - Max Level of nesting.
>>> - Limitation to ASCII characters that can be escaped.
>>> - Perhaps more things I cannot think about?
>>
>>
>> No, I disagree. Maybe we need those things as well, but we do need a 
>> static test where we can test the output against known results. I 
>> have no objection to changing the input and output files.
>>
>> It's worth noting that t/002_inline.pl does generate some input and 
>> test e.g., the maximum nesting levels among other errors. Perhaps you 
>> missed that. If you think we need more tests there adding them would 
>> be extremely simple.
>>
>>
>>> So the current state of things is kind of disappointing, and the size
>>> of the data set added to the tree is not that portable either if you
>>> want to test various scenarios depending on the data set.  It seems to
>>> me that this has been committed too hastily and that this is not ready
>>> for integration, even if that's just a test module.  Tom also has
>>> shared some concerns upthread, as far as I can see.
>>
>>
>> I have posted a patch already that addresses the issue Tom raised.
>>
>>
>
>
> Here's a consolidated set of cleanup patches, including the memory 
> leak patch and Jacob's shrink-tiny patch.


Here's v2 of the cleanup patch 4, that fixes some more typos kindly 
pointed out to me by Alexander Lakhin.


cheers


andrew




--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

v2-0004-Assorted-minor-cleanups-in-the-test_json_parser-m.patch

Re: WIP Incremental JSON Parser

From

Michael Paquier

Date:

18 April 2024, 05:25:16

On Tue, Apr 09, 2024 at 09:26:49AM -0700, Jacob Champion wrote:
> On Tue, Apr 9, 2024 at 7:30 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> I think Michael's point was that if we carry the code we should test we
>> can run it. The other possibility would be just to remove it. I can see
>> arguments for both.
>
> Hm. If it's not acceptable to carry this (as a worse-is-better smoke
> test) without also running it during tests, then my personal vote
> would be to tear it out and just have people write/contribute targeted
> benchmarks when they end up working on performance. I don't think the
> cost/benefit makes sense at that point.

And you may catch up a couple of bugs while on it.  In my experience,
things with custom makefile and/or meson rules tend to rot easily
because everybody forgets about them.  There are a few of them in the
tree that could be ripped off, as well..
--
Michael

Attachment

signature.asc

Re: WIP Incremental JSON Parser

From

Michael Paquier

Date:

18 April 2024, 06:04:41

On Wed, Apr 10, 2024 at 07:47:38AM -0400, Andrew Dunstan wrote:
> Here's v2 of the cleanup patch 4, that fixes some more typos kindly pointed
> out to me by Alexander Lakhin.

I can see that this has been applied as of 42fa4b660143 with some
extra commits.

Anyway, I have noticed another thing in the surroundings that's
annoying.  003 has this logic:
use File::Temp qw(tempfile);
[...]
my ($fh, $fname) = tempfile();

print $fh $stdout,"\n";

close($fh);

This creates a temporary file in /tmp/ that remains around, slowing
bloating the temporary file space on a node while leaving around some
data.  Why using File::Temp::tempfile here?  Couldn't you just use a
file in a PostgreSQL::Test::Utils::tempdir() that would be cleaned up
once the test finishes?

Per [1], escape_json() has no coverage outside its default path.  Is
that intended?

Documenting all these test files with a few comments would be welcome,
as well, with some copyright notices...

        json_file = fopen(testfile, "r");
        fstat(fileno(json_file), &statbuf);
        bytes_left = statbuf.st_size;

No checks on failure of fstat() here?

        json_file = fopen(argv[2], "r");

Second one in test_json_parser_perf.c, with more stuff for fread().

[1]: https://coverage.postgresql.org/src/test/modules/test_json_parser/test_json_parser_incremental.c.gcov.html
--
Michael

Attachment

signature.asc

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

18 April 2024, 10:03:45

On 2024-04-18 Th 02:04, Michael Paquier wrote:

On Wed, Apr 10, 2024 at 07:47:38AM -0400, Andrew Dunstan wrote:

Here's v2 of the cleanup patch 4, that fixes some more typos kindly pointed
out to me by Alexander Lakhin.

I can see that this has been applied as of 42fa4b660143 with some
extra commits.

Anyway, I have noticed another thing in the surroundings that's
annoying.  003 has this logic:
use File::Temp qw(tempfile);
[...]
my ($fh, $fname) = tempfile();

print $fh $stdout,"\n";

close($fh);

This creates a temporary file in /tmp/ that remains around, slowing
bloating the temporary file space on a node while leaving around some
data.

My bad, I should have used the UNLINK option like in the other tests.

 Why using File::Temp::tempfile here?  Couldn't you just use a
file in a PostgreSQL::Test::Utils::tempdir() that would be cleaned up
once the test finishes?

That's another possibility, but I think the above is the simplest.


Per [1], escape_json() has no coverage outside its default path.  Is
that intended?

Not particularly. I'll add some stuff to get complete coverage.


Documenting all these test files with a few comments would be welcome,
as well, with some copyright notices...

        json_file = fopen(testfile, "r");        fstat(fileno(json_file), &statbuf);        bytes_left = statbuf.st_size;

No checks on failure of fstat() here?

ok will fix

        json_file = fopen(argv[2], "r");

Second one in test_json_parser_perf.c, with more stuff for fread().

ok will fix

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Michael Paquier

Date:

19 April 2024, 05:34:48

On Thu, Apr 18, 2024 at 06:03:45AM -0400, Andrew Dunstan wrote:
> On 2024-04-18 Th 02:04, Michael Paquier wrote:
>>   Why usingFile::Temp::tempfile  here?  Couldn't you just use a
>> file in a PostgreSQL::Test::Utils::tempdir() that would be cleaned up
>> once the test finishes?
>
> That's another possibility, but I think the above is the simplest.

Are you sure that relying on Temp::File is a good thing overall?  The
current temporary file knowledge is encapsulated within Utils.pm, with
files removed or kept depending on PG_TEST_NOCLEAN.  So it would be
just more consistent to rely on the existing facilities instead?
test_json_parser is the only code path in the whole tree that directly
uses File::Temp.  The rest of the TAP tests relies on Utils.pm for
temp file paths.
--
Michael

Attachment

signature.asc

Re: WIP Incremental JSON Parser

From

Robert Haas

Date:

19 April 2024, 18:04:40

On Fri, Apr 19, 2024 at 1:35 AM Michael Paquier <michael@paquier.xyz> wrote:
> Are you sure that relying on Temp::File is a good thing overall?  The
> current temporary file knowledge is encapsulated within Utils.pm, with
> files removed or kept depending on PG_TEST_NOCLEAN.  So it would be
> just more consistent to rely on the existing facilities instead?
> test_json_parser is the only code path in the whole tree that directly
> uses File::Temp.  The rest of the TAP tests relies on Utils.pm for
> temp file paths.

Yeah, I think this patch invented a new solution to a problem that
we've solved in a different way everywhere else. I think we should
change it to match what we do in general.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: WIP Incremental JSON Parser

From

Michael Paquier

Date:

24 April 2024, 08:56:54

On Fri, Apr 19, 2024 at 02:04:40PM -0400, Robert Haas wrote:
> Yeah, I think this patch invented a new solution to a problem that
> we've solved in a different way everywhere else. I think we should
> change it to match what we do in general.

As of ba3e6e2bca97, did you notice that test_json_parser_perf
generates two core files because progname is not set, failing an
assertion when using the frontend logging?
--
Michael

Attachment

signature.asc

Re: WIP Incremental JSON Parser

From

Andrew Dunstan

Date:

24 April 2024, 12:35:37

On 2024-04-24 We 04:56, Michael Paquier wrote:
> On Fri, Apr 19, 2024 at 02:04:40PM -0400, Robert Haas wrote:
>> Yeah, I think this patch invented a new solution to a problem that
>> we've solved in a different way everywhere else. I think we should
>> change it to match what we do in general.
> As of ba3e6e2bca97, did you notice that test_json_parser_perf
> generates two core files because progname is not set, failing an
> assertion when using the frontend logging?


No, it didn't for me. Thanks for noticing, I've pushed a fix.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com