Thread: WIP Incremental JSON Parser
Quite a long time ago Robert asked me about the possibility of an incremental JSON parser. I wrote one, and I've tweaked it a bit, but the performance is significantly worse that that of the current Recursive Descent parser. Nevertheless, I'm attaching my current WIP state for it, and I'll add it to the next CF to keep the conversation going. One possible use would be in parsing large manifest files for incremental backup. However, it struck me a few days ago that this might not work all that well. The current parser and the new parser both palloc() space for each field name and scalar token in the JSON (unless they aren't used, which is normally not the case), and they don't free it, so that particularly if done in frontend code this amounts to a possible memory leak, unless the semantic routines do the freeing themselves. So while we can save some memory by not having to slurp in the whole JSON in one hit, we aren't saving any of that other allocation of memory, which amounts to almost as much space as the raw JSON. In any case, I've had fun so it's not a total loss come what may :-) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan <andrew@dunslane.net> wrote: > Quite a long time ago Robert asked me about the possibility of an > incremental JSON parser. I wrote one, and I've tweaked it a bit, but the > performance is significantly worse that that of the current Recursive > Descent parser. Nevertheless, I'm attaching my current WIP state for it, > and I'll add it to the next CF to keep the conversation going. Thanks for doing this. I think it's useful even if it's slower than the current parser, although that probably necessitates keeping both, which isn't great, but I don't see a better alternative. > One possible use would be in parsing large manifest files for > incremental backup. However, it struck me a few days ago that this might > not work all that well. The current parser and the new parser both > palloc() space for each field name and scalar token in the JSON (unless > they aren't used, which is normally not the case), and they don't free > it, so that particularly if done in frontend code this amounts to a > possible memory leak, unless the semantic routines do the freeing > themselves. So while we can save some memory by not having to slurp in > the whole JSON in one hit, we aren't saving any of that other allocation > of memory, which amounts to almost as much space as the raw JSON. It seems like a pretty significant savings no matter what. Suppose the backup_manifest file is 2GB, and instead of creating a 2GB buffer, you create an 1MB buffer and feed the data to the parser in 1MB chunks. Well, that saves 2GB less 1MB, full stop. Now if we address the issue you raise here in some way, we can potentially save even more memory, which is great, but even if we don't, we still saved a bunch of memory that could not have been saved in any other way. As far as addressing that other issue, we could address the issue either by having the semantic routines free the memory if they don't need it, or alternatively by having the parser itself free the memory after invoking any callbacks to which it might be passed. The latter approach feels more conceptually pure, but the former might be the more practical approach. I think what really matters here is that we document who must or may do which things. When a callback gets passed a pointer, we can document either that (1) it's a palloc'd chunk that the calllback can free if they want or (2) that it's a palloc'd chunk that the caller must not free or (3) that it's not a palloc'd chunk. We can further document the memory context in which the chunk will be allocated, if applicable, and when/if the parser will free it. -- Robert Haas EDB: http://www.enterprisedb.com
On 2024-01-02 Tu 10:14, Robert Haas wrote: > On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> Quite a long time ago Robert asked me about the possibility of an >> incremental JSON parser. I wrote one, and I've tweaked it a bit, but the >> performance is significantly worse that that of the current Recursive >> Descent parser. Nevertheless, I'm attaching my current WIP state for it, >> and I'll add it to the next CF to keep the conversation going. > Thanks for doing this. I think it's useful even if it's slower than > the current parser, although that probably necessitates keeping both, > which isn't great, but I don't see a better alternative. > >> One possible use would be in parsing large manifest files for >> incremental backup. However, it struck me a few days ago that this might >> not work all that well. The current parser and the new parser both >> palloc() space for each field name and scalar token in the JSON (unless >> they aren't used, which is normally not the case), and they don't free >> it, so that particularly if done in frontend code this amounts to a >> possible memory leak, unless the semantic routines do the freeing >> themselves. So while we can save some memory by not having to slurp in >> the whole JSON in one hit, we aren't saving any of that other allocation >> of memory, which amounts to almost as much space as the raw JSON. > It seems like a pretty significant savings no matter what. Suppose the > backup_manifest file is 2GB, and instead of creating a 2GB buffer, you > create an 1MB buffer and feed the data to the parser in 1MB chunks. > Well, that saves 2GB less 1MB, full stop. Now if we address the issue > you raise here in some way, we can potentially save even more memory, > which is great, but even if we don't, we still saved a bunch of memory > that could not have been saved in any other way. > > As far as addressing that other issue, we could address the issue > either by having the semantic routines free the memory if they don't > need it, or alternatively by having the parser itself free the memory > after invoking any callbacks to which it might be passed. The latter > approach feels more conceptually pure, but the former might be the > more practical approach. I think what really matters here is that we > document who must or may do which things. When a callback gets passed > a pointer, we can document either that (1) it's a palloc'd chunk that > the calllback can free if they want or (2) that it's a palloc'd chunk > that the caller must not free or (3) that it's not a palloc'd chunk. > We can further document the memory context in which the chunk will be > allocated, if applicable, and when/if the parser will free it. Yeah. One idea I had yesterday was to stash the field names, which in large JSON docs tent to be pretty repetitive, in a hash table instead of pstrduping each instance. The name would be valid until the end of the parse, and would only need to be duplicated by the callback function if it were needed beyond that. That's not the case currently with the parse_manifest code. I'll work on using a hash table. The parse_manifest code does seem to pfree the scalar values it no longer needs fairly well, so maybe we don't need to to anything there. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan <andrew@dunslane.net> wrote: > Yeah. One idea I had yesterday was to stash the field names, which in > large JSON docs tent to be pretty repetitive, in a hash table instead of > pstrduping each instance. The name would be valid until the end of the > parse, and would only need to be duplicated by the callback function if > it were needed beyond that. That's not the case currently with the > parse_manifest code. I'll work on using a hash table. IMHO, this is not a good direction. Anybody who is parsing JSON probably wants to discard the duplicated labels and convert other heavily duplicated strings to enum values or something. (e.g. if every record has {"color":"red"} or {"color":"green"}). So the hash table lookups will cost but won't really save anything more than just freeing the memory not needed, but will probably be more expensive. > The parse_manifest code does seem to pfree the scalar values it no > longer needs fairly well, so maybe we don't need to to anything there. Hmm. This makes me wonder if you've measured how much actual leakage there is? -- Robert Haas EDB: http://www.enterprisedb.com
On 2024-01-03 We 08:45, Robert Haas wrote: > On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> Yeah. One idea I had yesterday was to stash the field names, which in >> large JSON docs tent to be pretty repetitive, in a hash table instead of >> pstrduping each instance. The name would be valid until the end of the >> parse, and would only need to be duplicated by the callback function if >> it were needed beyond that. That's not the case currently with the >> parse_manifest code. I'll work on using a hash table. > IMHO, this is not a good direction. Anybody who is parsing JSON > probably wants to discard the duplicated labels and convert other > heavily duplicated strings to enum values or something. (e.g. if every > record has {"color":"red"} or {"color":"green"}). So the hash table > lookups will cost but won't really save anything more than just > freeing the memory not needed, but will probably be more expensive. I don't quite follow. Say we have a document with an array 1m objects, each with a field called "color". As it stands we'll allocate space for that field name 1m times. Using a hash table we'd allocated space for it once. And allocating the memory isn't free, although it might be cheaper than doing hash lookups. I guess we can benchmark it and see what the performance impact of using a hash table might be. Another possibility would be simply to have the callback free the field name after use. for the parse_manifest code that could be a one-line addition to the code at the bottom of json_object_manifest_field_start(). >> The parse_manifest code does seem to pfree the scalar values it no >> longer needs fairly well, so maybe we don't need to to anything there. > Hmm. This makes me wonder if you've measured how much actual leakage there is? No I haven't. I have simply theorized about how much memory we might consume if nothing were done by the callers to free the memory. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Wed, Jan 3, 2024 at 9:59 AM Andrew Dunstan <andrew@dunslane.net> wrote: > Say we have a document with an array 1m objects, each with a field > called "color". As it stands we'll allocate space for that field name 1m > times. Using a hash table we'd allocated space for it once. And > allocating the memory isn't free, although it might be cheaper than > doing hash lookups. > > I guess we can benchmark it and see what the performance impact of using > a hash table might be. > > Another possibility would be simply to have the callback free the field > name after use. for the parse_manifest code that could be a one-line > addition to the code at the bottom of json_object_manifest_field_start(). Yeah. So I'm arguing that allocating the memory each time and then freeing it sounds cheaper than looking it up in the hash table every time, discovering it's there, and thus skipping the allocate/free. I might be wrong about that. It's just that allocating and freeing a small chunk of memory should boil down to popping it off of a linked list and then pushing it back on. And that sounds cheaper than hashing the string and looking for it in a hash bucket. -- Robert Haas EDB: http://www.enterprisedb.com
On 2024-01-03 We 10:12, Robert Haas wrote: > On Wed, Jan 3, 2024 at 9:59 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> Say we have a document with an array 1m objects, each with a field >> called "color". As it stands we'll allocate space for that field name 1m >> times. Using a hash table we'd allocated space for it once. And >> allocating the memory isn't free, although it might be cheaper than >> doing hash lookups. >> >> I guess we can benchmark it and see what the performance impact of using >> a hash table might be. >> >> Another possibility would be simply to have the callback free the field >> name after use. for the parse_manifest code that could be a one-line >> addition to the code at the bottom of json_object_manifest_field_start(). > Yeah. So I'm arguing that allocating the memory each time and then > freeing it sounds cheaper than looking it up in the hash table every > time, discovering it's there, and thus skipping the allocate/free. > > I might be wrong about that. It's just that allocating and freeing a > small chunk of memory should boil down to popping it off of a linked > list and then pushing it back on. And that sounds cheaper than hashing > the string and looking for it in a hash bucket. OK, cleaning up in the client code will be much simpler, so let's go with that for now and revisit it later if necessary. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote: > It seems like a pretty significant savings no matter what. Suppose the > backup_manifest file is 2GB, and instead of creating a 2GB buffer, you > create an 1MB buffer and feed the data to the parser in 1MB chunks. > Well, that saves 2GB less 1MB, full stop. Now if we address the issue > you raise here in some way, we can potentially save even more memory, > which is great, but even if we don't, we still saved a bunch of memory > that could not have been saved in any other way. You could also build a streaming incremental parser. That is, one that outputs a path and a leaf value (where leaf values are scalar values, `null`, `true`, `false`, numbers, and strings). Then if the caller is doing something JSONPath-like then the caller can probably immediately free almost all allocations and even terminate the parse early. Nico --
On Wed, Jan 3, 2024 at 6:36 PM Nico Williams <nico@cryptonector.com> wrote: > On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote: > > It seems like a pretty significant savings no matter what. Suppose the > > backup_manifest file is 2GB, and instead of creating a 2GB buffer, you > > create an 1MB buffer and feed the data to the parser in 1MB chunks. > > Well, that saves 2GB less 1MB, full stop. Now if we address the issue > > you raise here in some way, we can potentially save even more memory, > > which is great, but even if we don't, we still saved a bunch of memory > > that could not have been saved in any other way. > > You could also build a streaming incremental parser. That is, one that > outputs a path and a leaf value (where leaf values are scalar values, > `null`, `true`, `false`, numbers, and strings). Then if the caller is > doing something JSONPath-like then the caller can probably immediately > free almost all allocations and even terminate the parse early. I think our current parser is event-based rather than this ... but it seems like this could easily be built on top of it, if someone wanted to. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan <andrew@dunslane.net> wrote: > Quite a long time ago Robert asked me about the possibility of an > incremental JSON parser. I wrote one, and I've tweaked it a bit, but the > performance is significantly worse that that of the current Recursive > Descent parser. The prediction stack is neat. It seems like the main loop is hit so many thousands of times that micro-optimization would be necessary... I attached a sample diff to get rid of the strlen calls during push_prediction(), which speeds things up a bit (8-15%, depending on optimization level) on my machines. Maybe it's possible to condense some of those productions down, and reduce the loop count? E.g. does every "scalar" production need to go three times through the loop/stack, or can the scalar semantic action just peek at the next token prediction and do all the callback work at once? > + case JSON_SEM_SCALAR_CALL: > + { > + json_scalar_action sfunc = sem->scalar; > + > + if (sfunc != NULL) > + (*sfunc) (sem->semstate, scalar_val, scalar_tok); > + } Is it safe to store state (scalar_val/scalar_tok) on the stack, or does it disappear if the parser hits an incomplete token? > One possible use would be in parsing large manifest files for > incremental backup. I'm keeping an eye on this thread for OAuth, since the clients have to parse JSON as well. Those responses tend to be smaller, though, so you'd have to really be hurting for resources to need this. --Jacob
Attachment
On 2024-01-09 Tu 13:46, Jacob Champion wrote: > On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> Quite a long time ago Robert asked me about the possibility of an >> incremental JSON parser. I wrote one, and I've tweaked it a bit, but the >> performance is significantly worse that that of the current Recursive >> Descent parser. > The prediction stack is neat. It seems like the main loop is hit so > many thousands of times that micro-optimization would be necessary... > I attached a sample diff to get rid of the strlen calls during > push_prediction(), which speeds things up a bit (8-15%, depending on > optimization level) on my machines. Thanks for looking! I've been playing around with a similar idea, but yours might be better. > > Maybe it's possible to condense some of those productions down, and > reduce the loop count? E.g. does every "scalar" production need to go > three times through the loop/stack, or can the scalar semantic action > just peek at the next token prediction and do all the callback work at > once? Also a good suggestion. Will look and see. IIRC I had trouble with this bit. > >> + case JSON_SEM_SCALAR_CALL: >> + { >> + json_scalar_action sfunc = sem->scalar; >> + >> + if (sfunc != NULL) >> + (*sfunc) (sem->semstate, scalar_val, scalar_tok); >> + } > Is it safe to store state (scalar_val/scalar_tok) on the stack, or > does it disappear if the parser hits an incomplete token? Good point. In fact it might be responsible for the error I'm currently trying to get to the bottom of. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2024-01-09 Tu 13:46, Jacob Champion wrote: > On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> Quite a long time ago Robert asked me about the possibility of an >> incremental JSON parser. I wrote one, and I've tweaked it a bit, but the >> performance is significantly worse that that of the current Recursive >> Descent parser. > The prediction stack is neat. It seems like the main loop is hit so > many thousands of times that micro-optimization would be necessary... > I attached a sample diff to get rid of the strlen calls during > push_prediction(), which speeds things up a bit (8-15%, depending on > optimization level) on my machines. > > Maybe it's possible to condense some of those productions down, and > reduce the loop count? E.g. does every "scalar" production need to go > three times through the loop/stack, or can the scalar semantic action > just peek at the next token prediction and do all the callback work at > once? > >> + case JSON_SEM_SCALAR_CALL: >> + { >> + json_scalar_action sfunc = sem->scalar; >> + >> + if (sfunc != NULL) >> + (*sfunc) (sem->semstate, scalar_val, scalar_tok); >> + } > Is it safe to store state (scalar_val/scalar_tok) on the stack, or > does it disappear if the parser hits an incomplete token? > >> One possible use would be in parsing large manifest files for >> incremental backup. > I'm keeping an eye on this thread for OAuth, since the clients have to > parse JSON as well. Those responses tend to be smaller, though, so > you'd have to really be hurting for resources to need this. > I've incorporated your suggestion, and fixed the bug you identified. The attached also adds support for incrementally parsing backup manifests, and uses that in the three places we call the manifest parser. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
2024-01 Commitfest. Hi, This patch has a CF status of "Needs Review" [1], but it seems there were CFbot test failures last time it was run [2]. Please have a look and post an updated version if necessary. ====== [1] https://commitfest.postgresql.org/46/4725/ [2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4725 Kind Regards, Peter Smith.
On 2024-01-22 Mo 01:29, Peter Smith wrote: > 2024-01 Commitfest. > > Hi, This patch has a CF status of "Needs Review" [1], but it seems > there were CFbot test failures last time it was run [2]. Please have a > look and post an updated version if necessary. Thanks. Let's see if the attached does better. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On 2024-01-22 Mo 14:16, Andrew Dunstan wrote: > > On 2024-01-22 Mo 01:29, Peter Smith wrote: >> 2024-01 Commitfest. >> >> Hi, This patch has a CF status of "Needs Review" [1], but it seems >> there were CFbot test failures last time it was run [2]. Please have a >> look and post an updated version if necessary. > > > Thanks. > > Let's see if the attached does better. This time for sure! (Watch me pull a rabbit out of my hat!) It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed it, and I think the test will now pass. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On 2024-01-22 Mo 18:01, Andrew Dunstan wrote: > > On 2024-01-22 Mo 14:16, Andrew Dunstan wrote: >> >> On 2024-01-22 Mo 01:29, Peter Smith wrote: >>> 2024-01 Commitfest. >>> >>> Hi, This patch has a CF status of "Needs Review" [1], but it seems >>> there were CFbot test failures last time it was run [2]. Please have a >>> look and post an updated version if necessary. >> >> >> Thanks. >> >> Let's see if the attached does better. > > > > This time for sure! (Watch me pull a rabbit out of my hat!) > > > It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed > it, and I think the test will now pass. > > > Fixed one problem but there are some others. I'm hoping this will satisfy the cfbot. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On 2024-01-22 Mo 18:01, Andrew Dunstan wrote:
On 2024-01-22 Mo 14:16, Andrew Dunstan wrote:
On 2024-01-22 Mo 01:29, Peter Smith wrote:2024-01 Commitfest.
Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.
Thanks.
Let's see if the attached does better.
This time for sure! (Watch me pull a rabbit out of my hat!)
It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed it, and I think the test will now pass.
Fixed one problem but there are some others. I'm hoping this will satisfy the cfbot.
The cfbot reports an error on a 32 bit build <https://api.cirrus-ci.com/v1/artifact/task/6055909135220736/testrun/build-32/testrun/pg_combinebackup/003_timeline/log/regress_log_003_timeline>:
# Running: pg_basebackup -D /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2 --no-sync -cfast --incremental /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup1/backup_manifest pg_basebackup: error: could not upload manifest: ERROR: could not parse backup manifest: file size is not an integer pg_basebackup: removing data directory "/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2" [02:41:07.830](0.073s) not ok 2 - incremental backup from node1 [02:41:07.830](0.000s) # Failed test 'incremental backup from node1'
I have set up a Debian 12 EC2 instance following the recipe at <https://raw.githubusercontent.com/anarazel/pg-vm-images/main/scripts/linux_debian_install_deps.sh>, and ran what I think are the same tests dozens of times, but the failure did not reappear in my setup. Unfortunately, the test doesn't show the failing manifest or log the failing field, so trying to divine what happened here is more than difficult.
Not sure how to address this.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Wed, Jan 24, 2024 at 10:04 AM Andrew Dunstan <andrew@dunslane.net> wrote: > The cfbot reports an error on a 32 bit build <https://api.cirrus-ci.com/v1/artifact/task/6055909135220736/testrun/build-32/testrun/pg_combinebackup/003_timeline/log/regress_log_003_timeline>: > > # Running: pg_basebackup -D /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2 --no-sync-cfast --incremental /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup1/backup_manifest > pg_basebackup: error: could not upload manifest: ERROR: could not parse backup manifest: file size is not an integer > pg_basebackup: removing data directory "/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2" > [02:41:07.830](0.073s) not ok 2 - incremental backup from node1 > [02:41:07.830](0.000s) # Failed test 'incremental backup from node1' > > I have set up a Debian 12 EC2 instance following the recipe at <https://raw.githubusercontent.com/anarazel/pg-vm-images/main/scripts/linux_debian_install_deps.sh>,and ran what I thinkare the same tests dozens of times, but the failure did not reappear in my setup. Unfortunately, the test doesn't showthe failing manifest or log the failing field, so trying to divine what happened here is more than difficult. > > Not sure how to address this. Yeah, that's really odd. The backup size field is printed like this: appendStringInfo(&buf, "\"Size\": %zu, ", size); And parsed like this: size = strtoul(parse->size, &ep, 10); if (*ep) json_manifest_parse_failure(parse->context, "file size is not an integer"); I confess to bafflement -- how could the output of the first fail to be parsed by the second? The manifest has to look pretty much valid in order not to error out before it gets to this check, with just that one field corrupted. But I don't understand how that could happen. I agree that the error reporting could be better here, but it didn't seem worth spending time on when I wrote the code. I figured the only way we could end up with something like "Size": "Walrus" is if the user was messing with us on purpose. Apparently that's not so, yet the mechanism eludes me. Or maybe it's not some random string, but is something like an empty string or a number with trailing garbage or a number that's out of range. But I don't see how any of those things can happen either. Maybe you should adjust your patch to dump the manifests into the log file with note(). Then when cfbot runs on it you can see exactly what the raw file looks like. Although I wonder if it's possible that the manifest itself is OK, but somehow it gets garbled when uploaded to the server, either because the client code that sends it or the server code that receives it does something that isn't safe in 32-bit mode. If we hypothesize an off-by-one error or a buffer overrun, that could possibly explain how one field got garbled while the rest of the file is OK. -- Robert Haas EDB: http://www.enterprisedb.com
On 2024-01-24 We 13:08, Robert Haas wrote: > > Maybe you should adjust your patch to dump the manifests into the log > file with note(). Then when cfbot runs on it you can see exactly what > the raw file looks like. Although I wonder if it's possible that the > manifest itself is OK, but somehow it gets garbled when uploaded to > the server, either because the client code that sends it or the server > code that receives it does something that isn't safe in 32-bit mode. > If we hypothesize an off-by-one error or a buffer overrun, that could > possibly explain how one field got garbled while the rest of the file > is OK. Yeah, I thought earlier today I was on the track of an off by one error, but I was apparently mistaken, so here's the same patch set with an extra patch that logs a bunch of stuff, and might let us see what's upsetting the cfbot. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On 2024-01-26 Fr 12:15, Andrew Dunstan wrote: > > On 2024-01-24 We 13:08, Robert Haas wrote: >> >> Maybe you should adjust your patch to dump the manifests into the log >> file with note(). Then when cfbot runs on it you can see exactly what >> the raw file looks like. Although I wonder if it's possible that the >> manifest itself is OK, but somehow it gets garbled when uploaded to >> the server, either because the client code that sends it or the server >> code that receives it does something that isn't safe in 32-bit mode. >> If we hypothesize an off-by-one error or a buffer overrun, that could >> possibly explain how one field got garbled while the rest of the file >> is OK. > > > Yeah, I thought earlier today I was on the track of an off by one > error, but I was apparently mistaken, so here's the same patch set > with an extra patch that logs a bunch of stuff, and might let us see > what's upsetting the cfbot. > > > Well, that didn't help a lot, but meanwhile the CFBot seems to have decided in the last few days that it's now happy, so full steam aead! ;-) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Tue, Feb 20, 2024 at 2:10 PM Andrew Dunstan <andrew@dunslane.net> wrote: > Well, that didn't help a lot, but meanwhile the CFBot seems to have > decided in the last few days that it's now happy, so full steam aead! ;-) I haven't been able to track down the root cause yet, but I am able to reproduce the failure consistently on my machine: ERROR: could not parse backup manifest: file size is not an integer: '16384, "Last-Modified": "2024-02-20 23:07:43 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "66c829cd" }, { "Path": "base/4/2606", "Size": 24576, "Last-Modified": "20 Full log is attached. --Jacob
Attachment
On 2024-02-20 Tu 19:53, Jacob Champion wrote: > On Tue, Feb 20, 2024 at 2:10 PM Andrew Dunstan <andrew@dunslane.net> wrote: >> Well, that didn't help a lot, but meanwhile the CFBot seems to have >> decided in the last few days that it's now happy, so full steam aead! ;-) > I haven't been able to track down the root cause yet, but I am able to > reproduce the failure consistently on my machine: > > ERROR: could not parse backup manifest: file size is not an > integer: '16384, "Last-Modified": "2024-02-20 23:07:43 GMT", > "Checksum-Algorithm": "CRC32C", "Checksum": "66c829cd" }, > { "Path": "base/4/2606", "Size": 24576, "Last-Modified": "20 > > Full log is attached. > *sigh* That's weird. I wonder why you can reproduce it and I can't. Can you give me details of the build? OS, compiler, path to source, build setup etc.? Anything that might be remotely relevant. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Tue, Feb 20, 2024 at 9:32 PM Andrew Dunstan <andrew@dunslane.net> wrote: > *sigh* That's weird. I wonder why you can reproduce it and I can't. Can > you give me details of the build? OS, compiler, path to source, build > setup etc.? Anything that might be remotely relevant. Sure: - guest VM running in UTM (QEMU 7.2) is Ubuntu 22.04 for ARM, default core count, 8GB - host is macOS Sonoma 14.3.1, Apple Silicon (M3 Pro), 36GB - it's a Meson build (plus a diff for the test utilities, attached, but that's hopefully not relevant to the failure) - buildtype=debug, optimization=g, cassert=true - GCC 11.4 - build path is nested a bit (~/src/postgres/worktree-inc-json/build-dev) --Jacob
Attachment
On Wed, Feb 21, 2024 at 6:50 AM Jacob Champion <jacob.champion@enterprisedb.com> wrote: > On Tue, Feb 20, 2024 at 9:32 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > *sigh* That's weird. I wonder why you can reproduce it and I can't. Can > > you give me details of the build? OS, compiler, path to source, build > > setup etc.? Anything that might be remotely relevant. This construction seems suspect, in json_lex_number(): > if (lex->incremental && !lex->inc_state->is_last_chunk && > len >= lex->input_length) > { > appendStringInfoString(&lex->inc_state->partial_token, > lex->token_start); > return JSON_INCOMPLETE; > } appendStringInfoString() isn't respecting the end of the chunk: if there's extra data after the chunk boundary (as AppendIncrementalManifestData() does) then all of that will be stuck onto the end of the partial_token. I'm about to context-switch off of this for the day, but I can work on a patch tomorrow if that'd be helpful. It looks like this is not the only call to appendStringInfoString(). --Jacob
On 2024-02-21 We 15:26, Jacob Champion wrote: > On Wed, Feb 21, 2024 at 6:50 AM Jacob Champion > <jacob.champion@enterprisedb.com> wrote: >> On Tue, Feb 20, 2024 at 9:32 PM Andrew Dunstan <andrew@dunslane.net> wrote: >>> *sigh* That's weird. I wonder why you can reproduce it and I can't. Can >>> you give me details of the build? OS, compiler, path to source, build >>> setup etc.? Anything that might be remotely relevant. > This construction seems suspect, in json_lex_number(): > >> if (lex->incremental && !lex->inc_state->is_last_chunk && >> len >= lex->input_length) >> { >> appendStringInfoString(&lex->inc_state->partial_token, >> lex->token_start); >> return JSON_INCOMPLETE; >> } > appendStringInfoString() isn't respecting the end of the chunk: if > there's extra data after the chunk boundary (as > AppendIncrementalManifestData() does) then all of that will be stuck > onto the end of the partial_token. > > I'm about to context-switch off of this for the day, but I can work on > a patch tomorrow if that'd be helpful. It looks like this is not the > only call to appendStringInfoString(). > Yeah, the issue seems to be with chunks of json that are not null-terminated. We don't require that they be so this code was buggy. It wasn't picked up earlier because the tests that I wrote did put a null byte at the end. Patch 5 in this series fixes those issues and adjusts most of the tests to add some trailing junk to the pieces of json, so we can be sure that this is done right. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
- v8-0001-Introduce-a-non-recursive-JSON-parser.patch
- v8-0002-Add-support-for-incrementally-parsing-backup-mani.patch
- v8-0003-Use-incremental-parsing-of-backup-manifests.patch
- v8-0004-add-logging-traces-to-see-if-we-can-find-out-what.patch
- v8-0005-fixes-for-non-null-terminated-inputs-for-incremen.patch
On Thu, Feb 22, 2024 at 1:38 AM Andrew Dunstan <andrew@dunslane.net> wrote: > Patch 5 in this series fixes those issues and > adjusts most of the tests to add some trailing junk to the pieces of > json, so we can be sure that this is done right. This fixes the test failure for me, thanks! I've attached my current mesonification diff, which just adds test_json_parser to the suite. It relies on the PATH that's set up, which appears to include the build directory for both VPATH builds and Meson. Are there plans to fill out the test suite more? Since we should be able to control all the initial conditions, it'd be good to get fairly comprehensive coverage of the new code. As an aside, I find the behavior of need_escapes=false to be completely counterintuitive. I know the previous code did this, but it seems like the opposite of "provides unescaped strings" should be "provides raw strings", not "all strings are now NULL". --Jacob
Attachment
On 2024-02-22 Th 15:29, Jacob Champion wrote: > On Thu, Feb 22, 2024 at 1:38 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> Patch 5 in this series fixes those issues and >> adjusts most of the tests to add some trailing junk to the pieces of >> json, so we can be sure that this is done right. > This fixes the test failure for me, thanks! I've attached my current > mesonification diff, which just adds test_json_parser to the suite. It > relies on the PATH that's set up, which appears to include the build > directory for both VPATH builds and Meson. OK, thanks, will add this in the next version. > > Are there plans to fill out the test suite more? Since we should be > able to control all the initial conditions, it'd be good to get fairly > comprehensive coverage of the new code. Well, it's tested (as we know) by the backup manifest tests. During development, I tested by making the regular parser use the non-recursive parser (see FORCE_JSON_PSTACK). That doesn't test the incremental piece of it, but it does check that the rest of it is doing the right thing. We could probably extend the incremental test by making it output a stream of tokens and making sure that was correct. > As an aside, I find the behavior of need_escapes=false to be > completely counterintuitive. I know the previous code did this, but it > seems like the opposite of "provides unescaped strings" should be > "provides raw strings", not "all strings are now NULL". Yes, we could possibly call it "need_strings" or something like that. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Thu, Feb 22, 2024 at 3:43 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > Are there plans to fill out the test suite more? Since we should be > > able to control all the initial conditions, it'd be good to get fairly > > comprehensive coverage of the new code. > > Well, it's tested (as we know) by the backup manifest tests. During > development, I tested by making the regular parser use the > non-recursive parser (see FORCE_JSON_PSTACK). That doesn't test the > incremental piece of it, but it does check that the rest of it is doing > the right thing. We could probably extend the incremental test by making > it output a stream of tokens and making sure that was correct. That would also cover all the semantic callbacks (currently, OFIELD_END and AELEM_* are uncovered), so +1 from me. Looking at lcov, it'd be good to - test failure modes as well as the happy path, so we know we're rejecting invalid syntax correctly - test the prediction stack resizing code - provide targeted coverage of the partial token support, since at the moment we're at the mercy of the manifest format and the default chunk size As a brute force example of the latter, with the attached diff I get test failures at chunk sizes 1, 2, 3, 4, 6, and 12. > > As an aside, I find the behavior of need_escapes=false to be > > completely counterintuitive. I know the previous code did this, but it > > seems like the opposite of "provides unescaped strings" should be > > "provides raw strings", not "all strings are now NULL". > > Yes, we could possibly call it "need_strings" or something like that. +1 --Jacob
On Mon, Feb 26, 2024 at 7:08 AM Jacob Champion <jacob.champion@enterprisedb.com> wrote: > As a brute force example of the latter, with the attached diff I get > test failures at chunk sizes 1, 2, 3, 4, 6, and 12. But this time with the diff. --Jacob
Attachment
On 2024-02-26 Mo 10:10, Jacob Champion wrote: > On Mon, Feb 26, 2024 at 7:08 AM Jacob Champion > <jacob.champion@enterprisedb.com> wrote: >> As a brute force example of the latter, with the attached diff I get >> test failures at chunk sizes 1, 2, 3, 4, 6, and 12. > But this time with the diff. > Ouch. I'll check it out. Thanks! cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2024-02-26 Mo 19:20, Andrew Dunstan wrote: > > On 2024-02-26 Mo 10:10, Jacob Champion wrote: >> On Mon, Feb 26, 2024 at 7:08 AM Jacob Champion >> <jacob.champion@enterprisedb.com> wrote: >>> As a brute force example of the latter, with the attached diff I get >>> test failures at chunk sizes 1, 2, 3, 4, 6, and 12. >> But this time with the diff. >> > > Ouch. I'll check it out. Thanks! > > > The good news is that the parser is doing fine - this issue was due to a thinko on my part in the test program that got triggered by the input file size being an exact multiple of the chunk size. I'll have a new patch set later this week. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Mon, Feb 26, 2024 at 9:20 PM Andrew Dunstan <andrew@dunslane.net> wrote: > The good news is that the parser is doing fine - this issue was due to a > thinko on my part in the test program that got triggered by the input > file size being an exact multiple of the chunk size. I'll have a new > patch set later this week. Ah, feof()! :D Good to know it's not the partial token logic. --Jacob
Some more observations as I make my way through the patch: In src/common/jsonapi.c, > +#define JSON_NUM_NONTERMINALS 6 Should this be 5 now? > + res = pg_parse_json_incremental(&(incstate->lex), &(incstate->sem), > + chunk, size, is_last); > + > + expected = is_last ? JSON_SUCCESS : JSON_INCOMPLETE; > + > + if (res != expected) > + json_manifest_parse_failure(context, "parsing failed"); This leads to error messages like pg_verifybackup: error: could not parse backup manifest: parsing failed which I would imagine is going to lead to confused support requests in the event that someone does manage to corrupt their manifest. Can we make use of json_errdetail() and print the line and column numbers? Patch 0001 over at [1] has one approach to making json_errdetail() workable in frontend code. Top-level scalars like `false` or `12345` do not parse correctly if the chunk size is too small; instead json_errdetail() reports 'Token "" is invalid'. With small chunk sizes, json_errdetail() additionally segfaults on constructions like `[tru]` or `12zz`. For my local testing, I'm carrying the following diff in 001_test_json_parser_incremental.pl: > - ok($stdout =~ /SUCCESS/, "chunk size $size: test succeeds"); > - ok(!$stderr, "chunk size $size: no error output"); > + like($stdout, qr/SUCCESS/, "chunk size $size: test succeeds"); > + is($stderr, "", "chunk size $size: no error output"); This is particularly helpful when a test fails spuriously due to code coverage spray on stderr. Thanks, --Jacob [1] https://www.postgresql.org/message-id/CAOYmi%2BmSSY4SvOtVN7zLyUCQ4-RDkxkzmTuPEN%2Bt-PsB7GHnZA%40mail.gmail.com
On 2024-03-07 Th 10:28, Jacob Champion wrote: > Some more observations as I make my way through the patch: > > In src/common/jsonapi.c, > >> +#define JSON_NUM_NONTERMINALS 6 > Should this be 5 now? Yep. > >> + res = pg_parse_json_incremental(&(incstate->lex), &(incstate->sem), >> + chunk, size, is_last); >> + >> + expected = is_last ? JSON_SUCCESS : JSON_INCOMPLETE; >> + >> + if (res != expected) >> + json_manifest_parse_failure(context, "parsing failed"); > This leads to error messages like > > pg_verifybackup: error: could not parse backup manifest: parsing failed > > which I would imagine is going to lead to confused support requests in > the event that someone does manage to corrupt their manifest. Can we > make use of json_errdetail() and print the line and column numbers? > Patch 0001 over at [1] has one approach to making json_errdetail() > workable in frontend code. Looks sound on a first look. Maybe we should get that pushed ASAP so we can take advantage of it. > > Top-level scalars like `false` or `12345` do not parse correctly if > the chunk size is too small; instead json_errdetail() reports 'Token > "" is invalid'. With small chunk sizes, json_errdetail() additionally > segfaults on constructions like `[tru]` or `12zz`. Ugh. Will investigate. > > For my local testing, I'm carrying the following diff in > 001_test_json_parser_incremental.pl: > >> - ok($stdout =~ /SUCCESS/, "chunk size $size: test succeeds"); >> - ok(!$stderr, "chunk size $size: no error output"); >> + like($stdout, qr/SUCCESS/, "chunk size $size: test succeeds"); >> + is($stderr, "", "chunk size $size: no error output"); > This is particularly helpful when a test fails spuriously due to code > coverage spray on stderr. > Makes sense, thanks. I'll have a fresh patch set soon which will also take care of the bitrot. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2024-03-07 Th 22:42, Andrew Dunstan wrote: > > > > > > > >> >> Top-level scalars like `false` or `12345` do not parse correctly if >> the chunk size is too small; instead json_errdetail() reports 'Token >> "" is invalid'. With small chunk sizes, json_errdetail() additionally >> segfaults on constructions like `[tru]` or `12zz`. > > > Ugh. Will investigate. I haven't managed to reproduce this. But I'm including some tests for it. > > > > > I'll have a fresh patch set soon which will also take care of the bitrot. > > > See attached. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On Sun, Mar 10, 2024 at 11:43 PM Andrew Dunstan <andrew@dunslane.net> wrote: > I haven't managed to reproduce this. But I'm including some tests for it. If I remove the newline from the end of the new tests: > @@ -25,7 +25,7 @@ for (my $size = 64; $size > 0; $size--) > foreach my $test_string (@inlinetests) > { > my ($fh, $fname) = tempfile(); > - print $fh "$test_string\n"; > + print $fh "$test_string"; > close($fh); > > foreach my $size (1..6, 10, 20, 30) then I can reproduce the same result as my local tests. That extra whitespace appears to help the partial token logic out somehow. --Jacob
On 2024-03-11 Mo 09:47, Jacob Champion wrote: > On Sun, Mar 10, 2024 at 11:43 PM Andrew Dunstan <andrew@dunslane.net> wrote: >> I haven't managed to reproduce this. But I'm including some tests for it. > If I remove the newline from the end of the new tests: > >> @@ -25,7 +25,7 @@ for (my $size = 64; $size > 0; $size--) >> foreach my $test_string (@inlinetests) >> { >> my ($fh, $fname) = tempfile(); >> - print $fh "$test_string\n"; >> + print $fh "$test_string"; >> close($fh); >> >> foreach my $size (1..6, 10, 20, 30) > then I can reproduce the same result as my local tests. That extra > whitespace appears to help the partial token logic out somehow. > Yep, here's a patch set with that fixed, and the tests adjusted. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
I've been poking at the partial token logic. The json_errdetail() bug mentioned upthread (e.g. for an invalid input `[12zz]` and small chunk size) seems to be due to the disconnection between the "main" lex instance and the dummy_lex that's created from it. The dummy_lex contains all the information about the failed token, which is discarded upon an error return: > partial_result = json_lex(&dummy_lex); > if (partial_result != JSON_SUCCESS) > return partial_result; In these situations, there's an additional logical error: lex->token_start is pointing to a spot in the string after lex->token_terminator, which breaks an invariant that will mess up later pointer math. Nothing appears to be setting lex->token_start to point into the partial token buffer until _after_ the partial token is successfully lexed, which doesn't seem right -- in addition to the pointer math problems, if a previous chunk was freed (or on a stale stack frame), lex->token_start will still be pointing off into space. Similarly, wherever we set token_terminator, we need to know that token_start is pointing into the same buffer. Determining the end of a token is now done in two separate places between the partial- and full-lexer code paths, which is giving me a little heartburn. I'm concerned that those could drift apart, and if the two disagree on where to end a token, we could lose data into the partial token buffer in a way that would be really hard to debug. Is there a way to combine them? Thanks, --Jacob
I've been poking at the partial token logic. The json_errdetail() bug
mentioned upthread (e.g. for an invalid input `[12zz]` and small chunk
size) seems to be due to the disconnection between the "main" lex
instance and the dummy_lex that's created from it. The dummy_lex
contains all the information about the failed token, which is
discarded upon an error return:
> partial_result = json_lex(&dummy_lex);
> if (partial_result != JSON_SUCCESS)
> return partial_result;
In these situations, there's an additional logical error:
lex->token_start is pointing to a spot in the string after
lex->token_terminator, which breaks an invariant that will mess up
later pointer math. Nothing appears to be setting lex->token_start to
point into the partial token buffer until _after_ the partial token is
successfully lexed, which doesn't seem right -- in addition to the
pointer math problems, if a previous chunk was freed (or on a stale
stack frame), lex->token_start will still be pointing off into space.
Similarly, wherever we set token_terminator, we need to know that
token_start is pointing into the same buffer.
Determining the end of a token is now done in two separate places
between the partial- and full-lexer code paths, which is giving me a
little heartburn. I'm concerned that those could drift apart, and if
the two disagree on where to end a token, we could lose data into the
partial token buffer in a way that would be really hard to debug. Is
there a way to combine them?
Attachment
On Mon, Mar 18, 2024 at 3:32 AM Andrew Dunstan <andrew@dunslane.net> wrote: > Not very easily. But I think and hope I've fixed the issue you've identified above about returning before lex->token_startis properly set. > > Attached is a new set of patches that does that and is updated for the json_errdetaiil() changes. Thanks! > ++ * Normally token_start would be ptok->data, but it could be later, > ++ * see json_lex_string's handling of invalid escapes. > + */ > -+ lex->token_start = ptok->data; > ++ lex->token_start = dummy_lex.token_start; > + lex->token_terminator = ptok->data + ptok->len; By the same token (ha), the lex->token_terminator needs to be updated from dummy_lex for some error paths. (IIUC, on success, the token_terminator should always point to the end of the buffer. If it's not possible to combine the two code paths, maybe it'd be good to check that and assert/error out if we've incorrectly pulled additional data into the partial token.) With the incremental parser, I think prev_token_terminator is not likely to be safe to use except in very specific circumstances, since it could be pointing into a stale chunk. Some documentation around how to use that safely in a semantic action would be good. It looks like some of the newly added error handling paths cannot be hit, because the production stack makes it logically impossible to get there. (For example, if it takes a successfully lexed comma to transition into JSON_PROD_MORE_ARRAY_ELEMENTS to begin with, then when we pull that production's JSON_TOKEN_COMMA off the stack, we can't somehow fail to match that same comma.) Assuming I haven't missed a different way to get into that situation, could the "impossible" cases have assert calls added? I've attached two diffs. One is the group of tests I've been using locally (called 002_inline.pl; I replaced the existing inline tests with it), and the other is a set of potential fixes to get those tests green. Thanks, --Jacob
Attachment
On Mon, Mar 18, 2024 at 3:32 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Not very easily. But I think and hope I've fixed the issue you've identified above about returning before lex->token_start is properly set.
>
> Attached is a new set of patches that does that and is updated for the json_errdetaiil() changes.
Thanks!
> ++ * Normally token_start would be ptok->data, but it could be later,
> ++ * see json_lex_string's handling of invalid escapes.
> + */
> -+ lex->token_start = ptok->data;
> ++ lex->token_start = dummy_lex.token_start;
> + lex->token_terminator = ptok->data + ptok->len;
By the same token (ha), the lex->token_terminator needs to be updated
from dummy_lex for some error paths. (IIUC, on success, the
token_terminator should always point to the end of the buffer. If it's
not possible to combine the two code paths, maybe it'd be good to
check that and assert/error out if we've incorrectly pulled additional
data into the partial token.)
With the incremental parser, I think prev_token_terminator is not
likely to be safe to use except in very specific circumstances, since
it could be pointing into a stale chunk. Some documentation around how
to use that safely in a semantic action would be good.
It looks like some of the newly added error handling paths cannot be
hit, because the production stack makes it logically impossible to get
there. (For example, if it takes a successfully lexed comma to
transition into JSON_PROD_MORE_ARRAY_ELEMENTS to begin with, then when
we pull that production's JSON_TOKEN_COMMA off the stack, we can't
somehow fail to match that same comma.) Assuming I haven't missed a
different way to get into that situation, could the "impossible" cases
have assert calls added?
I've attached two diffs. One is the group of tests I've been using
locally (called 002_inline.pl; I replaced the existing inline tests
with it), and the other is a set of potential fixes to get those tests
green.
Attachment
It also removes the frontend exits I had. In the case of stack depth, we follow the example of the RD parser and only check stack depth for backend code. In the case of the check that the lexer is set up for incremental parsing, the exit is replaced by an Assert.
On Tue, Mar 19, 2024 at 3:07 PM Andrew Dunstan <andrew@dunslane.net> wrote: > On Mon, Mar 18, 2024 at 3:35 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote: >> With the incremental parser, I think prev_token_terminator is not >> likely to be safe to use except in very specific circumstances, since >> it could be pointing into a stale chunk. Some documentation around how >> to use that safely in a semantic action would be good. > > Quite right. It's not safe. Should we ensure it's set to something like NULL or -1? Nulling it out seems reasonable. > Also, where do you think we should put a warning about it? I was thinking in the doc comment for JsonLexContext. > It also removes the frontend exits I had. In the case of stack depth, we follow the example of the RD parser and only checkstack depth for backend code. In the case of the check that the lexer is set up for incremental parsing, the exit isreplaced by an Assert. That means your test for an over-nested array doesn't work any more, so I have commented it out. Hm, okay. We really ought to fix the recursive parser, but that's for a separate thread. (Probably OAuth.) The ideal behavior IMO would be for the caller to configure a maximum depth in the JsonLexContext. Note that the repalloc will eventually still exit() if the pstack gets too big; is that a concern? Alternatively, could unbounded heap growth be a problem for a superuser? I guess the scalars themselves aren't capped for length... On Wed, Mar 20, 2024 at 12:19 AM Andrew Dunstan <andrew@dunslane.net> wrote: > On second thoughts, I think it might be better if we invent a new error return code for a lexer mode mismatch. +1 --Jacob
This new return path... > + if (!tok_done) > + { > + if (lex->inc_state->is_last_chunk) > + return JSON_INVALID_TOKEN; ...also needs to set the token pointers. See one approach in the attached diff, which additionally asserts that we've consumed the entire chunk in this case, along with a handful of new tests. --Jacob
Attachment
This new return path...
> + if (!tok_done)
> + {
> + if (lex->inc_state->is_last_chunk)
> + return JSON_INVALID_TOKEN;
...also needs to set the token pointers. See one approach in the
attached diff, which additionally asserts that we've consumed the
entire chunk in this case, along with a handful of new tests.
Attachment
On Wed, Mar 20, 2024 at 11:56 PM Andrew Dunstan <andrew@dunslane.net> wrote: > Thanks, included that and attended to the other issues we discussed. I think this is pretty close now. Okay, looking over the thread, there are the following open items: - extend the incremental test in order to exercise the semantic callbacks [1] - add Assert calls in impossible error cases [2] - error out if the non-incremental lex doesn't consume the entire token [2] - double-check that out of memory is an appropriate failure mode for the frontend [3] Just as a general style nit: > + if (lex->incremental) > + { > + lex->input = lex->token_terminator = lex->line_start = json; > + lex->input_length = len; > + lex->inc_state->is_last_chunk = is_last; > + } > + else > + return JSON_INVALID_LEXER_TYPE; I think flipping this around would probably make it more readable; something like: if (!lex->incremental) return JSON_INVALID_LEXER_TYPE; lex->input = ... Thanks, --Jacob [1] https://www.postgresql.org/message-id/CAOYmi%2BnHV55Uhz%2Bo-HKq0GNiWn2L5gMcuyRQEz_fqpGY%3DpFxKA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAD5tBcLi2ffZkktV2qrsKSBykE-N8CiYgrfbv0vZ-F7%3DxLFeqw%40mail.gmail.com [3] https://www.postgresql.org/message-id/CAOYmi%2BnY%3DrF6dJCzaOuA3d-3FbwXCcecOs_S1NutexFA3dRXAw%40mail.gmail.com
On Wed, Mar 20, 2024 at 11:56 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> Thanks, included that and attended to the other issues we discussed. I think this is pretty close now.
Okay, looking over the thread, there are the following open items:
- extend the incremental test in order to exercise the semantic callbacks [1]
- add Assert calls in impossible error cases [2]
- error out if the non-incremental lex doesn't consume the entire token [2]
- double-check that out of memory is an appropriate failure mode for
the frontend [3]
Just as a general style nit:
> + if (lex->incremental)
> + {
> + lex->input = lex->token_terminator = lex->line_start = json;
> + lex->input_length = len;
> + lex->inc_state->is_last_chunk = is_last;
> + }
> + else
> + return JSON_INVALID_LEXER_TYPE;
I think flipping this around would probably make it more readable;
something like:
if (!lex->incremental)
return JSON_INVALID_LEXER_TYPE;
lex->input = ...
[1] https://www.postgresql.org/message-id/CAOYmi%2BnHV55Uhz%2Bo-HKq0GNiWn2L5gMcuyRQEz_fqpGY%3DpFxKA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAD5tBcLi2ffZkktV2qrsKSBykE-N8CiYgrfbv0vZ-F7%3DxLFeqw%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAOYmi%2BnY%3DrF6dJCzaOuA3d-3FbwXCcecOs_S1NutexFA3dRXAw%40mail.gmail.com
On Mon, Mar 25, 2024 at 4:02 PM Andrew Dunstan <andrew@dunslane.net> wrote: > Well, what's the alternative? The current parser doesn't check stack depth in frontend code. Presumably it too will eventuallyjust run out of memory, possibly rather sooner as the stack frames could be more expensive than the incrementalparser stack extensions. Stack size should be pretty limited, at least on the platforms I'm familiar with. So yeah, the recursive descent will segfault pretty quickly, but it won't repalloc() an unbounded amount of heap space. The alternative would just be to go back to a hardcoded limit in the short term, I think. --Jacob
On Mon, Mar 25, 2024 at 4:12 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote: > Stack size should be pretty limited, at least on the platforms I'm > familiar with. So yeah, the recursive descent will segfault pretty > quickly, but it won't repalloc() an unbounded amount of heap space. > The alternative would just be to go back to a hardcoded limit in the > short term, I think. And I should mention that there are other ways to consume a bunch of memory, but I think they're bounded by the size of the JSON file. Looks like the repalloc()s amplify the JSON size by a factor of ~20 (JS_MAX_PROD_LEN + sizeof(char*) + sizeof(bool)). That may or may not be enough to be concerned about in the end, since I think it's still linear, but I wanted to make sure it was known. --Jacob
On Mon, Mar 25, 2024 at 4:02 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> Well, what's the alternative? The current parser doesn't check stack depth in frontend code. Presumably it too will eventually just run out of memory, possibly rather sooner as the stack frames could be more expensive than the incremental parser stack extensions.
Stack size should be pretty limited, at least on the platforms I'm
familiar with. So yeah, the recursive descent will segfault pretty
quickly, but it won't repalloc() an unbounded amount of heap space.
The alternative would just be to go back to a hardcoded limit in the
short term, I think.
On Mon, Mar 25, 2024 at 4:24 PM Andrew Dunstan <andrew@dunslane.net> wrote: > OK, so we invent a new error code and have the parser return that if the stack depth gets too big? Yeah, that seems reasonable. I'd potentially be able to build on that via OAuth for next cycle, too, since that client needs to limit its memory usage. --Jacob
On Mon, Mar 25, 2024 at 3:14 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote: > - add Assert calls in impossible error cases [2] To expand on this one, I think these parts of the code (marked with `<- here`) are impossible to reach: > switch (top) > { > case JSON_TOKEN_STRING: > if (next_prediction(pstack) == JSON_TOKEN_COLON) > ctx = JSON_PARSE_STRING; > else > ctx = JSON_PARSE_VALUE; <- here > break; > case JSON_TOKEN_NUMBER: <- here > case JSON_TOKEN_TRUE: <- here > case JSON_TOKEN_FALSE: <- here > case JSON_TOKEN_NULL: <- here > case JSON_TOKEN_ARRAY_START: <- here > case JSON_TOKEN_OBJECT_START: <- here > ctx = JSON_PARSE_VALUE; > break; > case JSON_TOKEN_ARRAY_END: <- here > ctx = JSON_PARSE_ARRAY_NEXT; > break; > case JSON_TOKEN_OBJECT_END: <- here > ctx = JSON_PARSE_OBJECT_NEXT; > break; > case JSON_TOKEN_COMMA: <- here > if (next_prediction(pstack) == JSON_TOKEN_STRING) > ctx = JSON_PARSE_OBJECT_NEXT; > else > ctx = JSON_PARSE_ARRAY_NEXT; > break; Since none of these cases are non-terminals, the only way to get to this part of the code is if (top != tok). But inspecting the productions and transitions that can put these tokens on the stack, it looks like the only way for them to be at the top of the stack to begin with is if (tok == top). (Otherwise, we would have chosen a different production, or else errored out on a non-terminal.) This case is possible... > case JSON_TOKEN_STRING: > if (next_prediction(pstack) == JSON_TOKEN_COLON) > ctx = JSON_PARSE_STRING; ...if we're in the middle of JSON_PROD_[MORE_]KEY_PAIRS. But the corresponding else case is not, because if we're in the middle of a _KEY_PAIRS production, the next_prediction() _must_ be JSON_TOKEN_COLON. Do you agree, or am I missing a way to get there? Thanks, --Jacob
On 2024-03-26 Tu 17:52, Jacob Champion wrote: > On Mon, Mar 25, 2024 at 3:14 PM Jacob Champion > <jacob.champion@enterprisedb.com> wrote: >> - add Assert calls in impossible error cases [2] > To expand on this one, I think these parts of the code (marked with > `<- here`) are impossible to reach: > >> switch (top) >> { >> case JSON_TOKEN_STRING: >> if (next_prediction(pstack) == JSON_TOKEN_COLON) >> ctx = JSON_PARSE_STRING; >> else >> ctx = JSON_PARSE_VALUE; <- here >> break; >> case JSON_TOKEN_NUMBER: <- here >> case JSON_TOKEN_TRUE: <- here >> case JSON_TOKEN_FALSE: <- here >> case JSON_TOKEN_NULL: <- here >> case JSON_TOKEN_ARRAY_START: <- here >> case JSON_TOKEN_OBJECT_START: <- here >> ctx = JSON_PARSE_VALUE; >> break; >> case JSON_TOKEN_ARRAY_END: <- here >> ctx = JSON_PARSE_ARRAY_NEXT; >> break; >> case JSON_TOKEN_OBJECT_END: <- here >> ctx = JSON_PARSE_OBJECT_NEXT; >> break; >> case JSON_TOKEN_COMMA: <- here >> if (next_prediction(pstack) == JSON_TOKEN_STRING) >> ctx = JSON_PARSE_OBJECT_NEXT; >> else >> ctx = JSON_PARSE_ARRAY_NEXT; >> break; > Since none of these cases are non-terminals, the only way to get to > this part of the code is if (top != tok). But inspecting the > productions and transitions that can put these tokens on the stack, it > looks like the only way for them to be at the top of the stack to > begin with is if (tok == top). (Otherwise, we would have chosen a > different production, or else errored out on a non-terminal.) > > This case is possible... > >> case JSON_TOKEN_STRING: >> if (next_prediction(pstack) == JSON_TOKEN_COLON) >> ctx = JSON_PARSE_STRING; > ...if we're in the middle of JSON_PROD_[MORE_]KEY_PAIRS. But the > corresponding else case is not, because if we're in the middle of a > _KEY_PAIRS production, the next_prediction() _must_ be > JSON_TOKEN_COLON. > > Do you agree, or am I missing a way to get there? > One good way of testing would be to add the Asserts, build with -DFORCE_JSON_PSTACK, and run the standard regression suite, which has a fairly comprehensive set of JSON errors. I'll play with that. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Mon, Mar 25, 2024 at 6:15 PM Jacob Champion <jacob.champion@enterprisedb.com> wrote:On Wed, Mar 20, 2024 at 11:56 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> Thanks, included that and attended to the other issues we discussed. I think this is pretty close now.
Okay, looking over the thread, there are the following open items:
- extend the incremental test in order to exercise the semantic callbacks [1]Yeah, I'm on a super long plane trip later this week, so I might get it done then :-)- add Assert calls in impossible error cases [2]ok, will do- error out if the non-incremental lex doesn't consume the entire token [2]ok, will do- double-check that out of memory is an appropriate failure mode for
the frontend [3]Well, what's the alternative? The current parser doesn't check stack depth in frontend code. Presumably it too will eventually just run out of memory, possibly rather sooner as the stack frames could be more expensive than the incremental parser stack extensions.
Just as a general style nit:
> + if (lex->incremental)
> + {
> + lex->input = lex->token_terminator = lex->line_start = json;
> + lex->input_length = len;
> + lex->inc_state->is_last_chunk = is_last;
> + }
> + else
> + return JSON_INVALID_LEXER_TYPE;
I think flipping this around would probably make it more readable;
something like:
if (!lex->incremental)
return JSON_INVALID_LEXER_TYPE;
lex->input = ...Noted. will do, Thanks.
Here's a new set of patches, with I think everything except the error case Asserts attended to. There's a change to add semantic processing to the test suite in patch 4, but I'd fold that into patch 1 when committing.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On Fri, Mar 29, 2024 at 9:42 AM Andrew Dunstan <andrew@dunslane.net> wrote: > Here's a new set of patches, with I think everything except the error case Asserts attended to. There's a change to addsemantic processing to the test suite in patch 4, but I'd fold that into patch 1 when committing. Thanks! 0004 did highlight one last bug for me -- the return value from semantic callbacks is ignored, so applications can't interrupt the parsing early: > + if (ostart != NULL) > + (*ostart) (sem->semstate); > + inc_lex_level(lex); Anything not JSON_SUCCESS should be returned immediately, I think, for this and the other calls. > + /* make sure we've used all the input */ > + if (lex->token_terminator - lex->token_start != ptok->len) > + return JSON_INVALID_TOKEN; nit: Debugging this, if anyone ever hits it, is going to be confusing. The error message is going to claim that a valid token is invalid, as opposed to saying that the parser has a bug. Short of introducing something like JSON_INTERNAL_ERROR, maybe an Assert() should be added at least? > + case JSON_NESTING_TOO_DEEP: > + return(_("Json nested too deep, maximum permitted depth is 6400")); nit: other error messages use all-caps, "JSON" > + char chunk_start[100], > + chunk_end[100]; > + > + snprintf(chunk_start, 100, "%s", ib->buf.data); > + snprintf(chunk_end, 100, "%s", ib->buf.data + (ib->buf.len - (MIN_CHUNK + 99))); > + elog(NOTICE, "incremental manifest:\nchunk_start='%s',\nchunk_end='%s'", chunk_start, chunk_end); Is this NOTICE intended to be part of the final commit? > + inc_state = json_parse_manifest_incremental_init(&context); > + > + buffer = pg_malloc(chunk_size + 64); These `+ 64`s (there are two of them) seem to be left over from debugging. In v7 they were just `+ 1`. -- From this point onward, I think all of my feedback is around maintenance characteristics, which is firmly in YMMV territory, and I don't intend to hold up a v1 of the feature for it since I'm not the maintainer. :D The complexity around the checksum handling and "final chunk"-ing is unfortunate, but not a fault of this patch -- just a weird consequence of the layering violation in the format, where the checksum is inside the object being checksummed. I don't like it, but I don't think there's much to be done about it. By far the trickiest part of the implementation is the partial token logic, because of all the new ways there are to handle different types of failures. I think any changes to the incremental handling in json_lex() are going to need intense scrutiny from someone who already has the mental model loaded up. I'm going snowblind on the patch and I'm no longer the right person to review how hard it is to get up to speed with the architecture, but I'd say it's not easy. For something as security-critical as a JSON parser, I'd usually want to counteract that by sinking the observable behaviors in concrete and getting both the effective test coverage *and* the code coverage as close to 100% as we can stand. (By that, I mean that purposely introducing observable bugs into the parser should result in test failures. We're not there yet when it comes to the semantic callbacks, at least.) But I don't think the current JSON parser is held to that standard currently, so it seems strange for me to ask for that here. I think it'd be good for a v1.x of this feature to focus on simplification of the code, and hopefully consolidate and/or eliminate some of the duplicated parsing work so that the mental model isn't quite so big. Thanks! --Jacob
On 2024-03-29 Fr 17:21, Jacob Champion wrote: > On Fri, Mar 29, 2024 at 9:42 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> Here's a new set of patches, with I think everything except the error case Asserts attended to. There's a change to addsemantic processing to the test suite in patch 4, but I'd fold that into patch 1 when committing. > Thanks! 0004 did highlight one last bug for me -- the return value > from semantic callbacks is ignored, so applications can't interrupt > the parsing early: > >> + if (ostart != NULL) >> + (*ostart) (sem->semstate); >> + inc_lex_level(lex); > Anything not JSON_SUCCESS should be returned immediately, I think, for > this and the other calls. Fixed > >> + /* make sure we've used all the input */ >> + if (lex->token_terminator - lex->token_start != ptok->len) >> + return JSON_INVALID_TOKEN; > nit: Debugging this, if anyone ever hits it, is going to be confusing. > The error message is going to claim that a valid token is invalid, as > opposed to saying that the parser has a bug. Short of introducing > something like JSON_INTERNAL_ERROR, maybe an Assert() should be added > at least? Done > >> + case JSON_NESTING_TOO_DEEP: >> + return(_("Json nested too deep, maximum permitted depth is 6400")); > nit: other error messages use all-caps, "JSON" Fixed > >> + char chunk_start[100], >> + chunk_end[100]; >> + >> + snprintf(chunk_start, 100, "%s", ib->buf.data); >> + snprintf(chunk_end, 100, "%s", ib->buf.data + (ib->buf.len - (MIN_CHUNK + 99))); >> + elog(NOTICE, "incremental manifest:\nchunk_start='%s',\nchunk_end='%s'", chunk_start, chunk_end); > Is this NOTICE intended to be part of the final commit? No, removed. > >> + inc_state = json_parse_manifest_incremental_init(&context); >> + >> + buffer = pg_malloc(chunk_size + 64); > These `+ 64`s (there are two of them) seem to be left over from > debugging. In v7 they were just `+ 1`. Fixed I've also added the Asserts to the error handling code you suggested. This tests fine with the regression suite under FORCE_JSON_PSTACK. > > -- > > >From this point onward, I think all of my feedback is around > maintenance characteristics, which is firmly in YMMV territory, and I > don't intend to hold up a v1 of the feature for it since I'm not the > maintainer. :D > > The complexity around the checksum handling and "final chunk"-ing is > unfortunate, but not a fault of this patch -- just a weird consequence > of the layering violation in the format, where the checksum is inside > the object being checksummed. I don't like it, but I don't think > there's much to be done about it. Yeah. I agree it's ugly, but I don't think we should let it hold us up at all. Working around it doesn't involve too much extra code. > > By far the trickiest part of the implementation is the partial token > logic, because of all the new ways there are to handle different types > of failures. I think any changes to the incremental handling in > json_lex() are going to need intense scrutiny from someone who already > has the mental model loaded up. I'm going snowblind on the patch and > I'm no longer the right person to review how hard it is to get up to > speed with the architecture, but I'd say it's not easy. json_lex() is not really a very hot piece of code. I have added a comment at the top giving a brief description of the partial token handling. Even without the partial token bit it can be a bit scary, but I think the partial token piece here is not so scary that we should not proceed. > > For something as security-critical as a JSON parser, I'd usually want > to counteract that by sinking the observable behaviors in concrete and > getting both the effective test coverage *and* the code coverage as > close to 100% as we can stand. (By that, I mean that purposely > introducing observable bugs into the parser should result in test > failures. We're not there yet when it comes to the semantic callbacks, > at least.) But I don't think the current JSON parser is held to that > standard currently, so it seems strange for me to ask for that here. Yeah. > > I think it'd be good for a v1.x of this feature to focus on > simplification of the code, and hopefully consolidate and/or eliminate > some of the duplicated parsing work so that the mental model isn't > quite so big. > I'm not sure how you think that can be done. What parts of the processing are you having difficulty in coming to grips with? Anyway, here are new patches. I've rolled the new semantic test into the first patch. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On Mon, Apr 1, 2024 at 4:53 PM Andrew Dunstan <andrew@dunslane.net> wrote: > Anyway, here are new patches. I've rolled the new semantic test into the > first patch. Looks good! I've marked RfC. > json_lex() is not really a very hot piece of code. Sure, but I figure if someone is trying to get the performance of the incremental parser to match the recursive one, so we can eventually replace it, it might get a little warmer. :) > > I think it'd be good for a v1.x of this feature to focus on > > simplification of the code, and hopefully consolidate and/or eliminate > > some of the duplicated parsing work so that the mental model isn't > > quite so big. > > I'm not sure how you think that can be done. I think we'd need to teach the lower levels of the lexer about incremental parsing too, so that we don't have two separate sources of truth about what ends a token. Bonus points if we could keep the parse state across chunks to the extent that we didn't need to restart at the beginning of the token every time. (Our current tools for this are kind of poor, like the restartable state machine in PQconnectPoll. While I'm dreaming, I'd like coroutines.) Now, whether the end result would be more or less maintainable is left as an exercise... Thanks! --Jacob
On 2024-04-02 Tu 15:38, Jacob Champion wrote: > On Mon, Apr 1, 2024 at 4:53 PM Andrew Dunstan <andrew@dunslane.net> wrote: >> Anyway, here are new patches. I've rolled the new semantic test into the >> first patch. > Looks good! I've marked RfC. Thanks! I appreciate all the work you've done on this. I will give it one more pass and commit RSN. > >> json_lex() is not really a very hot piece of code. > Sure, but I figure if someone is trying to get the performance of the > incremental parser to match the recursive one, so we can eventually > replace it, it might get a little warmer. :) I don't think this is where the performance penalty lies. Rather, I suspect it's the stack operations in the non-recursive parser itself. The speed test doesn't involve any partial token processing at all, and yet the non-recursive parser is significantly slower in that test. >>> I think it'd be good for a v1.x of this feature to focus on >>> simplification of the code, and hopefully consolidate and/or eliminate >>> some of the duplicated parsing work so that the mental model isn't >>> quite so big. >> I'm not sure how you think that can be done. > I think we'd need to teach the lower levels of the lexer about > incremental parsing too, so that we don't have two separate sources of > truth about what ends a token. Bonus points if we could keep the parse > state across chunks to the extent that we didn't need to restart at > the beginning of the token every time. (Our current tools for this are > kind of poor, like the restartable state machine in PQconnectPoll. > While I'm dreaming, I'd like coroutines.) Now, whether the end result > would be more or less maintainable is left as an exercise... > I tried to disturb the main lexer processing as little as possible. We could possibly unify the two paths, but I have a strong suspicion that would result in a performance hit (the main part of the lexer doesn't copy any characters at all, it just keeps track of pointers into the input). And while the code might not undergo lots of change, the routine itself is quite performance critical. Anyway, I think that's all something for another day. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2024-04-02 Tu 17:12, Andrew Dunstan wrote: > > On 2024-04-02 Tu 15:38, Jacob Champion wrote: >> On Mon, Apr 1, 2024 at 4:53 PM Andrew Dunstan <andrew@dunslane.net> >> wrote: >>> Anyway, here are new patches. I've rolled the new semantic test into >>> the >>> first patch. >> Looks good! I've marked RfC. > > > Thanks! I appreciate all the work you've done on this. I will give it > one more pass and commit RSN. pushed. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > pushed. My animals don't like this [1][2]: parse_manifest.c:99:3: error: redefinition of typedef 'JsonManifestParseIncrementalState' is a C11 feature [-Werror,-Wtypedef-redefinition] } JsonManifestParseIncrementalState; ^ ../../src/include/common/parse_manifest.h:23:50: note: previous definition is here typedef struct JsonManifestParseIncrementalState JsonManifestParseIncrementalState; ^ 1 error generated. (and similarly in other places) regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2024-04-04%2013%3A08%3A02 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka&dt=2024-04-04%2013%3A07%3A01
On 2024-04-04 Th 10:26, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> pushed. > My animals don't like this [1][2]: > > parse_manifest.c:99:3: error: redefinition of typedef 'JsonManifestParseIncrementalState' is a C11 feature [-Werror,-Wtypedef-redefinition] > } JsonManifestParseIncrementalState; > ^ > ../../src/include/common/parse_manifest.h:23:50: note: previous definition is here > typedef struct JsonManifestParseIncrementalState JsonManifestParseIncrementalState; > ^ > 1 error generated. > > (and similarly in other places) > > regards, tom lane > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2024-04-04%2013%3A08%3A02 > [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka&dt=2024-04-04%2013%3A07%3A01 Darn it. Ok, will try to fix. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2024-04-04 Th 10:26, Tom Lane wrote: >> My animals don't like this [1][2]: > Darn it. Ok, will try to fix. I think you just need to follow the standard pattern: typedef struct foo foo; ... foo* is ok to use here ... struct foo { ... fields }; regards, tom lane
I wrote: > I think you just need to follow the standard pattern: Yeah, the attached is enough to silence it for me. (But personally I'd add comments saying that the typedef appears in thus-and-such header file; see examples in our tree.) regards, tom lane diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c index 3d1bd37ac26..0bb46b43024 100644 --- a/src/common/jsonapi.c +++ b/src/common/jsonapi.c @@ -79,7 +79,7 @@ typedef enum * and the token and value for scalars that need to be preserved * across calls. */ -typedef struct JsonParserStack +struct JsonParserStack { int stack_size; char *prediction; @@ -89,18 +89,18 @@ typedef struct JsonParserStack bool *fnull; JsonTokenType scalar_tok; char *scalar_val; -} JsonParserStack; +}; /* * struct containing state used when there is a possible partial token at the * end of a json chunk when we are doing incremental parsing. */ -typedef struct JsonIncrementalState +struct JsonIncrementalState { bool is_last_chunk; bool partial_completed; StringInfoData partial_token; -} JsonIncrementalState; +}; /* * constants and macros used in the nonrecursive parser diff --git a/src/common/parse_manifest.c b/src/common/parse_manifest.c index 040c5597df4..f9c026a6369 100644 --- a/src/common/parse_manifest.c +++ b/src/common/parse_manifest.c @@ -91,12 +91,12 @@ typedef struct char *manifest_checksum; } JsonManifestParseState; -typedef struct JsonManifestParseIncrementalState +struct JsonManifestParseIncrementalState { JsonLexContext lex; JsonSemAction sem; pg_cryptohash_ctx *manifest_ctx; -} JsonManifestParseIncrementalState; +}; static JsonParseErrorType json_manifest_object_start(void *state); static JsonParseErrorType json_manifest_object_end(void *state);
On 2024-04-04 Th 11:16, Tom Lane wrote: > I wrote: >> I think you just need to follow the standard pattern: > Yeah, the attached is enough to silence it for me. Thanks, that's what I came up with too (after converting my setup to use clang) > (But personally I'd add comments saying that the typedef > appears in thus-and-such header file; see examples in > our tree.) Done cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Oh, more problems: after running check-world in a non-VPATH build, there are droppings all over my tree: $ git status On branch master Your branch is up to date with 'origin/master'. Untracked files: (use "git add <file>..." to include in what will be committed) src/interfaces/ecpg/test/sql/sqljson_jsontable src/interfaces/ecpg/test/sql/sqljson_jsontable.c src/test/modules/test_json_parser/test_json_parser_incremental src/test/modules/test_json_parser/test_json_parser_perf src/test/modules/test_json_parser/tmp_check/ nothing added to commit but untracked files present (use "git add" to track) "make clean" or "make distclean" removes some of that but not all: $ make -s distclean $ git status On branch master Your branch is up to date with 'origin/master'. Untracked files: (use "git add <file>..." to include in what will be committed) src/test/modules/test_json_parser/test_json_parser_incremental src/test/modules/test_json_parser/test_json_parser_perf nothing added to commit but untracked files present (use "git add" to track) So we're short several .gitignore entries, and apparently also shy a couple of make-clean rules. regards, tom lane
On 2024-04-04 Th 12:04, Tom Lane wrote: > Oh, more problems: after running check-world in a non-VPATH build, > there are droppings all over my tree: > > $ git status > On branch master > Your branch is up to date with 'origin/master'. > > Untracked files: > (use "git add <file>..." to include in what will be committed) > src/interfaces/ecpg/test/sql/sqljson_jsontable > src/interfaces/ecpg/test/sql/sqljson_jsontable.c > src/test/modules/test_json_parser/test_json_parser_incremental > src/test/modules/test_json_parser/test_json_parser_perf > src/test/modules/test_json_parser/tmp_check/ > > nothing added to commit but untracked files present (use "git add" to track) > > "make clean" or "make distclean" removes some of that but not all: > > $ make -s distclean > $ git status > On branch master > Your branch is up to date with 'origin/master'. > > Untracked files: > (use "git add <file>..." to include in what will be committed) > src/test/modules/test_json_parser/test_json_parser_incremental > src/test/modules/test_json_parser/test_json_parser_perf > > nothing added to commit but untracked files present (use "git add" to track) > > So we're short several .gitignore entries, and apparently also > shy a couple of make-clean rules. Argh. You get out of the habit when you're running with meson :-( There's another issue I'm running down too. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Thu, 4 Apr 2024 at 18:13, Andrew Dunstan <andrew@dunslane.net> wrote: > Argh. You get out of the habit when you're running with meson :-( I'm having some issues with meson too actually. "meson test -C build" is now failing with this for me: Command 'test_json_parser_incremental' not found in /home/jelte/opensource/postgres/build/tmp_install//home/jelte/.pgenv/pgsql-17beta9/bin, ...
Apologies for piling on, but my compiler (gcc 9.4.0) is unhappy: ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’: ../postgresql/src/common/jsonapi.c:2016:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 2016 | if (lex->incremental && !lex->inc_state->is_last_chunk && | ~~~^~~~~~~~~~~ ../postgresql/src/common/jsonapi.c:2020:36: error: ‘dummy_lex.token_start’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 2020 | lex->token_start, s - lex->token_start); | ~~~^~~~~~~~~~~~~ ../postgresql/src/common/jsonapi.c:302:26: error: ‘numeric_error’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 302 | return (!numeric_error) && (total_len == dummy_lex.input_length); | ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Apr 4, 2024 at 10:17 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > Command 'test_json_parser_incremental' not found in > /home/jelte/opensource/postgres/build/tmp_install//home/jelte/.pgenv/pgsql-17beta9/bin, > ... I can't reproduce this locally... What's in the `...`? I wouldn't expect to find the test binary in your tmp_install. --Jacob
On Thu, Apr 4, 2024 at 11:12 AM Jacob Champion <jacob.champion@enterprisedb.com> wrote: > What's in the `...`? I wouldn't expect to find the test binary in your > tmp_install. Oh, I wonder if this is just a build dependency thing? I typically run a bare `ninja` right before testing, because I think most of those dependencies are missing for the tests at the moment. (For example, if I manually remove the `libpq_pipeline` executable and then try to run `meson test` without a rebuild, it fails.) We can certainly fix that (see attached patch as a first draft) but I wonder if there was a reason we decided not to during the Meson port? --Jacob
Attachment
On Thu, Apr 4, 2024 at 11:06 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’: > ../postgresql/src/common/jsonapi.c:2016:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function [-Werror=maybe-uninitialized] > 2016 | if (lex->incremental && !lex->inc_state->is_last_chunk && > | ~~~^~~~~~~~~~~ > ../postgresql/src/common/jsonapi.c:2020:36: error: ‘dummy_lex.token_start’ may be used uninitialized in this function [-Werror=maybe-uninitialized] > 2020 | lex->token_start, s - lex->token_start); > | ~~~^~~~~~~~~~~~~ Ah, it hasn't figured out that the incremental path is disabled. Zero-initializing the dummy_lex would be good regardless. > ../postgresql/src/common/jsonapi.c:302:26: error: ‘numeric_error’ may be used uninitialized in this function [-Werror=maybe-uninitialized] > 302 | return (!numeric_error) && (total_len == dummy_lex.input_length); > | ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For this case, though, I think it'd be good for API consistency if *numeric_error were always set on return, instead of relying on the caller to initialize. --Jacob
On 2024-04-04 Th 14:06, Nathan Bossart wrote: > Apologies for piling on, but my compiler (gcc 9.4.0) is unhappy: > > ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’: > ../postgresql/src/common/jsonapi.c:2016:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function [-Werror=maybe-uninitialized] > 2016 | if (lex->incremental && !lex->inc_state->is_last_chunk && > | ~~~^~~~~~~~~~~ > ../postgresql/src/common/jsonapi.c:2020:36: error: ‘dummy_lex.token_start’ may be used uninitialized in this function [-Werror=maybe-uninitialized] > 2020 | lex->token_start, s - lex->token_start); > | ~~~^~~~~~~~~~~~~ > ../postgresql/src/common/jsonapi.c:302:26: error: ‘numeric_error’ may be used uninitialized in this function [-Werror=maybe-uninitialized] > 302 | return (!numeric_error) && (total_len == dummy_lex.input_length); > | ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Please pile on. I want to get this fixed. It seems odd that my much later gcc didn't complain. Does the attached patch fix it for you? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On Thu, Apr 04, 2024 at 03:31:12PM -0400, Andrew Dunstan wrote: > Does the attached patch fix it for you? It clears up 2 of the 3 warnings for me: ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’: ../postgresql/src/common/jsonapi.c:2018:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 2018 | if (lex->incremental && !lex->inc_state->is_last_chunk && | -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, 4 Apr 2024 at 20:48, Jacob Champion <jacob.champion@enterprisedb.com> wrote: > > On Thu, Apr 4, 2024 at 11:12 AM Jacob Champion > <jacob.champion@enterprisedb.com> wrote: > > What's in the `...`? I wouldn't expect to find the test binary in your > > tmp_install. > > Oh, I wonder if this is just a build dependency thing? I typically run > a bare `ninja` right before testing, because I think most of those > dependencies are missing for the tests at the moment. (For example, if > I manually remove the `libpq_pipeline` executable and then try to run > `meson test` without a rebuild, it fails.) > > We can certainly fix that (see attached patch as a first draft) but I > wonder if there was a reason we decided not to during the Meson port? Yeah, that patch fixed my problem. (as well as running plain ninja) To clarify: I did do a rebuild. But it turns out that install-quiet doesn't build everything... The full command I was having problems with was: ninja -C build install-quiet && meson test -C build Maybe that's something worth addressing too. I expected that install/install-quiet was a strict superset of a plain ninja invocation.
On Thu, Apr 4, 2024 at 1:42 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > Maybe that's something worth addressing too. I expected that > install/install-quiet was a strict superset of a plain ninja > invocation. Maybe that's the intent, but I hope not, because I don't see any reason for `ninja install` to worry about test-only binaries that won't be installed. For the tests, there's a pretty clear rationale for the dependency. --Jacob
On Thu, 4 Apr 2024 at 23:34, Jacob Champion <jacob.champion@enterprisedb.com> wrote: > > On Thu, Apr 4, 2024 at 1:42 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > Maybe that's something worth addressing too. I expected that > > install/install-quiet was a strict superset of a plain ninja > > invocation. > > Maybe that's the intent, but I hope not, because I don't see any > reason for `ninja install` to worry about test-only binaries that > won't be installed. For the tests, there's a pretty clear rationale > for the dependency. Fair enough, I guess I'll change my invocation to include the ninja "test" target too: ninja -C build test install-quiet && meson test -C build
On Thu, 4 Apr 2024 at 23:52, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > Fair enough, I guess I'll change my invocation to include the ninja > "test" target too: > ninja -C build test install-quiet && meson test -C build Actually that doesn't do what I want either because that actually runs the tests too... And I prefer to do that using meson directly, so I can add some filters to it. I'll figure something out when I'm more fresh tomorrow.
On 2024-04-04 Th 15:54, Nathan Bossart wrote: > On Thu, Apr 04, 2024 at 03:31:12PM -0400, Andrew Dunstan wrote: >> Does the attached patch fix it for you? > It clears up 2 of the 3 warnings for me: > > ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’: > ../postgresql/src/common/jsonapi.c:2018:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function [-Werror=maybe-uninitialized] > 2018 | if (lex->incremental && !lex->inc_state->is_last_chunk && > | > Thanks, please try this instead. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On Fri, Apr 05, 2024 at 10:15:45AM -0400, Andrew Dunstan wrote: > On 2024-04-04 Th 15:54, Nathan Bossart wrote: >> On Thu, Apr 04, 2024 at 03:31:12PM -0400, Andrew Dunstan wrote: >> > Does the attached patch fix it for you? >> It clears up 2 of the 3 warnings for me: >> >> ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’: >> ../postgresql/src/common/jsonapi.c:2018:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function [-Werror=maybe-uninitialized] >> 2018 | if (lex->incremental && !lex->inc_state->is_last_chunk && >> | > > Thanks, please try this instead. LGTM, thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On 2024-04-05 Fr 11:43, Nathan Bossart wrote: > On Fri, Apr 05, 2024 at 10:15:45AM -0400, Andrew Dunstan wrote: >> On 2024-04-04 Th 15:54, Nathan Bossart wrote: >>> On Thu, Apr 04, 2024 at 03:31:12PM -0400, Andrew Dunstan wrote: >>>> Does the attached patch fix it for you? >>> It clears up 2 of the 3 warnings for me: >>> >>> ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’: >>> ../postgresql/src/common/jsonapi.c:2018:30: error: ‘dummy_lex.inc_state’ may be used uninitialized in this function [-Werror=maybe-uninitialized] >>> 2018 | if (lex->incremental && !lex->inc_state->is_last_chunk && >>> | >> Thanks, please try this instead. > LGTM, thanks! > Thanks for checking. Pushed. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Coverity complained that this patch leaks memory: /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_combinebackup/load_manifest.c: 212 in load_backup_manifest() 206 bytes_left -= rc; 207 json_parse_manifest_incremental_chunk( 208 inc_state, buffer, rc, bytes_left == 0); 209 } 210 211 close(fd); >>> CID 1596259: (RESOURCE_LEAK) >>> Variable "inc_state" going out of scope leaks the storage it points to. 212 } 213 214 /* All done. */ 215 pfree(buffer); 216 return result; 217 } /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c: 488 in parse_manifest_file() 482 bytes_left -= rc; 483 json_parse_manifest_incremental_chunk( 484 inc_state, buffer, rc, bytes_left == 0); 485 } 486 487 close(fd); >>> CID 1596257: (RESOURCE_LEAK) >>> Variable "inc_state" going out of scope leaks the storage it points to. 488 } 489 490 /* Done with the buffer. */ 491 pfree(buffer); 492 493 return result; It's right about that AFAICS, and not only is the "inc_state" itself leaked but so is its assorted infrastructure. Perhaps we don't care too much about that in the existing applications, but ISTM that isn't going to be a tenable assumption across the board. Shouldn't there be a "json_parse_manifest_incremental_shutdown()" or the like to deallocate all the storage allocated by the parser? regards, tom lane
On 2024-04-07 Su 20:58, Tom Lane wrote: > Coverity complained that this patch leaks memory: > > /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_combinebackup/load_manifest.c: 212 in load_backup_manifest() > 206 bytes_left -= rc; > 207 json_parse_manifest_incremental_chunk( > 208 inc_state, buffer, rc, bytes_left == 0); > 209 } > 210 > 211 close(fd); >>>> CID 1596259: (RESOURCE_LEAK) >>>> Variable "inc_state" going out of scope leaks the storage it points to. > 212 } > 213 > 214 /* All done. */ > 215 pfree(buffer); > 216 return result; > 217 } > > /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c: 488 in parse_manifest_file() > 482 bytes_left -= rc; > 483 json_parse_manifest_incremental_chunk( > 484 inc_state, buffer, rc, bytes_left == 0); > 485 } > 486 > 487 close(fd); >>>> CID 1596257: (RESOURCE_LEAK) >>>> Variable "inc_state" going out of scope leaks the storage it points to. > 488 } > 489 > 490 /* Done with the buffer. */ > 491 pfree(buffer); > 492 > 493 return result; > > It's right about that AFAICS, and not only is the "inc_state" itself > leaked but so is its assorted infrastructure. Perhaps we don't care > too much about that in the existing applications, but ISTM that > isn't going to be a tenable assumption across the board. Shouldn't > there be a "json_parse_manifest_incremental_shutdown()" or the like > to deallocate all the storage allocated by the parser? yeah, probably. Will work on it. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Michael pointed out over at [1] that the new tiny.json is pretty inscrutable given its size, and I have to agree. Attached is a patch to pare it down 98% or so. I think people wanting to run the performance comparisons will need to come up with their own gigantic files. Michael, with your "Jacob might be a nefarious cabal of state-sponsored hackers" hat on, is this observable enough, or do we need to get it smaller? I was thinking we may want to replace the URLs with stuff that doesn't link randomly around the Internet. Delicious in its original form is long gone. Thanks, --Jacob [1] https://www.postgresql.org/message-id/ZhCTtpGoXbERacoN@paquier.xyz
Attachment
On 2024-04-08 Mo 14:24, Jacob Champion wrote: > Michael pointed out over at [1] that the new tiny.json is pretty > inscrutable given its size, and I have to agree. Attached is a patch > to pare it down 98% or so. I think people wanting to run the > performance comparisons will need to come up with their own gigantic > files. Let's see if we can do a bit better than that. Maybe a script to construct a larger input for the speed test from the smaller file. Should be pretty simple. > > Michael, with your "Jacob might be a nefarious cabal of > state-sponsored hackers" hat on, is this observable enough, or do we > need to get it smaller? I was thinking we may want to replace the URLs > with stuff that doesn't link randomly around the Internet. Delicious > in its original form is long gone. > Arguably the fact that it points nowhere is a good thing. But feel free to replace it with something else. It doesn't have to be URLs at all. That happened simply because it was easy to extract from a very large piece of JSON I had lying around, probably from the last time I wrote a JSON parser :-) cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2024-04-08 Mo 09:29, Andrew Dunstan wrote: > > On 2024-04-07 Su 20:58, Tom Lane wrote: >> Coverity complained that this patch leaks memory: >> >> /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_combinebackup/load_manifest.c: >> 212 in load_backup_manifest() >> 206 bytes_left -= rc; >> 207 json_parse_manifest_incremental_chunk( >> 208 inc_state, buffer, rc, bytes_left == 0); >> 209 } >> 210 >> 211 close(fd); >>>>> CID 1596259: (RESOURCE_LEAK) >>>>> Variable "inc_state" going out of scope leaks the storage it >>>>> points to. >> 212 } >> 213 >> 214 /* All done. */ >> 215 pfree(buffer); >> 216 return result; >> 217 } >> >> /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c: >> 488 in parse_manifest_file() >> 482 bytes_left -= rc; >> 483 json_parse_manifest_incremental_chunk( >> 484 inc_state, buffer, rc, bytes_left == 0); >> 485 } >> 486 >> 487 close(fd); >>>>> CID 1596257: (RESOURCE_LEAK) >>>>> Variable "inc_state" going out of scope leaks the storage it >>>>> points to. >> 488 } >> 489 >> 490 /* Done with the buffer. */ >> 491 pfree(buffer); >> 492 >> 493 return result; >> >> It's right about that AFAICS, and not only is the "inc_state" itself >> leaked but so is its assorted infrastructure. Perhaps we don't care >> too much about that in the existing applications, but ISTM that >> isn't going to be a tenable assumption across the board. Shouldn't >> there be a "json_parse_manifest_incremental_shutdown()" or the like >> to deallocate all the storage allocated by the parser? > > > > yeah, probably. Will work on it. > > > Here's a patch. In addition to the leaks Coverity found, there was another site in the backend code that should call the shutdown function, and a probably memory leak from a logic bug in the incremental json parser code. All these are fixed. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On Mon, Apr 08, 2024 at 05:42:35PM -0400, Andrew Dunstan wrote: > Arguably the fact that it points nowhere is a good thing. But feel free to > replace it with something else. It doesn't have to be URLs at all. That > happened simply because it was easy to extract from a very large piece of > JSON I had lying around, probably from the last time I wrote a JSON parser > :-) : "http://www.theatermania.com/broadway/", As as matter of fact, this one points to something that seems to be real. Most of the blobs had better be trimmed down. Looking at the code paths in this module, it looks like this could be even simpler than what Jacob is shaving, checking for: - Objects start and end, including booleans, ints and strings for the quotations. - Nested objects, including arrays made of simple JSON objects. Some escape patterns are already checked, but most of them are missing: https://coverage.postgresql.org/src/test/modules/test_json_parser/test_json_parser_incremental.c.gcov.html There is no direct check on test_json_parser_perf.c, either, only a custom rule in the Makefile without specifying something for meson. So it looks like you could do short execution check in a TAP test, at least. -- Michael
Attachment
On Tue, Apr 09, 2024 at 09:48:18AM +0900, Michael Paquier wrote: > There is no direct check on test_json_parser_perf.c, either, only a > custom rule in the Makefile without specifying something for meson. > So it looks like you could do short execution check in a TAP test, at > least. While reading the code, I've noticed a few things, as well: + /* max delicious line length is less than this */ + char buff[6001]; Delicious applies to the contents, nothing to do with this code even if I'd like to think that these lines of code are edible and good. Using a hardcoded limit for some JSON input sounds like a bad idea to me even if this is a test module. Comment that applies to both the perf and the incremental tools. You could use a #define'd buffer size for readability rather than assuming this value in many places. +++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c + * This progam tests incremental parsing of json. The input is fed into + * full range of incement handling, especially in the lexer, is exercised. +++ b/src/test/modules/test_json_parser/test_json_parser_perf.c + * Performancet est program for both flavors of the JSON parser + * This progam tests either the standard (recursive descent) JSON parser +++ b/src/test/modules/test_json_parser/README + reads in a file and pases it in very small chunks (60 bytes at a time) to Collection of typos across various files. + appendStringInfoString(&json, "1+23 trailing junk"); What's the purpose here? Perhaps the intention should be documented in a comment? At the end, having a way to generate JSON blobs randomly to test this stuff would be more appealing than what you have currently, with perhaps a few factors like: - Array and simple object density. - Max Level of nesting. - Limitation to ASCII characters that can be escaped. - Perhaps more things I cannot think about? So the current state of things is kind of disappointing, and the size of the data set added to the tree is not that portable either if you want to test various scenarios depending on the data set. It seems to me that this has been committed too hastily and that this is not ready for integration, even if that's just a test module. Tom also has shared some concerns upthread, as far as I can see. -- Michael
Attachment
On Tue, Apr 09, 2024 at 09:48:18AM +0900, Michael Paquier wrote:There is no direct check on test_json_parser_perf.c, either, only a custom rule in the Makefile without specifying something for meson. So it looks like you could do short execution check in a TAP test, at least.
Not adding a test for that was deliberate - any sane test takes a while, and I didn't want to spend that much time on it every time someone runs "make check-world" or equivalent. However, adding a test to run it with a trivial number of iterations seems reasonable, so I'll add that. I'll also add a meson target for the binary.
While reading the code, I've noticed a few things, as well: + /* max delicious line length is less than this */ + char buff[6001]; Delicious applies to the contents, nothing to do with this code even if I'd like to think that these lines of code are edible and good. Using a hardcoded limit for some JSON input sounds like a bad idea to me even if this is a test module. Comment that applies to both the perf and the incremental tools. You could use a #define'd buffer size for readability rather than assuming this value in many places.
The comment is a remnant of when I hadn't yet added support for incomplete tokens, and so I had to parse the input line by line. I agree it can go, and we can use a manifest constant for the buffer size.
+++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c + * This progam tests incremental parsing of json. The input is fed into + * full range of incement handling, especially in the lexer, is exercised. +++ b/src/test/modules/test_json_parser/test_json_parser_perf.c + * Performancet est program for both flavors of the JSON parser + * This progam tests either the standard (recursive descent) JSON parser +++ b/src/test/modules/test_json_parser/README + reads in a file and pases it in very small chunks (60 bytes at a time) to Collection of typos across various files.
Will fix. (The older I get the more typos I seem to make and the harder it is to notice them. It's very annoying.)
+ appendStringInfoString(&json, "1+23 trailing junk"); What's the purpose here? Perhaps the intention should be documented in a comment?
The purpose is to ensure that if there is not a trailing '\0' on the json chunk the parser will still do the right thing. I'll add a comment to that effect.
At the end, having a way to generate JSON blobs randomly to test this stuff would be more appealing than what you have currently, with perhaps a few factors like: - Array and simple object density. - Max Level of nesting. - Limitation to ASCII characters that can be escaped. - Perhaps more things I cannot think about?
No, I disagree. Maybe we need those things as well, but we do need a static test where we can test the output against known results. I have no objection to changing the input and output files.
It's worth noting that t/002_inline.pl does generate some input and test e.g., the maximum nesting levels among other errors. Perhaps you missed that. If you think we need more tests there adding them would be extremely simple.
So the current state of things is kind of disappointing, and the size of the data set added to the tree is not that portable either if you want to test various scenarios depending on the data set. It seems to me that this has been committed too hastily and that this is not ready for integration, even if that's just a test module. Tom also has shared some concerns upthread, as far as I can see.
I have posted a patch already that addresses the issue Tom raised.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Tue, Apr 9, 2024 at 4:54 AM Andrew Dunstan <andrew@dunslane.net> wrote: > On 2024-04-09 Tu 01:23, Michael Paquier wrote: > There is no direct check on test_json_parser_perf.c, either, only a > custom rule in the Makefile without specifying something for meson. > So it looks like you could do short execution check in a TAP test, at > least. > > Not adding a test for that was deliberate - any sane test takes a while, and I didn't want to spend that much time on itevery time someone runs "make check-world" or equivalent. However, adding a test to run it with a trivial number of iterationsseems reasonable, so I'll add that. I'll also add a meson target for the binary. Okay, but for what purpose? My understanding during review was that this was a convenience utility for people who were actively hacking on the code (and I used it for exactly that purpose a few months back, so I didn't question that any further). Why does the farm need to spend any time running it at all? --Jacob
On 2024-04-09 Tu 09:45, Jacob Champion wrote: > On Tue, Apr 9, 2024 at 4:54 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> On 2024-04-09 Tu 01:23, Michael Paquier wrote: >> There is no direct check on test_json_parser_perf.c, either, only a >> custom rule in the Makefile without specifying something for meson. >> So it looks like you could do short execution check in a TAP test, at >> least. >> >> Not adding a test for that was deliberate - any sane test takes a while, and I didn't want to spend that much time onit every time someone runs "make check-world" or equivalent. However, adding a test to run it with a trivial number ofiterations seems reasonable, so I'll add that. I'll also add a meson target for the binary. > Okay, but for what purpose? My understanding during review was that > this was a convenience utility for people who were actively hacking on > the code (and I used it for exactly that purpose a few months back, so > I didn't question that any further). Why does the farm need to spend > any time running it at all? > I think Michael's point was that if we carry the code we should test we can run it. The other possibility would be just to remove it. I can see arguments for both. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Tue, Apr 9, 2024 at 7:30 AM Andrew Dunstan <andrew@dunslane.net> wrote: > I think Michael's point was that if we carry the code we should test we > can run it. The other possibility would be just to remove it. I can see > arguments for both. Hm. If it's not acceptable to carry this (as a worse-is-better smoke test) without also running it during tests, then my personal vote would be to tear it out and just have people write/contribute targeted benchmarks when they end up working on performance. I don't think the cost/benefit makes sense at that point. --Jacob
On Mon, Apr 8, 2024 at 10:24 PM Michael Paquier <michael@paquier.xyz> wrote: > At the end, having a way to generate JSON blobs randomly to test this > stuff would be more appealing For the record, I'm working on an LLVM fuzzer target for the JSON parser. I think that would be a lot more useful than anything we can hand-code. But I want it to cover both the recursive and incremental code paths, and we'd need to talk about where it would live. libfuzzer is seeded with a bunch of huge incomprehensible blobs, which is something we're now trying to avoid checking in. There's also the security aspect of "what do we do when it finds something", and at that point maybe we need to look into a service like oss-fuzz. Thanks, --Jacob
On Mon, Apr 8, 2024 at 5:48 PM Michael Paquier <michael@paquier.xyz> wrote: > : "http://www.theatermania.com/broadway/", > > As as matter of fact, this one points to something that seems to be > real. Most of the blobs had better be trimmed down. This new version keeps all the existing structure but points the hostnames to example.com [1]. --Jacob [1] https://datatracker.ietf.org/doc/html/rfc2606#section-3
Attachment
On 2024-04-09 Tu 07:54, Andrew Dunstan wrote: > > > On 2024-04-09 Tu 01:23, Michael Paquier wrote: >> On Tue, Apr 09, 2024 at 09:48:18AM +0900, Michael Paquier wrote: >>> There is no direct check on test_json_parser_perf.c, either, only a >>> custom rule in the Makefile without specifying something for meson. >>> So it looks like you could do short execution check in a TAP test, at >>> least. > > > Not adding a test for that was deliberate - any sane test takes a > while, and I didn't want to spend that much time on it every time > someone runs "make check-world" or equivalent. However, adding a test > to run it with a trivial number of iterations seems reasonable, so > I'll add that. I'll also add a meson target for the binary. > > >> While reading the code, I've noticed a few things, as well: >> >> + /* max delicious line length is less than this */ >> + char buff[6001]; >> >> Delicious applies to the contents, nothing to do with this code even >> if I'd like to think that these lines of code are edible and good. >> Using a hardcoded limit for some JSON input sounds like a bad idea to >> me even if this is a test module. Comment that applies to both the >> perf and the incremental tools. You could use a #define'd buffer size >> for readability rather than assuming this value in many places. > > > The comment is a remnant of when I hadn't yet added support for > incomplete tokens, and so I had to parse the input line by line. I > agree it can go, and we can use a manifest constant for the buffer size. > > >> +++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c >> + * This progam tests incremental parsing of json. The input is fed into >> + * full range of incement handling, especially in the lexer, is exercised. >> +++ b/src/test/modules/test_json_parser/test_json_parser_perf.c >> + * Performancet est program for both flavors of the JSON parser >> + * This progam tests either the standard (recursive descent) JSON parser >> +++ b/src/test/modules/test_json_parser/README >> + reads in a file and pases it in very small chunks (60 bytes at a time) to >> >> Collection of typos across various files. > > > Will fix. (The older I get the more typos I seem to make and the > harder it is to notice them. It's very annoying.) > > >> + appendStringInfoString(&json, "1+23 trailing junk"); >> What's the purpose here? Perhaps the intention should be documented >> in a comment? > > > The purpose is to ensure that if there is not a trailing '\0' on the > json chunk the parser will still do the right thing. I'll add a > comment to that effect. > > >> At the end, having a way to generate JSON blobs randomly to test this >> stuff would be more appealing than what you have currently, with >> perhaps a few factors like: >> - Array and simple object density. >> - Max Level of nesting. >> - Limitation to ASCII characters that can be escaped. >> - Perhaps more things I cannot think about? > > > No, I disagree. Maybe we need those things as well, but we do need a > static test where we can test the output against known results. I have > no objection to changing the input and output files. > > It's worth noting that t/002_inline.pl does generate some input and > test e.g., the maximum nesting levels among other errors. Perhaps you > missed that. If you think we need more tests there adding them would > be extremely simple. > > >> So the current state of things is kind of disappointing, and the size >> of the data set added to the tree is not that portable either if you >> want to test various scenarios depending on the data set. It seems to >> me that this has been committed too hastily and that this is not ready >> for integration, even if that's just a test module. Tom also has >> shared some concerns upthread, as far as I can see. > > > I have posted a patch already that addresses the issue Tom raised. > > Here's a consolidated set of cleanup patches, including the memory leak patch and Jacob's shrink-tiny patch. I think the biggest open issue is whether or not we remove the performance test program. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On 2024-04-09 Tu 15:42, Andrew Dunstan wrote: > > On 2024-04-09 Tu 07:54, Andrew Dunstan wrote: >> >> >> On 2024-04-09 Tu 01:23, Michael Paquier wrote: >>> On Tue, Apr 09, 2024 at 09:48:18AM +0900, Michael Paquier wrote: >>>> There is no direct check on test_json_parser_perf.c, either, only a >>>> custom rule in the Makefile without specifying something for meson. >>>> So it looks like you could do short execution check in a TAP test, at >>>> least. >> >> >> Not adding a test for that was deliberate - any sane test takes a >> while, and I didn't want to spend that much time on it every time >> someone runs "make check-world" or equivalent. However, adding a test >> to run it with a trivial number of iterations seems reasonable, so >> I'll add that. I'll also add a meson target for the binary. >> >> >>> While reading the code, I've noticed a few things, as well: >>> >>> + /* max delicious line length is less than this */ >>> + char buff[6001]; >>> >>> Delicious applies to the contents, nothing to do with this code even >>> if I'd like to think that these lines of code are edible and good. >>> Using a hardcoded limit for some JSON input sounds like a bad idea to >>> me even if this is a test module. Comment that applies to both the >>> perf and the incremental tools. You could use a #define'd buffer size >>> for readability rather than assuming this value in many places. >> >> >> The comment is a remnant of when I hadn't yet added support for >> incomplete tokens, and so I had to parse the input line by line. I >> agree it can go, and we can use a manifest constant for the buffer size. >> >> >>> +++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c >>> + * This progam tests incremental parsing of json. The input is fed >>> into >>> + * full range of incement handling, especially in the lexer, is >>> exercised. >>> +++ b/src/test/modules/test_json_parser/test_json_parser_perf.c >>> + * Performancet est program for both flavors of the JSON parser >>> + * This progam tests either the standard (recursive descent) JSON >>> parser >>> +++ b/src/test/modules/test_json_parser/README >>> + reads in a file and pases it in very small chunks (60 bytes at a >>> time) to >>> >>> Collection of typos across various files. >> >> >> Will fix. (The older I get the more typos I seem to make and the >> harder it is to notice them. It's very annoying.) >> >> >>> + appendStringInfoString(&json, "1+23 trailing junk"); >>> What's the purpose here? Perhaps the intention should be documented >>> in a comment? >> >> >> The purpose is to ensure that if there is not a trailing '\0' on the >> json chunk the parser will still do the right thing. I'll add a >> comment to that effect. >> >> >>> At the end, having a way to generate JSON blobs randomly to test this >>> stuff would be more appealing than what you have currently, with >>> perhaps a few factors like: >>> - Array and simple object density. >>> - Max Level of nesting. >>> - Limitation to ASCII characters that can be escaped. >>> - Perhaps more things I cannot think about? >> >> >> No, I disagree. Maybe we need those things as well, but we do need a >> static test where we can test the output against known results. I >> have no objection to changing the input and output files. >> >> It's worth noting that t/002_inline.pl does generate some input and >> test e.g., the maximum nesting levels among other errors. Perhaps you >> missed that. If you think we need more tests there adding them would >> be extremely simple. >> >> >>> So the current state of things is kind of disappointing, and the size >>> of the data set added to the tree is not that portable either if you >>> want to test various scenarios depending on the data set. It seems to >>> me that this has been committed too hastily and that this is not ready >>> for integration, even if that's just a test module. Tom also has >>> shared some concerns upthread, as far as I can see. >> >> >> I have posted a patch already that addresses the issue Tom raised. >> >> > > > Here's a consolidated set of cleanup patches, including the memory > leak patch and Jacob's shrink-tiny patch. Here's v2 of the cleanup patch 4, that fixes some more typos kindly pointed out to me by Alexander Lakhin. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
On Tue, Apr 09, 2024 at 09:26:49AM -0700, Jacob Champion wrote: > On Tue, Apr 9, 2024 at 7:30 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> I think Michael's point was that if we carry the code we should test we >> can run it. The other possibility would be just to remove it. I can see >> arguments for both. > > Hm. If it's not acceptable to carry this (as a worse-is-better smoke > test) without also running it during tests, then my personal vote > would be to tear it out and just have people write/contribute targeted > benchmarks when they end up working on performance. I don't think the > cost/benefit makes sense at that point. And you may catch up a couple of bugs while on it. In my experience, things with custom makefile and/or meson rules tend to rot easily because everybody forgets about them. There are a few of them in the tree that could be ripped off, as well.. -- Michael
Attachment
On Wed, Apr 10, 2024 at 07:47:38AM -0400, Andrew Dunstan wrote: > Here's v2 of the cleanup patch 4, that fixes some more typos kindly pointed > out to me by Alexander Lakhin. I can see that this has been applied as of 42fa4b660143 with some extra commits. Anyway, I have noticed another thing in the surroundings that's annoying. 003 has this logic: use File::Temp qw(tempfile); [...] my ($fh, $fname) = tempfile(); print $fh $stdout,"\n"; close($fh); This creates a temporary file in /tmp/ that remains around, slowing bloating the temporary file space on a node while leaving around some data. Why using File::Temp::tempfile here? Couldn't you just use a file in a PostgreSQL::Test::Utils::tempdir() that would be cleaned up once the test finishes? Per [1], escape_json() has no coverage outside its default path. Is that intended? Documenting all these test files with a few comments would be welcome, as well, with some copyright notices... json_file = fopen(testfile, "r"); fstat(fileno(json_file), &statbuf); bytes_left = statbuf.st_size; No checks on failure of fstat() here? json_file = fopen(argv[2], "r"); Second one in test_json_parser_perf.c, with more stuff for fread(). [1]: https://coverage.postgresql.org/src/test/modules/test_json_parser/test_json_parser_incremental.c.gcov.html -- Michael
Attachment
On Wed, Apr 10, 2024 at 07:47:38AM -0400, Andrew Dunstan wrote:Here's v2 of the cleanup patch 4, that fixes some more typos kindly pointed out to me by Alexander Lakhin.I can see that this has been applied as of 42fa4b660143 with some extra commits. Anyway, I have noticed another thing in the surroundings that's annoying. 003 has this logic: use File::Temp qw(tempfile); [...] my ($fh, $fname) = tempfile(); print $fh $stdout,"\n"; close($fh); This creates a temporary file in /tmp/ that remains around, slowing bloating the temporary file space on a node while leaving around some data.
My bad, I should have used the UNLINK option like in the other tests.
Why using File::Temp::tempfile here? Couldn't you just use a file in a PostgreSQL::Test::Utils::tempdir() that would be cleaned up once the test finishes?
That's another possibility, but I think the above is the simplest.
Per [1], escape_json() has no coverage outside its default path. Is that intended?
Not particularly. I'll add some stuff to get complete coverage.
Documenting all these test files with a few comments would be welcome, as well, with some copyright notices...
ok
json_file = fopen(testfile, "r"); fstat(fileno(json_file), &statbuf); bytes_left = statbuf.st_size; No checks on failure of fstat() here?
ok will fix
json_file = fopen(argv[2], "r"); Second one in test_json_parser_perf.c, with more stuff for fread().
ok will fix
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Thu, Apr 18, 2024 at 06:03:45AM -0400, Andrew Dunstan wrote: > On 2024-04-18 Th 02:04, Michael Paquier wrote: >> Why usingFile::Temp::tempfile here? Couldn't you just use a >> file in a PostgreSQL::Test::Utils::tempdir() that would be cleaned up >> once the test finishes? > > That's another possibility, but I think the above is the simplest. Are you sure that relying on Temp::File is a good thing overall? The current temporary file knowledge is encapsulated within Utils.pm, with files removed or kept depending on PG_TEST_NOCLEAN. So it would be just more consistent to rely on the existing facilities instead? test_json_parser is the only code path in the whole tree that directly uses File::Temp. The rest of the TAP tests relies on Utils.pm for temp file paths. -- Michael
Attachment
On Fri, Apr 19, 2024 at 1:35 AM Michael Paquier <michael@paquier.xyz> wrote: > Are you sure that relying on Temp::File is a good thing overall? The > current temporary file knowledge is encapsulated within Utils.pm, with > files removed or kept depending on PG_TEST_NOCLEAN. So it would be > just more consistent to rely on the existing facilities instead? > test_json_parser is the only code path in the whole tree that directly > uses File::Temp. The rest of the TAP tests relies on Utils.pm for > temp file paths. Yeah, I think this patch invented a new solution to a problem that we've solved in a different way everywhere else. I think we should change it to match what we do in general. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 19, 2024 at 02:04:40PM -0400, Robert Haas wrote: > Yeah, I think this patch invented a new solution to a problem that > we've solved in a different way everywhere else. I think we should > change it to match what we do in general. As of ba3e6e2bca97, did you notice that test_json_parser_perf generates two core files because progname is not set, failing an assertion when using the frontend logging? -- Michael
Attachment
On 2024-04-24 We 04:56, Michael Paquier wrote: > On Fri, Apr 19, 2024 at 02:04:40PM -0400, Robert Haas wrote: >> Yeah, I think this patch invented a new solution to a problem that >> we've solved in a different way everywhere else. I think we should >> change it to match what we do in general. > As of ba3e6e2bca97, did you notice that test_json_parser_perf > generates two core files because progname is not set, failing an > assertion when using the frontend logging? No, it didn't for me. Thanks for noticing, I've pushed a fix. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com