Thread: Why is writing JSONB faster than just JSON?
Hi! I have a project where we among other data want to store static JSON objects which can get pretty large (10-100 KB). I was trying to evaluate how it would work if we simply store it as an additional column in a PostgreSQL database. So I made a benchmark [1]. The results surprised me a bit and I am writing here because I would like to understand them. Namely, it looks like writing into a jsonb typed column is 30% faster than writing into a json typed column. Why is that? Does not jsonb require parsing of JSON and conversion? That should be slower than just storing a blob as-is? [1] https://gitlab.com/mitar/benchmark-pg-json Mitar -- http://mitar.tnode.com/ https://twitter.com/mitar_m
> On Tue, Apr 13, 2021 at 11:38:04PM -0700, Mitar wrote: > Hi! > > I have a project where we among other data want to store static JSON > objects which can get pretty large (10-100 KB). I was trying to > evaluate how it would work if we simply store it as an additional > column in a PostgreSQL database. So I made a benchmark [1]. The > results surprised me a bit and I am writing here because I would like > to understand them. Namely, it looks like writing into a jsonb typed > column is 30% faster than writing into a json typed column. Why is > that? Does not jsonb require parsing of JSON and conversion? That > should be slower than just storing a blob as-is? > > [1] https://gitlab.com/mitar/benchmark-pg-json Interesting. I've tried to reproduce the schema and queries from the repository above (with a different generated data though and without the app itself) and indeed got a bit slower (not 30%, but visibly) execution for json column instead of jsonb. There are couple of important points: * Parsing is happening in both cases, for json it's done for validation purposes. * Compression of json data is actually dominates the runtime load for large json objects, making other factors less visible and reducing difference in size (jsonb also should be a bit bigger, that's why I would think it would be slower). * At the same time perf shows that json compression spends a bit more time in pglz_find_match (mostly around accessing and checking history entries), which is compression optimization via history table. So probably due to differences in data layout this optimization works slightly worse for plain json than for jsonb?
Dmitry Dolgov <9erthalion6@gmail.com> writes: >> On Tue, Apr 13, 2021 at 11:38:04PM -0700, Mitar wrote: >> ... Namely, it looks like writing into a jsonb typed >> column is 30% faster than writing into a json typed column. Why is >> that? Does not jsonb require parsing of JSON and conversion? That >> should be slower than just storing a blob as-is? > * Parsing is happening in both cases, for json it's done for validation > purposes. Right. > * Compression of json data is actually dominates the runtime load for large > json objects, making other factors less visible and reducing difference in > size (jsonb also should be a bit bigger, that's why I would think it would be > slower). > * At the same time perf shows that json compression spends a bit more time in > pglz_find_match (mostly around accessing and checking history entries), which > is compression optimization via history table. So probably due to differences > in data layout this optimization works slightly worse for plain json than for > jsonb? Interesting. I recall that we made some last-minute changes in the JSONB physical representation after finding that the original choices resulted in sucky compression behavior. But I think we were only looking at the resultant compressed size, not time-to-compress. My own guess about this, without having tried to reproduce it, is that JSONB might end up physically smaller than JSON, resulting in less work to push the toasted datum out to disk. This'd depend a lot on your formatting habits for JSON, of course. But in any case, it'd be worth comparing pg_column_size() results to see what's up with that. regards, tom lane
> On Wed, Apr 14, 2021 at 10:26:23AM -0400, Tom Lane wrote: > Dmitry Dolgov <9erthalion6@gmail.com> writes: > >> On Tue, Apr 13, 2021 at 11:38:04PM -0700, Mitar wrote: > >> ... Namely, it looks like writing into a jsonb typed > >> column is 30% faster than writing into a json typed column. Why is > >> that? Does not jsonb require parsing of JSON and conversion? That > >> should be slower than just storing a blob as-is? > > > * Parsing is happening in both cases, for json it's done for validation > > purposes. > > Right. > > > * Compression of json data is actually dominates the runtime load for large > > json objects, making other factors less visible and reducing difference in > > size (jsonb also should be a bit bigger, that's why I would think it would be > > slower). > > * At the same time perf shows that json compression spends a bit more time in > > pglz_find_match (mostly around accessing and checking history entries), which > > is compression optimization via history table. So probably due to differences > > in data layout this optimization works slightly worse for plain json than for > > jsonb? > > Interesting. I recall that we made some last-minute changes in the JSONB > physical representation after finding that the original choices resulted > in sucky compression behavior. But I think we were only looking at the > resultant compressed size, not time-to-compress. > > My own guess about this, without having tried to reproduce it, is that > JSONB might end up physically smaller than JSON, resulting in less work > to push the toasted datum out to disk. This'd depend a lot on your > formatting habits for JSON, of course. But in any case, it'd be worth > comparing pg_column_size() results to see what's up with that. Oh, of course I've missed that the input I was using was indeed formatted, without formatting both cases perform equally well and I can't reproduce the issue. Although if I understand correctly the original code in question doesn't actually do any formatting.
Dmitry Dolgov <9erthalion6@gmail.com> writes: > On Wed, Apr 14, 2021 at 10:26:23AM -0400, Tom Lane wrote: >> My own guess about this, without having tried to reproduce it, is that >> JSONB might end up physically smaller than JSON, resulting in less work >> to push the toasted datum out to disk. This'd depend a lot on your >> formatting habits for JSON, of course. But in any case, it'd be worth >> comparing pg_column_size() results to see what's up with that. > Oh, of course I've missed that the input I was using was indeed > formatted, without formatting both cases perform equally well and I > can't reproduce the issue. Although if I understand correctly the > original code in question doesn't actually do any formatting. My point was that for JSON, after validating that the input is syntactically correct, we just store it as-received. So in particular the amount of whitespace in the value would depend on how the client had chosen to format the JSON. This'd affect the stored size of course, and I think it would have an effect on compression time too. regards, tom lane
> On Thu, Apr 15, 2021 at 12:47:25PM -0400, Tom Lane wrote: > Dmitry Dolgov <9erthalion6@gmail.com> writes: > > On Wed, Apr 14, 2021 at 10:26:23AM -0400, Tom Lane wrote: > >> My own guess about this, without having tried to reproduce it, is that > >> JSONB might end up physically smaller than JSON, resulting in less work > >> to push the toasted datum out to disk. This'd depend a lot on your > >> formatting habits for JSON, of course. But in any case, it'd be worth > >> comparing pg_column_size() results to see what's up with that. > > > Oh, of course I've missed that the input I was using was indeed > > formatted, without formatting both cases perform equally well and I > > can't reproduce the issue. Although if I understand correctly the > > original code in question doesn't actually do any formatting. > > My point was that for JSON, after validating that the input is > syntactically correct, we just store it as-received. So in particular > the amount of whitespace in the value would depend on how the client > had chosen to format the JSON. This'd affect the stored size of > course, and I think it would have an effect on compression time too. Yes, I got it and just wanted to confirm you were right - this was the reason I've observed slowdown trying to reproduce the report.
Hi! On Thu, Apr 15, 2021 at 12:11 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > My point was that for JSON, after validating that the input is > > syntactically correct, we just store it as-received. So in particular > > the amount of whitespace in the value would depend on how the client > > had chosen to format the JSON. This'd affect the stored size of > > course, and I think it would have an effect on compression time too. > > Yes, I got it and just wanted to confirm you were right - this was the > reason I've observed slowdown trying to reproduce the report. Thank you for trying to reproduce the report. I did a bit more digging myself and I am still confused. First, it is important to note that the JSON I am using contains primarily random strings as values, so not really something which is easy to compress. See example at [1]. I have realized though that in the previous benchmark I have been using the same JSON document and inserting it multiple times, so compression might work across documents or something. So I ran a version of the benchmark with different JSONs being inserted (but with the same structure, just values are random strings). There was no difference. Second, as you see from [1], the JSON representation I am using is really compact and has no extra spaces. I also used pg_total_relation_size to get the size of the table after inserting 10k rows and the numbers are similar, with JSONB being slightly larger than others. So I think the idea of compression does not hold. So I do not know what is happening and why you cannot reproduce it. Maybe explain a bit how you are trying to reproduce it? Directly from psql console? Are you using the same version as me (13.2)? Numbers with inserting the same large JSON 10k times: Type: jsonb Mean: 200243.1 Stddev: 1679.7741187433503 Size: { pg_total_relation_size: '4611792896' } Type: json Mean: 256938.5 Stddev: 2471.9909890612466 Size: { pg_total_relation_size: '4597833728' } Type: text Mean: 248175.3 Stddev: 376.677594236769 Size: { pg_total_relation_size: '4597833728' } Inserting different JSON 10k times: Type: jsonb Mean: 202794.5 Stddev: 978.5346442512907 Size: { pg_total_relation_size: '4611792896' } Type: json Mean: 259437.9 Stddev: 1785.8411155531167 Size: { pg_total_relation_size: '4597833728' } Type: text Mean: 250060.5 Stddev: 912.9207249263213 Size: { pg_total_relation_size: '4597833728' } [1] https://gitlab.com/mitar/benchmark-pg-json/-/blob/master/example.json Mitar -- http://mitar.tnode.com/ https://twitter.com/mitar_m
Just a note: On Fri, Apr 23, 2021 at 10:57 AM Mitar <mmitar@gmail.com> wrote: > First, it is important to note that the JSON I am using contains > primarily random strings as values, so not really something which is > easy to compress. See example at [1]. A fast look at the link. It seems to be long string of random LOWER CASE HEX DIGITS. A simple huffman coder can probably put it in 5 bits per char, and a more sophisticated algorithm can probably approach 4. FOS.
Hi! On Fri, Apr 23, 2021 at 10:28 AM Francisco Olarte <folarte@peoplecall.com> wrote: > A fast look at the link. It seems to be long string of random LOWER > CASE HEX DIGITS. A simple huffman coder can probably put it in 5 bits > per char, and a more sophisticated algorithm can probably approach 4. But this compression-ility would apply to both JSONB and JSON column types, no? Moreover, it looks like JSONB column type ends up larger on disk. Mitar -- http://mitar.tnode.com/ https://twitter.com/mitar_m
Mitar: On Fri, Apr 23, 2021 at 7:33 PM Mitar <mmitar@gmail.com> wrote: > On Fri, Apr 23, 2021 at 10:28 AM Francisco Olarte > <folarte@peoplecall.com> wrote: > > A fast look at the link. It seems to be long string of random LOWER > > CASE HEX DIGITS. A simple huffman coder can probably put it in 5 bits > > per char, and a more sophisticated algorithm can probably approach 4. > But this compression-ility would apply to both JSONB and JSON column > types, no? Moreover, it looks like JSONB column type ends up larger on > disk. Of course, I did not follow the thread to deeply, just pointed that in case you were assuming that was not going to be stored compressed. Also, not surprised JSONB ends up being fatter, when no binary data is transmitted and no compression is used compact JSON has very little redundancy, about 6 chars per k/v pair, this is hard to beat without using sophisticated encodings, especially with long data. And if jsonb starts up bigger, and IIRC it does trade off compactness to be easily indexable an other stuff, it will normally end up bigger. But not an expert on this. FOS.
Hi! On Fri, Apr 23, 2021 at 10:49 AM Francisco Olarte <folarte@peoplecall.com> wrote: > Of course, I did not follow the thread to deeply, just pointed that in > case you were assuming that was not going to be stored compressed. Thanks for pointing that out. I was just trying to make sure I am understanding you correctly and that we are all on the same page about implications. It seems we are. > Also, not surprised JSONB ends up being fatter, Yes, by itself this is not surprising. Why I mentioned it is because in my original post in this thread, I posted that I am surprised that inserting into JSONB column seems observably faster than into JSON or TEXT column (for same data) and I wonder why that is. One theory presented was that JSONB might compress better so there is less IO so insertion is faster. But JSONB does not look more compressed (I have not measured the size in my original benchmark), so now I am searching for other explanations for the results of my benchmark. Mitar -- http://mitar.tnode.com/ https://twitter.com/mitar_m
> On Fri, Apr 23, 2021 at 01:56:57AM -0700, Mitar wrote: > Hi! > > On Thu, Apr 15, 2021 at 12:11 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > My point was that for JSON, after validating that the input is > > > syntactically correct, we just store it as-received. So in particular > > > the amount of whitespace in the value would depend on how the client > > > had chosen to format the JSON. This'd affect the stored size of > > > course, and I think it would have an effect on compression time too. > > > > Yes, I got it and just wanted to confirm you were right - this was the > > reason I've observed slowdown trying to reproduce the report. > > Thank you for trying to reproduce the report. I did a bit more digging > myself and I am still confused. > > ... > > So I do not know what is happening and why you cannot reproduce it. Could you maybe get a profile with perf for both cases? Since they're executed within a single backend, you can profile only a single pid. Having a reasonable profiling frequency, --call-graph dwarf and probably limit events to only user space with precise tagging (cycles:uppp) should give an impression what's going on.