Re: Making jsonb_agg() faster - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: Making jsonb_agg() faster
Date
Msg-id CAHyXU0zOLCn_6NXg0ooEm3pijFxZaaiN9A=W4AfFvY4+n1GG3A@mail.gmail.com
Whole thread Raw
In response to Making jsonb_agg() faster  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Making jsonb_agg() faster
List pgsql-hackers
On Tue, Jul 22, 2025 at 10:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
There have been some complaints recently about how jsonb_agg()
is a lot slower than json_agg() [1].  That's annoying considering
that the whole selling point of JSONB is to have faster processing
than the original JSON type, so I poked into that.  What I found
is that jsonb_agg() and its variants are just really inefficiently
implemented.  Basically, for each aggregate input value, they will:

1. Build a JsonbValue tree representation of the input value.
2. Flatten the JsonbValue tree into a Jsonb in on-disk format.
3. Iterate through the Jsonb, building a JsonbValue that is part
of the aggregate's state stored in aggcontext, but is otherwise
identical to what phase 1 built.

The motivation for this seems to have been to make sure that any
memory leakage during phase 1 does not happen in the long-lived
aggcontext.  But it's hard not to call it a Rube Goldberg contraption.

The attached patch series gets rid of phases 2 and 3 by refactoring
pushJsonbValue() and related functions so that the JsonbValue tree
they construct can be constructed in a context that's not
CurrentMemoryContext.  With that and some run-of-the-mill optimization
work, I'm getting 2.5X speedup for jsonb_agg on a text column (as
measured by the attached test script) and a bit over 2X on an int8
column.  It's still a little slower than json_agg, but no longer
slower by integer multiples.

0001 is a somewhat invasive refactoring of the API for
pushJsonbValue and friends.  It doesn't in itself have any
measurable speed consequences as far as I can tell, but I think
it makes the code nicer in any case.  (I really do not like the
existing coding setup where sometimes it's important to capture
the result of pushJsonbValue and sometimes it's not; that
seems awfully confusing and bug-prone.)  The real point though
is to have a way of commanding pushJsonbValue to build the
JsonbValue tree somewhere other than CurrentMemoryContext.

Having laid the groundwork with 0001, 0002 simply amounts to
telling pushJsonbValue to put its handiwork in the aggcontext
and then ripping out phases 2 and 3 of the aggregate transfns.

0003 is some simple micro-optimization of the datatype conversion
code in datum_to_jsonb_internal.

Thoughts?

Really excited about this -- I'll do some testing.  Performance of serialization (generating json output from non json data) is the main reason I still recommend json...jsonb is better in most other cases, but serialization performance is extremely important.  

The other reason json type is needed is being able to serialize WYSIWYG:

superuser@postgres=# select to_json(q), to_jsonb(q) from (select 1 as b, 2 as a, 3 as b) q;
       to_json       |     to_jsonb
---------------------+------------------
 {"b":1,"a":2,"b":3} | {"a": 2, "b": 3}

jsonb output can be obnoxious for aesthetic reasons here as well as uncommon technical ones where key order is important to the receiving processor. Point being, if there was some way to do that in the jsonb interface with good performance, the need for json type (or at least an independent implementation) would completely evaporate.  

The only hypothetical use case for legacy json is precise storage, but that can be worked around with TEXT.  

merlin 

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: index prefetching
Next
From: Peter Geoghegan
Date:
Subject: Re: index prefetching