Thread: Speed up JSON escape processing with SIMD plus other optimisations

Currently the escape_json() function takes a cstring and char-by-char
checks each character in the string up to the NUL and adds the escape
sequence if the character requires it.

Because this function requires a NUL terminated string, we're having
to do a little more work in some places.  For example, in
jsonb_put_escaped_value() we call pnstrdup() on the non-NUL-terminated
string to make a NUL-terminated string to pass to escape_json().

To make this faster, we can just have a version of escape_json which
takes a 'len' and stops after doing that many chars rather than
stopping when the NUL char is reached. Now there's no need to
pnstrdup() which saves some palloc()/memcpy() work.

There are also a few places where we do escape_json() with a "text"
typed Datum where we go and convert the text to a NUL-terminated
cstring so we can pass that along to ecape_json().  That's wasteful as
we could just pass the payload of the text Datum directly, and only
allocate memory if the text Datum needs to be de-toasted.  That saves
a useless palloc/memcpy/pfree cycle.

Now, to make this more interesting, since we have a version of
escape_json which takes a 'len', we could start looking at more than 1
character at a time. If you look closely add escape_json() all the
special chars apart from " and \ are below the space character.
pg_lfind8() and pg_lfind8_le() allow processing of 16 bytes at a time,
so we only need to search the 16 bytes 3 times to ensure that no
special chars exist within. When that test fails, just go into
byte-at-a-time processing first copying over the portion of the string
that passed the vector test up until that point.

I've attached 2 patches:

0001 does everything I've described aside from SIMD.
0002 does SIMD

I've not personally done too much work in the area of JSON, so I don't
have any canned workloads to throw at this.  I did try the following:

create table j1 (very_long_column_name_to_test_json_escape text);
insert into j1 select repeat('x', x) from generate_series(0,1024)x;
vacuum freeze j1;

bench.sql:
select row_to_json(j1)::jsonb from j1;

Master:
$ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
tps = 362.494309 (without initial connection time)
tps = 363.182458 (without initial connection time)
tps = 362.679654 (without initial connection time)

Master + 0001 + 0002
$ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
tps = 426.456885 (without initial connection time)
tps = 430.573046 (without initial connection time)
tps = 431.142917 (without initial connection time)

About 18% faster.

It would be much faster if we could also get rid of the
escape_json_cstring() call in the switch default case of
datum_to_json_internal(). row_to_json() would be heaps faster with
that done. I considered adding a special case for the "text" type
there, but in the end felt that we should just fix that with some
hypothetical other patch that changes how output functions work.
Others may feel it's worthwhile. I certainly could be convinced of it.

I did add a new regression test. I'm not sure I'd want to keep that,
but felt it's worth leaving in there for now.

Other things I considered were if doing 16 bytes at a time is too much
as it puts quite a bit of work into byte-at-a-time processing if just
1 special char exists in a 16-byte chunk. I considered doing SWAR [1]
processing to do the job of vector8_has_le() and vector8_has() byte
maybe with just uint32s.  It might be worth doing that. However, I've
not done it yet as it raises the bar for this patch quite a bit.  SWAR
vector processing is pretty much write-only code. Imagine trying to
write comments for the code in [2] so that the average person could
understand what's going on!?

I'd be happy to hear from anyone that can throw these patches at a
real-world JSON workload to see if it runs more quickly.

Parking for July CF.

David

[1] https://en.wikipedia.org/wiki/SWAR
[2] https://dotat.at/@/2022-06-27-tolower-swar.html

Attachment
On Thu, 23 May 2024 at 13:23, David Rowley <dgrowleyml@gmail.com> wrote:
> Master:
> $ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
> tps = 362.494309 (without initial connection time)
> tps = 363.182458 (without initial connection time)
> tps = 362.679654 (without initial connection time)
>
> Master + 0001 + 0002
> $ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
> tps = 426.456885 (without initial connection time)
> tps = 430.573046 (without initial connection time)
> tps = 431.142917 (without initial connection time)
>
> About 18% faster.
>
> It would be much faster if we could also get rid of the
> escape_json_cstring() call in the switch default case of
> datum_to_json_internal(). row_to_json() would be heaps faster with
> that done. I considered adding a special case for the "text" type
> there, but in the end felt that we should just fix that with some
> hypothetical other patch that changes how output functions work.
> Others may feel it's worthwhile. I certainly could be convinced of it.

Just to turn that into performance numbers, I tried the attached
patch.  The numbers came out better than I thought.

Same test as before:

master + 0001 + 0002 + attached hacks:
$ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
tps = 616.094394 (without initial connection time)
tps = 615.928236 (without initial connection time)
tps = 614.175494 (without initial connection time)

About 70% faster than master.

David

Attachment
On 2024-05-22 We 22:15, David Rowley wrote:
> On Thu, 23 May 2024 at 13:23, David Rowley <dgrowleyml@gmail.com> wrote:
>> Master:
>> $ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
>> tps = 362.494309 (without initial connection time)
>> tps = 363.182458 (without initial connection time)
>> tps = 362.679654 (without initial connection time)
>>
>> Master + 0001 + 0002
>> $ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
>> tps = 426.456885 (without initial connection time)
>> tps = 430.573046 (without initial connection time)
>> tps = 431.142917 (without initial connection time)
>>
>> About 18% faster.
>>
>> It would be much faster if we could also get rid of the
>> escape_json_cstring() call in the switch default case of
>> datum_to_json_internal(). row_to_json() would be heaps faster with
>> that done. I considered adding a special case for the "text" type
>> there, but in the end felt that we should just fix that with some
>> hypothetical other patch that changes how output functions work.
>> Others may feel it's worthwhile. I certainly could be convinced of it.
> Just to turn that into performance numbers, I tried the attached
> patch.  The numbers came out better than I thought.
>
> Same test as before:
>
> master + 0001 + 0002 + attached hacks:
> $ pgbench -n -f bench.sql -T 10 -M prepared postgres | grep tps
> tps = 616.094394 (without initial connection time)
> tps = 615.928236 (without initial connection time)
> tps = 614.175494 (without initial connection time)
>
> About 70% faster than master.
>

That's all pretty nice! I'd take the win on this rather than wait for 
some hypothetical patch that changes how output functions work.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




On Fri, 24 May 2024 at 08:34, Andrew Dunstan <andrew@dunslane.net> wrote:
> That's all pretty nice! I'd take the win on this rather than wait for
> some hypothetical patch that changes how output functions work.

On re-think of that, even if we changed the output functions to write
directly to a StringInfo, we wouldn't get the same speedup.  All it
would get us is a better ability to know the length of the string the
output function generated by looking at the StringInfoData.len before
and after calling the output function. That *would* allow us to use
the SIMD escaping, but not save the palloc/memcpy cycle for
non-toasted Datums.  In other words, if we want this speedup then I
don't see another way other than this special case.

I've attached a rebased patch series which includes the 3rd patch in a
more complete form. This one also adds handling for varchar and
char(n) output functions. Ideally, these would also use textout() to
save from having the ORs in the if condition. The output function code
is the same in each.

Updated benchmarks from the test in [1].

master @ 7c655a04a
$ for i in {1..3}; do pgbench -n -f bench.sql -T 10 -M prepared
postgres | grep tps; done
tps = 366.211426
tps = 359.707014
tps = 362.204383

master + 0001
$ for i in {1..3}; do pgbench -n -f bench.sql -T 10 -M prepared
postgres | grep tps; done
tps = 362.641668
tps = 367.986495
tps = 368.698193 (+1% vs master)

master + 0001 + 0002
$ for i in {1..3}; do pgbench -n -f bench.sql -T 10 -M prepared
postgres | grep tps; done
tps = 430.477314
tps = 425.173469
tps = 431.013275 (+18% vs master)

master + 0001 + 0002 + 0003
$ for i in {1..3}; do pgbench -n -f bench.sql -T 10 -M prepared
postgres | grep tps; done
tps = 606.702305
tps = 625.727031
tps = 617.164822 (+70% vs master)

David

[1] https://postgr.es/m/CAApHDvpLXwMZvbCKcdGfU9XQjGCDm7tFpRdTXuB9PVgpNUYfEQ@mail.gmail.com

Attachment
Hi David,

Thanks for the patch.

In 0001 patch, I see that there are some escape_json() calls with NUL-terminated strings and gets the length by calling strlen(), like below:

- escape_json(&buf, "timestamp");
+ escape_json(&buf, "timestamp", strlen("timestamp"));

 Wouldn't using escape_json_cstring() be better instead? IIUC there isn't much difference between escape_json() and escape_json_cstring(), right? We would avoid strlen() with escape_json_cstring().

Regards,
--
Melih Mutlu
Microsoft


On 2024-06-11 Tu 08:08, Melih Mutlu wrote:
Hi David,

Thanks for the patch.

In 0001 patch, I see that there are some escape_json() calls with NUL-terminated strings and gets the length by calling strlen(), like below:

- escape_json(&buf, "timestamp");
+ escape_json(&buf, "timestamp", strlen("timestamp"));

 Wouldn't using escape_json_cstring() be better instead? IIUC there isn't much difference between escape_json() and escape_json_cstring(), right? We would avoid strlen() with escape_json_cstring().



or maybe use sizeof("timestamp") - 1


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Thanks for having a look.

On Wed, 12 Jun 2024 at 00:08, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> In 0001 patch, I see that there are some escape_json() calls with NUL-terminated strings and gets the length by
callingstrlen(), like below:
 
>
>> - escape_json(&buf, "timestamp");
>> + escape_json(&buf, "timestamp", strlen("timestamp"));
>
>  Wouldn't using escape_json_cstring() be better instead? IIUC there isn't much difference between escape_json() and
escape_json_cstring(),right? We would avoid strlen() with escape_json_cstring().
 

It maybe would be better, but not for this reason. Most compilers will
be able to perform constant folding to transform the
strlen("timestamp") into 9.  You can see that's being done by both gcc
and clang in [1].

It might be better to use escape_json_cstring() regardless of that as
the SIMD only kicks in when there are >= 16 chars, so there might be a
few more instructions calling the SIMD version for such a short
string. Probably, if we're worried about performance here we could
just not bother passing the string through the escape function to
search for something we know isn't there and just
appendBinaryStringInfo \""timestamp\":" directly.

I don't really have a preference as to which of these we use. I doubt
the JSON escaping rules would ever change sufficiently that the latter
of these methods would be a bad idea. I just doubt it's worth the
debate as I imagine the performance won't matter that much.

David

[1] https://godbolt.org/z/xqj4rKara