Thread: [PATCH] Optimize json_lex_string by batching character copying

[PATCH] Optimize json_lex_string by batching character copying

From
Jelte Fennema
Date:
When parsing JSON strings need to be converted from the JSON string
format to a c-style string. A simple copy of the buffer does not suffice
because of the various escape sequences that that JSON supports. Because
of this our JSON parser wrote characters into the c-style string buffer
one at a time.

However, this is only necessary for these escaped sequences that map to
another character. This patch changes the behaviour for non-escaped
characters. These are now copied in batches instead of one character at
a time.

To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB. The JSONB values
are a JSON object with a single level. They contain a few small keys and
values, but one very big value that's a stringified JSON blob. So this
JSON blob contains a relatively high number of escape characters, to
escape all the " characters. This change improves performance for
workload this workload on my machine by ~18% (going from 1m24s to 1m09s).

@Andres, there was indeed some low hanging fruit.
@John Naylor, SSE2 indeed sounds like another nice improvement. I'll leave
that to you.

Attachment

Re: [PATCH] Optimize json_lex_string by batching character copying

From
Zhihong Yu
Date:
Hi,
Looking at the patch,

+           if (copyable_characters_length)
+           {
+               /* flush copyable characters */
+               appendBinaryStringInfo(
+                                      lex->strval,
+                                      s - copyable_characters_length,
+                                      copyable_characters_length);
+
+           }
            break;

I wonder why copyable_characters_length is not reset after flushing.

Cheers

Re: [PATCH] Optimize json_lex_string by batching character copying

From
Jelte Fennema
Date:
> +           if (copyable_characters_length)
> +           {
> +               /* flush copyable characters */
> +               appendBinaryStringInfo(
> +                                      lex->strval,
> +                                      s - copyable_characters_length,
> +                                      copyable_characters_length);
> +
> +           }
>            break;
>
> I wonder why copyable_characters_length is not reset after flushing.

It breaks from the loop right after. So copyable_characters_length isn't used
again and thus resetting is not necessary. But I agree this could use a comment.


Re: [PATCH] Optimize json_lex_string by batching character copying

From
Andres Freund
Date:
Hi,

On 2022-06-24 08:47:09 +0000, Jelte Fennema wrote:
> To test performance of this change I used COPY BINARY from a JSONB table
> into another, containing fairly JSONB values of ~15kB.

This will have a lot of other costs included (DML is expensive). I'd suggest
storing the json in a text column and casting it to json[b], with a filter
ontop of the json[b] result that cheaply filters it away. That should end up
spending nearly all the time somewhere around json parsing.

It's useful for things like this to include a way for others to use the same
benchmark...

I tried your patch with:

DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM
generate_series(1,100);
 
VACUUM FREEZE json_as_text;

SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';

Which the patch improves from 846ms to 754ms (best of three). A bit smaller
than your improvement, but still nice.


I think your patch doesn't quite go far enough - we still end up looping for
each character, have the added complication of needing to flush the
"buffer". I'd be surprised if a "dedicated" loop to see until where the string
last isn't faster.  That then obviously could be SIMDified.


Separately, it seems pretty awful efficiency / code density wise to have the
NULL checks for ->strval all over. Might be worth forcing json_lex() and
json_lex_string() to be inlined, with a constant parameter deciding whether
->strval is expected. That'd likely be enough to get the compiler specialize
the code for us.


Might also be worth to maintain ->strval using appendBinaryStringInfoNT().

Greetings,

Andres Freund



Re: [PATCH] Optimize json_lex_string by batching character copying

From
Andres Freund
Date:
Hi,

On 2022-06-24 17:18:10 -0700, Andres Freund wrote:
> On 2022-06-24 08:47:09 +0000, Jelte Fennema wrote:
> > To test performance of this change I used COPY BINARY from a JSONB table
> > into another, containing fairly JSONB values of ~15kB.
> 
> This will have a lot of other costs included (DML is expensive). I'd suggest
> storing the json in a text column and casting it to json[b], with a filter
> ontop of the json[b] result that cheaply filters it away. That should end up
> spending nearly all the time somewhere around json parsing.
> 
> It's useful for things like this to include a way for others to use the same
> benchmark...
> 
> I tried your patch with:
> 
> DROP TABLE IF EXISTS json_as_text;
> CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM
generate_series(1,100);
 
> VACUUM FREEZE json_as_text;
> 
> SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';
> 
> Which the patch improves from 846ms to 754ms (best of three). A bit smaller
> than your improvement, but still nice.
> 
> 
> I think your patch doesn't quite go far enough - we still end up looping for
> each character, have the added complication of needing to flush the
> "buffer". I'd be surprised if a "dedicated" loop to see until where the string
> last isn't faster.  That then obviously could be SIMDified.

A naive implementation (attached) of that gets me down to 706ms.

Greetings,

Andres Freund

Attachment

Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
On Sat, Jun 25, 2022 at 8:05 AM Andres Freund <andres@anarazel.de> wrote:

> > I tried your patch with:
> >
> > DROP TABLE IF EXISTS json_as_text;
> > CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM
generate_series(1,100);
 
> > VACUUM FREEZE json_as_text;
> >
> > SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';
> >
> > Which the patch improves from 846ms to 754ms (best of three). A bit smaller
> > than your improvement, but still nice.
> >
> >
> > I think your patch doesn't quite go far enough - we still end up looping for
> > each character, have the added complication of needing to flush the
> > "buffer". I'd be surprised if a "dedicated" loop to see until where the string
> > last isn't faster.  That then obviously could be SIMDified.
>
> A naive implementation (attached) of that gets me down to 706ms.

Taking this a step further, I modified json_lex and json_lex_string to
use a const end pointer instead of maintaining the length (0001). The
observed speedup is small enough that it might not be real, but the
code is simpler this way, and it makes 0002 and 0003 easier to reason
about. Then I modified your patch to do the same (0002). Hackish SSE2
support is in 0003.

To exercise the SIMD code a bit, I added a second test:

DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
        select repeat(description, 10) from pg_description pd
)
SELECT (select json_agg(row_to_json(long)) as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;

SELECT 1 FROM long_json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';

With this, I get (turbo disabled, min of 3):

short test:
master: 769ms
0001:   751ms
0002:   703ms
0003:   701ms

long test;
master: 939ms
0001:   883ms
0002:   537ms
0003:   439ms

I think 0001/2 are mostly in committable shape.

With 0003, I'd want to make the portability check a bit nicer and more
centralized. I'm thinking of modifying the CRC check to report that
the host cpu/compiler understands SSE4.2 x86 intrinsics, and then the
compile time SSE2 check can piggyback on top of that without a runtime
check. This is conceptually easy but a bit of work to not look like a
hack (which probably means the ARM CRC check should look more generic
somehow...). The regression tests will likely need some work as well.

> Separately, it seems pretty awful efficiency / code density wise to have the
> NULL checks for ->strval all over. Might be worth forcing json_lex() and
> json_lex_string() to be inlined, with a constant parameter deciding whether
> ->strval is expected. That'd likely be enough to get the compiler specialize
> the code for us.

I had a look at this but it's a bit more invasive than I want to
devote time to at this point.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: [PATCH] Optimize json_lex_string by batching character copying

From
Andres Freund
Date:
Hi,

On 2022-07-06 12:10:20 +0700, John Naylor wrote:
> diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
> index eeedc0645a..ad4858c623 100644
> --- a/src/common/jsonapi.c
> +++ b/src/common/jsonapi.c
> @@ -851,10 +851,26 @@ json_lex_string(JsonLexContext *lex)
>          }
>          else if (lex->strval != NULL)
>          {
> +            /* start lookahead at next byte */
> +            char       *p = s + 1;
> +
>              if (hi_surrogate != -1)
>                  return JSON_UNICODE_LOW_SURROGATE;
>  
> -            appendStringInfoChar(lex->strval, *s);
> +            while (p < end)
> +            {
> +                if (*p == '\\' || *p == '"' || (unsigned char) *p < 32)
> +                    break;
> +                p++;
> +            }
> +
> +            appendBinaryStringInfo(lex->strval, s, p - s);
> +
> +            /*
> +             * s will be incremented at the top of the loop, so set it to just
> +             * behind our lookahead position
> +             */
> +            s = p - 1;
>          }
>      }
>  
> -- 
> 2.36.1

I think before committing something along those lines we should make the
relevant bits also be applicable when ->strval is NULL, as several functions
use that (notably json_in IIRC). Afaics we'd just need to move the strval
check to be around the appendBinaryStringInfo(). And it should simplify the
function, because some of the relevant code is duplicated outside as well...

Greetings,

Andres Freund



Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
On Wed, Jul 6, 2022 at 12:18 PM Andres Freund <andres@anarazel.de> wrote:

> I think before committing something along those lines we should make the
> relevant bits also be applicable when ->strval is NULL, as several functions
> use that (notably json_in IIRC). Afaics we'd just need to move the strval
> check to be around the appendBinaryStringInfo().

That makes sense and is easy.

> And it should simplify the
> function, because some of the relevant code is duplicated outside as well...

Not sure how far to take this, but I put the returnable paths inside
the "other" path, so only backslash will go back to the top.

Both the above changes are split into a new 0003 patch for easier
review, but in the end will likely be squashed with 0002.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
I've pushed 0001 (although the email seems to have been swallowed
again), and pending additional comments on 0002 and 0003 I'll squash
and push those next week. 0004 needs some thought on integrating with
symbols we discover during configure.

--
John Naylor
EDB: http://www.enterprisedb.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
On Fri, Jul 8, 2022 at 3:06 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> I've pushed 0001 (although the email seems to have been swallowed
> again), and pending additional comments on 0002 and 0003 I'll squash
> and push those next week.

This is done.

> 0004 needs some thought on integrating with
> symbols we discover during configure.

Still needs thought.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
Andres Freund
Date:
Hi,

On 2022-07-06 15:58:44 +0700, John Naylor wrote:
> From 82e13b6bebd85a152ededcfd75495c0c0f642354 Mon Sep 17 00:00:00 2001
> From: John Naylor <john.naylor@postgresql.org>
> Date: Wed, 6 Jul 2022 15:50:09 +0700
> Subject: [PATCH v4 4/4] Use vectorized lookahead in json_lex_string on x86
> 
> ---
>  src/common/jsonapi.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 48 insertions(+)
> 
> diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
> index 81e176ad8d..44e8ed2b2f 100644
> --- a/src/common/jsonapi.c
> +++ b/src/common/jsonapi.c
> @@ -24,6 +24,12 @@
>  #include "miscadmin.h"
>  #endif
>  
> +/*  WIP: put somewhere sensible and consider removing CRC from the names */
> +#if (defined (__x86_64__) || defined(_M_AMD64)) && (defined(USE_SSE42_CRC32C) ||
defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK))
> +#include <nmmintrin.h>
> +#define USE_SSE2
> +#endif
> +
>  /*
>   * The context of the parser is maintained by the recursive descent
>   * mechanism, but is passed explicitly to the error reporting routine
> @@ -842,12 +848,54 @@ json_lex_string(JsonLexContext *lex)
>          }
>          else
>          {
> +#ifdef USE_SSE2
> +            __m128i        block,
> +                        has_backslash,
> +                        has_doublequote,
> +                        control,
> +                        has_control,
> +                        error_cum = _mm_setzero_si128();
> +            const        __m128i backslash = _mm_set1_epi8('\\');
> +            const        __m128i doublequote = _mm_set1_epi8('"');
> +            const        __m128i max_control = _mm_set1_epi8(0x1F);
> +#endif
>              /* start lookahead at current byte */
>              char       *p = s;
>  
>              if (hi_surrogate != -1)
>                  return JSON_UNICODE_LOW_SURROGATE;
>  
> +#ifdef USE_SSE2
> +            while (p < end - sizeof(__m128i))
> +            {
> +                block = _mm_loadu_si128((const __m128i *) p);
> +
> +                /* direct comparison to quotes and backslashes */
> +                has_backslash = _mm_cmpeq_epi8(block, backslash);
> +                has_doublequote = _mm_cmpeq_epi8(block, doublequote);
> +
> +                /*
> +                 * use saturation arithmetic to check for <= highest control
> +                 * char
> +                 */
> +                control = _mm_subs_epu8(block, max_control);
> +                has_control = _mm_cmpeq_epi8(control, _mm_setzero_si128());
> +
> +                /*
> +                 * set bits in error_cum where the corresponding lanes in has_*
> +                 * are set
> +                 */
> +                error_cum = _mm_or_si128(error_cum, has_backslash);
> +                error_cum = _mm_or_si128(error_cum, has_doublequote);
> +                error_cum = _mm_or_si128(error_cum, has_control);
> +
> +                if (_mm_movemask_epi8(error_cum))
> +                    break;
> +
> +                p += sizeof(__m128i);
> +            }
> +#endif                            /* USE_SSE2 */
> +
>              while (p < end)
>              {
>                  if (*p == '\\' || *p == '"')
> -- 
> 2.36.1
> 

I wonder if we can't abstract this at least a bit better. If we go that route
a bit further, then add another arch, this code will be pretty much
unreadable.

Greetings,

Andres Freund



Re: [PATCH] Optimize json_lex_string by batching character copying

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I wonder if we can't abstract this at least a bit better. If we go that route
> a bit further, then add another arch, this code will be pretty much
> unreadable.

IMO, it's pretty unreadable *now*, for lack of comments about what it's
doing and why.

            regards, tom lane



Re: [PATCH] Optimize json_lex_string by batching character copying

From
Andres Freund
Date:
Hi,

On 2022-07-11 11:53:26 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I wonder if we can't abstract this at least a bit better. If we go that route
> > a bit further, then add another arch, this code will be pretty much
> > unreadable.
>
> IMO, it's pretty unreadable *now*, for lack of comments about what it's
> doing and why.

Yea, that could at least be addressed by adding comments. But even with a
bunch of comments, it'd still be pretty hard to read once the events above
have happened (and they seem kind of inevitable).

I wonder if we can add a somewhat more general function for scanning until
some characters are found using SIMD? There's plenty other places that could
be useful.

Greetings,

Andres Freund



Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:

On Mon, Jul 11, 2022 at 11:07 PM Andres Freund <andres@anarazel.de> wrote:

> I wonder if we can add a somewhat more general function for scanning until
> some characters are found using SIMD? There's plenty other places that could
> be useful.

In simple cases, we could possibly abstract the entire loop. With this particular case, I imagine the most approachable way to write the loop would be a bit more low-level:

while (p < end - VECTOR_WIDTH &&
       !vector_has_byte(p, '\\') &&
       !vector_has_byte(p, '"') &&
       vector_min_byte(p, 0x20))
    p += VECTOR_WIDTH

I wonder if we'd lose a bit of efficiency here by not accumulating set bits from the three conditions, but it's worth trying.
--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PATCH] Optimize json_lex_string by batching character copying

From
Andrew Dunstan
Date:
On 2022-06-24 Fr 20:18, Andres Freund wrote:
> Hi,
>
> On 2022-06-24 08:47:09 +0000, Jelte Fennema wrote:
>> To test performance of this change I used COPY BINARY from a JSONB table
>> into another, containing fairly JSONB values of ~15kB.
> This will have a lot of other costs included (DML is expensive). I'd suggest
> storing the json in a text column and casting it to json[b], with a filter
> ontop of the json[b] result that cheaply filters it away. That should end up
> spending nearly all the time somewhere around json parsing.
>
> It's useful for things like this to include a way for others to use the same
> benchmark...
>
> I tried your patch with:
>
> DROP TABLE IF EXISTS json_as_text;
> CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM
generate_series(1,100);
 
> VACUUM FREEZE json_as_text;
>
> SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';


I've been doing some other work related to json parsing and John
referred me to this. But it's actually not the best test for pure json
parsing - casting to jsonb involves some extra work besides pure
parsing. Instead I've been using this query with the same table, which
should be almost all json parsing:


select 1 from json_as_text where t::json is null;


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
I wrote

> On Mon, Jul 11, 2022 at 11:07 PM Andres Freund <andres@anarazel.de> wrote:
>
> > I wonder if we can add a somewhat more general function for scanning until
> > some characters are found using SIMD? There's plenty other places that could
> > be useful.
>
> In simple cases, we could possibly abstract the entire loop. With this particular case, I imagine the most
approachableway to write the loop would be a bit more low-level:
 
>
> while (p < end - VECTOR_WIDTH &&
>        !vector_has_byte(p, '\\') &&
>        !vector_has_byte(p, '"') &&
>        vector_min_byte(p, 0x20))
>     p += VECTOR_WIDTH
>
> I wonder if we'd lose a bit of efficiency here by not accumulating set bits from the three conditions, but it's worth
trying.

The attached implements the above, more or less, using new pg_lfind8()
and pg_lfind8_le(), which in turn are based on helper functions that
act on a single vector. The pg_lfind* functions have regression tests,
but I haven't done the same for json yet. I went the extra step to use
bit-twiddling for non-SSE builds using uint64 as a "vector", which
still gives a pretty good boost (test below, min of 3):

master:
356ms

v5:
259ms

v5 disable SSE:
288ms

It still needs a bit of polishing and testing, but I think it's a good
workout for abstracting SIMD out of the way.

-------------
test:

DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
        select repeat(description, 11)
        from pg_description
)
select (select json_agg(row_to_json(long))::text as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;

select 1 from long_json_as_text where t::json is null; -- from Andrew upthread

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: [PATCH] Optimize json_lex_string by batching character copying

From
Ranier Vilela
Date:

Hi,

I ran this test.

DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 11)
from pg_description
)
select (select json_agg(row_to_json(long))::text as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;

select 1 from long_json_as_text where t::json is null;

head:
Time: 161,741ms

v5:
Time: 270,298 ms

ubuntu 64 bits
gcc 9.4.0

Am I missing something?

regards,
Ranier Vilela

Re: [PATCH] Optimize json_lex_string by batching character copying

From
Ranier Vilela
Date:
Em seg., 15 de ago. de 2022 às 15:34, Ranier Vilela <ranier.vf@gmail.com> escreveu:

Hi,

I ran this test.

DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 11)
from pg_description
)
select (select json_agg(row_to_json(long))::text as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;

select 1 from long_json_as_text where t::json is null;

head:
Time: 161,741ms

v5:
Time: 270,298 ms
Sorry too fast, 270,298ms with native memchr.

v5
Time: 208,689 ms

regards,
Ranier Vilela

Re: [PATCH] Optimize json_lex_string by batching character copying

From
Nathan Bossart
Date:
On Mon, Aug 15, 2022 at 08:33:21PM +0700, John Naylor wrote:
> The attached implements the above, more or less, using new pg_lfind8()
> and pg_lfind8_le(), which in turn are based on helper functions that
> act on a single vector. The pg_lfind* functions have regression tests,
> but I haven't done the same for json yet. I went the extra step to use
> bit-twiddling for non-SSE builds using uint64 as a "vector", which
> still gives a pretty good boost (test below, min of 3):

Looks pretty reasonable to me.

> +#ifdef USE_SSE2
> +        chunk = _mm_loadu_si128((const __m128i *) &base[i]);
> +#else
> +        memcpy(&chunk, &base[i], sizeof(chunk));
> +#endif                            /* USE_SSE2 */

> +#ifdef USE_SSE2
> +        chunk = _mm_loadu_si128((const __m128i *) &base[i]);
> +#else
> +        memcpy(&chunk, &base[i], sizeof(chunk));
> +#endif                            /* USE_SSE2 */

Perhaps there should be a macro or inline function for loading a vector so
that these USE_SSE2 checks can be abstracted away, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
On Tue, Aug 16, 2022 at 4:23 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Mon, Aug 15, 2022 at 08:33:21PM +0700, John Naylor wrote:
> > +#ifdef USE_SSE2
> > +             chunk = _mm_loadu_si128((const __m128i *) &base[i]);
> > +#else
> > +             memcpy(&chunk, &base[i], sizeof(chunk));
> > +#endif                                                       /* USE_SSE2 */
>
> Perhaps there should be a macro or inline function for loading a vector so
> that these USE_SSE2 checks can be abstracted away, too.

This is done. Also:
- a complete overhaul of the pg_lfind8* tests
- using a typedef for the vector type
- some refactoring, name changes and other cleanups (a few of these
could also be applied to the 32-byte element path, but that is left
for future work)

TODO: json-specific tests of the new path

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: [PATCH] Optimize json_lex_string by batching character copying

From
Nathan Bossart
Date:
On Fri, Aug 19, 2022 at 03:11:36PM +0700, John Naylor wrote:
> This is done. Also:
> - a complete overhaul of the pg_lfind8* tests
> - using a typedef for the vector type
> - some refactoring, name changes and other cleanups (a few of these
> could also be applied to the 32-byte element path, but that is left
> for future work)
> 
> TODO: json-specific tests of the new path

This looks pretty good to me.  Should we rename vector_broadcast() and
vector_has_zero() to indicate that they are working with bytes (e.g.,
vector_broadcast_byte())?  We might be able to use vector_broadcast_int()
in the 32-bit functions, and your other vector functions already have a
_byte suffix.

In general, the approach you've taken seems like a decent readability
improvement.  I'd be happy to try my hand at adjusting the 32-bit path and
adding ARM versions of all this stuff.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
Nathan Bossart
Date:
On Fri, Aug 19, 2022 at 01:42:15PM -0700, Nathan Bossart wrote:
> On Fri, Aug 19, 2022 at 03:11:36PM +0700, John Naylor wrote:
>> This is done. Also:
>> - a complete overhaul of the pg_lfind8* tests
>> - using a typedef for the vector type
>> - some refactoring, name changes and other cleanups (a few of these
>> could also be applied to the 32-byte element path, but that is left
>> for future work)
>> 
>> TODO: json-specific tests of the new path
> 
> This looks pretty good to me.  Should we rename vector_broadcast() and
> vector_has_zero() to indicate that they are working with bytes (e.g.,
> vector_broadcast_byte())?  We might be able to use vector_broadcast_int()
> in the 32-bit functions, and your other vector functions already have a
> _byte suffix.
> 
> In general, the approach you've taken seems like a decent readability
> improvement.  I'd be happy to try my hand at adjusting the 32-bit path and
> adding ARM versions of all this stuff.

I spent some more time looking at this one, and I had a few ideas that I
thought I'd share.  0001 is your v6 patch with a few additional changes,
including simplying the assertions for readability, splitting out the
Vector type into Vector8 and Vector32 (needed for ARM), and adjusting
pg_lfind32() to use the new tools in simd.h.  0002 adds ARM versions of
everything, which obsoletes the other thread I started [0].  This is still
a little rough around the edges (e.g., this should probably be more than 2
patches), but I think it helps demonstrate a more comprehensive design than
what I've proposed in the pg_lfind32-for-ARM thread [0].

Apologies if I'm stepping on your toes a bit here.

[0] https://postgr.es/m/20220819200829.GA395728%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
On Sun, Aug 21, 2022 at 12:47 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> I spent some more time looking at this one, and I had a few ideas that I
> thought I'd share.  0001 is your v6 patch with a few additional changes,
> including simplying the assertions for readability, splitting out the
> Vector type into Vector8 and Vector32 (needed for ARM), and adjusting
> pg_lfind32() to use the new tools in simd.h.  0002 adds ARM versions of
> everything, which obsoletes the other thread I started [0].  This is still
> a little rough around the edges (e.g., this should probably be more than 2
> patches), but I think it helps demonstrate a more comprehensive design than
> what I've proposed in the pg_lfind32-for-ARM thread [0].
>
> Apologies if I'm stepping on your toes a bit here.

Not at all! However, the 32-bit-element changes are irrelevant for
json, and make review more difficult. I would suggest keeping those in
the other thread starting with whatever refactoring is needed. I can
always rebase over that.

Not a full review, but on a brief look:

- I like the idea of simplifying the assertions, but I can't get
behind using platform lfind to do it, since it has a different API,
requires new functions we don't need, and possibly has portability
issues. A simple for-loop is better for assertions.
- A runtime elog is not appropriate for a compile time check -- use
#error instead.

--
John Naylor
EDB: http://www.enterprisedb.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
Nathan Bossart
Date:
On Mon, Aug 22, 2022 at 09:35:34AM +0700, John Naylor wrote:
> Not at all! However, the 32-bit-element changes are irrelevant for
> json, and make review more difficult. I would suggest keeping those in
> the other thread starting with whatever refactoring is needed. I can
> always rebase over that.

Yeah, I'll remove those to keep this thread focused.

> - I like the idea of simplifying the assertions, but I can't get
> behind using platform lfind to do it, since it has a different API,
> requires new functions we don't need, and possibly has portability
> issues. A simple for-loop is better for assertions.

My main goal with this was improving readability, which is likely possible
without lfind().  I'll see what I can do.

> - A runtime elog is not appropriate for a compile time check -- use
> #error instead.

Will do.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
Nathan Bossart
Date:
On Mon, Aug 22, 2022 at 02:22:29PM -0700, Nathan Bossart wrote:
> On Mon, Aug 22, 2022 at 09:35:34AM +0700, John Naylor wrote:
>> Not at all! However, the 32-bit-element changes are irrelevant for
>> json, and make review more difficult. I would suggest keeping those in
>> the other thread starting with whatever refactoring is needed. I can
>> always rebase over that.
> 
> Yeah, I'll remove those to keep this thread focused.

Here's a new version of the patch with the 32-bit changes and calls to
lfind() removed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
On Tue, Aug 23, 2022 at 10:32 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Mon, Aug 22, 2022 at 02:22:29PM -0700, Nathan Bossart wrote:
> > On Mon, Aug 22, 2022 at 09:35:34AM +0700, John Naylor wrote:
> >> Not at all! However, the 32-bit-element changes are irrelevant for
> >> json, and make review more difficult. I would suggest keeping those in
> >> the other thread starting with whatever refactoring is needed. I can
> >> always rebase over that.
> >
> > Yeah, I'll remove those to keep this thread focused.
>
> Here's a new version of the patch with the 32-bit changes and calls to
> lfind() removed.

LGTM overall. My plan is to split out the json piece, adding tests for
that, and commit the infrastructure for it fairly soon. Possible
bikeshedding: Functions like vector8_eq() might be misunderstood as
comparing two vectors, but here we are comparing each lane with a
scalar. I wonder if vector8_eq_scalar() et al might be more clear.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
Nathan Bossart
Date:
On Tue, Aug 23, 2022 at 01:03:03PM +0700, John Naylor wrote:
> On Tue, Aug 23, 2022 at 10:32 AM Nathan Bossart
>> Here's a new version of the patch with the 32-bit changes and calls to
>> lfind() removed.
> 
> LGTM overall. My plan is to split out the json piece, adding tests for
> that, and commit the infrastructure for it fairly soon. Possible
> bikeshedding: Functions like vector8_eq() might be misunderstood as
> comparing two vectors, but here we are comparing each lane with a
> scalar. I wonder if vector8_eq_scalar() et al might be more clear.

Good point.  I had used vector32_veq() to denote vector comparison, which
would extend to something like vector8_seq().  But that doesn't seem
descriptive enough.  It might be worth considering vector8_contains() or
vector8_has() as well.  I don't really have an opinion, but if I had to
pick something, I guess I'd choose vector8_contains().

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
On Wed, Aug 24, 2022 at 12:15 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Tue, Aug 23, 2022 at 01:03:03PM +0700, John Naylor wrote:
> > On Tue, Aug 23, 2022 at 10:32 AM Nathan Bossart
> >> Here's a new version of the patch with the 32-bit changes and calls to
> >> lfind() removed.
> >
> > LGTM overall. My plan is to split out the json piece, adding tests for
> > that, and commit the infrastructure for it fairly soon. Possible
> > bikeshedding: Functions like vector8_eq() might be misunderstood as
> > comparing two vectors, but here we are comparing each lane with a
> > scalar. I wonder if vector8_eq_scalar() et al might be more clear.
>
> Good point.  I had used vector32_veq() to denote vector comparison, which
> would extend to something like vector8_seq().  But that doesn't seem
> descriptive enough.  It might be worth considering vector8_contains() or
> vector8_has() as well.  I don't really have an opinion, but if I had to
> pick something, I guess I'd choose vector8_contains().

It seems "scalar" would be a bad choice since it already means
(confusingly) operating on the least significant element of a vector.
I'm thinking of *_has and *_has_le, matching the already existing in
the earlier patch *_has_zero.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
Nathan Bossart
Date:
On Wed, Aug 24, 2022 at 11:59:25AM +0700, John Naylor wrote:
> It seems "scalar" would be a bad choice since it already means
> (confusingly) operating on the least significant element of a vector.
> I'm thinking of *_has and *_has_le, matching the already existing in
> the earlier patch *_has_zero.

That seems reasonable to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
On Wed, Aug 24, 2022 at 11:56 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Wed, Aug 24, 2022 at 11:59:25AM +0700, John Naylor wrote:
> > It seems "scalar" would be a bad choice since it already means
> > (confusingly) operating on the least significant element of a vector.
> > I'm thinking of *_has and *_has_le, matching the already existing in
> > the earlier patch *_has_zero.
>
> That seems reasonable to me.

Okay, done that way, also in v9:
- a convenience macro in the test suite which is handy now and can be
used for 32-bit element tests if we like
- more tests
- pgindent and some additional comment smithing
- split out the json piece for a later commit
- For the following comment, pgindent will put spaced operands on a
separate line which is not great for readability. and our other
reference to the Stanford bithacks page keeps the in-page link, and I
see no reason to exclude it -- if it goes missing, the whole page will
still load. So I put back those two details.

+        * To find bytes <= c, we can use bitwise operations to find
bytes < c+1,
+        * but it only works if c+1 <= 128 and if the highest bit in v
is not set.
+        * Adapted from
+        * https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord

I think I'll go ahead and commit 0001 in a couple days pending further comments.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: [PATCH] Optimize json_lex_string by batching character copying

From
Nathan Bossart
Date:
On Thu, Aug 25, 2022 at 01:35:45PM +0700, John Naylor wrote:
> - For the following comment, pgindent will put spaced operands on a
> separate line which is not great for readability. and our other
> reference to the Stanford bithacks page keeps the in-page link, and I
> see no reason to exclude it -- if it goes missing, the whole page will
> still load. So I put back those two details.
> 
> +        * To find bytes <= c, we can use bitwise operations to find
> bytes < c+1,
> +        * but it only works if c+1 <= 128 and if the highest bit in v
> is not set.
> +        * Adapted from
> +        * https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord

This was just unnecessary fiddling on my part, sorry about that.

> +test_lfind8_internal(uint8 key)
> +{
> +    uint8        charbuf[LEN_WITH_TAIL(Vector8)];
> +    const int    len_no_tail = LEN_NO_TAIL(Vector8);
> +    const int    len_with_tail = LEN_WITH_TAIL(Vector8);
> +
> +    memset(charbuf, 0xFF, len_with_tail);
> +    /* search tail to test one-byte-at-a-time path */
> +    charbuf[len_with_tail - 1] = key;
> +    if (key > 0x00 && pg_lfind8(key - 1, charbuf, len_with_tail))
> +        elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
> +    if (key < 0xFF && !pg_lfind8(key, charbuf, len_with_tail))
> +        elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
> +    if (key < 0xFE && pg_lfind8(key + 1, charbuf, len_with_tail))
> +        elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
> +
> +    memset(charbuf, 0xFF, len_with_tail);
> +    /* search with vector operations */
> +    charbuf[len_no_tail - 1] = key;
> +    if (key > 0x00 && pg_lfind8(key - 1, charbuf, len_no_tail))
> +        elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
> +    if (key < 0xFF && !pg_lfind8(key, charbuf, len_no_tail))
> +        elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
> +    if (key < 0xFE && pg_lfind8(key + 1, charbuf, len_no_tail))
> +        elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
> +}

nitpick: Shouldn't the elog() calls use "==" instead of "<=" for this one?

Otherwise, 0001 looks good to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
On Fri, Aug 26, 2022 at 10:14 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> > +test_lfind8_internal(uint8 key)
[...]
> > +             elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
> > +}
>
> nitpick: Shouldn't the elog() calls use "==" instead of "<=" for this one?

Good catch, will fix.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
On Thu, Aug 25, 2022 at 1:35 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> I think I'll go ahead and commit 0001 in a couple days pending further comments.

Pushed with Nathan's correction and some cosmetic rearrangements.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
Jelte Fennema
Date:
> I wonder why copyable_characters_length is not reset after flushing.

It's not necessary because of the break statement right after. But this part
of the code was refactored away in John's improved patch that's actually
merged: 
https://github.com/postgres/postgres/commit/3838fa269c15706df2b85ce2d6af8aacd5611655 



Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
On Tue, Aug 23, 2022 at 1:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> LGTM overall. My plan is to split out the json piece, adding tests for
> that, and commit the infrastructure for it fairly soon.

Here's the final piece. I debated how many tests to add and decided it
was probably enough to add one each for checking quotes and
backslashes in the fast path. There is one cosmetic change in the
code: Before, the vectorized less-equal check compared to 0x1F, but
the byte-wise path did so with < 32. I made them both "less-equal 31"
for consistency. I'll commit this by the end of the week unless anyone
has a better idea about testing.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: [PATCH] Optimize json_lex_string by batching character copying

From
Nathan Bossart
Date:
On Wed, Aug 31, 2022 at 10:50:39AM +0700, John Naylor wrote:
> Here's the final piece. I debated how many tests to add and decided it
> was probably enough to add one each for checking quotes and
> backslashes in the fast path. There is one cosmetic change in the
> code: Before, the vectorized less-equal check compared to 0x1F, but
> the byte-wise path did so with < 32. I made them both "less-equal 31"
> for consistency. I'll commit this by the end of the week unless anyone
> has a better idea about testing.

LGTM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: [PATCH] Optimize json_lex_string by batching character copying

From
John Naylor
Date:
On Wed, Aug 31, 2022 at 11:17 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Wed, Aug 31, 2022 at 10:50:39AM +0700, John Naylor wrote:
> > Here's the final piece. I debated how many tests to add and decided it
> > was probably enough to add one each for checking quotes and
> > backslashes in the fast path. There is one cosmetic change in the
> > code: Before, the vectorized less-equal check compared to 0x1F, but
> > the byte-wise path did so with < 32. I made them both "less-equal 31"
> > for consistency. I'll commit this by the end of the week unless anyone
> > has a better idea about testing.
>
> LGTM

Pushed, thanks for looking!

-- 
John Naylor
EDB: http://www.enterprisedb.com