Thread: Optimizing COPY with SIMD

Optimizing COPY with SIMD

From
Neil Conway
Date:
Inspired by David Rowley's work [1] on optimizing JSON escape processing with SIMD, I noticed that the COPY code could potentially benefit from SIMD instructions in a few places, eg:

(1) CopyAttributeOutCSV() has 2 byte-by-byte loops
(2) CopyAttributeOutText() has 1
(3) CopyReadLineText() has 1
(4) CopyReadAttributesCSV() has 1
(5) CopyReadAttributesText() has 1

Attached is a quick POC patch that uses SIMD instructions for case (1) above. For sufficiently large attribute values, this is a significant performance win. For small fields, performance looks to be about the same. Results on an M1 Macbook Pro.

======
neilconway=# select count(*), avg(length(a))::int, avg(length(b))::int, avg(length(c))::int from short_strings;
 count  | avg | avg | avg
--------+-----+-----+-----
 524288 |   8 |   8 |   8
(1 row)

neilconway=# select count(*), avg(length(a))::int, avg(length(b))::int, avg(length(c))::int from long_strings;
 count | avg | avg | avg
-------+-----+-----+-----
 65536 | 657 | 657 | 657
(1 row)

master @ 8fea1bd541:

$ for i in ~/*.sql; do hyperfine --warmup 5 "./psql -f $i"; done
Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-long-quotes.sql
  Time (mean ± σ):      2.027 s ±  0.075 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    1.928 s …  2.207 s    10 runs

Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-long.sql
  Time (mean ± σ):      1.420 s ±  0.027 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    1.379 s …  1.473 s    10 runs

Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-short.sql
  Time (mean ± σ):     546.0 ms ±   9.6 ms    [User: 1.4 ms, System: 0.3 ms]
  Range (min … max):   539.0 ms … 572.1 ms    10 runs

master + SIMD patch:

$ for i in ~/*.sql; do hyperfine --warmup 5 "./psql -f $i"; done
Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-long-quotes.sql
  Time (mean ± σ):     797.8 ms ±  19.4 ms    [User: 0.9 ms, System: 0.0 ms]
  Range (min … max):   770.0 ms … 828.5 ms    10 runs

Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-long.sql
  Time (mean ± σ):     732.3 ms ±  20.8 ms    [User: 1.2 ms, System: 0.0 ms]
  Range (min … max):   701.1 ms … 763.5 ms    10 runs

Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-short.sql
  Time (mean ± σ):     545.7 ms ±  13.5 ms    [User: 1.3 ms, System: 0.1 ms]
  Range (min … max):   533.6 ms … 580.2 ms    10 runs
======

Implementation-wise, it seems complex to use SIMD when encoding_embeds_ascii is true (which should be uncommon). In principle, we could probably still use SIMD here, but it would require juggling between the SIMD chunk size and sizes returned by pg_encoding_mblen(). For now, the POC patch falls back to the old code path when encoding_embeds_ascii is true.

Any feedback would be very welcome.

Cheers,
Neil


Attachment

Re: Optimizing COPY with SIMD

From
Joe Conway
Date:
On 6/2/24 15:17, Neil Conway wrote:
> Inspired by David Rowley's work [1]


Welcome back!

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Optimizing COPY with SIMD

From
Neil Conway
Date:
On Mon, Jun 3, 2024 at 9:22 AM Joe Conway <mail@joeconway.com> wrote:
Welcome back!

Thanks Joe! It's been a minute :)

Neil

Re: Optimizing COPY with SIMD

From
Nathan Bossart
Date:
On Sun, Jun 02, 2024 at 03:17:21PM -0400, Neil Conway wrote:
> master @ 8fea1bd541:
> 
> $ for i in ~/*.sql; do hyperfine --warmup 5 "./psql -f $i"; done
> Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-long-quotes.sql
>   Time (mean ± σ):      2.027 s ±  0.075 s    [User: 0.001 s, System: 0.000
> s]
>   Range (min … max):    1.928 s …  2.207 s    10 runs
> 
> Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-long.sql
>   Time (mean ± σ):      1.420 s ±  0.027 s    [User: 0.001 s, System: 0.000
> s]
>   Range (min … max):    1.379 s …  1.473 s    10 runs
> 
> Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-short.sql
>   Time (mean ± σ):     546.0 ms ±   9.6 ms    [User: 1.4 ms, System: 0.3 ms]
>   Range (min … max):   539.0 ms … 572.1 ms    10 runs
> 
> master + SIMD patch:
> 
> $ for i in ~/*.sql; do hyperfine --warmup 5 "./psql -f $i"; done
> Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-long-quotes.sql
>   Time (mean ± σ):     797.8 ms ±  19.4 ms    [User: 0.9 ms, System: 0.0 ms]
>   Range (min … max):   770.0 ms … 828.5 ms    10 runs
> 
> Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-long.sql
>   Time (mean ± σ):     732.3 ms ±  20.8 ms    [User: 1.2 ms, System: 0.0 ms]
>   Range (min … max):   701.1 ms … 763.5 ms    10 runs
> 
> Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-short.sql
>   Time (mean ± σ):     545.7 ms ±  13.5 ms    [User: 1.3 ms, System: 0.1 ms]
>   Range (min … max):   533.6 ms … 580.2 ms    10 runs

These are nice results.

> -/*
> - * Send text representation of one attribute, with conversion and escaping
> - */
>  #define DUMPSOFAR() \

IIUC this comment was meant to describe the CopyAttributeOutText() function
just below this macro.  When the macro was added in commit 0a5fdb0 from
2006, the comment became detached from the function.  Maybe we should just
move it back down below the macro.

> +/*
> + * Send text representation of one attribute, with conversion and CSV-style
> + * escaping. This variant uses SIMD instructions to optimize processing, but
> + * we can only use this approach when encoding_embeds_ascii if false.
> + */

nitpick: Can we add a few words about why using SIMD instructions when
encoding_embeds_ascii is true is difficult?  I don't dispute that it is
complex and/or not worth the effort, but it's not clear to me why that's
the case just from reading the patch.

> +static void
> +CopyAttributeOutCSVFast(CopyToState cstate, const char *ptr,
> +                        bool use_quote)

nitpick: Can we add "vector" or "simd" to the name instead of "fast"?  IMHO
it's better to be more descriptive.

At a glance, the code look pretty reasonable to me.  I might have some
other nitpicks, such as styling tricks to avoid too many levels of
indentation, but that's not terribly important.

-- 
nathan



Re: Optimizing COPY with SIMD

From
Neil Conway
Date:
Thanks for the review and feedback!

On Mon, Jun 3, 2024 at 10:56 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> -/*
> - * Send text representation of one attribute, with conversion and escaping
> - */
>  #define DUMPSOFAR() \

IIUC this comment was meant to describe the CopyAttributeOutText() function
just below this macro.  When the macro was added in commit 0a5fdb0 from
2006, the comment became detached from the function.  Maybe we should just
move it back down below the macro.

Ah, that makes sense -- done.
 
> +/*
> + * Send text representation of one attribute, with conversion and CSV-style
> + * escaping. This variant uses SIMD instructions to optimize processing, but
> + * we can only use this approach when encoding_embeds_ascii if false.
> + */

nitpick: Can we add a few words about why using SIMD instructions when
encoding_embeds_ascii is true is difficult?  I don't dispute that it is
complex and/or not worth the effort, but it's not clear to me why that's
the case just from reading the patch.

Sounds good.
 
> +static void
> +CopyAttributeOutCSVFast(CopyToState cstate, const char *ptr,
> +                                             bool use_quote)

nitpick: Can we add "vector" or "simd" to the name instead of "fast"?  IMHO
it's better to be more descriptive.

Sure, done.

Attached is a revised patch series, that incorporates the feedback above and makes two additional changes:

* Add some regression tests to cover COPY behavior with octal and hex escape sequences
* Optimize the COPY TO text (non-CSV) code path (CopyAttributeOutText()).

In CopyAttributeOutText(), I refactored some code into a helper function to reduce code duplication, on the theory that field delimiters and escape sequences are rare, so we don't mind taking a function call in those cases.

We could go further and use the same code to handle both the tail of the string in the vectorized case and the entire string in the non-vectorized case, but I didn't bother with that -- as written, it would require taking an unnecessary strlen() of the input string in the non-vectorized case.

Performance for COPY TO in text (non-CSV) mode:

===
master

Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-text-long-strings.sql
  Time (mean ± σ):      1.240 s ±  0.013 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    1.220 s …  1.256 s    10 runs

Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-text-short.sql
  Time (mean ± σ):     522.3 ms ±  11.3 ms    [User: 1.2 ms, System: 0.0 ms]
  Range (min … max):   512.0 ms … 544.3 ms    10 runs

master + SIMD patches:

Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-text-long-strings.sql
  Time (mean ± σ):     867.6 ms ±  12.7 ms    [User: 1.2 ms, System: 0.0 ms]
  Range (min … max):   842.1 ms … 891.6 ms    10 runs

Benchmark 1: ./psql -f /Users/neilconway/copy-out-bench-text-short.sql
  Time (mean ± σ):     536.7 ms ±  10.9 ms    [User: 1.2 ms, System: 0.0 ms]
  Range (min … max):   530.1 ms … 566.8 ms    10 runs
===

Looks like there is a slight regression for short attribute values, but I think the tradeoff is a net win.

I'm going to take a look at applying similar ideas to COPY FROM next.

Neil

Attachment

Re: Optimizing COPY with SIMD

From
Nathan Bossart
Date:
On Wed, Jun 05, 2024 at 01:46:44PM -0400, Neil Conway wrote:
> We could go further and use the same code to handle both the tail of the
> string in the vectorized case and the entire string in the non-vectorized
> case, but I didn't bother with that -- as written, it would require taking
> an unnecessary strlen() of the input string in the non-vectorized case.

For pg_lfind32(), we ended up using an overlapping approach for the
vectorized case (see commit 7644a73).  That appeared to help more than it
harmed in the many (admittedly branch predictor friendly) tests I ran.  I
wonder if you could do something similar here.

> Looks like there is a slight regression for short attribute values, but I
> think the tradeoff is a net win.

It'd be interesting to see the threshold where your patch starts winning.
IIUC the vector stuff won't take effect until there are 16 bytes to
process.  If we don't expect attributes to ordinarily be >= 16 bytes, it
might be worth trying to mitigate this ~3% regression.  Maybe we can find
some other small gains elsewhere to offset it.

-- 
nathan



Re: Optimizing COPY with SIMD

From
Neil Conway
Date:
On Wed, Jun 5, 2024 at 3:05 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
For pg_lfind32(), we ended up using an overlapping approach for the
vectorized case (see commit 7644a73).  That appeared to help more than it
harmed in the many (admittedly branch predictor friendly) tests I ran.  I
wonder if you could do something similar here.

I didn't entirely follow what you are suggesting here -- seems like we would need to do strlen() for the non-SIMD case if we tried to use a similar approach.

It'd be interesting to see the threshold where your patch starts winning.
IIUC the vector stuff won't take effect until there are 16 bytes to
process.  If we don't expect attributes to ordinarily be >= 16 bytes, it
might be worth trying to mitigate this ~3% regression.  Maybe we can find
some other small gains elsewhere to offset it.

For the particular short-strings benchmark I have been using (3 columns with 8-character ASCII strings in each), I suspect the regression is caused by the need to do a strlen(), rather than the vectorized loop itself (we skip the vectorized loop anyway because sizeof(Vector8) == 16 on this machine). (This explains why we see a regression on short strings for text but not CSV: CSV needed to do a strlen() for the non-quoted-string case regardless). Unfortunately this makes it tricky to make the optimization conditional on the length of the string. I suppose we could play some games where we start with a byte-by-byte loop and then switch over to the vectorized path (and take a strlen()) if we have seen more than, say, sizeof(Vector8) bytes so far. Seems a bit kludgy though.

I will do some more benchmarking and report back. For the time being, I'm not inclined to push to get the CopyAttributeOutTextVector() into the tree in its current state, as I agree that the short-attribute case is quite important.

In the meantime, attached is a revised patch series. This uses SIMD to optimize CopyReadLineText in COPY FROM. Performance results:

====
master @ 8fea1bd5411b:

Benchmark 1: ./psql -f /Users/neilconway/copy-from-large-long-strings.sql
  Time (mean ± σ):      1.944 s ±  0.013 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    1.927 s …  1.975 s    10 runs

Benchmark 1: ./psql -f /Users/neilconway/copy-from-large-short-strings.sql
  Time (mean ± σ):      1.021 s ±  0.017 s    [User: 0.002 s, System: 0.001 s]
  Range (min … max):    1.005 s …  1.053 s    10 runs

master + SIMD patches:

Benchmark 1: ./psql -f /Users/neilconway/copy-from-large-long-strings.sql
  Time (mean ± σ):      1.513 s ±  0.022 s    [User: 0.001 s, System: 0.000 s]
  Range (min … max):    1.493 s …  1.552 s    10 runs

Benchmark 1: ./psql -f /Users/neilconway/copy-from-large-short-strings.sql
  Time (mean ± σ):      1.032 s ±  0.032 s    [User: 0.002 s, System: 0.001 s]
  Range (min … max):    1.009 s …  1.113 s    10 runs
====

Neil
 
Attachment