Home > mailing lists

Re: Speed up COPY TO text/CSV parsing using SIMD - Mailing list pgsql-hackers

From	KAZAR Ayoub
Subject	Re: Speed up COPY TO text/CSV parsing using SIMD
Date	March 15 01:43:38
Msg-id	CA+K2Rum7+Jm2rm65K5msxaiAM8QTkhSNAYarPBP9O7nBXYo12Q@mail.gmail.com Whole thread Raw
In response to	Re: Speed up COPY TO text/CSV parsing using SIMD (Nathan Bossart <nathandbossart@gmail.com>)
Responses	Re: Speed up COPY TO text/CSV parsing using SIMD
List	pgsql-hackers

Tree view

Hello,

On Tue, Mar 10, 2026 at 8:17 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Sat, Feb 14, 2026 at 04:02:21PM +0100, KAZAR Ayoub wrote:
> On Thu, Feb 12, 2026 at 10:25 PM Andres Freund <andres@anarazel.de> wrote:
>> I have a hard time believing that adding a strlen() to the handling of a
>> short column won't be a measurable overhead with lots of short attributes.
>> Particularly because the patch afaict will call it repeatedly if there are
>> any to-be-escaped characters.
>
> [...]
>
> 1000 columns:
> TEXT: 17% regression
> CSV: 3.4% regression
>
> 500 columns:
> TEXT: 17.7% regression
> CSV: 3.1% regression
>
> 100 columns:
> TEXT: 17.3% regression
> CSV: 3% regression
>
> A bit unstable results, but yeah the overhead for worse cases like this is
> really significant, I can't argue whether this is worth it or not, so
> thoughts on this ?

I seriously doubt we'd commit something that produces a 17% regression
here. Perhaps we should skip the SIMD paths whenever transcoding is
required.

--
nathan

I've spent some time rethinking about this and here's what i've done in v3:

SIMD is only used for varlena attributes whose text representation is longer than a single SIMD vector, and only when no transcoding is required.

Fixed-size types such as integers etc.. mostly produce short ASCII output for which SIMD provides no benefit.

For eligible attributes, the stored varlena size is used as a cheap pre-filter to avoid an
unnecessary strlen() call on short values.

Here are the benchmark results after many runs compared to master (4deecb52aff):

TEXT clean: -34.0%

CSV clean: -39.3%

TEXT 1/3: +4.7%

CSV 1/3: -2.3%

the above numbers have a variance of 1% to 3% improvs or regressions across +20 runs

WIDE tables short attributes TEXT:

50 columns: -3.7%

100 columns: -1.7%

200 columns: +1.8%

500 columns: -0.5%

1000 columns: -0.3%

WIDE tables short attributes CSV:

50 columns: -2.5%

100 columns: +1.8%

200 columns: +1.4%

500 columns: -0.9%

1000 columns: -1.1%

Wide tables benchmarks where all similar noise, across +20 runs its always around -2% and +4% for all numbers of columns.

Just a small concern about where some varlenas have a larger binary size than its text representation ex:
SELECT pg_column_size(to_tsvector('SIMD is GOOD'));
pg_column_size
----------------
32

its text representation is less than sizeof(Vector8) so currently v3 would enter SIMD path and exit out just from the beginning (two extra branches)

because it does this:

+ if (TupleDescAttr(tup_desc, attnum - 1)->attlen == -1 &&
+ VARSIZE_ANY_EXHDR(DatumGetPointer(value)) > sizeof(Vector8))

I thought maybe we could do * 2 or * 4 its binary size, depends on the type really but this is just a proposition if this case is something concerning.

Thoughts?

Regards,

Ayoub

Attachment

v3-0001-Speed-up-COPY-TO-FORMAT-text-csv-using-SIMD.patch

pgsql-hackers by date:

From: Zsolt Parragi
Date: 15 March, 00:49:13
Subject: Proposal: common explicit lists for installed headers

From: Jeff Davis
Date: 15 March, 01:55:29
Subject: Re: [19] CREATE SUBSCRIPTION ... SERVER

Re: Speed up COPY TO text/CSV parsing using SIMD - Mailing list pgsql-hackers

Attachment

Previous

Next