Re: Speed up COPY TO text/CSV parsing using SIMD - Mailing list pgsql-hackers

From KAZAR Ayoub
Subject Re: Speed up COPY TO text/CSV parsing using SIMD
Date
Msg-id CA+K2Rum7+Jm2rm65K5msxaiAM8QTkhSNAYarPBP9O7nBXYo12Q@mail.gmail.com
Whole thread Raw
In response to Re: Speed up COPY TO text/CSV parsing using SIMD  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: Speed up COPY TO text/CSV parsing using SIMD
List pgsql-hackers
Hello,
On Tue, Mar 10, 2026 at 8:17 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Sat, Feb 14, 2026 at 04:02:21PM +0100, KAZAR Ayoub wrote:
> On Thu, Feb 12, 2026 at 10:25 PM Andres Freund <andres@anarazel.de> wrote:
>> I have a hard time believing that adding a strlen() to the handling of a
>> short column won't be a measurable overhead with lots of short attributes.
>> Particularly because the patch afaict will call it repeatedly if there are
>> any to-be-escaped characters.
>
> [...]
>
> 1000 columns:
> TEXT: 17% regression
> CSV: 3.4% regression
>
> 500 columns:
> TEXT: 17.7% regression
> CSV: 3.1% regression
>
> 100 columns:
> TEXT: 17.3% regression
> CSV: 3% regression
>
> A bit unstable results, but yeah the overhead for worse cases like this is
> really significant, I can't argue whether this is worth it or not, so
> thoughts on this ?

I seriously doubt we'd commit something that produces a 17% regression
here.  Perhaps we should skip the SIMD paths whenever transcoding is
required.

--
nathan
I've spent some time rethinking about this and here's what i've done in v3:
SIMD is only used for varlena attributes whose text representation is longer than a single SIMD vector, and only when no transcoding is required.  
Fixed-size types such as integers etc.. mostly produce short ASCII output for which SIMD provides no benefit.

For eligible attributes, the stored varlena size is used as a cheap pre-filter to avoid an
unnecessary strlen() call on short values.

Here are the benchmark results after many runs compared to master (4deecb52aff):
TEXT clean: -34.0%
CSV clean: -39.3%
TEXT 1/3: +4.7%
CSV 1/3: -2.3%
the above numbers have a variance of 1% to 3% improvs or regressions across +20 runs

WIDE tables short attributes TEXT: 
50 columns: -3.7% 
100 columns: -1.7% 
200 columns: +1.8% 
500 columns: -0.5% 
1000 columns: -0.3%

WIDE tables short attributes CSV: 
50 columns: -2.5%
100 columns: +1.8%
200 columns: +1.4% 
500 columns: -0.9% 
1000 columns: -1.1%

Wide tables benchmarks where all similar noise, across +20 runs its always around -2% and +4% for all numbers of columns.

Just a small concern about where some varlenas have a larger binary size than its text representation ex: 
SELECT pg_column_size(to_tsvector('SIMD is GOOD'));
 pg_column_size
----------------
             32

its text representation is less than sizeof(Vector8) so currently v3 would enter SIMD path and exit out just from the beginning (two extra branches)
because it does this:
+ if (TupleDescAttr(tup_desc, attnum - 1)->attlen == -1 &&
+ VARSIZE_ANY_EXHDR(DatumGetPointer(value)) > sizeof(Vector8))

I thought maybe we could do * 2 or * 4 its binary size, depends on the type really but this is just a proposition if this case is something concerning.

Thoughts?


Regards,
Ayoub
Attachment

pgsql-hackers by date:

Previous
From: Zsolt Parragi
Date:
Subject: Proposal: common explicit lists for installed headers
Next
From: Jeff Davis
Date:
Subject: Re: [19] CREATE SUBSCRIPTION ... SERVER