Home > mailing lists

Re: Speed up COPY TO text/CSV parsing using SIMD - Mailing list pgsql-hackers

From	KAZAR Ayoub
Subject	Re: Speed up COPY TO text/CSV parsing using SIMD
Date	March 24 03:16:37
Msg-id	CA+K2Runt9Pfst31BBmX9pya-2AwvxgoRwZQP-PcEyg4Hoejbug@mail.gmail.com Whole thread Raw
In response to	Re: Speed up COPY TO text/CSV parsing using SIMD (KAZAR Ayoub <ma_kazar@esi.dz>)
List	pgsql-hackers

Tree view

On Wed, Mar 18, 2026 at 3:29 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote:

On Wed, Mar 18, 2026 at 12:02 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
On Tue, Mar 17, 2026 at 7:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Sat, Mar 14, 2026 at 11:43:38PM +0100, KAZAR Ayoub wrote:
> Just a small concern about where some varlenas have a larger binary size
> than its text representation ex:
> SELECT pg_column_size(to_tsvector('SIMD is GOOD'));
> pg_column_size
> ----------------
> 32
>
> its text representation is less than sizeof(Vector8) so currently v3 would
> enter SIMD path and exit out just from the beginning (two extra branches)
> because it does this:
> + if (TupleDescAttr(tup_desc, attnum - 1)->attlen == -1 &&
> + VARSIZE_ANY_EXHDR(DatumGetPointer(value)) > sizeof(Vector8))
>
> I thought maybe we could do * 2 or * 4 its binary size, depends on the type
> really but this is just a proposition if this case is something concerning.

Can we measure the impact of this? How likely is this case?
I'll respond to this separately in a different email.
My example was already incorrect (the text representation is lexems and positions, not the text we read as it is, its lossy), anyways the point still holds.
If we have some json(b) column like : {"key1":"val1","key2":"val2"}, for CSV format this would immediately exit the SIMD path because of quote character, for json(b) this is going to be always the case.
I measured the overhead of exiting the SIMD path a lot (8 million times for one COPY TO command), i only found 3% regression for this case, sometimes 2%.

For cases where we do a false commitment on SIMD because we read a binary size >= sizeof(Vector8), which i found very niche too, the short circuit to scalar each time is even more negligible (the above CSV JSON case is the absolute worst case).
So I don't think any of this should be a concern.

Regards,
Ayoub

Rebased patch.

Regards,

Ayoub

Attachment

v3-0001-Speed-up-COPY-TO-FORMAT-text-csv-using-SIMD.patch

pgsql-hackers by date:

From: Masahiko Sawada
Date: 24 March, 03:13:57
Subject: Re: [PATCH] Add max_logical_replication_slots GUC

From: David Rowley
Date: 24 March, 03:32:40
Subject: Re: another autovacuum scheduling thread

Re: Speed up COPY TO text/CSV parsing using SIMD - Mailing list pgsql-hackers

Attachment

Previous

Next