Home > mailing lists

Re: Speed up COPY TO text/CSV parsing using SIMD - Mailing list pgsql-hackers

From	KAZAR Ayoub
Subject	Re: Speed up COPY TO text/CSV parsing using SIMD
Date	March 27 21:48:38
Msg-id	CA+K2Runq+1gy8p6a-DsxpT2OkEkEu3cUGsZ9tdiGNrg_=P39gg@mail.gmail.com Whole thread
In response to	Re: Speed up COPY TO text/CSV parsing using SIMD (Nathan Bossart <nathandbossart@gmail.com>)
Responses	Re: Speed up COPY TO text/CSV parsing using SIMD
List	pgsql-hackers

Tree view

Hello,

On Thu, Mar 26, 2026 at 10:23 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Wed, Mar 18, 2026 at 03:29:32AM +0100, KAZAR Ayoub wrote:
> If we have some json(b) column like : {"key1":"val1","key2":"val2"}, for
> CSV format this would immediately exit the SIMD path because of quote
> character, for json(b) this is going to be always the case.
> I measured the overhead of exiting the SIMD path a lot (8 million times for
> one COPY TO command), i only found 3% regression for this case, sometimes
> 2%.

I'm a little worried that we might be dismissing small-yet-measurable
regressions for extremely common workloads. Unlike the COPY FROM work,
this operates on a per-attribute level, meaning we only use SIMD when an
attribute is at least 16 bytes. The extra branching for each attribute
might not be something we can just ignore.

Thanks for the review.

I added a prescan loop inside the simd helpers trying to catch special chars in sizeof(Vector8) characters, i measured how good is this at reducing the overhead of starting simd and exiting at first vector:

the scalar loop is better than SIMD for one vector if it finds a special character before 6th character, worst case is not a clean vector, where the scalar loop needs 20 more cycles compared to SIMD.

This helps mitigate the case of JSON(B) in CSV format, this is why I only added this for CSV case only.

In a benchmark with 10M early SIMD exit like the JSONB case, the previous 3% regression is gone.

For the normal benchmark (clean, 1/3 specials, wide table), i ran for longer times for v4 now and i found this:
Test Master V4
TEXT clean 1619ms -28.0%
CSV clean 1866ms -37.1%
TEXT 1/3 backslashes 3913ms +1.2%
CSV 1/3 quotes 4012ms -3.0%

Wide table TEXT:

Cols Master V4
50 2109ms -2.9%
100 2029ms -1.6%
200 3982ms -2.9%
500 1962ms -6.1%
1000 3812ms -3.6%

Wide table CSV:

Cols Master V4
50 2531ms +0.3%
100 2465ms +1.1%
200 4965ms -0.2%
500 2346ms +1.4%
1000 4709ms -0.4%

Do we need more benchmarks for some other kind of workloads ? If i'm missing something else that has noticeable overhead maybe ?

Regards,

Ayoub

Attachment

v4-0001-Speed-up-COPY-TO-FORMAT-text-csv-using-SIMD.patch

pgsql-hackers by date:

From: Payal Singh
Date: 27 March, 21:44:19
Subject: Re: Review - Patch for pg_bsd_indent: improve formatting of multiline comments

From: Andres Freund
Date: 27 March, 21:56:16
Subject: Re: Fix race with LLVM and bison.

Re: Speed up COPY TO text/CSV parsing using SIMD - Mailing list pgsql-hackers

Attachment

Previous

Next