Home > mailing lists

Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

From	Manni Wood
Subject	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date	December 6, 2025 04:39:56
Msg-id	CAKWEB6oO4gQd+UJBrU=uuUTE8Hv7GMznjMouvn0Lskr52UqjhQ@mail.gmail.com Whole thread
In response to	Re: Speed up COPY FROM text/CSV parsing using SIMD (Manni Wood <manni.wood@enterprisedb.com>)
Responses	Re: Speed up COPY FROM text/CSV parsing using SIMD
List	pgsql-hackers

Tree view

On Wed, Nov 26, 2025 at 8:21 AM Manni Wood <manni.wood@enterprisedb.com> wrote:

On Wed, Nov 26, 2025 at 5:51 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Hello,
On Wed, Nov 19, 2025 at 10:01 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote:
> Thanks, done.

I took a look at the v3 patches. Here are my high-level thoughts:

+ /*
+ * Parse data and transfer into line_buf. To get benefit from inlining,
+ * call CopyReadLineText() with the constant boolean variables.
+ */
+ if (cstate->simd_continue)
+ result = CopyReadLineText(cstate, is_csv, true);
+ else
+ result = CopyReadLineText(cstate, is_csv, false);

I'm curious whether this actually generates different code, and if it does,
if it's actually faster. We're already branching on cstate->simd_continue
here.
I've compiled both versions with -O2 and confirmed they generate different code. When simd_continue is passed as a constant to CopyReadLineText, the compiler optimizes out the condition checks from the SIMD path.
A small benchmark on a 1GB+ file shows the expected benefit which is around 6% performance improvement.
I've attached the assembly outputs in case someone wants to check something else.

Regards,
Ayoub Kazar

Correction to my last post:

I also tried files that alternated lines with no special characters and lines with 1/3rd special characters, thinking I could force the algorithm to continually check whether or not it should use simd and therefore force more overhead in the try-simd/don't-try-simd housekeeping code. The text file was still 20% faster (not 50% faster as I originally stated --- that was a typo). The CSV file was still 13% faster.

Also, apologies for posting at the top in my last e-mail.
--
-- Manni Wood EDB: https://www.enterprisedb.com

Hello, all.

Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth using. Attached are v4 versions of the patches showing a first attempt at doing that.

I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations introduced by Ayub Kazar.

The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower.

This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of the file and then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion of maybe checking again every million lines or so.

-- Manni Wood EDB: https://www.enterprisedb.com

Attachment

pgsql-hackers by date:

From: Chao Li
Date: 06 December 2025, 04:35:51
Subject: Re: Making jsonb_agg() faster

From: Peter Geoghegan
Date: 06 December 2025, 06:48:15
Subject: Moving _bt_readpage and _bt_checkkeys into a new .c file

Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

Attachment

Previous

Next