Home > mailing lists

Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

From	Nathan Bossart
Subject	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date	February 14 02:09:25
Msg-id	aY-vJe_ENCB-fux9@nathan Whole thread
In response to	Re: Speed up COPY FROM text/CSV parsing using SIMD (Nazir Bilal Yavuz <byavuz81@gmail.com>)
Responses	Re: Speed up COPY FROM text/CSV parsing using SIMD Re: Speed up COPY FROM text/CSV parsing using SIMD Re: Speed up COPY FROM text/CSV parsing using SIMD
List	pgsql-hackers

Tree view

On Fri, Feb 13, 2026 at 02:45:30PM +0300, Nazir Bilal Yavuz wrote:
> Also, if I change this code to:
> 
>     if (cstate->simd_enabled)
>     {
>         if (is_csv)
>             result = CopyReadLineText(cstate, true, true);
>         else
>             result = CopyReadLineText(cstate, false, true);
>     }
>     else
>     {
>         if (is_csv)
>             result = CopyReadLineText(cstate, true, false);
>         else
>             result = CopyReadLineText(cstate, false, false);
>     }
> 
> then I see ~%5 performance improvement in scalar path compared to master.

Hm.  What difference do you see if you just do

    if (is_csv)
        result = CopyReadLineText(cstate, true);
    else
        result = CopyReadLineText(cstate, false);

both with and without the SIMD stuff?  IIUC this is allowing the compiler
to remove several branches in CopyReadLineText(), which might be a nice
improvement on its own.  That being said, I'm less convinced that adding a
simd_enabled parameter to CopyReadLineText() helps, because 1) it's
involved in fewer branches and 2) we change it within the function, so the
compiler can't remove the branches, anyway.  But perhaps I'm missing
something.

Some other random thoughts:

+                    match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));

+                match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));

Since \n and \r are well below "normal" ASCII values, I wonder if we could
simplify these to something like

    match = vector8_gt(... vector with all lanes set to \r + 1 ..., chunk);

+            /* Check if we found any special characters */
+            mask = vector8_highbit_mask(match);
+            if (mask != 0)

vector8_highbit_mask() is somewhat expensive on AArch64, so I wonder if
waiting until we enter the "if" block to calculate it has any benefit.

+                simd_hit_eol = (c1 == '\r' || c1 == '\n') && (!is_csv || !in_quote);

If (is_csv && in_quote), we shouldn't have picked up \r or \n in the first
place, right?

+                simd_hit_eof = c1 == '\\' && c2 == '.' && !is_csv;
+
+                /*
+                 * Do not disable SIMD when we hit EOL or EOF characters. In
+                 * practice, it does not matter for EOF because parsing ends
+                 * there, but we keep the behavior consistent.
+                 */
+                if (!(simd_hit_eof || simd_hit_eol))

I'd think that doing less unnecessary work would outweigh the benefits of
consistency for the EOF case.

-- 
nathan

pgsql-hackers by date:

From: Daniel Gustafsson
Date: 14 February, 01:32:18
Subject: Re: ecdh support causes unnecessary roundtrips

From: Chao Li
Date: 14 February, 02:37:07
Subject: Re: Make wal_receiver_timeout configurable per subscription

Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

Previous

Next