Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: Speed up COPY FROM text/CSV parsing using SIMD
Date
Msg-id aY-vJe_ENCB-fux9@nathan
Whole thread Raw
In response to Re: Speed up COPY FROM text/CSV parsing using SIMD  (Nazir Bilal Yavuz <byavuz81@gmail.com>)
Responses Re: Speed up COPY FROM text/CSV parsing using SIMD
Re: Speed up COPY FROM text/CSV parsing using SIMD
Re: Speed up COPY FROM text/CSV parsing using SIMD
List pgsql-hackers
On Fri, Feb 13, 2026 at 02:45:30PM +0300, Nazir Bilal Yavuz wrote:
> Also, if I change this code to:
> 
>     if (cstate->simd_enabled)
>     {
>         if (is_csv)
>             result = CopyReadLineText(cstate, true, true);
>         else
>             result = CopyReadLineText(cstate, false, true);
>     }
>     else
>     {
>         if (is_csv)
>             result = CopyReadLineText(cstate, true, false);
>         else
>             result = CopyReadLineText(cstate, false, false);
>     }
> 
> then I see ~%5 performance improvement in scalar path compared to master.

Hm.  What difference do you see if you just do

    if (is_csv)
        result = CopyReadLineText(cstate, true);
    else
        result = CopyReadLineText(cstate, false);

both with and without the SIMD stuff?  IIUC this is allowing the compiler
to remove several branches in CopyReadLineText(), which might be a nice
improvement on its own.  That being said, I'm less convinced that adding a
simd_enabled parameter to CopyReadLineText() helps, because 1) it's
involved in fewer branches and 2) we change it within the function, so the
compiler can't remove the branches, anyway.  But perhaps I'm missing
something.

Some other random thoughts:

+                    match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));

+                match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));

Since \n and \r are well below "normal" ASCII values, I wonder if we could
simplify these to something like

    match = vector8_gt(... vector with all lanes set to \r + 1 ..., chunk);

+            /* Check if we found any special characters */
+            mask = vector8_highbit_mask(match);
+            if (mask != 0)

vector8_highbit_mask() is somewhat expensive on AArch64, so I wonder if
waiting until we enter the "if" block to calculate it has any benefit.

+                simd_hit_eol = (c1 == '\r' || c1 == '\n') && (!is_csv || !in_quote);

If (is_csv && in_quote), we shouldn't have picked up \r or \n in the first
place, right?

+                simd_hit_eof = c1 == '\\' && c2 == '.' && !is_csv;
+
+                /*
+                 * Do not disable SIMD when we hit EOL or EOF characters. In
+                 * practice, it does not matter for EOF because parsing ends
+                 * there, but we keep the behavior consistent.
+                 */
+                if (!(simd_hit_eof || simd_hit_eol))

I'd think that doing less unnecessary work would outweigh the benefits of
consistency for the EOF case.

-- 
nathan



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: ecdh support causes unnecessary roundtrips
Next
From: Chao Li
Date:
Subject: Re: Make wal_receiver_timeout configurable per subscription