Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

From Manni Wood
Subject Re: Speed up COPY FROM text/CSV parsing using SIMD
Date
Msg-id CAKWEB6p-Y54yWA5kq6OXEYV=ABdHenJ559i0MshOoYkP4i=o5A@mail.gmail.com
Whole thread
In response to Re: Speed up COPY FROM text/CSV parsing using SIMD  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: Speed up COPY FROM text/CSV parsing using SIMD
List pgsql-hackers
Hello!

I ran some COPY FROM tests using master and then Nazir's v7-0001 and v7-0002 patches applied to master.

x86 master
TXT :                 29222.524250 ms
CSV :                 36162.588500 ms
TXT with 1/3 escapes: 32922.649750 ms
CSV with 1/3 quotes:  47631.423750 ms

x86 v7-0001
TXT :                 23247.834250 ms  20.445496% improvement
CSV :                 23162.711750 ms  35.948413% improvement
TXT with 1/3 escapes: 31786.386000 ms  3.451313% improvement
CSV with 1/3 quotes:  43330.475500 ms  9.029645% improvement

x86 v7-0002
TXT :                 22394.812500 ms  23.364552% improvement
CSV :                 22374.645750 ms  38.127643% improvement
TXT with 1/3 escapes: 32378.929750 ms  1.651507% improvement
CSV with 1/3 quotes:  47139.171750 ms  1.033461% improvement

arm master
TXT :                 9448.900500 ms
CSV :                 11135.871500 ms
TXT with 1/3 escapes: 10786.418750 ms
CSV with 1/3 quotes:  14115.335500 ms

arm v7-0001
TXT :                 7271.170500 ms  23.047443% improvement
CSV :                 7259.866750 ms  34.806479% improvement
TXT with 1/3 escapes: 10894.445500 ms  -1.001507% regression
CSV with 1/3 quotes:  13398.444000 ms  5.078813% improvement

arm v7-0002
TXT :                 7165.707250 ms  24.163587% improvement
CSV :                 7140.497250 ms  35.878416% improvement
TXT with 1/3 escapes: 10308.782250 ms  4.428129% improvement
CSV with 1/3 quotes:  12576.179500 ms  10.904140% improvement

v7-0001 + v7-0002 applied to master certainly seems promising: nice to see speed improvements across the board on both x86 and arm!

On Fri, Feb 13, 2026 at 5:09 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Fri, Feb 13, 2026 at 02:45:30PM +0300, Nazir Bilal Yavuz wrote:
> Also, if I change this code to:
>
>     if (cstate->simd_enabled)
>     {
>         if (is_csv)
>             result = CopyReadLineText(cstate, true, true);
>         else
>             result = CopyReadLineText(cstate, false, true);
>     }
>     else
>     {
>         if (is_csv)
>             result = CopyReadLineText(cstate, true, false);
>         else
>             result = CopyReadLineText(cstate, false, false);
>     }
>
> then I see ~%5 performance improvement in scalar path compared to master.

Hm.  What difference do you see if you just do

        if (is_csv)
                result = CopyReadLineText(cstate, true);
        else
                result = CopyReadLineText(cstate, false);

both with and without the SIMD stuff?  IIUC this is allowing the compiler
to remove several branches in CopyReadLineText(), which might be a nice
improvement on its own.  That being said, I'm less convinced that adding a
simd_enabled parameter to CopyReadLineText() helps, because 1) it's
involved in fewer branches and 2) we change it within the function, so the
compiler can't remove the branches, anyway.  But perhaps I'm missing
something.

Some other random thoughts:

+                    match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));

+                match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));

Since \n and \r are well below "normal" ASCII values, I wonder if we could
simplify these to something like

        match = vector8_gt(... vector with all lanes set to \r + 1 ..., chunk);

+            /* Check if we found any special characters */
+            mask = vector8_highbit_mask(match);
+            if (mask != 0)

vector8_highbit_mask() is somewhat expensive on AArch64, so I wonder if
waiting until we enter the "if" block to calculate it has any benefit.

+                simd_hit_eol = (c1 == '\r' || c1 == '\n') && (!is_csv || !in_quote);

If (is_csv && in_quote), we shouldn't have picked up \r or \n in the first
place, right?

+                simd_hit_eof = c1 == '\\' && c2 == '.' && !is_csv;
+
+                /*
+                 * Do not disable SIMD when we hit EOL or EOF characters. In
+                 * practice, it does not matter for EOF because parsing ends
+                 * there, but we keep the behavior consistent.
+                 */
+                if (!(simd_hit_eof || simd_hit_eol))

I'd think that doing less unnecessary work would outweigh the benefits of
consistency for the EOF case.

--
nathan


--
-- Manni Wood EDB: https://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Sami Imseih
Date:
Subject: Re: Flush some statistics within running transactions
Next
From: Chengpeng Yan
Date:
Subject: Re: Add a greedy join search algorithm to handle large join problems