Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

From Bilal Yavuz
Subject Re: Speed up COPY FROM text/CSV parsing using SIMD
Date
Msg-id CAN55FZ1fwKgGo2wEie1w2M2jzJko6cMi1NWD05Xm47_L9a3D+g@mail.gmail.com
Whole thread Raw
In response to Re: Speed up COPY FROM text/CSV parsing using SIMD  (Bilal Yavuz <byavuz81@gmail.com>)
Responses Re: Speed up COPY FROM text/CSV parsing using SIMD
List pgsql-hackers
Hi,

On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com> wrote:
> > Hello, all.
> >
> > Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth
using.Attached are v4 versions of the patches showing a first attempt at doing that. 
>
> Thank you for doing this!
>
> > I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations
introducedby Ayub Kazar. 
> >
> > The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with
1/3rdescapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower. 
> >
> > This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of
thefile and then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion
ofmaybe checking again every million lines or so. 
>
> My input-generation script is not ready to share yet, but the inputs
> follow this format: text_${n}.input, where n represents the number of
> normal characters before the delimiter. For example:
>
> n = 0 -> "\n\n\n\n\n..." (no normal characters)
> n = 1 -> "a\n..." (1 normal character before the delimiter)
> ...
> n = 5 -> "aaaaa\n..."
> … continuing up to n = 32.
>
> Each line has 4096 chars and there are a total of 100000 lines in each
> input file.
>
> I only benchmarked the text format. I compared the latest heuristic I
> shared [1] with the current method. The benchmarks show roughly a ~16%
> regression at the worst case (n = 2), with regressions up to n = 5.
> For the remaining values, performance was similar.

I tried to improve the v4 patchset. My changes are:

1 - I changed CopyReadLineText() to an inline function and sent the
use_simd variable as an argument to get help from inlining.

2 - A main for loop in the CopyReadLineText() function is called many
times, so I moved the use_simd check to the CopyReadLine() function.

3 - Instead of 'bytes_processed', I used 'chars_processed' because
cstate->bytes_processed is increased before we process them and this
can cause wrong results.

4 - Because of #2 and #3, instead of having
'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_processed /
special_chars_encountered' to determine whether we want to use SIMD.

5 - cstate->special_chars_encountered is incremented wrongly for the
CSV case. It is not incremented for the quote and escape delimiters. I
moved all increments of cstate->special_chars_encountered to the
central place and tried to optimize it but it still causes a
regression as it creates one more branching.

With these changes, I am able to decrease the regression to %10 from
%16. Regression decreases to %7 if I modify #5 for the only text input
but I did not do that.

My changes are in the 0003.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

pgsql-hackers by date:

Previous
From: David Geier
Date:
Subject: Re: Fix repetition in hash index documentation
Next
From: Anton Haglund
Date:
Subject: Re: [PATCH] Update comment in nodeBitmapHeapscan.c