Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

From Nazir Bilal Yavuz
Subject Re: Speed up COPY FROM text/CSV parsing using SIMD
Date
Msg-id CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com
Whole thread Raw
In response to Re: Speed up COPY FROM text/CSV parsing using SIMD  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: Speed up COPY FROM text/CSV parsing using SIMD
Re: Speed up COPY FROM text/CSV parsing using SIMD
List pgsql-hackers
Hi,

On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
> > Hi,
> >
> > On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >> I am able to reproduce the regression you mentioned but both
> >> regressions are %20 on my end. I found that (by experimenting) SIMD
> >> causes a regression if it advances less than 5 characters.
> >>
> >> So, I implemented a small heuristic. It works like that:
> >>
> >> - If advance < 5 -> insert a sleep penalty (n cycles).
> > 'sleep' might be a poor word choice here. I meant skipping SIMD for n
> > number of times.
> >
>
> I was thinking a bit about that this morning. I wonder if it might be
> better instead of having a constantly applied heuristic like this, it
> might be better to do a little extra accounting in the first, say, 1000
> lines of an input file, and if less than some portion of the input is
> found to be special characters then switch to the SIMD code. What that
> portion should be would need to be determined by some experimentation
> with a variety of typical workloads, but given your findings 20% seems
> like a good starting point.

I implemented a heuristic something similar to this. It is a mix of
previous heuristic and your idea, it works like that:

Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.

1 - We will try SIMD and decide if it is worth it to run SIMD.
1.1 - If it is worth it, we will continue to run SIMD and we will
halve the simd_last_sleep_cycle variable.
1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
and we will not run SIMD for these many lines.
1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
each 1024 lines at max.

With this heuristic the regression is limited by %2 in the worst case.

Patches are attached, the first patch is v2-0001 from Shinya with the
'-Werror=maybe-uninitialized' fixes and the pgindent changes. 0002 is
the actual heuristic patch.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: failed NUMA pages inquiry status: Operation not permitted
Next
From: Jacob Champion
Date:
Subject: Re: Thoughts on a "global" client configuration?