Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

From Mark Wong
Subject Re: Speed up COPY FROM text/CSV parsing using SIMD
Date
Msg-id aTx-LDyiHV-7wfOP@ltdrgnflg2
Whole thread Raw
In response to Re: Speed up COPY FROM text/CSV parsing using SIMD  (Bilal Yavuz <byavuz81@gmail.com>)
Responses Re: Speed up COPY FROM text/CSV parsing using SIMD
List pgsql-hackers
Hi everyone,

On Tue, Dec 09, 2025 at 04:40:19PM +0300, Bilal Yavuz wrote:
> Hi,
> 
> On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > Hi,
> >
> > On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com> wrote:
> > > Hello, all.
> > >
> > > Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth
using.Attached are v4 versions of the patches showing a first attempt at doing that.
 
> >
> > Thank you for doing this!
> >
> > > I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations
introducedby Ayub Kazar.
 
> > >
> > > The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text
with1/3rd escapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower.
 
> > >
> > > This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of
thefile and then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion
ofmaybe checking again every million lines or so.
 
> >
> > My input-generation script is not ready to share yet, but the inputs
> > follow this format: text_${n}.input, where n represents the number of
> > normal characters before the delimiter. For example:
> >
> > n = 0 -> "\n\n\n\n\n..." (no normal characters)
> > n = 1 -> "a\n..." (1 normal character before the delimiter)
> > ...
> > n = 5 -> "aaaaa\n..."
> > … continuing up to n = 32.
> >
> > Each line has 4096 chars and there are a total of 100000 lines in each
> > input file.
> >
> > I only benchmarked the text format. I compared the latest heuristic I
> > shared [1] with the current method. The benchmarks show roughly a ~16%
> > regression at the worst case (n = 2), with regressions up to n = 5.
> > For the remaining values, performance was similar.
> 
> I tried to improve the v4 patchset. My changes are:
> 
> 1 - I changed CopyReadLineText() to an inline function and sent the
> use_simd variable as an argument to get help from inlining.
> 
> 2 - A main for loop in the CopyReadLineText() function is called many
> times, so I moved the use_simd check to the CopyReadLine() function.
> 
> 3 - Instead of 'bytes_processed', I used 'chars_processed' because
> cstate->bytes_processed is increased before we process them and this
> can cause wrong results.
> 
> 4 - Because of #2 and #3, instead of having
> 'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_processed /
> special_chars_encountered' to determine whether we want to use SIMD.
> 
> 5 - cstate->special_chars_encountered is incremented wrongly for the
> CSV case. It is not incremented for the quote and escape delimiters. I
> moved all increments of cstate->special_chars_encountered to the
> central place and tried to optimize it but it still causes a
> regression as it creates one more branching.
> 
> With these changes, I am able to decrease the regression to %10 from
> %16. Regression decreases to %7 if I modify #5 for the only text input
> but I did not do that.
> 
> My changes are in the 0003.

I was helping collect some data, but I'm a little behind sharing what I
ran against the v4.1 patches (on commit 07961ef8) with the v4.2 version
out there...

I hope it's still helpfule that I share what I collected even though
they are not quite as nice, but maybe it's more about how/where I ran
them.

My laptop has a Intel(R) Core(TM) Ultra 7 165H, where most of these
tests were using up 95%+ of one of the cores (I have hyperthreading
disabled), and using about 10% the ssd's capacity.

Summarizing my results from the same script Manni ran, I didn't see as
much as an improvement in the positive tests, and then saw more negative
results in the other tests.

text copy with no special chars: 18% improvement of 15s from 80s before
the patch

CSV copy with no special chars: 23% improvement of 23s from 96s before
the patch

text with 1/3rd escapes: 6% slower, an additional 5s to 85 seconds
before the patch

CSV with 1/3rd quotes: 7% slower, an additional 10 seconds to 129
seconds before the patch


I'm wondering if my laptop/processor isn't the best test bed for this...

Regards,
Mark
--
Mark Wong <markwkm@gmail.com>
EDB https://enterprisedb.com



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Remaining dependency on setlocale()
Next
From: Bryan Green
Date:
Subject: Re: [PATCH] Fix severe performance regression with gettext 0.20+ on Windows