Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

From Nazir Bilal Yavuz
Subject Re: Speed up COPY FROM text/CSV parsing using SIMD
Date
Msg-id CAN55FZ2DE2XSrFUhsOqbpBo+BtzTwsJWOD0MffvdGnHtbsPRuw@mail.gmail.com
Whole thread Raw
In response to Re: Speed up COPY FROM text/CSV parsing using SIMD  (Manni Wood <manni.wood@enterprisedb.com>)
List pgsql-hackers
Hi,

On Sat, 13 Dec 2025 at 02:09, Manni Wood <manni.wood@enterprisedb.com> wrote:
>
> Hello, Everyone!
>
> I have attached two files. 1) the shell script that Mark and I have been using to get our test results, and 2) a
screenshotof a spreadsheet of my latest test results. (Please let me know if there's a different format than a
screenshotthat I could share my spreadsheet in.) 
>
> I took greater care this time to compile all three variants of Postgres (master at bfb335df, master at bfb335df with
v4.2patches installed, master at bfb335df with v3 patches installed) with the same gcc optimization flags that would be
usedto build Postgres packages. To the best of my knowledge, the two gcc flags of greatest interest would be -g and
-O2.I built all three variants of Postgres using meson like so: 
>
> BRANCH=$(git branch --show-current)
> meson setup build --prefix=/home/mwood/compiled-pg-instances/${BRANCH} --buildtype=debugoptimized
>
> It occurred to me that in addition to end users only caring about 1) wall clock time (is the speedup noticeable in
"realtime" or just technically faster / uses less CPU?) and 2) Postgres binaries compiled with the same optimization
levelone would get when installing Postgres from packages like .deb or .rpm; in other words, will the user see speedups
withouthaving do manually compile postgres. 
>
> My interesting finding, on my laptop (ThinkPad P14s Gen 1 running Ubuntu 24.04.3), is different from Mark Wong's. On
mylaptop, using three Postgres installations all compiled with the -O2 optimization flag, I see speedups with the v4.2
patchexcept for a 2% slowdown with CSV with 1/3rd quotes (a 2% slowdown). But with Nazir's proposed v3 patch, I see
improvementsacross the board. So even for a text file with 1/3rd escape characters, and even with a CSV file with 1/3rd
quotes,I see speedups of 11% and 26% respectively. 
>
> The format of these test files originally comes from Ayoub Kazar's test scripts; all Mark and I have done in playing
withthem is make them much larger: 5,000,000 rows, based on the assumption that longer tests are better tests. 
>
> I find my results interesting enough that I'd be curious to know if anybody else can reproduce them. It is very
interestingthat Mark's results are noticeably different from mine. 

Thank you for sharing the benchmark script! I ran the benchmarks using
your script with --buildtype=debugoptimized. My results are below:

master: 85ddcc2f4c

text, no special: 102294
text, 1/3 special: 108946
csv, no special: 121831
csv, 1/3 special: 140063

v3

text, no special: 88890 (13.1% speedup)
text, 1/3 special: 110463 (1.4% regression)
csv, no special: 89781 (26.3% speedup)
csv, 1/3 special: 147094 (5.0% regression)

v4.2

text, no special: 87785 (14.2% speedup)
text, 1/3 special: 127008 (16.6% regression)
csv, no special: 88093 (27.7% speedup)
csv, 1/3 special: 164487 (17.4% regression)

One thing I noticed is that your benchmark timings appear to have some
variance. In my runs, I did not observe differences greater than one
second between runs. It is possible that this variance is affecting
your results.

Before running the benchmarks, I use the these commands [1] to improve
result stability; they might be helpful if you are not already using
something similar:

I did this benchmark on my local and my specs are Intel i5 13600k,
32GB Memory and SATA SSD.

[1]
sudo cpupower frequency-set --governor=performance
sudo cpupower idle-set -D 0 # disable idle
echo "1" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo (intel only)

--
Regards,
Nazir Bilal Yavuz
Microsoft



pgsql-hackers by date:

Previous
From: Mikael Gustavsson
Date:
Subject: [PATCH] Documentation
Next
From: Peter Eisentraut
Date:
Subject: Re: Add sanity check for duplicate enum values in GUC definitions