Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers
| From | Manni Wood |
|---|---|
| Subject | Re: Speed up COPY FROM text/CSV parsing using SIMD |
| Date | |
| Msg-id | CAKWEB6o3yfVAo0Mrkhyyo2581DG=8wR4W6G+0g7Ke6BUxK-PaQ@mail.gmail.com Whole thread Raw |
| In response to | Re: Speed up COPY FROM text/CSV parsing using SIMD (Manni Wood <manni.wood@enterprisedb.com>) |
| Responses |
Re: Speed up COPY FROM text/CSV parsing using SIMD
Re: Speed up COPY FROM text/CSV parsing using SIMD |
| List | pgsql-hackers |
On Wed, Jan 7, 2026 at 1:13 PM Manni Wood <manni.wood@enterprisedb.com> wrote:
On Tue, Jan 6, 2026 at 2:05 PM Manni Wood <manni.wood@enterprisedb.com> wrote:On Wed, Dec 31, 2025 at 7:04 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:Hi,
On Wed, 24 Dec 2025 at 18:08, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
>
> Hello,
> Following the same path of optimizing COPY FROM using SIMD, i found that COPY TO can also benefit from this.
>
> I attached a small patch that uses SIMD to skip data and advance as far as the first special character is found, then fallback to scalar processing for that character and re-enter the SIMD path again...
> There's two ways to do this:
> 1) Essentially we do SIMD until we find a special character, then continue scalar path without re-entering SIMD again.
> - This gives from 10% to 30% speedups depending on the weight of special characters in the attribute, we don't lose anything here since it advances with SIMD until it can't (using the previous scripts: 1/3, 2/3 specials chars).
>
> 2) Do SIMD path, then use scalar path when we hit a special character, keep re-entering the SIMD path each time.
> - This is equivalent to the COPY FROM story, we'll need to find the same heuristic to use for both COPY FROM/TO to reduce the regressions (same regressions: around from 20% to 30% with 1/3, 2/3 specials chars).
>
> Something else to note is that the scalar path for COPY TO isn't as heavy as the state machine in COPY FROM.
>
> So if we find the sweet spot for the heuristic, doing the same for COPY TO will be trivial and always beneficial.
> Attached is 0004 which is option 1 (SIMD without re-entering), 0005 is the second one.
Patches look correct to me. I think we could move these SIMD code
portions into a shared function to remove duplication, although that
might have a performance impact. I have not benchmarked these patches
yet.
Another consideration is that these patches might need their own
thread, though I am not completely sure about this yet.
One question: what do you think about having a 0004-style approach for
COPY FROM? What I have in mind is running SIMD for each line & column,
stopping SIMD once it can no longer skip an entire chunk, and then
continuing with the next line & column.
--
Regards,
Nazir Bilal Yavuz
MicrosoftHello, Nazir, I tried your suggested cpupower commands as well as disabling turbo, and my results are indeed more uniform. (see attached screenshot of my spreadsheet).
This time, I ran the tests on my Tower PC instead of on my laptop.
I also followed Mark Wong's advice and used the taskset command to pin my postgres postmaster (and all of its children) to a single cpu core.
So when I start postgres, I do this to pin it to core 27:
${PGHOME}/bin/pg_ctl -D ${PGHOME}/data -l ${PGHOME}/logfile.txt start
PGPID=$(head -1 ${PGHOME}/data/postmaster.pid)
taskset --cpu-list -p 27 ${PGPID}
My results seem similar to yours:
master: Nazir 85ddcc2f4c | Manni 877ae5db
text, no special: 102294 | 302651
text, 1/3 special: 108946 | 326208
csv, no special: 121831 | 348930
csv, 1/3 special: 140063 | 439786
v3
text, no special: 88890 (13.1% speedup) | 227874 (24.7% speedup)
text, 1/3 special: 110463 (1.4% regression) | 322637 (1.1% speedup)
csv, no special: 89781 (26.3% speedup) | 226525 (35.1% speedup)
csv, 1/3 special: 147094 (5.0% regression) | 461501 (4.9% regression)
v4.2
text, no special: 87785 (14.2% speedup) | 225702 (25.4% speedup)
text, 1/3 special: 127008 (16.6% regression) | 343480 (5.3% regression)
csv, no special: 88093 (27.7% speedup) | 226633 (35.0% speedup)
csv, 1/3 special: 164487 (17.4% regression) | 510954 (16.2% regression)It would seem that both your results and mine show a more serious worst-case regression for the v4.2 patches than for the v3 patches. It seems also that the speedups for v4.2 and v3 are similar.I'm currently working with Mark Wong to see if his results continue to be dissimilar (as they currently are now) and, if so, why.---- Manni Wood EDB: https://www.enterprisedb.comHello, all.Now that I am following Nazir's on how to configure my CPU for performance test run, and now that I am following Mark's advice on pinning the postmaster to a particular CPU core, I figured I would share the scripts I have been using to build, run, and test Postges with various patches applied: https://github.com/manniwood/copysimdperfWith Nazir and Mark's tips, I have seen more consistent numbers on my tower PC, as shared in a previous e-mail. But Mark and I saw rather variable results on a different Linux system he has access to. So this has inspired me to spin up an AWS EC2 instance and test that when I find the time. And maybe re-test on my Linux laptop.If anybody else is inspired to test on different setups, that would be great.---- Manni Wood EDB: https://www.enterprisedb.com
I tested master (bfb335d) and v3 and v4.2 patches on an amazon ec2 instance (t2.small) and, with Mark's help, proved that on such a small system with default storage configured, IO will be the bottleneck and the v3 and v4.2 patches show no significant differences over master because the CPU is always waiting on IO. This is presumably an experience Postgres users will have when running on systems with IO so slow that the CPU is always waiting for data.
I went in the other direction and tested an all-RAM setup on my tower PC. I put the entire data dir in RAM for each postgres instance (master, v3 patch, v4.2 patch), and wrote and copied the test copyfiles from RAM. On Linux, /ram/user/<myuserid> is tmpfs (ramdisk), so I just put everything there. I had to shrink the data sizes compared to previous runs (to not run out of ramdisk space) but Nazir's cpupower tips are making all of my test runs much more uniform, so I no longer feel that I need huge data sizes to get good results.
Here are the results when all of the files are on RAM disks:
master: bfb335df
text, no special: 30372
text, 1/3 special: 32665
csv, no special: 34925
csv, 1/3 special: 44044
v3
text, no special: 22840 (24.7% speedup)
text, 1/3 special: 32448 (0.6% speedup)
csv, no special: 22642 (35.1% speedup)
csv, 1/3 special: 46280 (5.1% regression)
v4.2
text, no special: 22677 (25.3% speedup)
text, 1/3 special: 34512 (6.5% regression)
csv, no special: 22686 (35.0% speedup)
csv, 1/3 special: 51411 (16.7% regression)
text, no special: 30372
text, 1/3 special: 32665
csv, no special: 34925
csv, 1/3 special: 44044
v3
text, no special: 22840 (24.7% speedup)
text, 1/3 special: 32448 (0.6% speedup)
csv, no special: 22642 (35.1% speedup)
csv, 1/3 special: 46280 (5.1% regression)
v4.2
text, no special: 22677 (25.3% speedup)
text, 1/3 special: 34512 (6.5% regression)
csv, no special: 22686 (35.0% speedup)
csv, 1/3 special: 51411 (16.7% regression)
Assuming all-storage-is-RAM setups get us closer to the theoretical limit of each patch, it looks like v3 holds up quite well to v4.2 in the best case scenarios, while v3 has better performance than v4.2 in the worst-case scenarios.
Let me know what you think!
-- -- Manni Wood EDB: https://www.enterprisedb.com
pgsql-hackers by date: