Home > mailing lists

Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

From	Andrew Dunstan
Subject	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date	October 20 17:02:23
Msg-id	673d92f7-2489-475f-a208-9414ea35d4d8@dunslane.net Whole thread Raw
In response to	Re: Speed up COPY FROM text/CSV parsing using SIMD (Nazir Bilal Yavuz <byavuz81@gmail.com>)
Responses	Re: Speed up COPY FROM text/CSV parsing using SIMD
List	pgsql-hackers

Tree view

On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:

Hi,

On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew@dunslane.net> wrote:


On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:

Hi,

On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.

So, I implemented a small heuristic. It works like that:

- If advance < 5 -> insert a sleep penalty (n cycles).

'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.

I was thinking a bit about that this morning. I wonder if it might be
better instead of having a constantly applied heuristic like this, it
might be better to do a little extra accounting in the first, say, 1000
lines of an input file, and if less than some portion of the input is
found to be special characters then switch to the SIMD code. What that
portion should be would need to be determined by some experimentation
with a variety of typical workloads, but given your findings 20% seems
like a good starting point.

I implemented a heuristic something similar to this. It is a mix of
previous heuristic and your idea, it works like that:

Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.

1 - We will try SIMD and decide if it is worth it to run SIMD.
1.1 - If it is worth it, we will continue to run SIMD and we will
halve the simd_last_sleep_cycle variable.
1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
and we will not run SIMD for these many lines.
1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
each 1024 lines at max.

With this heuristic the regression is limited by %2 in the worst case.

My worry is that the worst case is actually quite common. Sparse data sets dominated by a lot of null values (and hence lots of special characters) are very common. Are people prepared to accept a 2% regression on load times for such data sets?

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

pgsql-hackers by date:

From: Viktor Holmberg
Date: 20 October, 17:02:01
Subject: Re: Docs and tests for RLS policies applied by command type

From: Tom Lane
Date: 20 October, 17:15:44
Subject: Re: Inconsistent Behavior of GROUP BY ROLLUP in v17 vs master

Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

Previous

Next