Home > mailing lists

Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

From	Bilal Yavuz
Subject	Re: Speed up COPY FROM text/CSV parsing using SIMD
Date	December 6 10:55:50
Msg-id	CAN55FZ0Nd9FL=aDSjOTJTeFAn8VNrZgWG+WbcHR+R7GkDMvUyw@mail.gmail.com Whole thread Raw
In response to	Re: Speed up COPY FROM text/CSV parsing using SIMD (Manni Wood <manni.wood@enterprisedb.com>)
List	pgsql-hackers

Tree view

Hi,

On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com> wrote:
> Hello, all.
>
> Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth using.
Attachedare v4 versions of the patches showing a first attempt at doing that. 

Thank you for doing this!

> I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations
introducedby Ayub Kazar. 
>
> The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with
1/3rdescapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower. 
>
> This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of the
fileand then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion of
maybechecking again every million lines or so. 

My input-generation script is not ready to share yet, but the inputs
follow this format: text_${n}.input, where n represents the number of
normal characters before the delimiter. For example:

n = 0 -> "\n\n\n\n\n..." (no normal characters)
n = 1 -> "a\n..." (1 normal character before the delimiter)
...
n = 5 -> "aaaaa\n..."
… continuing up to n = 32.

Each line has 4096 chars and there are a total of 100000 lines in each
input file.

I only benchmarked the text format. I compared the latest heuristic I
shared [1] with the current method. The benchmarks show roughly a ~16%
regression at the worst case (n = 2), with regressions up to n = 5.
For the remaining values, performance was similar.

Actual comparison of timings (in ms):

current method / heuristic
n = 0 -> 3252.7253 / 2856.2753 (%12)
n = 1 -> 2910.321 / 2520.7717 (%13)
n = 2 -> 2865.008 / 2403.2017 (%16)
n = 3 -> 2608.649 / 2353.1477 (%9)
n = 4 -> 2460.74 / 2300.1783 (%6)
n = 5 -> 2451.696 / 2362.1573 (%3)
No difference for the rest.

Side note: Sorry for the delay in responding, I will continue working
on this next week.

[1] https://postgr.es/m/CAN55FZ1KF7XNpm2XyG%3DM-sFUODai%3D6Z8a11xE3s4YRBeBKY3tA%40mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

pgsql-hackers by date:

From: Bryan Green
Date: 06 December, 10:08:47
Subject: Re: [PATCH] Allow complex data for GUC extra.

From: Victor Yegorov
Date: 06 December, 11:07:19
Subject: Re: Moving _bt_readpage and _bt_checkkeys into a new .c file

Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

Previous

Next