Hi,
On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com> wrote:
> Hello, all.
>
> Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth using.
Attachedare v4 versions of the patches showing a first attempt at doing that.
Thank you for doing this!
> I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations
introducedby Ayub Kazar.
>
> The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with
1/3rdescapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower.
>
> This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of the
fileand then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion of
maybechecking again every million lines or so.
My input-generation script is not ready to share yet, but the inputs
follow this format: text_${n}.input, where n represents the number of
normal characters before the delimiter. For example:
n = 0 -> "\n\n\n\n\n..." (no normal characters)
n = 1 -> "a\n..." (1 normal character before the delimiter)
...
n = 5 -> "aaaaa\n..."
… continuing up to n = 32.
Each line has 4096 chars and there are a total of 100000 lines in each
input file.
I only benchmarked the text format. I compared the latest heuristic I
shared [1] with the current method. The benchmarks show roughly a ~16%
regression at the worst case (n = 2), with regressions up to n = 5.
For the remaining values, performance was similar.
Actual comparison of timings (in ms):
current method / heuristic
n = 0 -> 3252.7253 / 2856.2753 (%12)
n = 1 -> 2910.321 / 2520.7717 (%13)
n = 2 -> 2865.008 / 2403.2017 (%16)
n = 3 -> 2608.649 / 2353.1477 (%9)
n = 4 -> 2460.74 / 2300.1783 (%6)
n = 5 -> 2451.696 / 2362.1573 (%3)
No difference for the rest.
Side note: Sorry for the delay in responding, I will continue working
on this next week.
[1] https://postgr.es/m/CAN55FZ1KF7XNpm2XyG%3DM-sFUODai%3D6Z8a11xE3s4YRBeBKY3tA%40mail.gmail.com
--
Regards,
Nazir Bilal Yavuz
Microsoft