Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers
From | KAZAR Ayoub |
---|---|
Subject | Re: Speed up COPY FROM text/CSV parsing using SIMD |
Date | |
Msg-id | CA+K2RumH-b=3-v0rfQ-oAbuQFxY8JLSSpVhmaJn+gRnX3t1_vg@mail.gmail.com Whole thread Raw |
In response to | Re: Speed up COPY FROM text/CSV parsing using SIMD (Nazir Bilal Yavuz <byavuz81@gmail.com>) |
Responses |
Re: Speed up COPY FROM text/CSV parsing using SIMD
Re: Speed up COPY FROM text/CSV parsing using SIMD |
List | pgsql-hackers |
On Sat, Oct 18, 2025 at 10:01 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Thank you so much for doing this! The results look nice, do you think
there are any other benchmarks that might be interesting to try?
> I'm also trying the idea of doing SIMD inside quotes with prefix XOR using carry less multiplication avoiding the slow path in all cases even with weird looking input, but it needs to take into consideration the availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we go, it quickly starts to become dirty OR we can wait for the decision to start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.
I can not quite picture this, would you mind sharing a few examples or patches?
The idea aims to avoid stopping at characters that are not actually special in their position (inside quote, escaped ..etc)
This is done by creating a lot of masks from the original chunk, masks like: quote_mask, escape_mask, odd escape sequences mask ; from these we can deduce which quotes are not special to stop at
Then for inside quotes, we aim to know which characters in our chunk are inside quotes (also keeping in track the previous chunk's quote state) and there's a clever/fast way to do it [1].
After this you start to match with LF and CR ..etc, all this while maintaining the state of what you've seen (the annoying part).
At the end you only reach the scalar path advancing by the position of first real special character that requires special treatment.
However, after trying to implement this on the existing pipeline way of COPY command [2] (broken hopeless try, but has the idea), It becomes very unreasonable for a lot of reasons:
- It is very challenging to correctly handle commas inside quoted fields, and tracking quoted vs. unquoted state (especially across chunk boundaries, or with escaped quotes) ....
- Using carry less multiplication (CLMUL) for prefix xor on a 16 bytes chunk is overkill for some architectures where PCLMULQDQ latency is high [3][4] to a point where it performs worse than an unrolled shifts + xor (5 cycles).
- It starts to feel that handling these cases is inherently scalar, doing all that work for a 16 bytes chunk would be unreasonable since it's not free, compared to a simple help using SIMD and heuristic of Nazir which is way nicer in general.
Currently we are at 200-400Mbps which isn't that terrible compared to production and non production grade parsers (of course we don't only parse in our case), also we are using SSE2 only so theoretically if we add support for avx later on we'll have even better numbers.
Maybe more micro optimizations to the current heuristic can squeeze it more.
[1] https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/
[2] https://github.com/AyoubKaz07/postgres/commit/73c6ecfedae4cce5c3f375fd6074b1ca9dfe1daf
Regards,
Ayoub Kazar.
pgsql-hackers by date: