On Wed, Mar 18, 2026 at 03:29:32AM +0100, KAZAR Ayoub wrote:
> If we have some json(b) column like : {"key1":"val1","key2":"val2"}, for
> CSV format this would immediately exit the SIMD path because of quote
> character, for json(b) this is going to be always the case.
> I measured the overhead of exiting the SIMD path a lot (8 million times for
> one COPY TO command), i only found 3% regression for this case, sometimes
> 2%.
I'm a little worried that we might be dismissing small-yet-measurable
regressions for extremely common workloads. Unlike the COPY FROM work,
this operates on a per-attribute level, meaning we only use SIMD when an
attribute is at least 16 bytes. The extra branching for each attribute
might not be something we can just ignore.
Thanks for the review.
I added a prescan loop inside the simd helpers trying to catch special chars in sizeof(Vector8) characters, i measured how good is this at reducing the overhead of starting simd and exiting at first vector:
the scalar loop is better than SIMD for one vector if it finds a special character before 6th character, worst case is not a clean vector, where the scalar loop needs 20 more cycles compared to SIMD.
This helps mitigate the case of JSON(B) in CSV format, this is why I only added this for CSV case only.
In a benchmark with 10M early SIMD exit like the JSONB case, the previous 3% regression is gone.
For the normal benchmark (clean, 1/3 specials, wide table), i ran for longer times for v4 now and i found this:
Test Master V4
TEXT clean 1619ms -28.0%
CSV clean 1866ms -37.1%
TEXT 1/3 backslashes 3913ms +1.2%
CSV 1/3 quotes 4012ms -3.0%
Wide table TEXT:
Cols Master V4
50 2109ms -2.9%
100 2029ms -1.6%
200 3982ms -2.9%
500 1962ms -6.1%
1000 3812ms -3.6%
Wide table CSV:
Cols Master V4
50 2531ms +0.3%
100 2465ms +1.1%
200 4965ms -0.2%
500 2346ms +1.4%
1000 4709ms -0.4%
Do we need more benchmarks for some other kind of workloads ? If i'm missing something else that has noticeable overhead maybe ?
Regards,
Ayoub