A few suggestions:
* I'm curious if we'll see better performance on large inputs if we flush to `line_buf` periodically (e.g., at least every few thousand bytes or so). Otherwise we might see poor data cache behavior if large inputs with no control characters get evicted before we've copied them over. See the approach taken in escape_json_with_len() in utils/adt/json.c
Hello Neil, thanks for the suggestions!
This might be true as the json code considers it, i'll take a look later.
* Did you compare the approach taken in the patch with a simpler approach that just does
if (!(vector8_has(chunk, '\\') ||
vector8_has(chunk, '\r') ||
vector8_has(chunk, '\n') /* and so on, accounting for CSV / escapec / quotec stuff */))
{
/* skip chunk */
}
That's roughly what we do elsewhere (e.g., escape_json_with_len). It has the advantage of being more readable, along with potentially having fewer data dependencies.
This is true and also similar to the concerns of the very first patches where i tried to just replace the special chars detection by something less dependent but I missed vector8_has.
I have tried this now and here are my thoughts and results:
v5: the simplest of all, we take a chunk, detect any special character, if it's clean then skip it all, otherwise fallback to scalar WITHOUT advancing to the first special char found, this is mentioned just for reference and comparison.
Numbers compared to master:
text, no special: +16.8%
text, 1/3 special: -21.3%
csv, no special: +29.5%
csv, 1/3 special: -33.5%
v5.1: Take a chunk, if its clean of specials, skip it, if not process the whole chunk byte-by-byte, retry the same thing for next chunk if it's fully skippable and so on...
Numbers:
text, no special: +16.5%
text, 1/3 special: -20.1% (weird) ; but overall the idea is clear maybe.
csv, no special: +29.4%
csv, 1/3 special: -4.4%
For files of 1/3 specials it's almost always terrible to consider simd because it will always fall back to scalar without even advancing to the first special character, even for something less than 1/3 i think it would be same case, i attached an extremely simple analysis on probability of having a 16 byte clean chunk depending on data we have.
Attached also is an overall comparison graph between master, v3, v5.1 (with patch)
Regards,
Ayoub