Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

From Neil Conway
Subject Re: Speed up COPY FROM text/CSV parsing using SIMD
Date
Msg-id CAOW5sYZEx=fPw2wp7y2nK_-ifXFeYW4CTmFx_OQeoHFjG7rbHw@mail.gmail.com
Whole thread Raw
In response to Re: Speed up COPY FROM text/CSV parsing using SIMD  (Nazir Bilal Yavuz <byavuz81@gmail.com>)
Responses Re: Speed up COPY FROM text/CSV parsing using SIMD
Re: Speed up COPY FROM text/CSV parsing using SIMD
List pgsql-hackers
A few suggestions:

* I'm curious if we'll see better performance on large inputs if we flush to `line_buf` periodically (e.g., at least every few thousand bytes or so). Otherwise we might see poor data cache behavior if large inputs with no control characters get evicted before we've copied them over. See the approach taken in escape_json_with_len() in utils/adt/json.c

* Did you compare the approach taken in the patch with a simpler approach that just does

if (!(vector8_has(chunk, '\\') ||
      vector8_has(chunk, '\r') ||
      vector8_has(chunk, '\n') /* and so on, accounting for CSV / escapec / quotec stuff */))
{
    /* skip chunk */
}

That's roughly what we do elsewhere (e.g., escape_json_with_len). It has the advantage of being more readable, along with potentially having fewer data dependencies.

Neil

On Wed, Dec 10, 2025 at 7:00 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,

On Wed, 10 Dec 2025 at 01:13, Manni Wood <manni.wood@enterprisedb.com> wrote:
>
> Bilal Yavuz (Nazir Bilal Yavuz?),

It is Nazir Bilal Yavuz, I changed some settings on my phone and it
seems that it affected my mail account, hopefully it should be fixed
now.

> I did not get a chance to do any work on this today, but wanted to thank you for finding my logic errors in counting special chars for CSV, and hacking on my naive solution to make it faster. By attempting Andrew Dunstan's suggestion, I got a better feel for the reality that the "housekeeping" code produces a significant amount of overhead.

You are welcome! v4.1 has some problems with in_quote case in SIMD
handling code and counting cstate->chars_processed variable. I fixed
them in v4.2.

--
Regards,
Nazir Bilal Yavuz
Microsoft

pgsql-hackers by date:

Previous
From: Marcos Magueta
Date:
Subject: Re: WIP - xmlvalidate implementation from TODO list
Next
From: Matthias van de Meent
Date:
Subject: Re: Reduce build times of pg_trgm GIN indexes