Re: Speed up COPY FROM text/CSV parsing using SIMD - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: Speed up COPY FROM text/CSV parsing using SIMD
Date
Msg-id aa8QlTVEDhG1JU0Z@nathan
Whole thread Raw
In response to Re: Speed up COPY FROM text/CSV parsing using SIMD  (Nazir Bilal Yavuz <byavuz81@gmail.com>)
Responses Re: Speed up COPY FROM text/CSV parsing using SIMD
Re: Speed up COPY FROM text/CSV parsing using SIMD
List pgsql-hackers
On Wed, Mar 04, 2026 at 06:15:53PM +0300, Nazir Bilal Yavuz wrote:
> +#ifndef USE_NO_SIMD
> +static bool CopyReadLineTextSIMDHelper(CopyFromState cstate, bool is_csv,
> +                                       bool *temp_hit_eof, int *temp_input_buf_ptr);
> +#endif

Should we inline this, too?

> +                /*
> +                 * Do not disable SIMD when we hit EOL or EOF characters. In
> +                 * practice, it does not matter for EOF because parsing ends
> +                 * there, but we keep the behavior consistent.
> +                 */
> +                if (!(simd_hit_eof || simd_hit_eol))
> +                    cstate->simd_enabled = false;

nitpick: I would personally avoid disabling it for EOF.  It probably
doesn't amount to much, but I don't see any point in the extra
complexity/work solely for consistency.

> +                /*
> +                 * We encountered a EOL or EOF on the first vector. This means
> +                 * lines are not long enough to skip fully sized vector. If
> +                 * this happens two times consecutively, then disable the
> +                 * SIMD.
> +                 */
> +                if (first_vector)
> +                {
> +                    if (cstate->simd_failed_first_vector)
> +                        cstate->simd_enabled = false;
> +
> +                    cstate->simd_failed_first_vector = true;
> +                }

The first time I saw this, my mind immediately went to the extreme case
where this likely regresses: alternating long and short lines.  We might
just want to disable it the first time we see a short line, like we do for
special characters.  This is another thing that we can improve
independently later on.

> +    /* First try to run SIMD, then continue with the scalar path */
> +    if (cstate->simd_enabled)
> +    {
> +        int            temp_input_buf_ptr = input_buf_ptr;
> +        bool        temp_hit_eof = false;
> +
> +        result = CopyReadLineTextSIMDHelper(cstate, is_csv, &temp_hit_eof,
> +                                            &temp_input_buf_ptr);
> +        input_buf_ptr = temp_input_buf_ptr;
> +        hit_eof = temp_hit_eof;

Given CopyReadLineTextSIMDHelper() doesn't have too much duplicated code,
moving the SIMD stuff to its own function is nice.  The temp variables seem
a bit too magical to me, though.  If those really make a difference, IMHO
there ought to be a big comment explaining why.

-- 
nathan



pgsql-hackers by date:

Previous
From: Hüseyin Demir
Date:
Subject: Re: client_connection_check_interval default value
Next
From: Corey Huinker
Date:
Subject: Re: Add starelid, attnum to pg_stats and leverage this in pg_dump