Thread: pgsql: Refactor COPY FROM to use format callback functions.

pgsql: Refactor COPY FROM to use format callback functions.

From
Masahiko Sawada
Date:
Refactor COPY FROM to use format callback functions.

This commit introduces a new CopyFromRoutine struct, which is a set of
callback routines to read tuples in a specific format. It also makes
COPY FROM with the existing formats (text, CSV, and binary) utilize
these format callbacks.

This change is a preliminary step towards making the COPY FROM command
extensible in terms of input formats.

Similar to 2e4127b6d2d, this refactoring contributes to a performance
improvement by reducing the number of "if" branches that need to be
checked on a per-row basis when sending field representations in text
or CSV mode. The performance benchmark results showed ~5% performance
gain in text or CSV mode.

Author: Sutou Kouhei <kou@clear-code.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/7717f63006935de00fafd000bff450280508adf1

Modified Files
--------------
src/backend/commands/copyfrom.c          | 192 ++++++++++---
src/backend/commands/copyfromparse.c     | 446 +++++++++++++++++--------------
src/include/commands/copy.h              |   2 -
src/include/commands/copyapi.h           |  50 +++-
src/include/commands/copyfrom_internal.h |  11 +
src/tools/pgindent/typedefs.list         |   1 +
6 files changed, 459 insertions(+), 243 deletions(-)


Re: pgsql: Refactor COPY FROM to use format callback functions.

From
Andrew Dunstan
Date:


On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:
Refactor COPY FROM to use format callback functions.

This commit introduces a new CopyFromRoutine struct, which is a set of
callback routines to read tuples in a specific format. It also makes
COPY FROM with the existing formats (text, CSV, and binary) utilize
these format callbacks.

This change is a preliminary step towards making the COPY FROM command
extensible in terms of input formats.

Similar to 2e4127b6d2d, this refactoring contributes to a performance
improvement by reducing the number of "if" branches that need to be
checked on a per-row basis when sending field representations in text
or CSV mode. The performance benchmark results showed ~5% performance
gain in text or CSV mode.

Author: Sutou Kouhei <kou@clear-code.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com



This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API is not a good thing.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pgsql: Refactor COPY FROM to use format callback functions.

From
Masahiko Sawada
Date:
On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:
>
> Refactor COPY FROM to use format callback functions.
>
> This commit introduces a new CopyFromRoutine struct, which is a set of
> callback routines to read tuples in a specific format. It also makes
> COPY FROM with the existing formats (text, CSV, and binary) utilize
> these format callbacks.
>
> This change is a preliminary step towards making the COPY FROM command
> extensible in terms of input formats.
>
> Similar to 2e4127b6d2d, this refactoring contributes to a performance
> improvement by reducing the number of "if" branches that need to be
> checked on a per-row basis when sending field representations in text
> or CSV mode. The performance benchmark results showed ~5% performance
> gain in text or CSV mode.
>
> Author: Sutou Kouhei <kou@clear-code.com>
> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
> Reviewed-by: Michael Paquier <michael@paquier.xyz>
> Reviewed-by: Andres Freund <andres@anarazel.de>
> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com
>
>
>
> This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API
isnot a good thing. 
>

Thank you for pointing it out.

I've just posted my analysis[1] and am planning to revive that API
(Sutou-san already proposed an idea). Could you please check if the
idea would work for file_text_array_fdw?

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoDr13%3Ddx%2Bk8gmQnR5_bY%2BNskyN4mbSWN0KhQncL6xuPMA%40mail.gmail.com


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: pgsql: Refactor COPY FROM to use format callback functions.

From
Andrew Dunstan
Date:
On 2025-02-28 Fr 2:55 PM, Masahiko Sawada wrote:
> On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>>
>> On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:
>>
>> Refactor COPY FROM to use format callback functions.
>>
>> This commit introduces a new CopyFromRoutine struct, which is a set of
>> callback routines to read tuples in a specific format. It also makes
>> COPY FROM with the existing formats (text, CSV, and binary) utilize
>> these format callbacks.
>>
>> This change is a preliminary step towards making the COPY FROM command
>> extensible in terms of input formats.
>>
>> Similar to 2e4127b6d2d, this refactoring contributes to a performance
>> improvement by reducing the number of "if" branches that need to be
>> checked on a per-row basis when sending field representations in text
>> or CSV mode. The performance benchmark results showed ~5% performance
>> gain in text or CSV mode.
>>
>> Author: Sutou Kouhei <kou@clear-code.com>
>> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
>> Reviewed-by: Michael Paquier <michael@paquier.xyz>
>> Reviewed-by: Andres Freund <andres@anarazel.de>
>> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
>> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
>> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com
>>
>>
>>
>> This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API
isnot a good thing.
 
>>
> Thank you for pointing it out.
>
> I've just posted my analysis[1] and am planning to revive that API
> (Sutou-san already proposed an idea). Could you please check if the
> idea would work for file_text_array_fdw?
>

Looks OK, I think. You could even use the Internal function further down 
in the file and avoid a function call.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: pgsql: Refactor COPY FROM to use format callback functions.

From
Masahiko Sawada
Date:
On Fri, Feb 28, 2025 at 12:14 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2025-02-28 Fr 2:55 PM, Masahiko Sawada wrote:
> > On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> >>
> >> On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:
> >>
> >> Refactor COPY FROM to use format callback functions.
> >>
> >> This commit introduces a new CopyFromRoutine struct, which is a set of
> >> callback routines to read tuples in a specific format. It also makes
> >> COPY FROM with the existing formats (text, CSV, and binary) utilize
> >> these format callbacks.
> >>
> >> This change is a preliminary step towards making the COPY FROM command
> >> extensible in terms of input formats.
> >>
> >> Similar to 2e4127b6d2d, this refactoring contributes to a performance
> >> improvement by reducing the number of "if" branches that need to be
> >> checked on a per-row basis when sending field representations in text
> >> or CSV mode. The performance benchmark results showed ~5% performance
> >> gain in text or CSV mode.
> >>
> >> Author: Sutou Kouhei <kou@clear-code.com>
> >> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
> >> Reviewed-by: Michael Paquier <michael@paquier.xyz>
> >> Reviewed-by: Andres Freund <andres@anarazel.de>
> >> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
> >> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
> >> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com
> >>
> >>
> >>
> >> This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from
APIis not a good thing. 
> >>
> > Thank you for pointing it out.
> >
> > I've just posted my analysis[1] and am planning to revive that API
> > (Sutou-san already proposed an idea). Could you please check if the
> > idea would work for file_text_array_fdw?
> >
>
> Looks OK, I think. You could even use the Internal function further down
> in the file and avoid a function call.

Right. I've attached the updated patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: pgsql: Refactor COPY FROM to use format callback functions.

From
Sutou Kouhei
Date:
Hi,

Thanks for following up the patch!

In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
  "Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
  Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> Right. I've attached the updated patch.

In general, this approach will work but could you keep
the same signature for backward compatibility?

> --- a/src/backend/commands/copyfromparse.c
> +++ b/src/backend/commands/copyfromparse.c

> +bool
> +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
> +{
> +    return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
> +}

NextCopyFromRawFields() uses
NextCopyFromRawFields(CopyFromState cstate, char ***fields,
int *nfields) (no "bool is_csv") before we remove it. So
could you use the no "bool is_csv" signature for backward
compatibility?

bool
NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
{
    return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
}


> @@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes)
>  /*
>   * Read raw fields in the next line for COPY FROM in text or csv mode.
>   * Return false if no more lines.
> + */
> +bool
> +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
> +{
> +    return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
> +}
> +
> +/*
> + * Workhorse for NextCopyFromRawFields().

It may be better that we don't split docs for
NextCopyFromRawFields() and
NextCopyFromRawFieldsInternal(). How about referring the
NextCopyFromRawFieldsInternal() doc from the
NextCopyFromRawFields() doc something like the following?

/*
 * See NextCopyFromRawFieldsInternal() for details.
 */
bool
NextCopyFromRawFields(...)
{
    ...
}

/*
 * Workhorse for NextCopyFromRawFields().
 *
 * Read raw fields ...
 *
 * ...
 */
static pg_attribute_always_inline bool
NextCopyFromRawFieldsInternal()
{
    ...
}


Thanks,
-- 
kou



Re: pgsql: Refactor COPY FROM to use format callback functions.

From
Masahiko Sawada
Date:
On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote:
>
> Hi,
>
> Thanks for following up the patch!
>
> In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
>   "Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
>   Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > Right. I've attached the updated patch.
>
> In general, this approach will work but could you keep
> the same signature for backward compatibility?
>
> > --- a/src/backend/commands/copyfromparse.c
> > +++ b/src/backend/commands/copyfromparse.c
>
> > +bool
> > +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
> > +{
> > +     return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
> > +}
>
> NextCopyFromRawFields() uses
> NextCopyFromRawFields(CopyFromState cstate, char ***fields,
> int *nfields) (no "bool is_csv") before we remove it. So
> could you use the no "bool is_csv" signature for backward
> compatibility?
>
> bool
> NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
> {
>         return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
> }

I like this idea.

>
>
> > @@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes)
> >  /*
> >   * Read raw fields in the next line for COPY FROM in text or csv mode.
> >   * Return false if no more lines.
> > + */
> > +bool
> > +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
> > +{
> > +     return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
> > +}
> > +
> > +/*
> > + * Workhorse for NextCopyFromRawFields().
>
> It may be better that we don't split docs for
> NextCopyFromRawFields() and
> NextCopyFromRawFieldsInternal(). How about referring the
> NextCopyFromRawFieldsInternal() doc from the
> NextCopyFromRawFields() doc something like the following?
>
> /*
>  * See NextCopyFromRawFieldsInternal() for details.
>  */
> bool
> NextCopyFromRawFields(...)
> {
>         ...
> }
>
> /*
>  * Workhorse for NextCopyFromRawFields().
>  *
>  * Read raw fields ...
>  *
>  * ...
>  */
> static pg_attribute_always_inline bool
> NextCopyFromRawFieldsInternal()
> {
>         ...
> }
>

Agreed.

I've updated the patch. I'm going to push it, barring any objections
and further comments.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: pgsql: Refactor COPY FROM to use format callback functions.

From
Andrew Dunstan
Date:


On 2025-02-28 Fr 5:39 PM, Masahiko Sawada wrote:
On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote:
Hi,

Thanks for following up the patch!

In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>  "Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,  Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Right. I've attached the updated patch.
In general, this approach will work but could you keep
the same signature for backward compatibility?

--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
+bool
+NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
+{
+     return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
+}
NextCopyFromRawFields() uses
NextCopyFromRawFields(CopyFromState cstate, char ***fields,
int *nfields) (no "bool is_csv") before we remove it. So
could you use the no "bool is_csv" signature for backward
compatibility?

bool
NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
{        return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
}
I like this idea.


@@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes) /*  * Read raw fields in the next line for COPY FROM in text or csv mode.  * Return false if no more lines.
+ */
+bool
+NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
+{
+     return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
+}
+
+/*
+ * Workhorse for NextCopyFromRawFields().
It may be better that we don't split docs for
NextCopyFromRawFields() and
NextCopyFromRawFieldsInternal(). How about referring the
NextCopyFromRawFieldsInternal() doc from the
NextCopyFromRawFields() doc something like the following?

/* * See NextCopyFromRawFieldsInternal() for details. */
bool
NextCopyFromRawFields(...)
{        ...
}

/* * Workhorse for NextCopyFromRawFields(). * * Read raw fields ... * * ... */
static pg_attribute_always_inline bool
NextCopyFromRawFieldsInternal()
{        ...
}

Agreed.

I've updated the patch. I'm going to push it, barring any objections
and further comments.



I'm OK either way - I have made changes to adapt to the API change, and tested them.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pgsql: Refactor COPY FROM to use format callback functions.

From
Masahiko Sawada
Date:
On Fri, Feb 28, 2025 at 3:06 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2025-02-28 Fr 5:39 PM, Masahiko Sawada wrote:
>
> On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote:
>
> Hi,
>
> Thanks for following up the patch!
>
> In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
>   "Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
>   Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Right. I've attached the updated patch.
>
> In general, this approach will work but could you keep
> the same signature for backward compatibility?
>
> --- a/src/backend/commands/copyfromparse.c
> +++ b/src/backend/commands/copyfromparse.c
>
> +bool
> +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
> +{
> +     return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
> +}
>
> NextCopyFromRawFields() uses
> NextCopyFromRawFields(CopyFromState cstate, char ***fields,
> int *nfields) (no "bool is_csv") before we remove it. So
> could you use the no "bool is_csv" signature for backward
> compatibility?
>
> bool
> NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
> {
>         return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
> }
>
> I like this idea.
>
>
> @@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes)
>  /*
>   * Read raw fields in the next line for COPY FROM in text or csv mode.
>   * Return false if no more lines.
> + */
> +bool
> +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
> +{
> +     return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
> +}
> +
> +/*
> + * Workhorse for NextCopyFromRawFields().
>
> It may be better that we don't split docs for
> NextCopyFromRawFields() and
> NextCopyFromRawFieldsInternal(). How about referring the
> NextCopyFromRawFieldsInternal() doc from the
> NextCopyFromRawFields() doc something like the following?
>
> /*
>  * See NextCopyFromRawFieldsInternal() for details.
>  */
> bool
> NextCopyFromRawFields(...)
> {
>         ...
> }
>
> /*
>  * Workhorse for NextCopyFromRawFields().
>  *
>  * Read raw fields ...
>  *
>  * ...
>  */
> static pg_attribute_always_inline bool
> NextCopyFromRawFieldsInternal()
> {
>         ...
> }
>
> Agreed.
>
> I've updated the patch. I'm going to push it, barring any objections
> and further comments.
>
>
>
> I'm OK either way - I have made changes to adapt to the API change, and tested them.
>

Thank you for the confirmation. Pushed the fix.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com