Thread: pgsql: Refactor COPY FROM to use format callback functions.
Refactor COPY FROM to use format callback functions. This commit introduces a new CopyFromRoutine struct, which is a set of callback routines to read tuples in a specific format. It also makes COPY FROM with the existing formats (text, CSV, and binary) utilize these format callbacks. This change is a preliminary step towards making the COPY FROM command extensible in terms of input formats. Similar to 2e4127b6d2d, this refactoring contributes to a performance improvement by reducing the number of "if" branches that need to be checked on a per-row basis when sending field representations in text or CSV mode. The performance benchmark results showed ~5% performance gain in text or CSV mode. Author: Sutou Kouhei <kou@clear-code.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/7717f63006935de00fafd000bff450280508adf1 Modified Files -------------- src/backend/commands/copyfrom.c | 192 ++++++++++--- src/backend/commands/copyfromparse.c | 446 +++++++++++++++++-------------- src/include/commands/copy.h | 2 - src/include/commands/copyapi.h | 50 +++- src/include/commands/copyfrom_internal.h | 11 + src/tools/pgindent/typedefs.list | 1 + 6 files changed, 459 insertions(+), 243 deletions(-)
On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:
Refactor COPY FROM to use format callback functions. This commit introduces a new CopyFromRoutine struct, which is a set of callback routines to read tuples in a specific format. It also makes COPY FROM with the existing formats (text, CSV, and binary) utilize these format callbacks. This change is a preliminary step towards making the COPY FROM command extensible in terms of input formats. Similar to 2e4127b6d2d, this refactoring contributes to a performance improvement by reducing the number of "if" branches that need to be checked on a per-row basis when sending field representations in text or CSV mode. The performance benchmark results showed ~5% performance gain in text or CSV mode. Author: Sutou Kouhei <kou@clear-code.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com
This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API is not a good thing.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote: > > Refactor COPY FROM to use format callback functions. > > This commit introduces a new CopyFromRoutine struct, which is a set of > callback routines to read tuples in a specific format. It also makes > COPY FROM with the existing formats (text, CSV, and binary) utilize > these format callbacks. > > This change is a preliminary step towards making the COPY FROM command > extensible in terms of input formats. > > Similar to 2e4127b6d2d, this refactoring contributes to a performance > improvement by reducing the number of "if" branches that need to be > checked on a per-row basis when sending field representations in text > or CSV mode. The performance benchmark results showed ~5% performance > gain in text or CSV mode. > > Author: Sutou Kouhei <kou@clear-code.com> > Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> > Reviewed-by: Michael Paquier <michael@paquier.xyz> > Reviewed-by: Andres Freund <andres@anarazel.de> > Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> > Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> > Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com > > > > This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API isnot a good thing. > Thank you for pointing it out. I've just posted my analysis[1] and am planning to revive that API (Sutou-san already proposed an idea). Could you please check if the idea would work for file_text_array_fdw? Regards, [1] https://www.postgresql.org/message-id/CAD21AoDr13%3Ddx%2Bk8gmQnR5_bY%2BNskyN4mbSWN0KhQncL6xuPMA%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2025-02-28 Fr 2:55 PM, Masahiko Sawada wrote: > On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> >> On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote: >> >> Refactor COPY FROM to use format callback functions. >> >> This commit introduces a new CopyFromRoutine struct, which is a set of >> callback routines to read tuples in a specific format. It also makes >> COPY FROM with the existing formats (text, CSV, and binary) utilize >> these format callbacks. >> >> This change is a preliminary step towards making the COPY FROM command >> extensible in terms of input formats. >> >> Similar to 2e4127b6d2d, this refactoring contributes to a performance >> improvement by reducing the number of "if" branches that need to be >> checked on a per-row basis when sending field representations in text >> or CSV mode. The performance benchmark results showed ~5% performance >> gain in text or CSV mode. >> >> Author: Sutou Kouhei <kou@clear-code.com> >> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> >> Reviewed-by: Michael Paquier <michael@paquier.xyz> >> Reviewed-by: Andres Freund <andres@anarazel.de> >> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> >> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> >> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com >> >> >> >> This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API isnot a good thing. >> > Thank you for pointing it out. > > I've just posted my analysis[1] and am planning to revive that API > (Sutou-san already proposed an idea). Could you please check if the > idea would work for file_text_array_fdw? > Looks OK, I think. You could even use the Internal function further down in the file and avoid a function call. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Fri, Feb 28, 2025 at 12:14 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2025-02-28 Fr 2:55 PM, Masahiko Sawada wrote: > > On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote: > >> > >> On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote: > >> > >> Refactor COPY FROM to use format callback functions. > >> > >> This commit introduces a new CopyFromRoutine struct, which is a set of > >> callback routines to read tuples in a specific format. It also makes > >> COPY FROM with the existing formats (text, CSV, and binary) utilize > >> these format callbacks. > >> > >> This change is a preliminary step towards making the COPY FROM command > >> extensible in terms of input formats. > >> > >> Similar to 2e4127b6d2d, this refactoring contributes to a performance > >> improvement by reducing the number of "if" branches that need to be > >> checked on a per-row basis when sending field representations in text > >> or CSV mode. The performance benchmark results showed ~5% performance > >> gain in text or CSV mode. > >> > >> Author: Sutou Kouhei <kou@clear-code.com> > >> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> > >> Reviewed-by: Michael Paquier <michael@paquier.xyz> > >> Reviewed-by: Andres Freund <andres@anarazel.de> > >> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> > >> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> > >> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com > >> > >> > >> > >> This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from APIis not a good thing. > >> > > Thank you for pointing it out. > > > > I've just posted my analysis[1] and am planning to revive that API > > (Sutou-san already proposed an idea). Could you please check if the > > idea would work for file_text_array_fdw? > > > > Looks OK, I think. You could even use the Internal function further down > in the file and avoid a function call. Right. I've attached the updated patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
Hi, Thanks for following up the patch! In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com> "Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Right. I've attached the updated patch. In general, this approach will work but could you keep the same signature for backward compatibility? > --- a/src/backend/commands/copyfromparse.c > +++ b/src/backend/commands/copyfromparse.c > +bool > +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) > +{ > + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); > +} NextCopyFromRawFields() uses NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields) (no "bool is_csv") before we remove it. So could you use the no "bool is_csv" signature for backward compatibility? bool NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields) { return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode); } > @@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes) > /* > * Read raw fields in the next line for COPY FROM in text or csv mode. > * Return false if no more lines. > + */ > +bool > +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) > +{ > + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); > +} > + > +/* > + * Workhorse for NextCopyFromRawFields(). It may be better that we don't split docs for NextCopyFromRawFields() and NextCopyFromRawFieldsInternal(). How about referring the NextCopyFromRawFieldsInternal() doc from the NextCopyFromRawFields() doc something like the following? /* * See NextCopyFromRawFieldsInternal() for details. */ bool NextCopyFromRawFields(...) { ... } /* * Workhorse for NextCopyFromRawFields(). * * Read raw fields ... * * ... */ static pg_attribute_always_inline bool NextCopyFromRawFieldsInternal() { ... } Thanks, -- kou
On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote: > > Hi, > > Thanks for following up the patch! > > In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com> > "Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800, > Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Right. I've attached the updated patch. > > In general, this approach will work but could you keep > the same signature for backward compatibility? > > > --- a/src/backend/commands/copyfromparse.c > > +++ b/src/backend/commands/copyfromparse.c > > > +bool > > +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) > > +{ > > + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); > > +} > > NextCopyFromRawFields() uses > NextCopyFromRawFields(CopyFromState cstate, char ***fields, > int *nfields) (no "bool is_csv") before we remove it. So > could you use the no "bool is_csv" signature for backward > compatibility? > > bool > NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields) > { > return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode); > } I like this idea. > > > > @@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes) > > /* > > * Read raw fields in the next line for COPY FROM in text or csv mode. > > * Return false if no more lines. > > + */ > > +bool > > +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) > > +{ > > + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); > > +} > > + > > +/* > > + * Workhorse for NextCopyFromRawFields(). > > It may be better that we don't split docs for > NextCopyFromRawFields() and > NextCopyFromRawFieldsInternal(). How about referring the > NextCopyFromRawFieldsInternal() doc from the > NextCopyFromRawFields() doc something like the following? > > /* > * See NextCopyFromRawFieldsInternal() for details. > */ > bool > NextCopyFromRawFields(...) > { > ... > } > > /* > * Workhorse for NextCopyFromRawFields(). > * > * Read raw fields ... > * > * ... > */ > static pg_attribute_always_inline bool > NextCopyFromRawFieldsInternal() > { > ... > } > Agreed. I've updated the patch. I'm going to push it, barring any objections and further comments. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On 2025-02-28 Fr 5:39 PM, Masahiko Sawada wrote:
On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote:Hi, Thanks for following up the patch! In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com> "Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800, Masahiko Sawada <sawada.mshk@gmail.com> wrote:Right. I've attached the updated patch.In general, this approach will work but could you keep the same signature for backward compatibility?--- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c+bool +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) +{ + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); +}NextCopyFromRawFields() uses NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields) (no "bool is_csv") before we remove it. So could you use the no "bool is_csv" signature for backward compatibility? bool NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields) { return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode); }I like this idea.@@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes) /* * Read raw fields in the next line for COPY FROM in text or csv mode. * Return false if no more lines. + */ +bool +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) +{ + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); +} + +/* + * Workhorse for NextCopyFromRawFields().It may be better that we don't split docs for NextCopyFromRawFields() and NextCopyFromRawFieldsInternal(). How about referring the NextCopyFromRawFieldsInternal() doc from the NextCopyFromRawFields() doc something like the following? /* * See NextCopyFromRawFieldsInternal() for details. */ bool NextCopyFromRawFields(...) { ... } /* * Workhorse for NextCopyFromRawFields(). * * Read raw fields ... * * ... */ static pg_attribute_always_inline bool NextCopyFromRawFieldsInternal() { ... }Agreed. I've updated the patch. I'm going to push it, barring any objections and further comments.
I'm OK either way - I have made changes to adapt to the API change, and tested them.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Fri, Feb 28, 2025 at 3:06 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2025-02-28 Fr 5:39 PM, Masahiko Sawada wrote: > > On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote: > > Hi, > > Thanks for following up the patch! > > In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com> > "Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800, > Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Right. I've attached the updated patch. > > In general, this approach will work but could you keep > the same signature for backward compatibility? > > --- a/src/backend/commands/copyfromparse.c > +++ b/src/backend/commands/copyfromparse.c > > +bool > +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) > +{ > + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); > +} > > NextCopyFromRawFields() uses > NextCopyFromRawFields(CopyFromState cstate, char ***fields, > int *nfields) (no "bool is_csv") before we remove it. So > could you use the no "bool is_csv" signature for backward > compatibility? > > bool > NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields) > { > return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode); > } > > I like this idea. > > > @@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes) > /* > * Read raw fields in the next line for COPY FROM in text or csv mode. > * Return false if no more lines. > + */ > +bool > +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) > +{ > + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); > +} > + > +/* > + * Workhorse for NextCopyFromRawFields(). > > It may be better that we don't split docs for > NextCopyFromRawFields() and > NextCopyFromRawFieldsInternal(). How about referring the > NextCopyFromRawFieldsInternal() doc from the > NextCopyFromRawFields() doc something like the following? > > /* > * See NextCopyFromRawFieldsInternal() for details. > */ > bool > NextCopyFromRawFields(...) > { > ... > } > > /* > * Workhorse for NextCopyFromRawFields(). > * > * Read raw fields ... > * > * ... > */ > static pg_attribute_always_inline bool > NextCopyFromRawFieldsInternal() > { > ... > } > > Agreed. > > I've updated the patch. I'm going to push it, barring any objections > and further comments. > > > > I'm OK either way - I have made changes to adapt to the API change, and tested them. > Thank you for the confirmation. Pushed the fix. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com