Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Make COPY format extendable: Extract COPY TO format implementations
Date
Msg-id CAD21AoBa0Wm3C2H12jaqkvLidP2zEhsC+gf=3w7XiA4LQnvx0g@mail.gmail.com
Whole thread Raw
In response to Re: Make COPY format extendable: Extract COPY TO format implementations  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Mon, Jul 28, 2025 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jul 18, 2025 at 3:05 AM Sutou Kouhei <kou@clear-code.com> wrote:
> >
> > Hi,
> >
> > In <CAD21AoAZL2RzPM4RLOJKm_73z5LXq2_VOVF+S+T0tnbjHdWTFA@mail.gmail.com>
> >   "Re: Make COPY format extendable: Extract COPY TO format implementations" on Thu, 17 Jul 2025 13:44:11 -0700,
> >   Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > >> > How about adding accessors instead of splitting
> > >> > Copy{From,To}State to Copy{From,To}ExecutionData? If we use
> > >> > the accessors approach, we can export only needed
> > >> > information step by step without breaking ABI.
> > >
> > > Yeah, while it can export required fields without breaking ABI, I'm
> > > concerned that setter and getter functions could be bloated if we need
> > > to have them for many fields.
> >
> > In general, I choose this approach in my projects even when
> > I need to define many accessors. Because I can hide
> > implementation details from users. I can change
> > implementation details without breaking API/ABI.
> >
> > But PostgreSQL isn't my project. Is there any guideline for
> > PostgreSQL API(/ABI?) design that we can refer for this
> > case?
>
> As far as I know there is no official guideline for PostgreSQL API and
> ABI design, but I've never seen structs having more than 10 getter and
> setter in PostgreSQL source code.
>
> >
> > FYI: We need to export at least the following fields:
> >
> >
https://www.postgresql.org/message-id/flat/20250714.173803.865595983884510428.kou%40clear-code.com#78fdbccf89742f856aa2cf95eaf42032
> >
> > > FROM:
> > >
> > > - attnumlist (*)
> > > - bytes_processed
> > > - cur_attname
> > > - escontext
> > > - in_functions (*)
> > > - input_buf
> > > - input_reached_eof
> > > - line_buf
> > > - opts (*)
> > > - raw_buf
> > > - raw_buf_index
> > > - raw_buf_len
> > > - rel (*)
> > > - typioparams (*)
> > >
> > > TO:
> > >
> > > - attnumlist (*)
> > > - fe_msgbuf
> > > - opts (*)
>
> I think we can think of the minimum list of fields that we need to
> expose. For instance, fields used for buffered reads for COPY FROM
> such as input_buf and raw_buf related fields don't necessarily need to
> be exposed as extension can implement it in its own way. We can start
> with the supporting simple copy format extensions like that read and
> parse the binary data from the data source and fill 'values' and
> 'nulls' arrays as output. Considering these facts, it might be
> sufficient for copy format extensions if they could access 'rel',
> 'attnumlist', and 'opts' in both COPY FROM and COPY TO (and
> CopyFromErrorCallback related fields for COPY FROM).
>
> Apart from this, we might want to reorganize CopyFromStateData fields
> and CopyToStateData fields since they have mixed fields of general
> purpose fields for COPY operations (e.g., num_defaults, whereClause,
> and range_table) and built-in format specific fields (e.g., line_buf
> and input_buf). Text and CSV formats are using some fields for parsing
> fields with buffered reads so one idea is that we move related fields
> to another struct so that both built-in formats (text and CSV) and
> external extensions that want to use the buffered reads for text
> parsing can use this functionality.

So probably it might be worth refactoring the codes in terms of:

1. hiding internal data from format callbacks
2. separating format-specific fields from the main state data.

I categorized the fields in CopyFromStateData. I think there are
roughly three different kind of fields mixed there:

1. fields used only the core (not by format callback)
- filename
- is_program
- whereClause
- cur_relname
- copycontext
- defmap
- num_defaults
- volatile_defexprs
- range_table
- rtrperminfos
- qualexpr
- transition_capture

2. fields used by both the core and format callbacks
- rel
- attnumlist
- cur_lineno
- cur_attname
- cur_attval
- relname_only
- num_errors
- opts
- in_functions
- typioparams
- escontext
- defexprs
- Input-related fields
    - copy_src
    - copy_file
    - fe_msgbuf
    - data_source_cb
    - byteprocessed

3. built-in format specific fields (mostly for text and csv)
- eol_type
- defaults
- Encoding related fields
    - file_encoding
    - need_transcoding
    - conversion_proc
- convert_select_flags
- raw data pointers
    - max_fields
    - raw_fields
- attribute_buf
- line_buf related fields
    - line_buf
    - line_buf_valid
- input_buf related fields
    - input_buf
    - input_buf_index
    - input_buf_len
    - input_reached_eof
    - input_reached_error
- raw_buf related fields
    - raw_buf
    - raw_buf_index
    - raw_buf_len
    - raw_reached_eof

The fields in 1 are mostly static fields, and the fields in 2 and 3
are likely to be accessed in hot functions during COPY FROM. Would it
be a good idea to restructure these fields so that we can hide the
fields in 1 from callback functions and having the fields in 3 in a
separate format-specific struct that can be accessed via an opaque
pointer? But could the latter change potentially cause performance
overheads?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [Patch] add new parameter to pg_replication_origin_session_setup
Next
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Conflict detection for update_deleted in logical replication