Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

From Sutou Kouhei
Subject Re: Make COPY format extendable: Extract COPY TO format implementations
Date
Msg-id 20240130.144531.1257430878438173740.kou@clear-code.com
Whole thread Raw
In response to Re: Make COPY format extendable: Extract COPY TO format implementations  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Make COPY format extendable: Extract COPY TO format implementations
List pgsql-hackers
Hi,

In <CAD21AoBmNiWwrspuedgAPgbAqsn7e7NoZYF6gNnYBf+gXEk9Mg@mail.gmail.com>
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on Tue, 30 Jan 2024 11:11:59 +0900,
  Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> ---
> +        if (!format_specified)
> +                /* Set the default format. */
> +                ProcessCopyOptionFormatTo(pstate, opts_out, cstate, NULL);
> +
> 
> I think we can pass "text" in this case instead of NULL. That way,
> ProcessCopyOptionFormatTo doesn't need to handle NULL case.

Yes, we can do it. But it needs a DefElem allocation. Is it
acceptable?

> We need curly brackets for this "if branch" as follows:
> 
> if (!format_specifed)
> {
>     /* Set the default format. */
>     ProcessCopyOptionFormatTo(pstate, opts_out, cstate, NULL);
> }

Oh, sorry. I assumed that pgindent adjusts the style too.

> ---
> +        /* Process not built-in options. */
> +        foreach(option, unknown_options)
> +        {
> +                DefElem    *defel = lfirst_node(DefElem, option);
> +                bool           processed = false;
> +
> +                if (!is_from)
> +                        processed =
> opts_out->to_routine->CopyToProcessOption(cstate, defel);
> +                if (!processed)
> +                        ereport(ERROR,
> +                                        (errcode(ERRCODE_SYNTAX_ERROR),
> +                                         errmsg("option \"%s\" not recognized",
> +                                                        defel->defname),
> +                                         parser_errposition(pstate,
> defel->location)));
> +        }
> +        list_free(unknown_options);
> 
> I think we can check the duplicated options in the core as we discussed.

Oh, sorry. I missed the part. I'll implement it.

> ---
> +static void
> +CopyToTextBasedInit(CopyToState cstate)
> +{
> +}
> 
> and
> 
> +static void
> +CopyToBinaryInit(CopyToState cstate)
> +{
> +}
> 
> Do we really need separate callbacks for the same behavior? I think we
> can have a common init function say CopyToBuitinInit() that does
> nothing. Or we can make the init callback optional.
> 
> The same is true for process-option callback.

OK. I'll make them optional.

> ---
>          List      *convert_select; /* list of column names (can be NIL) */
> +        const          CopyToRoutine *to_routine;      /* callback
> routines for COPY TO */
>  } CopyFormatOptions;
> 
> I think CopyToStateData is a better place to have CopyToRoutine.
> copy_data_dest_cb is also there.

We can do it but ProcessCopyOptions() accepts NULL
CopyToState for file_fdw. Can we create an empty
CopyToStateData internally like we did for opts_out in
ProcessCopyOptions()? (But it requires exporting
CopyToStateData. We'll export it in a later patch but it's
not yet in 0001.)

> The 0002 patch replaces the options checks with
> ProcessCopyOptionFormatFrom(). However, both
> ProcessCopyOptionFormatTo() and ProcessCOpyOptionFormatFrom() would
> set format-related options such as opts_out->csv_mode etc, which seems
> not elegant. IIUC the reason why we process only the "format" option
> first is to set the callback functions and call the init callback. So
> I think we don't necessarily need to do both setting callbacks and
> setting format-related options together. Probably we can do only the
> callback stuff first and then set format-related options in the
> original place we used to do?

If we do it, we need to write the (strcmp(format, "csv") ==
0) condition in copyto.c and copy.c. I wanted to avoid it. I
think that the duplication (setting opts_out->csv_mode in
copyto.c and copyfrom.c) is not a problem. But it's not a
strong opinion. If (strcmp(format, "csv") == 0) duplication
is better than opts_out->csv_mode = true duplication, I'll
do it.

BTW, if we want to make the CSV format implementation more
modularized, we will remove opts_out->csv_mode, move CSV
related options to CopyToCSVProcessOption() and keep CSV
related options in its opaque space. For example,
opts_out->force_quote exists in COPY TO opaque space but
doesn't exist in COPY FROM opaque space because it's not
used in COPY FROM.


> +static void
> +CopyToTextBasedFillCopyOutResponse(CopyToState cstate, StringInfoData *buf)
> +{
> +        int16          format = 0;
> +        int                    natts = list_length(cstate->attnumlist);
> +        int                    i;
> +
> +        pq_sendbyte(buf, format);      /* overall format */
> +        pq_sendint16(buf, natts);
> +        for (i = 0; i < natts; i++)
> +                pq_sendint16(buf, format);     /* per-column formats */
> +}
> 
> This function and CopyToBinaryFillCopyOutResponse() fill three things:
> overall format, the number of columns, and per-column formats. While
> this approach is flexible, extensions will have to understand the
> format of CopyOutResponse message. An alternative is to have one or
> more callbacks that return these three things.

Yes, we can choose the approach. I don't have a strong
opinion on which approach to choose.

> +        /* Get info about the columns we need to process. */
> +        cstate->out_functions = (FmgrInfo *) palloc(num_phys_attrs *
> sizeof(Fmgr\Info));
> +        foreach(cur, cstate->attnumlist)
> +        {
> +                int                    attnum = lfirst_int(cur);
> +                Oid                    out_func_oid;
> +                bool           isvarlena;
> +                Form_pg_attribute attr = TupleDescAttr(tupDesc, attnum - 1);
> +
> +                getTypeOutputInfo(attr->atttypid, &out_func_oid, &isvarlena);
> +                fmgr_info(out_func_oid, &cstate->out_functions[attnum - 1]);
> +        }
> 
> Is preparing the out functions an extension's responsibility? I
> thought the core could prepare them based on the overall format
> specified by extensions, as long as the overall format matches the
> actual data format to send. What do you think?

Hmm. I want to keep the preparation as an extension's
responsibility. Because it's not needed for all formats. For
example, Apache Arrow FORMAT doesn't need it. And JSON
FORMAT doesn't need it too because it use
composite_to_json().

> +        /*
> +         * Called when COPY TO via the PostgreSQL protocol is
> started. This must
> +         * fill buf as a valid CopyOutResponse message:
> +         *
> +         */
> +        /*--
> +         * +--------+--------+--------+--------+--------+   +--------+--------+
> +         * | Format | N attributes    | Attr1's format  |...| AttrN's format  |
> +         * +--------+--------+--------+--------+--------+   +--------+--------+
> +         * 0: text                      0: text               0: text
> +         * 1: binary                    1: binary             1: binary
> +         */
> 
> I think this kind of diagram could be missed from being updated when
> we update the CopyOutResponse format. It's better to refer to the
> documentation instead.

It makes sense. I couldn't find the documentation when I
wrote it but I found it now...:
https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-COPY

Is there recommended comment style to refer a documentation?
"See doc/src/sgml/protocol.sgml for the CopyOutResponse
message details" is OK?


Thanks,
-- 
kou



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning
Next
From: jian he
Date:
Subject: Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value