Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Make COPY format extendable: Extract COPY TO format implementations
Date
Msg-id ZcbLIf4ezfiw6xf4@paquier.xyz
Whole thread Raw
In response to Re: Make COPY format extendable: Extract COPY TO format implementations  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Fri, Feb 09, 2024 at 11:27:05AM -0800, Andres Freund wrote:
> On 2024-02-09 13:21:34 +0900, Michael Paquier wrote:
>> +static void
>> +CopyFromTextInFunc(CopyFromState cstate, Oid atttypid,
>> +                   FmgrInfo *finfo, Oid *typioparam)
>> +{
>> +    Oid            func_oid;
>> +
>> +    getTypeInputInfo(atttypid, &func_oid, typioparam);
>> +    fmgr_info(func_oid, finfo);
>> +}
>
> FWIW, we should really change the copy code to initialize FunctionCallInfoData
> instead of re-initializing that on every call, realy makes a difference
> performance wise.

You mean to initialize once its memory and let the internal routines
call InitFunctionCallInfoData for each attribute.  Sounds like a good
idea, doing that for HEAD before the main patch.  More impact with
more attributes.

>> +/*
>> + * CopyFromTextStart
>> + *
>> + * Start of COPY FROM for text/CSV format.
>> + */
>> +static void
>> +CopyFromTextStart(CopyFromState cstate, TupleDesc tupDesc)
>> +{
>> +    AttrNumber    attr_count;
>> +
>> +    /*
>> +     * If encoding conversion is needed, we need another buffer to hold the
>> +     * converted input data.  Otherwise, we can just point input_buf to the
>> +     * same buffer as raw_buf.
>> +     */
>> +    if (cstate->need_transcoding)
>> +    {
>> +        cstate->input_buf = (char *) palloc(INPUT_BUF_SIZE + 1);
>> +        cstate->input_buf_index = cstate->input_buf_len = 0;
>> +    }
>> +    else
>> +        cstate->input_buf = cstate->raw_buf;
>> +    cstate->input_reached_eof = false;
>> +
>> +    initStringInfo(&cstate->line_buf);
>
> Seems kinda odd that we have a supposedly extensible API that then stores all
> this stuff in the non-extensible CopyFromState.

That relates to the introduction of the the opaque pointer mentioned
upthread to point to a per-format structure, where we'd store data
specific to each format.

>> +    /* create workspace for CopyReadAttributes results */
>> +    attr_count = list_length(cstate->attnumlist);
>> +    cstate->max_fields = attr_count;
>
> Why is this here? This seems like generic code, not text format specific.

We don't care about that for binary.

>> +/*
>> + * CopyFromTextOneRow
>> + *
>> + * Copy one row to a set of `values` and `nulls` for the text and CSV
>> + * formats.
>> + */
>
> I'm very doubtful it's a good idea to combine text and CSV here. They have
> basically no shared parsing code, so what's the point in sending them through
> one input routine?

The code shared between text and csv involves a path called once per
attribute.  TBH, I am not sure how much of the NULL handling should be
put outside the per-row routine as these options are embedded in the
core options.  So I don't have a better idea on this one than what's
proposed here if we cannot dispatch the routine calls once per
attribute.

>> +    /* read raw fields in the next line */
>> +    if (!NextCopyFromRawFields(cstate, &field_strings, &fldct))
>> +        return false;
>> +
>> +    /* check for overflowing fields */
>> +    if (attr_count > 0 && fldct > attr_count)
>> +        ereport(ERROR,
>> +                (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
>> +                 errmsg("extra data after last expected column")));
>
> It bothers me that we look to be ending up with different error handling
> across the various output formats, particularly if they're ending up in
> extensions. That'll make it harder to evolve this code in the future.

But different formats may have different requirements, including the
number of attributes detected vs expected.  That was not really
nothing me.

>> +        if (cstate->opts.csv_mode)
>> +        {
>
> More unfortunate intermingling of multiple formats in a single
> routine.

Similar answer as a few paragraphs above.  Sutou-san was suggesting to
use an internal routine with fixed arguments instead, which would be
enough at the end with some inline instructions?

>> +
>> +        if (cstate->defaults[m])
>> +        {
>> +            /*
>> +             * The caller must supply econtext and have switched into the
>> +             * per-tuple memory context in it.
>> +             */
>> +            Assert(econtext != NULL);
>> +            Assert(CurrentMemoryContext == econtext->ecxt_per_tuple_memory);
>> +
>> +            values[m] = ExecEvalExpr(defexprs[m], econtext, &nulls[m]);
>> +        }
>
> I don't think it's good that we end up with this code in different copy
> implementations.

Yeah, still we don't care about that for binary.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Simplify documentation related to Windows builds
Next
From: Michael Paquier
Date:
Subject: Re: Small fix on query_id_enabled