Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

From Sutou Kouhei
Subject Re: Make COPY format extendable: Extract COPY TO format implementations
Date
Msg-id 20251117.154013.2019068769091256791.kou@clear-code.com
Whole thread Raw
In response to Re: Make COPY format extendable: Extract COPY TO format implementations  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
Hi,

In <CAD21AoDCEfe0PQhMEx8G1rpS7RrzGCJPobeqm3Mpn2bgbUH9nQ@mail.gmail.com>
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 14 Nov 2025 12:19:47 -0800,
  Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> This thread has involved extensive discussion, and the patch needs to
> be rebased. I'd like to summarize the current status of this patch and
> our discussions. I've attached updated patches that implement the
> whole ideas of this feature to help provide a clearer overall picture.

Thanks!

I'm not sure whether we should include option parsing
feature to this patch's scope or not but I'm OK with this
approach.

Here are my review comments but they are minor
comments. They don't require design change.

0001:

diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index cef452584e5..6a0a66507ba 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -155,6 +120,7 @@ static void CopySendInt16(CopyToState cstate, int16 val);
 
 /* text format */
 static const CopyToRoutine CopyToRoutineText = {
+    .CopyToEstimateStateSpace = CopyToEstimateStateTextLike,

How about including "Space"
(CopyToEstimateStateSpaceTextLike)?

How about renaming this to
"CopyToTextLikeEstimateStateSpace" because other functions
use "CopyToTextLikeXXX" style.

@@ -171,62 +138,77 @@ static const CopyToRoutine CopyToRoutineCSV = {
...
 /* Implementation of the start callback for text and CSV formats */
 static void
-CopyToTextLikeStart(CopyToState cstate, TupleDesc tupDesc)
+CopyToTextLikeStart(CopyToState ccstate, TupleDesc tupDesc)
 {
+    CopyToStateTextLike *cstate = (CopyToStateTextLike *) ccstate;

How about always using "cstate" for "CopyToState"? In other
functions in this patch, "CopyToState" is referred "cstate"
or "cctate".

We can use "state" or something for
"CopyToStateTextLike". (0002 uses "state" for
"CopyFromStateBuiltins".)

+        cstate->base.opts.null_print_client = pg_server_to_any(cstate->base.opts.null_print,
+                                                               cstate->base.opts.null_print_len,
+                                                               cstate->file_encoding);

We can use "ccstate->" instead of "cstate->base." in this
function.

@@ -614,6 +603,54 @@ EndCopy(CopyToState cstate)
     pfree(cstate);
 }
 
+/*
+ * Allocate COPY TO state data based on the format's EsimateStateSpace
+ * callback.
+ */
+static CopyToState
+create_copyto_state(ParseState *pstate, List *options)

"Esimate" ->
"Estimate"

How about using CamelCase like "CreateCopyToState" for
function name like other functions in this file?

+    Size        req_size;

"state_size" may be better.

@@ -1233,16 +1246,16 @@ CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot)
...
 static void
-CopyAttributeOutText(CopyToState cstate, const char *string)
+CopyAttributeOutText(CopyToStateTextLike * cstate, const char *string)
 {

CopyAttributeOutText(CopyToState cstate, const char *string)
{
    CopyToStateTextLike *state = (CopyToStateTextLike *)cstate;

may reduce "cstate->base." and "(CopyToState) cstate" in
this function.

BTW, could you remove a needless space after "*" in
"CopyToStateTextLike * cstate"?


0002:

diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 12781963b4f..0c51e5ba5e1 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -129,6 +131,7 @@ static void CopyFromBinaryEnd(CopyFromState cstate);
 
 /* text format */
 static const CopyFromRoutine CopyFromRoutineText = {
+    .CopyFromEstimateStateSpace = CopyFromBuiltinsEstimateSpace,

How about including "State"
("CopyFromBuiltinsEstimateStateSpace")?

@@ -145,54 +149,129 @@ static const CopyFromRoutine CopyFromRoutineCSV = {
...
+/*
+ * Common routine to initialize CopyFromStateBuiltins data.
+ */
+static void
+initialize_copyfrom_bultins_state(CopyFromStateBuiltins * state, TupleDesc tupDesc)

* How about using CamelCase like other functions in this file?
* How about using the same name as the struct?

InitializeCopyFromStateBuiltins?

@@ -1379,8 +1462,8 @@ CopyFrom(CopyFromState cstate)
                     /* Add this tuple to the tuple buffer */
                     CopyMultiInsertInfoStore(&multiInsertInfo,
                                              resultRelInfo, myslot,
-                                             cstate->line_buf.len,
-                                             cstate->cur_lineno);
+                                             rowinfo.tuplen,
+                                             rowinfo.lineno);

How about passing "CopyFromRowInfo *" instead of
"rowinfo.tuplen" and "rowinfo.lineno"?

@@ -1512,6 +1595,50 @@ CopyFrom(CopyFromState cstate)
     return processed;
 }
 
+static CopyFromState
+create_copyfrom_state(ParseState *pstate, List *options)

How about using CamelCase like other functions in this file?

CreateCopyFromState?

@@ -1138,19 +1156,29 @@ CopyFromBinaryOneRow(CopyFromState cstate, ExprContext *econtext, Datum *values,
...
+    if (rowinfo)
+    {
+        /*
+         * XXX: We used to use line_buf.len but we don't actually use line_buf
+         * in binary format.
+         */
+        rowinfo->lineno = cstate->base.cur_lineno;
+        rowinfo->tuplen = cstate->line_buf.len;
     }

How about always setting "0" to "rowinfo->tuplen" instead of
using "cstate->line_buf.len"?

diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index d75a70715a4..30a1d2bff6e 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -48,6 +48,12 @@ typedef enum CopyLogVerbosityChoice
...
+typedef struct CopyFromRowInfo
+{
+    uint64        lineno;
+    int            tuplen;
+}            CopyFromRowInfo;

I don't have a strong opinion nor alternative but I'm not
sure whether this name is suitable or not...


diff --git a/src/include/commands/copyapi.h b/src/include/commands/copyapi.h
index 3b9982d54b8..c3d2199a0b6 100644
--- a/src/include/commands/copyapi.h
+++ b/src/include/commands/copyapi.h
@@ -99,12 +104,11 @@ typedef struct CopyFromRoutine
      * Returns false if there are no more tuples to read.
      */
     bool        (*CopyFromOneRow) (CopyFromState cstate, ExprContext *econtext,
-                                   Datum *values, bool *nulls);
+                                   Datum *values, bool *nulls, CopyFromRowInfo * rowinfo);

Can we add some docstrings for "rowinfo"?


diff --git a/src/include/commands/copystate.h b/src/include/commands/copystate.h
index 7561083a323..145dccd0f8f 100644
--- a/src/include/commands/copystate.h
+++ b/src/include/commands/copystate.h
@@ -65,4 +67,99 @@ typedef struct CopyToStateData
...
+/*
+ * Represents the different source cases we need to worry about at
+ * the bottom level
+ */
+typedef enum CopySource
+{
+    COPY_FILE,                    /* from file (or a piped program) */
+    COPY_FRONTEND,                /* from frontend */
+    COPY_CALLBACK,                /* from callback function */
+} CopySource;

Can we use "COPY_SOURCE_" prefix instead of "COPY_" prefix
such as "COPY_SOURCE_FILE"?


0003:

diff --git a/src/backend/commands/copy_custom_format.c b/src/backend/commands/copy_custom_format.c
new file mode 100644
index 00000000000..8bef6e779ac
--- /dev/null
+++ b/src/backend/commands/copy_custom_format.c

+/*
+ * Returns true if the given custom format name is registered.
+ */
+bool
+FindCustomCopyFormat(const char *fmt_name)
+{
+    for (int i = 0; i < CopyCustomFormatAssigned; i++)
+    {
+        if (strcmp(CopyCustomFormatArray[i].fmt_name, fmt_name) == 0)
+            return true;
+    }
+
+    return false;
+}

How about using other word than "Find" here? I expect that
"FindXXX()" returns a found value instead of "whether found
or not" as bool.

CustomCopyFormatExists()?

+bool
+GetCustomCopyToRoutine(const char *fmt_name, const CopyToRoutine **to_routine)
+{
+    for (int i = 0; i < CopyCustomFormatAssigned; i++)
+    {
+        if (strcmp(CopyCustomFormatArray[i].fmt_name, fmt_name) == 0)
+        {
+            *to_routine = CopyCustomFormatArray[i].to_routine;
+            return true;
+        }
+    }
+
+    return false;
+}

How about returning "const CopyToRoutine *" instead of
"bool"? We can use "FindCustomCopyFormat()" whether the
given format name exists or not.

+bool
+GetCustomCopyFromRoutine(const char *fmt_name, const CopyFromRoutine **from_routine)

Ditto.

diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index 30a1d2bff6e..82f07b05823 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -120,6 +120,12 @@ extern uint64 CopyFrom(CopyFromState cstate);
 
 extern DestReceiver *CreateCopyDestReceiver(void);
 
+extern void ProcessCopyBuiltinOptions(List *options, CopyFormatOptions *opts_out,
+                                      bool is_from, List **other_options, ParseState *pstate);

It seems that this change should exist in 0004.


0004:

diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index b1b3ae141eb..ea31fa911f9 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -1570,3 +1571,32 @@ CreateCopyDestReceiver(void)
 
     return (DestReceiver *) self;
 }
+
+static void
+ProcessCopyToOptions(CopyToState cstate, List *options, ParseState *pstate)
+{
+    bool        temp_state = false;
+    List       *other_options = NIL;
+    CopyFormatOptions *opts;
+
+    if (cstate == NULL)
+    {
+        cstate = create_copyto_state(pstate, options);
+        temp_state = true;
+    }
+
+    opts = &cstate->opts;
+
+    ProcessCopyBuiltinOptions(options, opts, false, &other_options, pstate);
+
+    foreach_node(DefElem, option, options)

options ->
other_options

(This change exists in 0005.)


0006:

diff --git a/src/test/modules/test_copy_custom_format/test_copy_custom_format.c
b/src/test/modules/test_copy_custom_format/test_copy_custom_format.c
new file mode 100644
index 00000000000..a63390e875b
--- /dev/null
+++ b/src/test/modules/test_copy_custom_format/test_copy_custom_format.c


+static Size
+TestCopyToEsimateSpace(void)
+{
+    return sizeof(TestCopyToState);
+}
+
+static Size
+TestCopyFromEsimateSpace(void)
+{
+    return sizeof(TestCopyFromState);
+}

EsimateSpace ->
EstimateStateSpace

+    if (strcmp(option->defname, "common_int") == 0)
+    {
+        int            val = defGetInt32(option);
+
+        opt->common_int = val;
+
+        return true;
+    }
+    else if (strcmp(option->defname, "common_bool") == 0)
+    {
+        bool        val = defGetBoolean(option);
+
+        opt->common_bool = val;
+
+        return true;

We may not need to use local variables:

opt->common_int = defGetInt32(option);
opt->common_bool = defGetBoolean(option);


> One potential improvement would be adding support for random file
> access in COPY FROM operations. For example, with parquet files, it
> would be much more efficient to read the footer section first since it
> contains metadata, allowing selective reading of necessary file
> sections. The current sequential read API (CopyFromGetData()) requires
> reading all data to access the metadata.

This is outside the scope of this patch but I've created
a custom COPY format implementation for Apache Parquet:
https://github.com/kou/pg-copy-parquet

We need to start parsing from footer as mentioned above. So
the implementation reads all data before it starts parsing:

https://github.com/kou/pg-copy-parquet/blob/7da367ea81d8964f5045fe0b1514a798d4ecbbc7/copy_parquet.cc#L410-L434

If we have random access API, we don't need to read all
data. It improves performance.

Anyway, this is outside the scope of this patch. We can
discuss this in a separated thread after we merge this
patch.


Thanks,
-- 
kou



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Next
From: Chao Li
Date:
Subject: Re: CREATE/ALTER PUBLICATION improvements for syntax synopsis