Re: Parallel copy - Mailing list pgsql-hackers

From Ashutosh Sharma
Subject Re: Parallel copy
Date
Msg-id CAE9k0P=hANHFodoycfA--CtFMJGkPfaUBqa_WQsxV_dYjU8Ecw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (vignesh C <vignesh21@gmail.com>)
Responses Re: Parallel copy
Re: Parallel copy
List pgsql-hackers
Hi Vignesh,

Thanks for the updated patches. Here are some more comments that I can
find after reviewing your latest patches:

+/*
+ * This structure helps in storing the common data from CopyStateData that are
+ * required by the workers. This information will then be allocated and stored
+ * into the DSM for the worker to retrieve and copy it to CopyStateData.
+ */
+typedef struct SerializedParallelCopyState
+{
+   /* low-level state data */
+   CopyDest    copy_dest;      /* type of copy source/destination */
+   int         file_encoding;  /* file or remote side's character encoding */
+   bool        need_transcoding;   /* file encoding diff from server? */
+   bool        encoding_embeds_ascii;  /* ASCII can be non-first byte? */
+
...
...
+
+   /* Working state for COPY FROM */
+   AttrNumber  num_defaults;
+   Oid         relid;
+} SerializedParallelCopyState;

Can the above structure not be part of the CopyStateData structure? I
am just asking this question because all the fields present in the
above structure are also present in the CopyStateData structure. So,
including it in the CopyStateData structure will reduce the code
duplication and will also make CopyStateData a bit shorter.

--

+           pcxt = BeginParallelCopy(cstate->nworkers, cstate, stmt->attlist,
+                                    relid);

Do we need to pass cstate->nworkers and relid to BeginParallelCopy()
function when we are already passing cstate structure, using which
both of these information can be retrieved ?

--

+/* DSM keys for parallel copy.  */
+#define PARALLEL_COPY_KEY_SHARED_INFO              1
+#define PARALLEL_COPY_KEY_CSTATE                   2
+#define PARALLEL_COPY_WAL_USAGE                    3
+#define PARALLEL_COPY_BUFFER_USAGE                 4

DSM key names do not appear to be consistent. For shared info and
cstate structures, the key name is prefixed with "PARALLEL_COPY_KEY",
but for WalUsage and BufferUsage structures, it is prefixed with
"PARALLEL_COPY". I think it would be better to make them consistent.

--

    if (resultRelInfo->ri_TrigDesc != NULL &&
        (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
         resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
    {
        /*
         * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
         * triggers on the table. Such triggers might query the table we're
         * inserting into and act differently if the tuples that have already
         * been processed and prepared for insertion are not there.
         */
        insertMethod = CIM_SINGLE;
    }
    else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
             resultRelInfo->ri_TrigDesc->trig_insert_new_table)
    {
        /*
         * For partitioned tables we can't support multi-inserts when there
         * are any statement level insert triggers. It might be possible to
         * allow partitioned tables with such triggers in the future, but for
         * now, CopyMultiInsertInfoFlush expects that any before row insert
         * and statement level insert triggers are on the same relation.
         */
        insertMethod = CIM_SINGLE;
    }
    else if (resultRelInfo->ri_FdwRoutine != NULL ||
             cstate->volatile_defexprs)
    {
...
...

I think, if possible, all these if-else checks in CopyFrom() can be
moved to a single function which can probably be named as
IdentifyCopyInsertMethod() and this function can be called in
IsParallelCopyAllowed(). This will ensure that in case of Parallel
Copy when the leader has performed all these checks, the worker won't
do it again. I also feel that it will make the code look a bit
cleaner.

--

+void
+ParallelCopyMain(dsm_segment *seg, shm_toc *toc)
+{
...
...
+   InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+                         &walusage[ParallelWorkerNumber]);
+
+   MemoryContextSwitchTo(oldcontext);
+   pfree(cstate);
+   return;
+}

It seems like you also need to delete the memory context
(cstate->copycontext) here.

--

+void
+ExecBeforeStmtTrigger(CopyState cstate)
+{
+   EState     *estate = CreateExecutorState();
+   ResultRelInfo *resultRelInfo;

This function has a lot of comments which have been copied as it is
from the CopyFrom function, I think it would be good to remove those
comments from here and mention that this code changes done in this
function has been taken from the CopyFrom function. If any queries
people may refer to the CopyFrom function. This will again avoid the
unnecessary code in the patch.

--

As Heikki rightly pointed out in his previous email, we need some high
level description of how Parallel Copy works somewhere in
copyparallel.c file. For reference, please see how a brief description
about parallel vacuum has been added in the vacuumlazy.c file.

 * Lazy vacuum supports parallel execution with parallel worker processes.  In
 * a parallel vacuum, we perform both index vacuum and index cleanup with
 * parallel worker processes.  Individual indexes are processed by one vacuum
...
...

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


On Wed, Oct 21, 2020 at 12:08 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, Oct 19, 2020 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, Oct 18, 2020 at 7:47 AM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
> > >
> > > Hi Vignesh,
> > >
> > > After having a look over the patch,
> > > I have some suggestions for
> > > 0003-Allow-copy-from-command-to-process-data-from-file.patch.
> > >
> > > 1.
> > >
> > > +static uint32
> > > +EstimateCstateSize(ParallelContext *pcxt, CopyState cstate, List *attnamelist,
> > > +                                  char **whereClauseStr, char **rangeTableStr,
> > > +                                  char **attnameListStr, char **notnullListStr,
> > > +                                  char **nullListStr, char **convertListStr)
> > > +{
> > > +       uint32          strsize = MAXALIGN(sizeof(SerializedParallelCopyState));
> > > +
> > > +       strsize += EstimateStringSize(cstate->null_print);
> > > +       strsize += EstimateStringSize(cstate->delim);
> > > +       strsize += EstimateStringSize(cstate->quote);
> > > +       strsize += EstimateStringSize(cstate->escape);
> > >
> > >
> > > It use function EstimateStringSize to get the strlen of null_print, delim, quote and escape.
> > > But the length of null_print seems has been stored in null_print_len.
> > > And delim/quote/escape must be 1 byte, so I think call strlen again seems unnecessary.
> > >
> > > How about  " strsize += sizeof(uint32) + cstate->null_print_len + 1"
> > >
> >
> > +1. This seems like a good suggestion but add comments for
> > delim/quote/escape to indicate that we are considering one-byte for
> > each. I think this will obviate the need of function
> > EstimateStringSize. Another thing in this regard is that we normally
> > use add_size function to compute the size but I don't see that being
> > used in this and nearby computation. That helps us to detect overflow
> > of addition if any.
> >
> > EstimateCstateSize()
> > {
> > ..
> > +
> > + strsize++;
> > ..
> > }
> >
> > Why do we need this additional one-byte increment? Does it make sense
> > to add a small comment for the same?
> >
>
> Changed it to handle null_print, delim, quote & escape accordingly in
> the attached patch, the one byte increment is not required, I have
> removed it.
>
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Enumize logical replication message actions
Next
From: Sridhar N Bamandlapally
Date:
Subject: Re: git clone failed in windows