Re: Parallel copy - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Parallel copy
Date
Msg-id CAA4eK1LkGaad0Ucf8rR1MvxJfDD86od1s=FvWk4X6E2UeAq3Vg@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (Greg Nancarrow <gregn4422@gmail.com>)
Responses Re: Parallel copy  (Greg Nancarrow <gregn4422@gmail.com>)
Re: Parallel copy  (vignesh C <vignesh21@gmail.com>)
List pgsql-hackers
On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 5:44 AM vignesh C <vignesh21@gmail.com> wrote:
>
> > Attached v6 patch with the fixes.
> >
>
> Hi Vignesh,
>
> I noticed a couple of issues when scanning the code in the following patch:
>
>     v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> In the following code, it will put a junk uint16 value into *destptr
> (and thus may well cause a crash) on a Big Endian architecture
> (Solaris Sparc, s390x, etc.):
> You're storing a (uint16) string length in a uint32 and then pulling
> out the lower two bytes of the uint32 and copying them into the
> location pointed to by destptr.
>
>
> static void
> +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
> + uint32 *copiedsize)
> +{
> + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
> +
> + memcpy(destptr, (uint16 *) &len, sizeof(uint16));
> + *copiedsize += sizeof(uint16);
> + if (len)
> + {
> + memcpy(destptr + sizeof(uint16), srcPtr, len);
> + *copiedsize += len;
> + }
> +}
>
> I suggest you change the code to:
>
>     uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
>     memcpy(destptr, &len, sizeof(uint16));
>
> [I assume string length here can't ever exceed (65535 - 1), right?]
>

Your suggestion makes sense to me if the assumption related to string
length is correct. If we can't ensure that then we need to probably
use four bytes uint32 to store the length.

> Looking a bit deeper into this, I'm wondering if in fact your
> EstimateStringSize() and EstimateNodeSize() functions should be using
> BUFFERALIGN() for EACH stored string/node (rather than just calling
> shm_toc_estimate_chunk() once at the end, after the length of packed
> strings and nodes has been estimated), to ensure alignment of start of
> each string/node. Other Postgres code appears to be aligning each
> stored chunk using shm_toc_estimate_chunk(). See the definition of
> that macro and its current usages.
>

I am not sure if this required for the purpose of correctness. AFAIU,
we do store/estimate multiple parameters in same way at other places,
see EstimateParamListSpace and SerializeParamList. Do you have
something else in mind?

While looking at the latest code, I observed below issue in patch
v6-0003-Allow-copy-from-command-to-process-data-from-file:

+ /* Estimate the size for shared information for PARALLEL_COPY_KEY_CSTATE */
+ est_cstateshared = MAXALIGN(sizeof(SerializedParallelCopyState));
+ shm_toc_estimate_chunk(&pcxt->estimator, est_cstateshared);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+ strsize = EstimateCstateSize(pcxt, cstate, attnamelist, &whereClauseStr,
+ &rangeTableStr, &attnameListStr,
+ ¬nullListStr, &nullListStr,
+ &convertListStr);

Here, do we need to separately estimate the size of
SerializedParallelCopyState when it is also done in
EstimateCstateSize?

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: "Deng, Gang"
Date:
Subject: RE: [PoC] Non-volatile WAL buffer
Next
From: Pavel Stehule
Date:
Subject: Re: range_agg