Thread: Improvements in Copy From

Improvements in Copy From

From
vignesh C
Date:
Hi,

While reviewing copy from I identified few  improvements for copy from
that can be done :
a) copy from stdin copies lesser amount of data to buffer even though
space is available in buffer because minread was passed as 1 to
CopyGetData, Hence it only reads until the data read from libpq is
less than minread. This can be fixed by passing the actual space
available in buffer, this reduces the unnecessary frequent calls to
CopyGetData.
b) CopyMultiInsertInfoNextFreeSlot had an unused function parameter
that is not being used, it can be removed.
c) Copy from reads header line and do nothing for the header line, we
need not clear EOL & need not convert to server encoding for the
header line.

Attached patch has the changes for the same.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Improvements in Copy From

From
vignesh C
Date:
On Wed, Jul 1, 2020 at 6:16 PM vignesh C <vignesh21@gmail.com> wrote:
> Attached patch has the changes for the same.
> Thoughts?
>

Added a commitfest entry for this:
https://commitfest.postgresql.org/29/2642/

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Improvements in Copy From

From
David Rowley
Date:
On Thu, 2 Jul 2020 at 00:46, vignesh C <vignesh21@gmail.com> wrote:
> b) CopyMultiInsertInfoNextFreeSlot had an unused function parameter
> that is not being used, it can be removed.

This was raised in [1]. We decided not to remove it.

David

[1]
https://www.postgresql.org/message-id/flat/CAKJS1f-A5aYvPHe10Wy9LjC4RzLsBrya8b2gfuQHFabhwZT_NQ%40mail.gmail.com#3bae9a84be253c527b0e621add0fbaef



Re: Improvements in Copy From

From
David Rowley
Date:
On Tue, 14 Jul 2020 at 17:22, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 2 Jul 2020 at 00:46, vignesh C <vignesh21@gmail.com> wrote:
> > b) CopyMultiInsertInfoNextFreeSlot had an unused function parameter
> > that is not being used, it can be removed.
>
> This was raised in [1]. We decided not to remove it.

I just added a comment to the function to mention why we want to keep
the parameter. I hope that will save any wasted time proposing its
removal in the future.

FWIW, the function is inlined.  Removing it will gain us nothing
performance-wise anyway.

David

> [1]
https://www.postgresql.org/message-id/flat/CAKJS1f-A5aYvPHe10Wy9LjC4RzLsBrya8b2gfuQHFabhwZT_NQ%40mail.gmail.com#3bae9a84be253c527b0e621add0fbaef



Re: Improvements in Copy From

From
vignesh C
Date:
On Tue, Jul 14, 2020 at 11:13 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Tue, 14 Jul 2020 at 17:22, David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Thu, 2 Jul 2020 at 00:46, vignesh C <vignesh21@gmail.com> wrote:
> > > b) CopyMultiInsertInfoNextFreeSlot had an unused function parameter
> > > that is not being used, it can be removed.
> >
> > This was raised in [1]. We decided not to remove it.
>
> I just added a comment to the function to mention why we want to keep
> the parameter. I hope that will save any wasted time proposing its
> removal in the future.
>
> FWIW, the function is inlined.  Removing it will gain us nothing
> performance-wise anyway.
>
> David
>
> > [1]
https://www.postgresql.org/message-id/flat/CAKJS1f-A5aYvPHe10Wy9LjC4RzLsBrya8b2gfuQHFabhwZT_NQ%40mail.gmail.com#3bae9a84be253c527b0e621add0fbaef

Thanks David for pointing it out, as this has been discussed and
concluded no point in discussing the same thing again. This patch has
a couple of other improvements which can still be taken forward. I
will remove this change and post a new patch to retain the other
issues that were fixed.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Improvements in Copy From

From
vignesh C
Date:
On Tue, Jul 14, 2020 at 12:17 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, Jul 14, 2020 at 11:13 AM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Tue, 14 Jul 2020 at 17:22, David Rowley <dgrowleyml@gmail.com> wrote:
> > >
> > > On Thu, 2 Jul 2020 at 00:46, vignesh C <vignesh21@gmail.com> wrote:
> > > > b) CopyMultiInsertInfoNextFreeSlot had an unused function parameter
> > > > that is not being used, it can be removed.
> > >
> > > This was raised in [1]. We decided not to remove it.
> >
> > I just added a comment to the function to mention why we want to keep
> > the parameter. I hope that will save any wasted time proposing its
> > removal in the future.
> >
> > FWIW, the function is inlined.  Removing it will gain us nothing
> > performance-wise anyway.
> >
> > David
> >
> > > [1]
https://www.postgresql.org/message-id/flat/CAKJS1f-A5aYvPHe10Wy9LjC4RzLsBrya8b2gfuQHFabhwZT_NQ%40mail.gmail.com#3bae9a84be253c527b0e621add0fbaef
>
> Thanks David for pointing it out, as this has been discussed and
> concluded no point in discussing the same thing again. This patch has
> a couple of other improvements which can still be taken forward. I
> will remove this change and post a new patch to retain the other
> issues that were fixed.
>

I have removed the changes that david had pointed out and retained the
remaining changes. Attaching the patch for the same.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Improvements in Copy From

From
Peter Smith
Date:
Hello.

FYI - that patch has conflicts when applied.

Kind Regards
Peter Smith
Fujitsu Australia.

On Thu, Aug 27, 2020 at 3:11 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, Jul 14, 2020 at 12:17 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, Jul 14, 2020 at 11:13 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > >
> > > On Tue, 14 Jul 2020 at 17:22, David Rowley <dgrowleyml@gmail.com> wrote:
> > > >
> > > > On Thu, 2 Jul 2020 at 00:46, vignesh C <vignesh21@gmail.com> wrote:
> > > > > b) CopyMultiInsertInfoNextFreeSlot had an unused function parameter
> > > > > that is not being used, it can be removed.
> > > >
> > > > This was raised in [1]. We decided not to remove it.
> > >
> > > I just added a comment to the function to mention why we want to keep
> > > the parameter. I hope that will save any wasted time proposing its
> > > removal in the future.
> > >
> > > FWIW, the function is inlined.  Removing it will gain us nothing
> > > performance-wise anyway.
> > >
> > > David
> > >
> > > > [1]
https://www.postgresql.org/message-id/flat/CAKJS1f-A5aYvPHe10Wy9LjC4RzLsBrya8b2gfuQHFabhwZT_NQ%40mail.gmail.com#3bae9a84be253c527b0e621add0fbaef
> >
> > Thanks David for pointing it out, as this has been discussed and
> > concluded no point in discussing the same thing again. This patch has
> > a couple of other improvements which can still be taken forward. I
> > will remove this change and post a new patch to retain the other
> > issues that were fixed.
> >
>
> I have removed the changes that david had pointed out and retained the
> remaining changes. Attaching the patch for the same.
> Thoughts?
>
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com



Re: Improvements in Copy From

From
vignesh C
Date:
On Thu, Aug 27, 2020 at 11:02 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hello.
>
> FYI - that patch has conflicts when applied.
>

Thanks for letting me know. Attached new patch which is rebased on top of head.

Regards,
VIgnesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Improvements in Copy From

From
Surafel Temesgen
Date:

Hi Vignesh

On Wed, Jul 1, 2020 at 3:46 PM vignesh C <vignesh21@gmail.com> wrote:
Hi,

While reviewing copy from I identified few  improvements for copy from
that can be done :
a) copy from stdin copies lesser amount of data to buffer even though
space is available in buffer because minread was passed as 1 to
CopyGetData, Hence it only reads until the data read from libpq is
less than minread. This can be fixed by passing the actual space
available in buffer, this reduces the unnecessary frequent calls to
CopyGetData.

why not applying the same optimization on file read ?

 
c) Copy from reads header line and do nothing for the header line, we
need not clear EOL & need not convert to server encoding for the
header line.

We have a patch for column matching feature [1] that may need a header line to be further processed. Even without that I think it is preferable to process the header line for nothing than adding those checks to the loop, performance-wise.

[1]. https://www.postgresql.org/message-id/flat/CAF1-J-0PtCWMeLtswwGV2M70U26n4g33gpe1rcKQqe6wVQDrFA@mail.gmail.com

regards

Surafel 

Re: Improvements in Copy From

From
Peter Smith
Date:
My basic understanding of first part of your patch is that by
adjusting the "minread" it now allows it to loop multiple times
internally within the CopyGetData rather than calling CopyLoadRawBuf
for every N lines. There doesn't seem to be much change to what other
code gets executed so the saving is essentially whatever is the cost
of making 2 x function calls (CopyLoadRawBuff + CopyGetData) x N. Is
that understanding correct?

But with that change there seems to be opportunity for yet another
tiny saving possible. IIUC, now you are processing a lot more data
within the CopyGetData so it is now very likely that you will also
gobble the COPY_NEW_FE's 'c' marker. So cstate->reached_eof will be
set. So this means the calling code of CopyReadLineText may no longer
need to call the CopyLoadRawBuf one last time just to discover there
are no more bytes to read - something that it already knows if
cstate->reached_eof == true.

For example, with your change can't you also modify CopyReadLineText like below:

BEFORE
            if (!CopyLoadRawBuf(cstate))
                hit_eof = true;

AFTER
            if (cstate->reached_eof)
            {
                cstate->raw_buf[0] = '\0';
                cstate->raw_buf_index = cstate->raw_buf_len = 0;
                hit_eof = true;
            }
            else if (!CopyLoadRawBuf(cstate))
            {
                hit_eof = true;
            }

Whether such a micro-optimisation is worth doing is another question.

---

Kind Regards,
Peter Smith.
Fujitsu Australia

On Sun, Aug 30, 2020 at 5:25 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, Aug 27, 2020 at 11:02 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Hello.
> >
> > FYI - that patch has conflicts when applied.
> >
>
> Thanks for letting me know. Attached new patch which is rebased on top of head.
>
> Regards,
> VIgnesh
> EnterpriseDB: http://www.enterprisedb.com



Re: Improvements in Copy From

From
vignesh C
Date:
On Mon, Sep 7, 2020 at 1:19 PM Surafel Temesgen <surafel3000@gmail.com> wrote:
>
>
> Hi Vignesh
>
> On Wed, Jul 1, 2020 at 3:46 PM vignesh C <vignesh21@gmail.com> wrote:
>>
>> Hi,
>>
>> While reviewing copy from I identified few  improvements for copy from
>> that can be done :
>> a) copy from stdin copies lesser amount of data to buffer even though
>> space is available in buffer because minread was passed as 1 to
>> CopyGetData, Hence it only reads until the data read from libpq is
>> less than minread. This can be fixed by passing the actual space
>> available in buffer, this reduces the unnecessary frequent calls to
>> CopyGetData.
>
>
> why not applying the same optimization on file read ?

For file read this is already taken care as you can see from below code:
bytesread = fread(databuf, 1, maxread, cstate->copy_file);
if (ferror(cstate->copy_file))
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not read from COPY file: %m")));
if (bytesread == 0)
cstate->reached_eof = true;
break;

We do not have any condition to break unlike the case of stdin, we
read 1 * maxread size of data, So no need to do anything for it.

>
>>
>> c) Copy from reads header line and do nothing for the header line, we
>> need not clear EOL & need not convert to server encoding for the
>> header line.
>
>
> We have a patch for column matching feature [1] that may need a header line to be further processed. Even without
thatI think it is preferable to process the header line for nothing than adding those checks to the loop,
performance-wise.

I had seen that patch, I feel that change to match the header if the
header is specified can be addressed in this patch if that patch gets
committed first or vice versa. We are doing a lot of processing for
the data which we need not do anything. Shouldn't this be skipped if
not required. Similar check is present in NextCopyFromRawFields also
to skip header.

Thoughts?

Regards,
VIgnesh
EnterpriseDB: http://www.enterprisedb.com



Re: Improvements in Copy From

From
vignesh C
Date:
On Wed, Sep 9, 2020 at 12:24 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> My basic understanding of first part of your patch is that by
> adjusting the "minread" it now allows it to loop multiple times
> internally within the CopyGetData rather than calling CopyLoadRawBuf
> for every N lines. There doesn't seem to be much change to what other
> code gets executed so the saving is essentially whatever is the cost
> of making 2 x function calls (CopyLoadRawBuff + CopyGetData) x N. Is
> that understanding correct?

Yes you are right, we will avoid the function calls and try to get as
many records as possible from the buffer & insert it to the relation.

> But with that change there seems to be opportunity for yet another
> tiny saving possible. IIUC, now you are processing a lot more data
> within the CopyGetData so it is now very likely that you will also
> gobble the COPY_NEW_FE's 'c' marker. So cstate->reached_eof will be
> set. So this means the calling code of CopyReadLineText may no longer
> need to call the CopyLoadRawBuf one last time just to discover there
> are no more bytes to read - something that it already knows if
> cstate->reached_eof == true.
>
> For example, with your change can't you also modify CopyReadLineText like below:
>
> BEFORE
>             if (!CopyLoadRawBuf(cstate))
>                 hit_eof = true;
>
> AFTER
>             if (cstate->reached_eof)
>             {
>                 cstate->raw_buf[0] = '\0';
>                 cstate->raw_buf_index = cstate->raw_buf_len = 0;
>                 hit_eof = true;
>             }
>             else if (!CopyLoadRawBuf(cstate))
>             {
>                 hit_eof = true;
>             }
>
> Whether such a micro-optimisation is worth doing is another question.
Yes, what you suggested can also be done, but even I have the same
question as you. Because we will reduce just one function call, the
eof check is present immediately in the function, Should we include
this or not?

Regards,
VIgnesh
EnterpriseDB: http://www.enterprisedb.com



Re: Improvements in Copy From

From
Surafel Temesgen
Date:


On Thu, Sep 10, 2020 at 1:17 PM vignesh C <vignesh21@gmail.com> wrote:

>
> We have a patch for column matching feature [1] that may need a header line to be further processed. Even without that I think it is preferable to process the header line for nothing than adding those checks to the loop, performance-wise.

I had seen that patch, I feel that change to match the header if the
header is specified can be addressed in this patch if that patch gets
committed first or vice versa. We are doing a lot of processing for
the data which we need not do anything. Shouldn't this be skipped if
not required. Similar check is present in NextCopyFromRawFields also
to skip header.

The existing check is unavoidable but we can live better without the checks added by the patch. For very large files the loop may iterate millions of times if it is not in billion and I am sure doing the check that many times will incur noticeable performance degradation than further processing a single line.

regards

Surafel 

Re: Improvements in Copy From

From
Kyotaro Horiguchi
Date:
At Thu, 10 Sep 2020 21:55:27 +0300, Surafel Temesgen <surafel3000@gmail.com> wrote in 
> On Thu, Sep 10, 2020 at 1:17 PM vignesh C <vignesh21@gmail.com> wrote:
> 
> >
> > >
> > > We have a patch for column matching feature [1] that may need a header
> > line to be further processed. Even without that I think it is preferable to
> > process the header line for nothing than adding those checks to the loop,
> > performance-wise.
> >
> > I had seen that patch, I feel that change to match the header if the
> > header is specified can be addressed in this patch if that patch gets
> > committed first or vice versa. We are doing a lot of processing for
> > the data which we need not do anything. Shouldn't this be skipped if
> > not required. Similar check is present in NextCopyFromRawFields also
> > to skip header.
> >
> 
> The existing check is unavoidable but we can live better without the checks
> added by the patch. For very large files the loop may iterate millions of
> times if it is not in billion and I am sure doing the check that many times
> will incur noticeable performance degradation than further processing a
> single line.

FWIW, I thought the same thing seeing the additional if-conditions. It
gives more loss than gain.

For the first part, the patch reveals COPY_NEW_FE, which I don't think
to be a knowledge for the function, to CopyGetData. Considering that
that doesn't seem to offer noticeable performance gain, I don't think
we should do that. On the contrary, if incoming data were
intermittently delayed for some reasons (heavy load of client or
in-between network), this patch would make things worse by waiting for
delayed bits before processing already received bits.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Improvements in Copy From

From
Peter Smith
Date:
On Thu, Sep 10, 2020 at 9:21 PM vignesh C <vignesh21@gmail.com> wrote:
> > Whether such a micro-optimisation is worth doing is another question.
> Yes, what you suggested can also be done, but even I have the same
> question as you. Because we will reduce just one function call, the
> eof check is present immediately in the function, Should we include
> this or not?

I expect the difference from my suggestion is too small to be measured.

Probably it is not worth changing the already complicated code unless
those changes can achieve something observable.

~~

FYI, I ran a few performance tests BEFORE/AFTER applying your patch.

Perf results for \COPY 5GB CSV file to UNLOGGED table.

perf -a –g <pid>
psql -d test -c "\copy tbl from '/my/path/data_5GB.csv' with (format csv);”
perf report –g

BEFORE
#1 CopyReadLineText = 12.70%, CopyLoadRawBuf = 0.81%
#2 CopyReadLineText = 12.54%, CopyLoadRawBuf = 0.81%
#3 CopyReadLineText = 12.52%, CopyLoadRawBuf = 0.81%

AFTER
#1 CopyReadLineText = 12.55%, CopyLoadRawBuf = 1.20%
#2 CopyReadLineText = 12.15%, CopyLoadRawBuf = 1.10%
#3 CopyReadLineText = 13.11%, CopyLoadRawBuf = 1.24%
#4 CopyReadLineText = 12.86%, CopyLoadRawBuf = 1.18%

I didn't quite know how to interpret those results. It was opposite
what I expected. Perhaps the original excessive CopyLoadRawBuf calls
were so brief they could often avoid being sampled? Anyway, I hope you
have a better understanding of perf than I do and can explain it.

I then repeated/times same tests but without perf

BEFORE
#1 4min.36s
#2 4min.45s
#3 4min.43s
#4 4min.34s

AFTER
#1 4min.41s
#2 4min.37s
#3 4min.34s

As you can see, unfortunately, the patch gave no observable benefit
for my test case.

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Improvements in Copy From

From
Kyotaro Horiguchi
Date:
At Fri, 11 Sep 2020 18:44:13 +1000, Peter Smith <smithpb2250@gmail.com> wrote in 
> On Thu, Sep 10, 2020 at 9:21 PM vignesh C <vignesh21@gmail.com> wrote:
> > > Whether such a micro-optimisation is worth doing is another question.
> > Yes, what you suggested can also be done, but even I have the same
> > question as you. Because we will reduce just one function call, the
> > eof check is present immediately in the function, Should we include
> > this or not?
> 
> I expect the difference from my suggestion is too small to be measured.
> 
> Probably it is not worth changing the already complicated code unless
> those changes can achieve something observable.
> 
> ~~
> 
> FYI, I ran a few performance tests BEFORE/AFTER applying your patch.
> 
> Perf results for \COPY 5GB CSV file to UNLOGGED table.
> 
> perf -a –g <pid>
> psql -d test -c "\copy tbl from '/my/path/data_5GB.csv' with (format csv);”
> perf report –g
> 
> BEFORE
> #1 CopyReadLineText = 12.70%, CopyLoadRawBuf = 0.81%
> #2 CopyReadLineText = 12.54%, CopyLoadRawBuf = 0.81%
> #3 CopyReadLineText = 12.52%, CopyLoadRawBuf = 0.81%
> 
> AFTER
> #1 CopyReadLineText = 12.55%, CopyLoadRawBuf = 1.20%
> #2 CopyReadLineText = 12.15%, CopyLoadRawBuf = 1.10%
> #3 CopyReadLineText = 13.11%, CopyLoadRawBuf = 1.24%
> #4 CopyReadLineText = 12.86%, CopyLoadRawBuf = 1.18%
> 
> I didn't quite know how to interpret those results. It was opposite
> what I expected. Perhaps the original excessive CopyLoadRawBuf calls
> were so brief they could often avoid being sampled? Anyway, I hope you
> have a better understanding of perf than I do and can explain it.
> 
> I then repeated/times same tests but without perf
> 
> BEFORE
> #1 4min.36s
> #2 4min.45s
> #3 4min.43s
> #4 4min.34s
> 
> AFTER
> #1 4min.41s
> #2 4min.37s
> #3 4min.34s
> 
> As you can see, unfortunately, the patch gave no observable benefit
> for my test case.

That observation agrees with my assumption.

At Fri, 11 Sep 2020 15:58:04 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
me> we should do that. On the contrary, if incoming data were
me> intermittently delayed for some reasons (heavy load of client or
me> in-between network), this patch would make things worse by waiting for
me> delayed bits before processing already received bits.

It seems that a slow network is enough to cause that behavior even
without any trouble,

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center