Re: [PATCH] Performance Improvement For Copy From Binary Files - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: [PATCH] Performance Improvement For Copy From Binary Files
Date
Msg-id CALj2ACVe_qKV6ky5bJCtMLbLdU07MPcPvk9iaMfdh3eKHRjWwg@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Performance Improvement For Copy From Binary Files  (Amit Langote <amitlangote09@gmail.com>)
Responses Re: [PATCH] Performance Improvement For Copy From Binary Files
List pgsql-hackers
>
> Considering these points, I came up with the attached patch with a
> much smaller footprint.  As far as I can see, it implements the same
> basic idea as your patch.  With it, I was able to see an improvement
> in loading time consistent with your numbers.  I measured the time of
> loading 10 million rows into tables with 5, 10, 20 text columns as
> follows:
>

Thanks Amit for buying into the idea. I agree that your patch looks
clean and simple compared to mine and I'm okay with your patch.

I reviewed and tested your patch, below are few comments:

I think we can remove(and delete the function from the code) the
CopyGetInt16() and have the code directly to save the function call
cost. It gets called for each attribute/column for each row/tuple to
just call CopyReadFromRawBuf() and set the byte order. From a
readability perspective it's okay to have this function, but cost wise
I feel no need for that function at all. In one of our earlier
work(parallel copy), we observed that having a new function or few
extra statements in this copy from path which gets hit for each row,
incurs noticeable execution cost.

The same way, we can also avoid using CopyGetInt32() function call in
CopyReadBinaryAttribute() for the same reason stated above.

In CopyReadFromRawBuf(), can the "saveTo" parameter be named "dest"
and use that with (char *) typecast directly, instead of having a
local variable? Though it may/may not be a standard practice, let's
have the parameter name all lower case to keep it consistent with
other function's parameters in the copy.c file.

Seems like there's a bug in below part of the code. Use case is
simple, have some junk value at the end of the binary file, then with
your patch the query succeeds, but it should report the below error.
Here, on fld_count == -1 instead of reading from file, we must be
reading it from the buffer, as we would have already read all the data
from the file into the buffer.
           if (cstate->copy_dest != COPY_OLD_FE &&
                CopyGetData(cstate, &dummy, 1, 1) > 0)
                ereport(ERROR,
                        (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
                         errmsg("received copy data after EOF marker")));

I also tried with some intentionally corrupted binary datasets, (apart
from above issue) patch works fine.

For the case where required nbytes may not fit into the buffer in
CopyReadFromRawBuf, I'm sure this can happen only for field data,
(field count , and field size are of fixed length and can fit in the
buffer), instead of reading them in parts of buff size into the buffer
(using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the
destination, I think we can detect this condition using requested
bytes and the buffer size and directly read from the file to the
destination buffer and then reload the raw_buffer for further
processing. I think this way, it will be good.

I have few synthesized test cases where fields can be of larger size.
I executed them on your patch, but didn't debug to see whether
actually we hit the code where required nbytes can't fit in the entire
buffer. I will try this on the next version of the patch.

>
>             HEAD    patched
> foo5        8.5     6.5
> foo10       14      10
> foo20       25      18
>

Numbers might improve a bit, if we remove the extra function calls as
stated above.

Overall, thanks for your suggestions in the previous mail, my patch
was prepared in a bit hurried manner, anyways, will take care next
time.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Parallel worker hangs while handling errors.
Next
From: Amit Kapila
Date:
Subject: Re: Physical replication slot advance is not persistent