Re: [PATCH] Performance Improvement For Copy From Binary Files - Mailing list pgsql-hackers
From | Bharath Rupireddy |
---|---|
Subject | Re: [PATCH] Performance Improvement For Copy From Binary Files |
Date | |
Msg-id | CALj2ACVe_qKV6ky5bJCtMLbLdU07MPcPvk9iaMfdh3eKHRjWwg@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Performance Improvement For Copy From Binary Files (Amit Langote <amitlangote09@gmail.com>) |
Responses |
Re: [PATCH] Performance Improvement For Copy From Binary Files
|
List | pgsql-hackers |
> > Considering these points, I came up with the attached patch with a > much smaller footprint. As far as I can see, it implements the same > basic idea as your patch. With it, I was able to see an improvement > in loading time consistent with your numbers. I measured the time of > loading 10 million rows into tables with 5, 10, 20 text columns as > follows: > Thanks Amit for buying into the idea. I agree that your patch looks clean and simple compared to mine and I'm okay with your patch. I reviewed and tested your patch, below are few comments: I think we can remove(and delete the function from the code) the CopyGetInt16() and have the code directly to save the function call cost. It gets called for each attribute/column for each row/tuple to just call CopyReadFromRawBuf() and set the byte order. From a readability perspective it's okay to have this function, but cost wise I feel no need for that function at all. In one of our earlier work(parallel copy), we observed that having a new function or few extra statements in this copy from path which gets hit for each row, incurs noticeable execution cost. The same way, we can also avoid using CopyGetInt32() function call in CopyReadBinaryAttribute() for the same reason stated above. In CopyReadFromRawBuf(), can the "saveTo" parameter be named "dest" and use that with (char *) typecast directly, instead of having a local variable? Though it may/may not be a standard practice, let's have the parameter name all lower case to keep it consistent with other function's parameters in the copy.c file. Seems like there's a bug in below part of the code. Use case is simple, have some junk value at the end of the binary file, then with your patch the query succeeds, but it should report the below error. Here, on fld_count == -1 instead of reading from file, we must be reading it from the buffer, as we would have already read all the data from the file into the buffer. if (cstate->copy_dest != COPY_OLD_FE && CopyGetData(cstate, &dummy, 1, 1) > 0) ereport(ERROR, (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), errmsg("received copy data after EOF marker"))); I also tried with some intentionally corrupted binary datasets, (apart from above issue) patch works fine. For the case where required nbytes may not fit into the buffer in CopyReadFromRawBuf, I'm sure this can happen only for field data, (field count , and field size are of fixed length and can fit in the buffer), instead of reading them in parts of buff size into the buffer (using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the destination, I think we can detect this condition using requested bytes and the buffer size and directly read from the file to the destination buffer and then reload the raw_buffer for further processing. I think this way, it will be good. I have few synthesized test cases where fields can be of larger size. I executed them on your patch, but didn't debug to see whether actually we hit the code where required nbytes can't fit in the entire buffer. I will try this on the next version of the patch. > > HEAD patched > foo5 8.5 6.5 > foo10 14 10 > foo20 25 18 > Numbers might improve a bit, if we remove the extra function calls as stated above. Overall, thanks for your suggestions in the previous mail, my patch was prepared in a bit hurried manner, anyways, will take care next time. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: