Thread: Re: Force lookahead in COPY FROM parsing
The cfbot thinks this patch no longer applies, but it works for me, so still set to RFC. It looks like it's because the thread to remove the v2 FE/BE protocol was still attached to the commitfest entry. I've deleted that, so let's see if that helps.
To answer the side question of whether it makes any difference in performance, I used the blackhole AM [1] to isolate the copy code path as much as possible. Forcing lookahead seems to not make a noticeable difference (min of 5 runs):
master:
306ms
force lookahead:
304ms
306ms
force lookahead:
304ms
On 18/03/2021 17:09, John Naylor wrote: > The cfbot thinks this patch no longer applies, but it works for me, so > still set to RFC. It looks like it's because the thread to remove the v2 > FE/BE protocol was still attached to the commitfest entry. I've deleted > that, so let's see if that helps. It was now truly bitrotted by commit f82de5c46b (the encoding conversion changes). Rebase attached. > To answer the side question of whether it makes any difference in > performance, I used the blackhole AM [1] to isolate the copy code path > as much as possible. Forcing lookahead seems to not make a noticeable > difference (min of 5 runs): > > master: > 306ms > > force lookahead: > 304ms > > I forgot to mention the small detail of what the test was: > > create extension blackhole_am; > create table blackhole_tab (a text) using blackhole_am ; > copy blackhole_tab from program 'for i in {1..100}; do cat /path/to/UTF-8\ Sampler.htm ; done;' ; > > ...where the .htm file is at http://kermitproject.org/utf8.html Ok, I wouldn't expect to see much difference in that test, it gets drowned in all the other parsing overhead. I tested this now with this: copy (select repeat('x', 10000) from generate_series(1, 100000)) to '/tmp/copydata-x.txt' create table blackhole_tab (a text); copy blackhole_tab from '/tmp/copydata-x.txt' where false; I took the min of 5 runs of that COPY FROM statement: master: 4107 ms v3-0001-Simplify-COPY-FROM-parsing-by-forcing-lookahead.patch: 3172 ms I was actually surprised it was so effective on that test, I expected a small but noticeable gain. But I'll take it. Replying to your earlier comments (sorry for the delay): > Looks good to me. Just a couple minor things: > > + * Look at the next character. If we're at EOF, c2 will wind > + * up as '\0' because of the guaranteed pad of raw_buf. > */ > - IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0); > - > - /* get next char */ > c = copy_raw_buf[raw_buf_ptr]; > > The new comment seems copy-pasted from the c2 statements further down. Fixed. > - if (raw_buf_ptr >= copy_buf_len || need_data) > +#define COPY_READ_LINE_LOOKAHEAD 4 > + if (raw_buf_ptr + COPY_READ_LINE_LOOKAHEAD >= copy_buf_len) > > Is this #define deliberately put down here rather than at the top of the file? Yeah, it's only used right here locally. Matter of taste, but I'd prefer to keep it here. > - * of the buffer and then we load more data after that. This case occurs only > - * when a multibyte character crosses a bufferload boundary. > + * of the buffer and then we load more data after that. > > Is the removed comment really invalidated by this patch? I figured it was something not affected until the patch to dothe encoding conversion in larger chunks. Not sure anymore, but this is moot now, since the other patch was committed. Thanks for the review and the testing! - Heikki
Attachment
On Thu, Apr 1, 2021 at 4:47 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Ok, I wouldn't expect to see much difference in that test, it gets
> drowned in all the other parsing overhead. I tested this now with this:
>
> copy (select repeat('x', 10000) from generate_series(1, 100000)) to
> '/tmp/copydata-x.txt'
> create table blackhole_tab (a text);
>
> copy blackhole_tab from '/tmp/copydata-x.txt' where false;
>
> I took the min of 5 runs of that COPY FROM statement:
>
> master:
> 4107 ms
>
> v3-0001-Simplify-COPY-FROM-parsing-by-forcing-lookahead.patch:
> 3172 ms
>
> I was actually surprised it was so effective on that test, I expected a
> small but noticeable gain. But I'll take it.
Nice! With this test on my laptop I only get 7-8% increase, but that's much better than what I saw before.
I have nothing further so it's RFC. The patch is pretty simple compared to the earlier ones, but is worth running the fuzzer again as added insurance?
As an aside, I noticed the URL near the top of copyfromparse.c that explains a detail of macros has moved from
http://www.cit.gu.edu.au/~anthony/info/C/C.macros
to
https://antofthy.gitlab.io/info/C/C_macros.txt
--
John Naylor
EDB: http://www.enterprisedb.com
On 02/04/2021 20:21, John Naylor wrote: > I have nothing further so it's RFC. The patch is pretty simple compared > to the earlier ones, but is worth running the fuzzer again as added > insurance? Good idea. I did that, and indeed it revealed bugs. If the client sent just a single byte in one CopyData message, we only loaded that one byte into the buffer, instead of the full 4 bytes needed for lookahead. Attached is a new version that fixes that. Unfortunately, that's not the end of it. Consider the byte sequence "\.<NL><some invalid bytes>" appearing at the end of the input. We should detect the end-of-copy marker \. and stop reading without complaining about the garbage after the end-of-copy marker. That doesn't work if we force 4 bytes of lookahead; the invalid byte sequence fits in the lookahead window, so we will try to convert it. I'm sure that can be fixed, for example by adding special handling for the last few bytes of the input. But it needs some more thinking, this patch isn't quite ready to be committed yet :-(. - Heikki