Thread: Re: Force lookahead in COPY FROM parsing

Re: Force lookahead in COPY FROM parsing

From
John Naylor
Date:
The cfbot thinks this patch no longer applies, but it works for me, so still set to RFC. It looks like it's because the thread to remove the v2 FE/BE protocol was still attached to the commitfest entry. I've deleted that, so let's see if that helps.

To answer the side question of whether it makes any difference in performance, I used the blackhole AM [1] to isolate the copy code path as much as possible. Forcing lookahead seems to not make a noticeable difference (min of 5 runs):

master:
306ms

force lookahead:
304ms

--

Re: Force lookahead in COPY FROM parsing

From
Heikki Linnakangas
Date:
On 18/03/2021 17:09, John Naylor wrote:
> The cfbot thinks this patch no longer applies, but it works for me, so 
> still set to RFC. It looks like it's because the thread to remove the v2 
> FE/BE protocol was still attached to the commitfest entry. I've deleted 
> that, so let's see if that helps.

It was now truly bitrotted by commit f82de5c46b (the encoding conversion 
changes). Rebase attached.

> To answer the side question of whether it makes any difference in 
> performance, I used the blackhole AM [1] to isolate the copy code path 
> as much as possible. Forcing lookahead seems to not make a noticeable 
> difference (min of 5 runs):
> 
> master:
> 306ms
> 
> force lookahead:
> 304ms
> 
> I forgot to mention the small detail of what the test was:
> 
> create extension blackhole_am;
> create table blackhole_tab (a text) using blackhole_am ;
> copy blackhole_tab from program 'for i in {1..100}; do cat /path/to/UTF-8\ Sampler.htm ;  done;' ;
> 
> ...where the .htm file is at http://kermitproject.org/utf8.html

Ok, I wouldn't expect to see much difference in that test, it gets 
drowned in all the other parsing overhead. I tested this now with this:

copy (select repeat('x', 10000) from generate_series(1, 100000)) to 
'/tmp/copydata-x.txt'
create table blackhole_tab (a text);

copy blackhole_tab from '/tmp/copydata-x.txt' where false;

I took the min of 5 runs of that COPY FROM statement:

master:
4107 ms

v3-0001-Simplify-COPY-FROM-parsing-by-forcing-lookahead.patch:
3172 ms

I was actually surprised it was so effective on that test, I expected a 
small but noticeable gain. But I'll take it.

Replying to your earlier comments (sorry for the delay):

> Looks good to me. Just a couple minor things:
> 
> + * Look at the next character.  If we're at EOF, c2 will wind
> + * up as '\0' because of the guaranteed pad of raw_buf.
>   */
> - IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
> -
> - /* get next char */
>   c = copy_raw_buf[raw_buf_ptr];
> 
> The new comment seems copy-pasted from the c2 statements further down.

Fixed.

> - if (raw_buf_ptr >= copy_buf_len || need_data)
> +#define COPY_READ_LINE_LOOKAHEAD 4
> + if (raw_buf_ptr + COPY_READ_LINE_LOOKAHEAD >= copy_buf_len)
> 
> Is this #define deliberately put down here rather than at the top of the file?

Yeah, it's only used right here locally. Matter of taste, but I'd prefer 
to keep it here.

> - * of the buffer and then we load more data after that.  This case occurs only
> - * when a multibyte character crosses a bufferload boundary.
> + * of the buffer and then we load more data after that.
> 
> Is the removed comment really invalidated by this patch? I figured it was something not affected until the patch to
dothe encoding conversion in larger chunks.
 

Not sure anymore, but this is moot now, since the other patch was committed.

Thanks for the review and the testing!

- Heikki

Attachment

Re: Force lookahead in COPY FROM parsing

From
John Naylor
Date:

On Thu, Apr 1, 2021 at 4:47 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Ok, I wouldn't expect to see much difference in that test, it gets
> drowned in all the other parsing overhead. I tested this now with this:
>
> copy (select repeat('x', 10000) from generate_series(1, 100000)) to
> '/tmp/copydata-x.txt'
> create table blackhole_tab (a text);
>
> copy blackhole_tab from '/tmp/copydata-x.txt' where false;
>
> I took the min of 5 runs of that COPY FROM statement:
>
> master:
> 4107 ms
>
> v3-0001-Simplify-COPY-FROM-parsing-by-forcing-lookahead.patch:
> 3172 ms
>
> I was actually surprised it was so effective on that test, I expected a
> small but noticeable gain. But I'll take it.

Nice! With this test on my laptop I only get 7-8% increase, but that's much better than what I saw before.

I have nothing further so it's RFC. The patch is pretty simple compared to the earlier ones, but is worth running the fuzzer again as added insurance?

As an aside, I noticed the URL near the top of copyfromparse.c that explains a detail of macros has moved from

http://www.cit.gu.edu.au/~anthony/info/C/C.macros

to

https://antofthy.gitlab.io/info/C/C_macros.txt

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Force lookahead in COPY FROM parsing

From
Heikki Linnakangas
Date:
On 02/04/2021 20:21, John Naylor wrote:
> I have nothing further so it's RFC. The patch is pretty simple compared 
> to the earlier ones, but is worth running the fuzzer again as added 
> insurance?

Good idea. I did that, and indeed it revealed bugs. If the client sent 
just a single byte in one CopyData message, we only loaded that one byte 
into the buffer, instead of the full 4 bytes needed for lookahead. 
Attached is a new version that fixes that.

Unfortunately, that's not the end of it. Consider the byte sequence 
"\.<NL><some invalid bytes>" appearing at the end of the input. We 
should detect the end-of-copy marker \. and stop reading without 
complaining about the garbage after the end-of-copy marker. That doesn't 
work if we force 4 bytes of lookahead; the invalid byte sequence fits in 
the lookahead window, so we will try to convert it.

I'm sure that can be fixed, for example by adding special handling for 
the last few bytes of the input. But it needs some more thinking, this 
patch isn't quite ready to be committed yet :-(.

- Heikki

Attachment