Re: Force lookahead in COPY FROM parsing - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Force lookahead in COPY FROM parsing
Date
Msg-id bf73c57e-b2e3-9df6-4412-6ca552bc64a6@iki.fi
Whole thread Raw
In response to Re: Force lookahead in COPY FROM parsing  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: Force lookahead in COPY FROM parsing
List pgsql-hackers
On 18/03/2021 17:09, John Naylor wrote:
> The cfbot thinks this patch no longer applies, but it works for me, so 
> still set to RFC. It looks like it's because the thread to remove the v2 
> FE/BE protocol was still attached to the commitfest entry. I've deleted 
> that, so let's see if that helps.

It was now truly bitrotted by commit f82de5c46b (the encoding conversion 
changes). Rebase attached.

> To answer the side question of whether it makes any difference in 
> performance, I used the blackhole AM [1] to isolate the copy code path 
> as much as possible. Forcing lookahead seems to not make a noticeable 
> difference (min of 5 runs):
> 
> master:
> 306ms
> 
> force lookahead:
> 304ms
> 
> I forgot to mention the small detail of what the test was:
> 
> create extension blackhole_am;
> create table blackhole_tab (a text) using blackhole_am ;
> copy blackhole_tab from program 'for i in {1..100}; do cat /path/to/UTF-8\ Sampler.htm ;  done;' ;
> 
> ...where the .htm file is at http://kermitproject.org/utf8.html

Ok, I wouldn't expect to see much difference in that test, it gets 
drowned in all the other parsing overhead. I tested this now with this:

copy (select repeat('x', 10000) from generate_series(1, 100000)) to 
'/tmp/copydata-x.txt'
create table blackhole_tab (a text);

copy blackhole_tab from '/tmp/copydata-x.txt' where false;

I took the min of 5 runs of that COPY FROM statement:

master:
4107 ms

v3-0001-Simplify-COPY-FROM-parsing-by-forcing-lookahead.patch:
3172 ms

I was actually surprised it was so effective on that test, I expected a 
small but noticeable gain. But I'll take it.

Replying to your earlier comments (sorry for the delay):

> Looks good to me. Just a couple minor things:
> 
> + * Look at the next character.  If we're at EOF, c2 will wind
> + * up as '\0' because of the guaranteed pad of raw_buf.
>   */
> - IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
> -
> - /* get next char */
>   c = copy_raw_buf[raw_buf_ptr];
> 
> The new comment seems copy-pasted from the c2 statements further down.

Fixed.

> - if (raw_buf_ptr >= copy_buf_len || need_data)
> +#define COPY_READ_LINE_LOOKAHEAD 4
> + if (raw_buf_ptr + COPY_READ_LINE_LOOKAHEAD >= copy_buf_len)
> 
> Is this #define deliberately put down here rather than at the top of the file?

Yeah, it's only used right here locally. Matter of taste, but I'd prefer 
to keep it here.

> - * of the buffer and then we load more data after that.  This case occurs only
> - * when a multibyte character crosses a bufferload boundary.
> + * of the buffer and then we load more data after that.
> 
> Is the removed comment really invalidated by this patch? I figured it was something not affected until the patch to
dothe encoding conversion in larger chunks.
 

Not sure anymore, but this is moot now, since the other patch was committed.

Thanks for the review and the testing!

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: pg_amcheck contrib application
Next
From: Thomas Munro
Date:
Subject: Re: OpenBSD versus semaphores