Re: Perform COPY FROM encoding conversions in larger chunks - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Perform COPY FROM encoding conversions in larger chunks
Date
Msg-id 11d39e63-b80a-5f8d-8043-fff04201fadc@iki.fi
Whole thread Raw
In response to Re: Perform COPY FROM encoding conversions in larger chunks  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: Perform COPY FROM encoding conversions in larger chunks
List pgsql-hackers
On 30/01/2021 20:47, John Naylor wrote:
> I took a look at v2, and for the first encoding I tried, it fails to 
> report the error for invalid input:

That's embarassing...

> I believe the problem is in UtfToLocal(). I've attached a fix formatted 
> as a text file to avoid confusing the cfbot. The fix keeps the debugging 
> ereport() in case you find it useful.

Thanks. I fixed it slightly differently, and also changed LocalToUtf() 
to follow the same pattern, even though LocalToUtf() did not have the 
same bug.

> Some additional test coverage might be good here, but not sure how
> much work that would be. I didn't check any other conversions yet.
I added a bunch of tests for various built-in conversions.

> v2-0002 seems fine to me, I just have cosmetic comments here:
> 
> + * the same, no conversion is required by we must still validate that the
> 
> s/by/but/
> 
> This comment in copyfrom_internal.h above the *StateData struct is the 
> same as the corresponding one in copyto.c:
> 
>   * Multi-byte encodings: all supported client-side encodings encode 
> multi-byte
>   * characters by having the first byte's high bit set. Subsequent bytes 
> of the
>   * character can have the high bit not set. When scanning data in such an
>   * encoding to look for a match to a single-byte (ie ASCII) character, 
> we must
>   * use the full pg_encoding_mblen() machinery to skip over multibyte
>   * characters, else we might find a false match to a trailing byte. In
>   * supported server encodings, there is no possibility of a false 
> match, and
>   * it's faster to make useless comparisons to trailing bytes than it is to
>   * invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii 
> is true
>   * when we have to do it the hard way.
> 
> The references to pg_encoding_mblen() and encoding_embeds_ascii, are out 
> of date for copy-from. I'm not sure the rest is relevant to copy-from 
> anymore, either. Can you confirm?

Yeah, that comment is obsolete for COPY FROM, the encoding conversion 
works differently now. Removed it from copyfrom_internal.h.

> This comment inside the struct is now out of date as well:
> 
>   * Similarly, line_buf holds the whole input line being processed. The
>   * input cycle is first to read the whole line into line_buf, convert it
>   * to server encoding there, and then extract the individual attribute
> 
> HEAD has this macro already:
> 
> /* Shorthand for number of unconsumed bytes available in raw_buf */
> #define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - 
> (cstate)->raw_buf_index)
> 
> It might make sense to create a CONVERSION_BUF_BYTES equivalent since 
> the patch calculates cstate->conversion_buf_len - 
> cstate->conversion_buf_index in a couple places.

Thanks for the review!

I spent some time refactoring and adding comments all around the patch, 
hopefully making it all more clear. One notable difference is that I 
renamed 'raw_buf' (which exists in master too) to 'input_buf', and 
renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this 
patch again another day with fresh eyes, and also try to add some tests 
for the corner cases at buffer boundaries.

Attached is a new set of patches. I added some regression tests for the 
built-in conversion functions, which cover the bug you found, and many 
other interesting cases that did not have test coverage yet. It comes in 
two patches: the first patch uses just the existing convert_from() SQL 
function, and the second one uses the new "noError" variants of the 
conversion functions. I also kept the bug-fixes compared to the previous 
patch version as a separate commit, for easier review.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Alexey Kondratov
Date:
Subject: Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Next
From: Stephen Frost
Date:
Subject: Re: Proposal: Save user's original authenticated identity for logging