Re: Perform COPY FROM encoding conversions in larger chunks - Mailing list pgsql-hackers

From John Naylor
Subject Re: Perform COPY FROM encoding conversions in larger chunks
Date
Msg-id CAFBsxsFp8QCnjdUdj+ZeXR3PzmTn2Xn=WTXUBCnf6=ovyn4Q+Q@mail.gmail.com
Whole thread Raw
In response to Re: Perform COPY FROM encoding conversions in larger chunks  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Perform COPY FROM encoding conversions in larger chunks  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On Thu, Jan 28, 2021 at 7:36 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Even more surprising was that the second patch
> (0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch)
> actually made things worse again. I thought it would give a modest gain,
> but nope.

Hmm, that surprised me too.

> Based on these results, I'm going to commit the first patch, but not the
> second one. There are much faster UTF-8 verification routines out there,
> using SIMD instructions and whatnot, and we should consider adopting one
> of those, but that's future work.

I have something in mind for that.

I took a look at v2, and for the first encoding I tried, it fails to report the error for invalid input:

create database euctest WITH ENCODING 'EUC_CN' LC_COLLATE='zh_CN.eucCN' LC_CTYPE='zh_CN.eucCN' TEMPLATE=template0;

\c euctest
create table foo (a text);

master:

euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> ä
>> \.
ERROR:  character with byte sequence 0xc3 0xa4 in encoding "UTF8" has no equivalent in encoding "EUC_CN"
CONTEXT:  COPY foo, line 1

patch:

euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> ä
>> \.
COPY 0
euctest=#

I believe the problem is in UtfToLocal(). I've attached a fix formatted as a text file to avoid confusing the cfbot. The fix keeps the debugging ereport() in case you find it useful. Some additional test coverage might be good here, but not sure how much work that would be. I didn't check any other conversions yet.


v2-0002 seems fine to me, I just have cosmetic comments here:

+ * the same, no conversion is required by we must still validate that the

s/by/but/

This comment in copyfrom_internal.h above the *StateData struct is the same as the corresponding one in copyto.c:

 * Multi-byte encodings: all supported client-side encodings encode multi-byte
 * characters by having the first byte's high bit set. Subsequent bytes of the
 * character can have the high bit not set. When scanning data in such an
 * encoding to look for a match to a single-byte (ie ASCII) character, we must
 * use the full pg_encoding_mblen() machinery to skip over multibyte
 * characters, else we might find a false match to a trailing byte. In
 * supported server encodings, there is no possibility of a false match, and
 * it's faster to make useless comparisons to trailing bytes than it is to
 * invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii is true
 * when we have to do it the hard way.

The references to pg_encoding_mblen() and encoding_embeds_ascii, are out of date for copy-from. I'm not sure the rest is relevant to copy-from anymore, either. Can you confirm?

This comment inside the struct is now out of date as well:

 * Similarly, line_buf holds the whole input line being processed. The
 * input cycle is first to read the whole line into line_buf, convert it
 * to server encoding there, and then extract the individual attribute

HEAD has this macro already:

/* Shorthand for number of unconsumed bytes available in raw_buf */
#define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - (cstate)->raw_buf_index)

It might make sense to create a CONVERSION_BUF_BYTES equivalent since the patch calculates cstate->conversion_buf_len - cstate->conversion_buf_index in a couple places.

--
John Naylor
EDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Michail Nikolaev
Date:
Subject: Re: Thoughts on "killed tuples" index hint bits support on standby
Next
From: Peter Eisentraut
Date:
Subject: Re: Add primary keys to system catalogs