Thread: Perform COPY FROM encoding conversions in larger chunks

Perform COPY FROM encoding conversions in larger chunks

From

Heikki Linnakangas

Date:

16 December 2020, 12:17:58

I've been looking at the COPY FROM parsing code, trying to refactor it 
so that the parallel COPY would be easier to implement. I haven't 
touched parallelism itself, just looking for ways to smoothen the way. 
And for ways to speed up COPY in general.

Currently, COPY FROM parses the input one line at a time. Each line is 
converted to the database encoding separately, or if the file encoding 
matches the database encoding, we just check that the input is valid for 
the encoding. It would be more efficient to do the encoding 
conversion/verification in larger chunks. At least potentially; the 
current conversion/verification implementations work one byte a time so 
it doesn't matter too much, but there are faster algorithms out there 
that use SIMD instructions or lookup tables that benefit from larger inputs.

So I'd like to change it so that the encoding conversion/verification is 
done before splitting the input into lines. The problem is that the 
conversion and verification functions throw an error on incomplete 
input. So we can't pass them a chunk of N raw bytes, if we don't know 
where the character boundaries are. The first step in this effort is to 
change the encoding and conversion routines to allow that. Attached 
patches 0001-0004 do that:

For encoding conversions, change the signature of the conversion 
function, by adding a "bool noError" argument and making them return the 
number of input bytes successfully converted. That way, the conversion 
function can be called in a streaming fashion: load a buffer with raw 
input without caring about the character boundaries, call the conversion 
function to convert it except for the few bytes at the end that might be 
an incomplete character, load the buffer with more data, and repeat.

For encoding verification, add a new function that works similarly. It 
takes N bytes of raw input, verifies as much of it as possible, and 
returns the number of input bytes that were valid. In principle, this 
could've been implemented by calling the existing pg_encoding_mblen() 
and pg_encoding_verifymb() functions in a loop, but it would be too 
slow. This adds encoding-specific functions for that. The UTF-8 
implementation is slightly optimized by basically inlining the 
pg_utf8_mblen() call, the other implementations are pretty naive.

- Heikki

Attachment

Re: Perform COPY FROM encoding conversions in larger chunks

From

Bruce Momjian

Date:

17 December 2020, 18:04:14

On Wed, Dec 16, 2020 at 02:17:58PM +0200, Heikki Linnakangas wrote:
> I've been looking at the COPY FROM parsing code, trying to refactor it so
> that the parallel COPY would be easier to implement. I haven't touched
> parallelism itself, just looking for ways to smoothen the way. And for ways
> to speed up COPY in general.

Yes, this makes a lot of sense.  Glad you are looking into this.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Perform COPY FROM encoding conversions in larger chunks

From

Heikki Linnakangas

Date:

17 December 2020, 21:44:22

One of the patches in this patch set is worth calling out separately: 
0003-Add-direct-conversion-routines-between-EUC_TW-and-Bi.patch. Per 
commit message:

     Add direct conversion routines between EUC_TW and Big5.

     Conversions between EUC_TW and Big5 were previously implemented by
     converting the whole input to MIC first, and then from MIC to the
     target encoding. Implement functions to convert directly between the
     two.

     The reason to do this now is that the next patch will change the
     change the conversion function signature so that if the input is
     invalid, we convert as much as we can and return the number of bytes
     successfully converted. That's not possible if we use an intermediary
     format, because if an error happens in the intermediary -> final
     conversion, we lose track of the location of the invalid character in
     the original input. Avoiding the intermediate step should be faster
     too.

This patch is fairly independent of the others. It could be reviewed and 
applied separately.


In order to verify that the new code is correct, I wrote some helper 
plpgsql functions to generate all valid EUC_TW and Big5 byte sequences 
that encode one character, and tested converting each of them. Then I 
compared the the results with unpatched server, to check that the new 
code performs the same conversion. This is perhaps overkill, but since 
its pretty straightforward to enumerate all the input characters, might 
as well do it.

For the sake of completeness, I wrote similar helpers for all the other 
encodings and conversions. Except for UTF-8, there are too many formally 
valid codepoints for that to feasible. This does test round-trip 
conversions of all codepoints from all the other encodings to UTF-8 and 
back, though, so there's pretty good coverage of UTF-8 too.

This test suite is probably too large to add to the source tree, but for 
the sake of the archives, I'm attaching it here. The first patch adds 
the test suite, including the expected output of each conversion. The 
second patch contains expected output changes for the above patch to add 
direct conversions between EUC_TW and Big5. It affected the error 
messages for some byte sequences that cannot be converted. For example, 
on unpatched  master:

postgres=# select convert('\xfdcc', 'euc_tw', 'big5');
ERROR:  character with byte sequence 0x95 0xfd 0xcc in encoding 
"MULE_INTERNAL" has no equivalent in encoding "BIG5"

With the patch:

postgres=# select convert('\xfdcc', 'euc_tw', 'big5');
ERROR:  character with byte sequence 0xfd 0xcc in encoding "EUC_TW" has 
no equivalent in encoding "BIG5"

The old message talked about "MULE_INTERNAL" which exposes the 
implementation detail that we used it as an intermediate in the 
conversion. That can be confusing to a user, the new message makes more 
sense. So that's also nice.

- Heikki

Attachment

Re: Perform COPY FROM encoding conversions in larger chunks

From

John Naylor

Date:

22 December 2020, 20:01:48

On Wed, Dec 16, 2020 at 8:18 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Currently, COPY FROM parses the input one line at a time. Each line is
> converted to the database encoding separately, or if the file encoding
> matches the database encoding, we just check that the input is valid for
> the encoding. It would be more efficient to do the encoding
> conversion/verification in larger chunks. At least potentially; the
> current conversion/verification implementations work one byte a time so
> it doesn't matter too much, but there are faster algorithms out there
> that use SIMD instructions or lookup tables that benefit from larger inputs.

Hi Heikki,

This is great news. I've seen examples of such algorithms and that'd be nice to have. I haven't studied the patch in detail, but it looks fine on the whole.

In 0004, it seems you have some doubts about upgrade compatibility. Is that because user-defined conversions would no longer have the right signature?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Perform COPY FROM encoding conversions in larger chunks

From

Heikki Linnakangas

Date:

23 December 2020, 07:41:43

On 22/12/2020 22:01, John Naylor wrote:
> In 0004, it seems you have some doubts about upgrade compatibility. Is 
> that because user-defined conversions would no longer have the right 
> signature?

Exactly. If you have an extension that adds a custom conversion function 
and does CREATE CONVERSION, the old installation script will fail on the 
new version. That causes trouble for pg_dump+restore and pg_upgrade.

Perhaps we could accept the old signature in the server when you do 
CREATE CONVERSION, but somehow mark the conversion as broken in the 
catalog so that you would get a runtime error if you tried to use it. 
That would be enough to make pg_dump+restore (and pg_upgrade) not throw 
an error, and you could then upgrade the extension later (ALTER 
EXTENSION UPDATE).

I'm not sure it's worth the trouble, though. Custom conversions are very 
rare. And I don't think any other object can depend on a conversion, so 
you can always drop the conversion before upgrade, and re-create it with 
the new function signature afterwards. A note in the release notes and a 
check in pg_upgrade, with instructions to drop and recreate the 
conversion, are probably enough.

- Heikki

Re: Perform COPY FROM encoding conversions in larger chunks

From

John Naylor

Date:

23 December 2020, 18:05:25

On Wed, Dec 23, 2020 at 3:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm not sure it's worth the trouble, though. Custom conversions are very
rare. And I don't think any other object can depend on a conversion, so
you can always drop the conversion before upgrade, and re-create it with
the new function signature afterwards. A note in the release notes and a
check in pg_upgrade, with instructions to drop and recreate the
conversion, are probably enough.

That was my thought as well.

John Naylor

EDB: http://www.enterprisedb.com

Re: Perform COPY FROM encoding conversions in larger chunks

From

John Naylor

Date:

27 January 2021, 23:23:38

Hi Heikki,

0001 through 0003 are straightforward, and I think they can be committed now if you like.

0004 is also pretty straightforward. The check you proposed upthread for pg_upgrade seems like the best solution to make that workable. I'll take a look at 0005 soon.

I measured the conversions that were rewritten in 0003, and there is indeed a noticeable speedup:

Big5 to EUC-TW:

head 196ms
0001-3 152ms

EUC-TW to Big5:

head 190ms
0001-3 144ms

I've attached the driver function for reference. Example use:

select drive_conversion(
1000, 'euc_tw'::name, 'big5'::name,
convert('a few kB of utf8 text here', 'utf8', 'euc_tw')
);

I took a look at the test suite also, and the only thing to note is a couple places where the comment doesn't match the code:

+ -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2)
+ byte1 = hex('0e');
+ for byte2 in hex('a1')..hex('df') loop
+ return next b(byte1, byte2);
+ end loop;
+
+ -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3)
+ byte1 = hex('0f');
+ for byte2 in hex('a1')..hex('fe') loop
+ for byte3 in hex('a1')..hex('fe') loop
+ return next b(byte1, byte2, byte3);
+ end loop;
+ end loop;

Not sure if it matters , but thought I'd mention it anyway.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

drive_conversion.c

Re: Perform COPY FROM encoding conversions in larger chunks

From

Heikki Linnakangas

Date:

28 January 2021, 11:36:04

On 28/01/2021 01:23, John Naylor wrote:
> Hi Heikki,
> 
> 0001 through 0003 are straightforward, and I think they can be committed 
> now if you like.

Thanks for the review!

I did some more rigorous microbenchmarking of patch 1 and 2. I used the 
attached test script, which calls convert_from() function to perform 
UTF-8 verification on two large strings, about 60kb each. One of the 
strings is pure ASCII, and the other is an HTML page that contains a mix 
of ASCII and multibyte characters.

Compiled with "gcc -O2", gcc version 10.2.1 20210110 (Debian 10.2.1-6)

            | mixed | ascii
-----------+-------+-------
  master    |  1866 |  1250
  patch 1   |   959 |   507
  patch 1+2 |  1396 |   987

So, the first patch, 
0001-Add-new-mbverifystr-function-for-each-encoding.patch, made huge 
difference. Even with pure ASCII input. That's very surprising, because 
there is already a fast-path for pure-ASCII input in pg_verify_mbstr_len().

Even more surprising was that the second patch 
(0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch) 
actually made things worse again. I thought it would give a modest gain, 
but nope.

It seems to me that GCC is not doing good job at optimizing the loop in 
pg_verify_mbstr(). The first patch fixes that, but the second patch 
somehow trips up GCC again.

So I also tried this with "gcc -O3" and clang:

Compiled with "gcc -O3"

            | mixed | ascii
-----------+-------+-------
  master    |  1522 |  1225
  patch 1   |   753 |   507
  patch 1+2 |   868 |   507

Compiled with "clang -O2", Debian clang version 11.0.1-2

            | mixed | ascii
-----------+-------+-------
  master    |  1257 |   520
  patch 1   |   899 |   507
  patch 1+2 |   884 |   508

With gcc -O3, the results are a better, but still the second patch seems 
harmful. With clang, I got the result I expected: Almost no difference 
with pure-ASCII input, because there's already a fast-path for that, and 
a nice speedup with multibyte characters. Still, I was surprised how big 
the speedup from the first patch was, and how little additional gain the 
second patch gives.

Based on these results, I'm going to commit the first patch, but not the 
second one. There are much faster UTF-8 verification routines out there, 
using SIMD instructions and whatnot, and we should consider adopting one 
of those, but that's future work.

- Heikki

Attachment

mbverifystr-speed.sql

Re: Perform COPY FROM encoding conversions in larger chunks

From

Heikki Linnakangas

Date:

28 January 2021, 13:05:39

On 28/01/2021 01:23, John Naylor wrote:
> Hi Heikki,
> 
> 0001 through 0003 are straightforward, and I think they can be committed 
> now if you like.
> 
> 0004 is also pretty straightforward. The check you proposed upthread for 
> pg_upgrade seems like the best solution to make that workable. I'll take 
> a look at 0005 soon.
> 
> I measured the conversions that were rewritten in 0003, and there is 
> indeed a noticeable speedup:
> 
> Big5 to EUC-TW:
> 
> head    196ms
> 0001-3  152ms
> 
> EUC-TW to Big5:
> 
> head    190ms
> 0001-3  144ms
> 
> I've attached the driver function for reference. Example use:
> 
> select drive_conversion(
>    1000, 'euc_tw'::name, 'big5'::name,
>    convert('a few kB of utf8 text here', 'utf8', 'euc_tw')
> );

Thanks! I have committed patches 0001 and 0003 in this series, with 
minor comment fixes. Next I'm going to write the pg_upgrade check for 
patch 0004, to get that into a committable state too.

> I took a look at the test suite also, and the only thing to note is a 
> couple places where the comment doesn't match the code:
> 
> +  -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2)
> +  byte1 = hex('0e');
> +  for byte2 in hex('a1')..hex('df') loop
> +    return next b(byte1, byte2);
> +  end loop;
> +
> +  -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3)
> +  byte1 = hex('0f');
> +  for byte2 in hex('a1')..hex('fe') loop
> +    for byte3 in hex('a1')..hex('fe') loop
> +      return next b(byte1, byte2, byte3);
> +    end loop;
> +  end loop;
> 
> Not sure if it matters , but thought I'd mention it anyway.

Good catch! The comments were correct, and the tests were wrong, not 
testing those 2- and 3-byte encoded characters as intened. Doesn't 
matter for testing this patch, I only included those euc_jis_2004 tets 
for the sake of completeness, but if someone finds this test suite in 
the archives and want to use it for something real, make sure you fix 
that first.

- Heikki

Re: Perform COPY FROM encoding conversions in larger chunks

From

Heikki Linnakangas

Date:

28 January 2021, 16:43:58

On 28/01/2021 15:05, Heikki Linnakangas wrote:
> Next I'm going to write the pg_upgrade check for
> patch 0004, to get that into a committable state too.

As promised, here are new versions of the remaining patches, with the 
pg_upgrade check added. If you have any custom encoding conversions in 
the old cluster, pg_upgrade now fails:

> Performing Consistency Checks
> -----------------------------
> Checking cluster versions                                   ok
> Checking database user is the install user                  ok
> Checking database connection settings                       ok
> Checking for prepared transactions                          ok
> Checking for reg* data types in user tables                 ok
> Checking for contrib/isn with bigint-passing mismatch       ok
> Checking for user-defined encoding conversions              fatal
> 
> Your installation contains user-defined encoding conversions.
> The conversion function parameters changed in PostgreSQL version 14
> so this cluster cannot currently be upgraded.  You can remove the
> encoding conversions in the old cluster and restart the upgrade.
> A list of user-defined encoding conversions is in the file:
>     encoding_conversions.txt
> 
> Failure, exiting

To test this, I wrote a dummy conversion function, also attached.

- Heikki

Attachment

Re: Perform COPY FROM encoding conversions in larger chunks

From

John Naylor

Date:

30 January 2021, 18:47:06

On Thu, Jan 28, 2021 at 7:36 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Even more surprising was that the second patch
> (0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch)
> actually made things worse again. I thought it would give a modest gain,
> but nope.

Hmm, that surprised me too.

> Based on these results, I'm going to commit the first patch, but not the
> second one. There are much faster UTF-8 verification routines out there,
> using SIMD instructions and whatnot, and we should consider adopting one
> of those, but that's future work.

I have something in mind for that.

I took a look at v2, and for the first encoding I tried, it fails to report the error for invalid input:

create database euctest WITH ENCODING 'EUC_CN' LC_COLLATE='zh_CN.eucCN' LC_CTYPE='zh_CN.eucCN' TEMPLATE=template0;

\c euctest
create table foo (a text);

master:

euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> ä
>> \.
ERROR: character with byte sequence 0xc3 0xa4 in encoding "UTF8" has no equivalent in encoding "EUC_CN"
CONTEXT: COPY foo, line 1

patch:

euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> ä
>> \.
COPY 0
euctest=#

I believe the problem is in UtfToLocal(). I've attached a fix formatted as a text file to avoid confusing the cfbot. The fix keeps the debugging ereport() in case you find it useful. Some additional test coverage might be good here, but not sure how much work that would be. I didn't check any other conversions yet.

v2-0002 seems fine to me, I just have cosmetic comments here:

+ * the same, no conversion is required by we must still validate that the

s/by/but/

This comment in copyfrom_internal.h above the *StateData struct is the same as the corresponding one in copyto.c:

* Multi-byte encodings: all supported client-side encodings encode multi-byte
* characters by having the first byte's high bit set. Subsequent bytes of the
* character can have the high bit not set. When scanning data in such an
* encoding to look for a match to a single-byte (ie ASCII) character, we must
* use the full pg_encoding_mblen() machinery to skip over multibyte
* characters, else we might find a false match to a trailing byte. In
* supported server encodings, there is no possibility of a false match, and
* it's faster to make useless comparisons to trailing bytes than it is to
* invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii is true
* when we have to do it the hard way.

The references to pg_encoding_mblen() and encoding_embeds_ascii, are out of date for copy-from. I'm not sure the rest is relevant to copy-from anymore, either. Can you confirm?

This comment inside the struct is now out of date as well:

* Similarly, line_buf holds the whole input line being processed. The
* input cycle is first to read the whole line into line_buf, convert it
* to server encoding there, and then extract the individual attribute

HEAD has this macro already:

/* Shorthand for number of unconsumed bytes available in raw_buf */
#define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - (cstate)->raw_buf_index)

It might make sense to create a CONVERSION_BUF_BYTES equivalent since the patch calculates cstate->conversion_buf_len - cstate->conversion_buf_index in a couple places.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

fix-utf8-convertedbytes-calc.txt

Re: Perform COPY FROM encoding conversions in larger chunks

From

Heikki Linnakangas

Date:

01 February 2021, 16:15:13

On 30/01/2021 20:47, John Naylor wrote:
> I took a look at v2, and for the first encoding I tried, it fails to 
> report the error for invalid input:

That's embarassing...

> I believe the problem is in UtfToLocal(). I've attached a fix formatted 
> as a text file to avoid confusing the cfbot. The fix keeps the debugging 
> ereport() in case you find it useful.

Thanks. I fixed it slightly differently, and also changed LocalToUtf() 
to follow the same pattern, even though LocalToUtf() did not have the 
same bug.

> Some additional test coverage might be good here, but not sure how
> much work that would be. I didn't check any other conversions yet.
I added a bunch of tests for various built-in conversions.

> v2-0002 seems fine to me, I just have cosmetic comments here:
> 
> + * the same, no conversion is required by we must still validate that the
> 
> s/by/but/
> 
> This comment in copyfrom_internal.h above the *StateData struct is the 
> same as the corresponding one in copyto.c:
> 
>   * Multi-byte encodings: all supported client-side encodings encode 
> multi-byte
>   * characters by having the first byte's high bit set. Subsequent bytes 
> of the
>   * character can have the high bit not set. When scanning data in such an
>   * encoding to look for a match to a single-byte (ie ASCII) character, 
> we must
>   * use the full pg_encoding_mblen() machinery to skip over multibyte
>   * characters, else we might find a false match to a trailing byte. In
>   * supported server encodings, there is no possibility of a false 
> match, and
>   * it's faster to make useless comparisons to trailing bytes than it is to
>   * invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii 
> is true
>   * when we have to do it the hard way.
> 
> The references to pg_encoding_mblen() and encoding_embeds_ascii, are out 
> of date for copy-from. I'm not sure the rest is relevant to copy-from 
> anymore, either. Can you confirm?

Yeah, that comment is obsolete for COPY FROM, the encoding conversion 
works differently now. Removed it from copyfrom_internal.h.

> This comment inside the struct is now out of date as well:
> 
>   * Similarly, line_buf holds the whole input line being processed. The
>   * input cycle is first to read the whole line into line_buf, convert it
>   * to server encoding there, and then extract the individual attribute
> 
> HEAD has this macro already:
> 
> /* Shorthand for number of unconsumed bytes available in raw_buf */
> #define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - 
> (cstate)->raw_buf_index)
> 
> It might make sense to create a CONVERSION_BUF_BYTES equivalent since 
> the patch calculates cstate->conversion_buf_len - 
> cstate->conversion_buf_index in a couple places.

Thanks for the review!

I spent some time refactoring and adding comments all around the patch, 
hopefully making it all more clear. One notable difference is that I 
renamed 'raw_buf' (which exists in master too) to 'input_buf', and 
renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this 
patch again another day with fresh eyes, and also try to add some tests 
for the corner cases at buffer boundaries.

Attached is a new set of patches. I added some regression tests for the 
built-in conversion functions, which cover the bug you found, and many 
other interesting cases that did not have test coverage yet. It comes in 
two patches: the first patch uses just the existing convert_from() SQL 
function, and the second one uses the new "noError" variants of the 
conversion functions. I also kept the bug-fixes compared to the previous 
patch version as a separate commit, for easier review.

- Heikki

Attachment

Re: Perform COPY FROM encoding conversions in larger chunks

From

John Naylor

Date:

02 February 2021, 21:42:31

On Mon, Feb 1, 2021 at 12:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Thanks. I fixed it slightly differently, and also changed LocalToUtf()
> to follow the same pattern, even though LocalToUtf() did not have the
> same bug.

Looks good to me.

> I added a bunch of tests for various built-in conversions.

Nice! I would like to have utf8 tests for every category of invalid byte (overlong, surrogate, 5 bytes, etc), but it's not necessary for this patch.

> I spent some time refactoring and adding comments all around the patch,
> hopefully making it all more clear. One notable difference is that I
> renamed 'raw_buf' (which exists in master too) to 'input_buf', and
> renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this
> patch again another day with fresh eyes, and also try to add some tests
> for the corner cases at buffer boundaries.

The comments and renaming are really helpful in understanding that file!

Although a new patch is likely forthcoming, I did take a brief look and found the following:

In copyfromparse.c, this is now out of date:

* Read the next input line and stash it in line_buf, with conversion to
* server encoding.

One of your FIXME comments seems to allude to this, but if we really need a difference here, maybe it should be explained:

+#define INPUT_BUF_SIZE 65536 /* we palloc INPUT_BUF_SIZE+1 bytes */

+#define RAW_BUF_SIZE 65536 /* allocated size of the buffer */

Lastly, it looks like pg_do_encoding_conversion_buf() ended up in 0003 accidentally?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Perform COPY FROM encoding conversions in larger chunks

From

Heikki Linnakangas

Date:

07 February 2021, 18:13:28

On 02/02/2021 23:42, John Naylor wrote:
> Although a new patch is likely forthcoming, I did take a brief look and 
> found the following:
> 
> 
> In copyfromparse.c, this is now out of date:
> 
>   * Read the next input line and stash it in line_buf, with conversion to
>   * server encoding.
> 
> 
> One of your FIXME comments seems to allude to this, but if we really 
> need a difference here, maybe it should be explained:
> 
> +#define INPUT_BUF_SIZE 65536 /* we palloc INPUT_BUF_SIZE+1 bytes */
> 
> +#define RAW_BUF_SIZE 65536 /* allocated size of the buffer */

We do in fact still need the +1 for the NUL terminator. It was missing 
from the last patch version, but that was wrong; my fuzz testing 
actually uncovered a bug caused by that. Fixed.

Attached are new patch versions. The first patch is same as before, but 
rebased, pgindented, and with a couple of tiny fixes where conversion 
functions were still missing the "if (noError) break;" checks.

I've hacked on the second patch more, doing more refactoring and 
commenting for readability. I think it's in pretty good shape now.

- Heikki

Attachment

Re: Perform COPY FROM encoding conversions in larger chunks

From

John Naylor

Date:

09 February 2021, 13:40:02

On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 02/02/2021 23:42, John Naylor wrote:
> >
> > In copyfromparse.c, this is now out of date:
> >
> > * Read the next input line and stash it in line_buf, with conversion to
> > * server encoding.

This comment for CopyReadLine() is still there. Conversion already happened by now, so I think this comment is outdated.

Other than that, I think this is ready for commit.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Perform COPY FROM encoding conversions in larger chunks

From

Heikki Linnakangas

Date:

09 February 2021, 17:36:10

On 09/02/2021 15:40, John Naylor wrote:
> On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
>  >
>  > On 02/02/2021 23:42, John Naylor wrote:
>  > >
>  > > In copyfromparse.c, this is now out of date:
>  > >
>  > >   * Read the next input line and stash it in line_buf, with 
> conversion to
>  > >   * server encoding.
> 
> This comment for CopyReadLine() is still there. Conversion already 
> happened by now, so I think this comment is outdated.
> 
> Other than that, I think this is ready for commit.

Fixed. And also fixed one more bug in allocating raw_buf_size, the "+ 1" 
somehow went missing again. That was causing a failure on Windows at 
cfbot.cputube.org.

I'll read through this one more time with fresh eyes tomorrow or the day 
after, and push. Thanks for all the review!

- Heikki

Re: Perform COPY FROM encoding conversions in larger chunks

From

Heikki Linnakangas

Date:

09 February 2021, 17:44:46

On 09/02/2021 19:36, Heikki Linnakangas wrote:
> On 09/02/2021 15:40, John Naylor wrote:
>> On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas <hlinnaka@iki.fi
>> <mailto:hlinnaka@iki.fi>> wrote:
>>   >
>>   > On 02/02/2021 23:42, John Naylor wrote:
>>   > >
>>   > > In copyfromparse.c, this is now out of date:
>>   > >
>>   > >   * Read the next input line and stash it in line_buf, with
>> conversion to
>>   > >   * server encoding.
>>
>> This comment for CopyReadLine() is still there. Conversion already
>> happened by now, so I think this comment is outdated.
>>
>> Other than that, I think this is ready for commit.
> 
> Fixed. And also fixed one more bug in allocating raw_buf_size, the "+ 1"
> somehow went missing again. That was causing a failure on Windows at
> cfbot.cputube.org.
> 
> I'll read through this one more time with fresh eyes tomorrow or the day
> after, and push. Thanks for all the review!

Forgot attachment..

- Heikki

Attachment

Re: Perform COPY FROM encoding conversions in larger chunks

From

John Naylor

Date:

18 March 2021, 16:58:03

On Tue, Feb 9, 2021 at 1:44 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > Fixed. And also fixed one more bug in allocating raw_buf_size, the "+ 1"
> > somehow went missing again. That was causing a failure on Windows at
> > cfbot.cputube.org.
> >
> > I'll read through this one more time with fresh eyes tomorrow or the day
> > after, and push. Thanks for all the review!
>
> Forgot attachment..
>
> - Heikki

I went ahead and rebased these.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment

Re: Perform COPY FROM encoding conversions in larger chunks

From

John Naylor

Date:

18 March 2021, 18:05:32

I wrote:

> I went ahead and rebased these.

It looks like FreeBSD doesn't like this for some reason.

I also wanted to see if this patch set had any performance effect, with and without changing how UTF-8 is validated, using the blackhole am from https://github.com/michaelpq/pg_plugins/tree/master/blackhole_am.

create extension blackhole_am;
create table blackhole_tab (a text) using blackhole_am ;
time ./inst/bin/psql -c "copy blackhole_tab from '/path/to/test-copy.txt'"

....where copy-test.txt is made by

for i in {1..100}; do cat UTF-8-Sampler.htm >> test-copy.txt ; done;

On Linux x86-64, gcc 8.4, I get these numbers (minimum of five runs):

master:
109ms

v6 do encoding in larger chunks:
109ms

v7 utf8 SIMD:
98ms

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Perform COPY FROM encoding conversions in larger chunks

From

John Naylor

Date:

18 March 2021, 18:23:09

On Thu, Mar 18, 2021 at 2:05 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> I wrote:
>
> > I went ahead and rebased these.
>
> It looks like FreeBSD doesn't like this for some reason.

On closer examination, make check was "terminated", not that the tests failed...

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Perform COPY FROM encoding conversions in larger chunks

From

Heikki Linnakangas

Date:

01 April 2021, 08:09:02

On 18/03/2021 20:05, John Naylor wrote:
> I wrote:
> 
>  > I went ahead and rebased these.

Thanks!

> I also wanted to see if this patch set had any performance effect, with 
> and without changing how UTF-8 is validated, using the blackhole am from 
> https://github.com/michaelpq/pg_plugins/tree/master/blackhole_am 
> <https://github.com/michaelpq/pg_plugins/tree/master/blackhole_am>.
> 
> create extension blackhole_am;
> create table blackhole_tab (a text) using blackhole_am ;
> time ./inst/bin/psql -c "copy blackhole_tab from '/path/to/test-copy.txt'"
> 
> ....where copy-test.txt is made by
> 
> for i in {1..100}; do cat UTF-8-Sampler.htm >> test-copy.txt ;  done;
> 
> On Linux x86-64, gcc 8.4, I get these numbers (minimum of five runs):
> 
> master:
> 109ms
> 
> v6 do encoding in larger chunks:
> 109ms
> 
> v7 utf8 SIMD:
> 98ms

That's disappointing. Perhaps the file size is just too small to see the 
effect? I'm seeing results between 40 ms and 75 ms on my laptop when I 
run a test like that multiple times. I used "WHERE false" instead of the 
blackhole AM but I don't think that makes much difference (only showing 
a few runs here for brevity):

for i in {1..100}; do cat /tmp/utf8.html >> /tmp/test-copy.txt ;  done;

postgres=# create table blackhole_tab (a text) ;
CREATE TABLE
postgres=# \timing
Timing is on.
postgres=# copy blackhole_tab  from '/tmp/test-copy.txt' where false;
COPY 0
Time: 53.166 ms
postgres=# copy blackhole_tab  from '/tmp/test-copy.txt' where false;
COPY 0
Time: 43.981 ms
postgres=# copy blackhole_tab  from '/tmp/test-copy.txt' where false;
COPY 0
Time: 71.850 ms
postgres=# copy blackhole_tab  from '/tmp/test-copy.txt' where false;
COPY 0
...

I tested that with a larger file:

for i in {1..10000}; do cat /tmp/utf8.html >> /tmp/test-copy.txt ;  done;
postgres=# copy blackhole_tab  from '/tmp/test-copy.txt' where false;

v6 do encoding in larger chunks (best of five):
Time: 3955.514 ms (00:03.956)

master (best of five):
Time: 4133.767 ms (00:04.134)

So with that, I'm seeing a measurable difference.

- Heikki

Re: Perform COPY FROM encoding conversions in larger chunks

From

Heikki Linnakangas

Date:

01 April 2021, 09:27:09

On 01/04/2021 11:09, Heikki Linnakangas wrote:
> On 18/03/2021 20:05, John Naylor wrote:
>> I wrote:
>>
>>   > I went ahead and rebased these.
> 
> Thanks!

I read through the patches one more time, made a few small comment 
fixes, and pushed.

- Heikki

Re: Re: Perform COPY FROM encoding conversions in larger chunks

From

Chapman Flack

Date:

01 May 2021, 20:06:19

On 04/01/21 05:27, Heikki Linnakangas wrote:
> I read through the patches one more time, made a few small comment fixes,
> and pushed.

Wow, this whole thread escaped my attention at the time, though my ears
would have perked right up if the subject had been something like
'improve encoding conversion API to stream a buffer at a time'. I think
this is of great interest beyond one particular use case in COPY FROM.
For example, it could limit the allocations needed when streaming a large
text value out to a client; it might be used to advantage with the recent
work in incrementally detoasting large values, and so on.

This part seems a little underdeveloped:

> * TODO: The conversion function interface is not great.  Firstly, it
> * would be nice to pass through the destination buffer size to the
> * conversion function, so that if you pass a shorter destination buffer, it
> * could still continue to fill up the whole buffer.  Currently, we have to
> * assume worst case expansion and stop the conversion short, even if there
> * is in fact space left in the destination buffer.  Secondly, it would be
> * nice to return the number of bytes written to the caller, to avoid a call
> * to strlen().

If I understand correctly, this patch already makes a breaking change to
the conversion function API. If that's going to be the case anyway, I wonder
if it's worth going further and changing the API further to eliminate this
odd limitation.

There seems to be a sort of common shape that conversion APIs have evolved
toward, that can be seen in both the ICU4C converters [0] and in Java's [1].
This current tweak to our conversion API seems to get allllmmoooosst there,
but just not quite. For example, noError allows us to keep control when
the function has stopped converting, but we don't find out which reason
it stopped.

If we just went the rest of the way and structured the API like those
existing ones, then:

- it would be super easy to write wrappers around ICU4C converters, if
  there were any we wanted to use;

- I could very easily write wrappers presenting any PG-supported charset
  as a Java charset.

The essence of the API common to ICU4C and Java is this:

1. You pass the function the address and length of a source buffer,
   the address and length of a destination buffer, and a flag that is true
   if you know there is no further input where this source buffer came from.
   (It's allowable to pass false and only then discover you really have no
   more input after all; then you just make one final call passing true.)

2. The function eats as much as it can of the source buffer, fills as much
   as it can of the destination buffer, and returns indicating one of four
   reasons it stopped:

   underflow - ran out of source buffer
   overflow  - ran out of destination buffer
   malformed - something in source buffer isn't valid in that representation
   unmappable - a valid codepoint not available in destination encoding

   Based on that, the caller refills the source buffer, or drains the
   destination buffer, or handles or reports the malformed or unmappable,
   then repeats.

3. The function should update pointers on return to indicate how much
   of the source buffer it consumed and how much of the destination buffer
   it filled.

4. If it left any bytes unconsumed in the source buffer, the caller must
   preserve them (perhaps moved to the front) for the next call.

5. The converter can have internal state (so it is an object in Java, or
   has a UConverter struct allocated in ICU4C, to have a place for its
   state). The state gets flushed on the final call where the flag is
   passed true. In many cases, the converter can be implemented without
   keeping internal state, if it simply leaves, for example, an
   incomplete sequence at the end of the source buffer unconsumed, so the
   caller will move it to the front and supply the rest. On the other hand,
   any unconsumed input after the final call with flush flag true must be
   treated as malformed.

6. On a malformed or unmappable return, the source buffer is left pointed
   at the start of the offending sequence and the length in bytes of
   that sequence is available for error reporting/recovery.

The efficient handling of states, returning updated pointers, and so on,
probably requires a function signature with 'internal' in it ... but
the current function signature already has 'internal', so that doesn't
seem like a deal-breaker.

Thoughts? It seems a shame to make a breaking change in the conversion
API, only to still end up with an API that "is not great" and is still
impedance-mismatched to other existing prominent conversion APIs.

Regards,
-Chap

[0]
https://unicode-org.github.io/icu/userguide/conversion/converters.html#3-buffered-or-streamed

[1]

https://docs.oracle.com/javase/9/docs/api/java/nio/charset/CharsetDecoder.html#decode-java.nio.ByteBuffer-java.nio.CharBuffer-boolean-