Thread: Perform COPY FROM encoding conversions in larger chunks
I've been looking at the COPY FROM parsing code, trying to refactor it so that the parallel COPY would be easier to implement. I haven't touched parallelism itself, just looking for ways to smoothen the way. And for ways to speed up COPY in general. Currently, COPY FROM parses the input one line at a time. Each line is converted to the database encoding separately, or if the file encoding matches the database encoding, we just check that the input is valid for the encoding. It would be more efficient to do the encoding conversion/verification in larger chunks. At least potentially; the current conversion/verification implementations work one byte a time so it doesn't matter too much, but there are faster algorithms out there that use SIMD instructions or lookup tables that benefit from larger inputs. So I'd like to change it so that the encoding conversion/verification is done before splitting the input into lines. The problem is that the conversion and verification functions throw an error on incomplete input. So we can't pass them a chunk of N raw bytes, if we don't know where the character boundaries are. The first step in this effort is to change the encoding and conversion routines to allow that. Attached patches 0001-0004 do that: For encoding conversions, change the signature of the conversion function, by adding a "bool noError" argument and making them return the number of input bytes successfully converted. That way, the conversion function can be called in a streaming fashion: load a buffer with raw input without caring about the character boundaries, call the conversion function to convert it except for the few bytes at the end that might be an incomplete character, load the buffer with more data, and repeat. For encoding verification, add a new function that works similarly. It takes N bytes of raw input, verifies as much of it as possible, and returns the number of input bytes that were valid. In principle, this could've been implemented by calling the existing pg_encoding_mblen() and pg_encoding_verifymb() functions in a loop, but it would be too slow. This adds encoding-specific functions for that. The UTF-8 implementation is slightly optimized by basically inlining the pg_utf8_mblen() call, the other implementations are pretty naive. - Heikki
Attachment
On Wed, Dec 16, 2020 at 02:17:58PM +0200, Heikki Linnakangas wrote: > I've been looking at the COPY FROM parsing code, trying to refactor it so > that the parallel COPY would be easier to implement. I haven't touched > parallelism itself, just looking for ways to smoothen the way. And for ways > to speed up COPY in general. Yes, this makes a lot of sense. Glad you are looking into this. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
One of the patches in this patch set is worth calling out separately: 0003-Add-direct-conversion-routines-between-EUC_TW-and-Bi.patch. Per commit message: Add direct conversion routines between EUC_TW and Big5. Conversions between EUC_TW and Big5 were previously implemented by converting the whole input to MIC first, and then from MIC to the target encoding. Implement functions to convert directly between the two. The reason to do this now is that the next patch will change the change the conversion function signature so that if the input is invalid, we convert as much as we can and return the number of bytes successfully converted. That's not possible if we use an intermediary format, because if an error happens in the intermediary -> final conversion, we lose track of the location of the invalid character in the original input. Avoiding the intermediate step should be faster too. This patch is fairly independent of the others. It could be reviewed and applied separately. In order to verify that the new code is correct, I wrote some helper plpgsql functions to generate all valid EUC_TW and Big5 byte sequences that encode one character, and tested converting each of them. Then I compared the the results with unpatched server, to check that the new code performs the same conversion. This is perhaps overkill, but since its pretty straightforward to enumerate all the input characters, might as well do it. For the sake of completeness, I wrote similar helpers for all the other encodings and conversions. Except for UTF-8, there are too many formally valid codepoints for that to feasible. This does test round-trip conversions of all codepoints from all the other encodings to UTF-8 and back, though, so there's pretty good coverage of UTF-8 too. This test suite is probably too large to add to the source tree, but for the sake of the archives, I'm attaching it here. The first patch adds the test suite, including the expected output of each conversion. The second patch contains expected output changes for the above patch to add direct conversions between EUC_TW and Big5. It affected the error messages for some byte sequences that cannot be converted. For example, on unpatched master: postgres=# select convert('\xfdcc', 'euc_tw', 'big5'); ERROR: character with byte sequence 0x95 0xfd 0xcc in encoding "MULE_INTERNAL" has no equivalent in encoding "BIG5" With the patch: postgres=# select convert('\xfdcc', 'euc_tw', 'big5'); ERROR: character with byte sequence 0xfd 0xcc in encoding "EUC_TW" has no equivalent in encoding "BIG5" The old message talked about "MULE_INTERNAL" which exposes the implementation detail that we used it as an intermediate in the conversion. That can be confusing to a user, the new message makes more sense. So that's also nice. - Heikki
Attachment
On Wed, Dec 16, 2020 at 8:18 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Currently, COPY FROM parses the input one line at a time. Each line is
> converted to the database encoding separately, or if the file encoding
> matches the database encoding, we just check that the input is valid for
> the encoding. It would be more efficient to do the encoding
> conversion/verification in larger chunks. At least potentially; the
> current conversion/verification implementations work one byte a time so
> it doesn't matter too much, but there are faster algorithms out there
> that use SIMD instructions or lookup tables that benefit from larger inputs.
Hi Heikki,
This is great news. I've seen examples of such algorithms and that'd be nice to have. I haven't studied the patch in detail, but it looks fine on the whole.
In 0004, it seems you have some doubts about upgrade compatibility. Is that because user-defined conversions would no longer have the right signature?
--
John Naylor
EDB: http://www.enterprisedb.com
>
> Currently, COPY FROM parses the input one line at a time. Each line is
> converted to the database encoding separately, or if the file encoding
> matches the database encoding, we just check that the input is valid for
> the encoding. It would be more efficient to do the encoding
> conversion/verification in larger chunks. At least potentially; the
> current conversion/verification implementations work one byte a time so
> it doesn't matter too much, but there are faster algorithms out there
> that use SIMD instructions or lookup tables that benefit from larger inputs.
Hi Heikki,
This is great news. I've seen examples of such algorithms and that'd be nice to have. I haven't studied the patch in detail, but it looks fine on the whole.
In 0004, it seems you have some doubts about upgrade compatibility. Is that because user-defined conversions would no longer have the right signature?
--
John Naylor
EDB: http://www.enterprisedb.com
On 22/12/2020 22:01, John Naylor wrote: > In 0004, it seems you have some doubts about upgrade compatibility. Is > that because user-defined conversions would no longer have the right > signature? Exactly. If you have an extension that adds a custom conversion function and does CREATE CONVERSION, the old installation script will fail on the new version. That causes trouble for pg_dump+restore and pg_upgrade. Perhaps we could accept the old signature in the server when you do CREATE CONVERSION, but somehow mark the conversion as broken in the catalog so that you would get a runtime error if you tried to use it. That would be enough to make pg_dump+restore (and pg_upgrade) not throw an error, and you could then upgrade the extension later (ALTER EXTENSION UPDATE). I'm not sure it's worth the trouble, though. Custom conversions are very rare. And I don't think any other object can depend on a conversion, so you can always drop the conversion before upgrade, and re-create it with the new function signature afterwards. A note in the release notes and a check in pg_upgrade, with instructions to drop and recreate the conversion, are probably enough. - Heikki
On Wed, Dec 23, 2020 at 3:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I'm not sure it's worth the trouble, though. Custom conversions are very
rare. And I don't think any other object can depend on a conversion, so
you can always drop the conversion before upgrade, and re-create it with
the new function signature afterwards. A note in the release notes and a
check in pg_upgrade, with instructions to drop and recreate the
conversion, are probably enough.
Hi Heikki,
0001 through 0003 are straightforward, and I think they can be committed now if you like.
0001 through 0003 are straightforward, and I think they can be committed now if you like.
0004 is also pretty straightforward. The check you proposed upthread for pg_upgrade seems like the best solution to make that workable. I'll take a look at 0005 soon.
I measured the conversions that were rewritten in 0003, and there is indeed a noticeable speedup:
Big5 to EUC-TW:
head 196ms
0001-3 152ms
EUC-TW to Big5:
head 190ms
0001-3 144ms
I've attached the driver function for reference. Example use:
select drive_conversion(
1000, 'euc_tw'::name, 'big5'::name,
convert('a few kB of utf8 text here', 'utf8', 'euc_tw')
);
I took a look at the test suite also, and the only thing to note is a couple places where the comment doesn't match the code:
+ -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2)
+ byte1 = hex('0e');
+ for byte2 in hex('a1')..hex('df') loop
+ return next b(byte1, byte2);
+ end loop;
+
+ -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3)
+ byte1 = hex('0f');
+ for byte2 in hex('a1')..hex('fe') loop
+ for byte3 in hex('a1')..hex('fe') loop
+ return next b(byte1, byte2, byte3);
+ end loop;
+ end loop;
Not sure if it matters , but thought I'd mention it anyway.
--
John Naylor
EDB: http://www.enterprisedb.com
I measured the conversions that were rewritten in 0003, and there is indeed a noticeable speedup:
Big5 to EUC-TW:
head 196ms
0001-3 152ms
EUC-TW to Big5:
head 190ms
0001-3 144ms
I've attached the driver function for reference. Example use:
select drive_conversion(
1000, 'euc_tw'::name, 'big5'::name,
convert('a few kB of utf8 text here', 'utf8', 'euc_tw')
);
I took a look at the test suite also, and the only thing to note is a couple places where the comment doesn't match the code:
+ -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2)
+ byte1 = hex('0e');
+ for byte2 in hex('a1')..hex('df') loop
+ return next b(byte1, byte2);
+ end loop;
+
+ -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3)
+ byte1 = hex('0f');
+ for byte2 in hex('a1')..hex('fe') loop
+ for byte3 in hex('a1')..hex('fe') loop
+ return next b(byte1, byte2, byte3);
+ end loop;
+ end loop;
Not sure if it matters , but thought I'd mention it anyway.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
On 28/01/2021 01:23, John Naylor wrote: > Hi Heikki, > > 0001 through 0003 are straightforward, and I think they can be committed > now if you like. Thanks for the review! I did some more rigorous microbenchmarking of patch 1 and 2. I used the attached test script, which calls convert_from() function to perform UTF-8 verification on two large strings, about 60kb each. One of the strings is pure ASCII, and the other is an HTML page that contains a mix of ASCII and multibyte characters. Compiled with "gcc -O2", gcc version 10.2.1 20210110 (Debian 10.2.1-6) | mixed | ascii -----------+-------+------- master | 1866 | 1250 patch 1 | 959 | 507 patch 1+2 | 1396 | 987 So, the first patch, 0001-Add-new-mbverifystr-function-for-each-encoding.patch, made huge difference. Even with pure ASCII input. That's very surprising, because there is already a fast-path for pure-ASCII input in pg_verify_mbstr_len(). Even more surprising was that the second patch (0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch) actually made things worse again. I thought it would give a modest gain, but nope. It seems to me that GCC is not doing good job at optimizing the loop in pg_verify_mbstr(). The first patch fixes that, but the second patch somehow trips up GCC again. So I also tried this with "gcc -O3" and clang: Compiled with "gcc -O3" | mixed | ascii -----------+-------+------- master | 1522 | 1225 patch 1 | 753 | 507 patch 1+2 | 868 | 507 Compiled with "clang -O2", Debian clang version 11.0.1-2 | mixed | ascii -----------+-------+------- master | 1257 | 520 patch 1 | 899 | 507 patch 1+2 | 884 | 508 With gcc -O3, the results are a better, but still the second patch seems harmful. With clang, I got the result I expected: Almost no difference with pure-ASCII input, because there's already a fast-path for that, and a nice speedup with multibyte characters. Still, I was surprised how big the speedup from the first patch was, and how little additional gain the second patch gives. Based on these results, I'm going to commit the first patch, but not the second one. There are much faster UTF-8 verification routines out there, using SIMD instructions and whatnot, and we should consider adopting one of those, but that's future work. - Heikki
Attachment
On 28/01/2021 01:23, John Naylor wrote: > Hi Heikki, > > 0001 through 0003 are straightforward, and I think they can be committed > now if you like. > > 0004 is also pretty straightforward. The check you proposed upthread for > pg_upgrade seems like the best solution to make that workable. I'll take > a look at 0005 soon. > > I measured the conversions that were rewritten in 0003, and there is > indeed a noticeable speedup: > > Big5 to EUC-TW: > > head 196ms > 0001-3 152ms > > EUC-TW to Big5: > > head 190ms > 0001-3 144ms > > I've attached the driver function for reference. Example use: > > select drive_conversion( > 1000, 'euc_tw'::name, 'big5'::name, > convert('a few kB of utf8 text here', 'utf8', 'euc_tw') > ); Thanks! I have committed patches 0001 and 0003 in this series, with minor comment fixes. Next I'm going to write the pg_upgrade check for patch 0004, to get that into a committable state too. > I took a look at the test suite also, and the only thing to note is a > couple places where the comment doesn't match the code: > > + -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2) > + byte1 = hex('0e'); > + for byte2 in hex('a1')..hex('df') loop > + return next b(byte1, byte2); > + end loop; > + > + -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3) > + byte1 = hex('0f'); > + for byte2 in hex('a1')..hex('fe') loop > + for byte3 in hex('a1')..hex('fe') loop > + return next b(byte1, byte2, byte3); > + end loop; > + end loop; > > Not sure if it matters , but thought I'd mention it anyway. Good catch! The comments were correct, and the tests were wrong, not testing those 2- and 3-byte encoded characters as intened. Doesn't matter for testing this patch, I only included those euc_jis_2004 tets for the sake of completeness, but if someone finds this test suite in the archives and want to use it for something real, make sure you fix that first. - Heikki
On 28/01/2021 15:05, Heikki Linnakangas wrote: > Next I'm going to write the pg_upgrade check for > patch 0004, to get that into a committable state too. As promised, here are new versions of the remaining patches, with the pg_upgrade check added. If you have any custom encoding conversions in the old cluster, pg_upgrade now fails: > Performing Consistency Checks > ----------------------------- > Checking cluster versions ok > Checking database user is the install user ok > Checking database connection settings ok > Checking for prepared transactions ok > Checking for reg* data types in user tables ok > Checking for contrib/isn with bigint-passing mismatch ok > Checking for user-defined encoding conversions fatal > > Your installation contains user-defined encoding conversions. > The conversion function parameters changed in PostgreSQL version 14 > so this cluster cannot currently be upgraded. You can remove the > encoding conversions in the old cluster and restart the upgrade. > A list of user-defined encoding conversions is in the file: > encoding_conversions.txt > > Failure, exiting To test this, I wrote a dummy conversion function, also attached. - Heikki
Attachment
On Thu, Jan 28, 2021 at 7:36 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Even more surprising was that the second patch
> (0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch)
> actually made things worse again. I thought it would give a modest gain,
> but nope.
Hmm, that surprised me too.
>
> Even more surprising was that the second patch
> (0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch)
> actually made things worse again. I thought it would give a modest gain,
> but nope.
Hmm, that surprised me too.
> Based on these results, I'm going to commit the first patch, but not the
> second one. There are much faster UTF-8 verification routines out there,
> using SIMD instructions and whatnot, and we should consider adopting one
> of those, but that's future work.
I have something in mind for that.
> second one. There are much faster UTF-8 verification routines out there,
> using SIMD instructions and whatnot, and we should consider adopting one
> of those, but that's future work.
I have something in mind for that.
I took a look at v2, and for the first encoding I tried, it fails to report the error for invalid input:
create database euctest WITH ENCODING 'EUC_CN' LC_COLLATE='zh_CN.eucCN' LC_CTYPE='zh_CN.eucCN' TEMPLATE=template0;
\c euctest
create table foo (a text);
master:
euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> ä
>> \.
ERROR: character with byte sequence 0xc3 0xa4 in encoding "UTF8" has no equivalent in encoding "EUC_CN"
CONTEXT: COPY foo, line 1
patch:
euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> ä
>> \.
COPY 0
euctest=#
I believe the problem is in UtfToLocal(). I've attached a fix formatted as a text file to avoid confusing the cfbot. The fix keeps the debugging ereport() in case you find it useful. Some additional test coverage might be good here, but not sure how much work that would be. I didn't check any other conversions yet.
v2-0002 seems fine to me, I just have cosmetic comments here:
+ * the same, no conversion is required by we must still validate that the
s/by/but/
This comment in copyfrom_internal.h above the *StateData struct is the same as the corresponding one in copyto.c:
* Multi-byte encodings: all supported client-side encodings encode multi-byte
* characters by having the first byte's high bit set. Subsequent bytes of the
* character can have the high bit not set. When scanning data in such an
* encoding to look for a match to a single-byte (ie ASCII) character, we must
* use the full pg_encoding_mblen() machinery to skip over multibyte
* characters, else we might find a false match to a trailing byte. In
* supported server encodings, there is no possibility of a false match, and
* it's faster to make useless comparisons to trailing bytes than it is to
* invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii is true
* when we have to do it the hard way.
The references to pg_encoding_mblen() and encoding_embeds_ascii, are out of date for copy-from. I'm not sure the rest is relevant to copy-from anymore, either. Can you confirm?
This comment inside the struct is now out of date as well:
* Similarly, line_buf holds the whole input line being processed. The
* input cycle is first to read the whole line into line_buf, convert it
* to server encoding there, and then extract the individual attribute
HEAD has this macro already:
/* Shorthand for number of unconsumed bytes available in raw_buf */
#define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - (cstate)->raw_buf_index)
It might make sense to create a CONVERSION_BUF_BYTES equivalent since the patch calculates cstate->conversion_buf_len - cstate->conversion_buf_index in a couple places.
--
John Naylor
EDB: http://www.enterprisedb.com
create database euctest WITH ENCODING 'EUC_CN' LC_COLLATE='zh_CN.eucCN' LC_CTYPE='zh_CN.eucCN' TEMPLATE=template0;
\c euctest
create table foo (a text);
master:
euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> ä
>> \.
ERROR: character with byte sequence 0xc3 0xa4 in encoding "UTF8" has no equivalent in encoding "EUC_CN"
CONTEXT: COPY foo, line 1
patch:
euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> ä
>> \.
COPY 0
euctest=#
I believe the problem is in UtfToLocal(). I've attached a fix formatted as a text file to avoid confusing the cfbot. The fix keeps the debugging ereport() in case you find it useful. Some additional test coverage might be good here, but not sure how much work that would be. I didn't check any other conversions yet.
v2-0002 seems fine to me, I just have cosmetic comments here:
+ * the same, no conversion is required by we must still validate that the
s/by/but/
This comment in copyfrom_internal.h above the *StateData struct is the same as the corresponding one in copyto.c:
* Multi-byte encodings: all supported client-side encodings encode multi-byte
* characters by having the first byte's high bit set. Subsequent bytes of the
* character can have the high bit not set. When scanning data in such an
* encoding to look for a match to a single-byte (ie ASCII) character, we must
* use the full pg_encoding_mblen() machinery to skip over multibyte
* characters, else we might find a false match to a trailing byte. In
* supported server encodings, there is no possibility of a false match, and
* it's faster to make useless comparisons to trailing bytes than it is to
* invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii is true
* when we have to do it the hard way.
The references to pg_encoding_mblen() and encoding_embeds_ascii, are out of date for copy-from. I'm not sure the rest is relevant to copy-from anymore, either. Can you confirm?
This comment inside the struct is now out of date as well:
* Similarly, line_buf holds the whole input line being processed. The
* input cycle is first to read the whole line into line_buf, convert it
* to server encoding there, and then extract the individual attribute
HEAD has this macro already:
/* Shorthand for number of unconsumed bytes available in raw_buf */
#define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - (cstate)->raw_buf_index)
It might make sense to create a CONVERSION_BUF_BYTES equivalent since the patch calculates cstate->conversion_buf_len - cstate->conversion_buf_index in a couple places.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
On 30/01/2021 20:47, John Naylor wrote: > I took a look at v2, and for the first encoding I tried, it fails to > report the error for invalid input: That's embarassing... > I believe the problem is in UtfToLocal(). I've attached a fix formatted > as a text file to avoid confusing the cfbot. The fix keeps the debugging > ereport() in case you find it useful. Thanks. I fixed it slightly differently, and also changed LocalToUtf() to follow the same pattern, even though LocalToUtf() did not have the same bug. > Some additional test coverage might be good here, but not sure how > much work that would be. I didn't check any other conversions yet. I added a bunch of tests for various built-in conversions. > v2-0002 seems fine to me, I just have cosmetic comments here: > > + * the same, no conversion is required by we must still validate that the > > s/by/but/ > > This comment in copyfrom_internal.h above the *StateData struct is the > same as the corresponding one in copyto.c: > > * Multi-byte encodings: all supported client-side encodings encode > multi-byte > * characters by having the first byte's high bit set. Subsequent bytes > of the > * character can have the high bit not set. When scanning data in such an > * encoding to look for a match to a single-byte (ie ASCII) character, > we must > * use the full pg_encoding_mblen() machinery to skip over multibyte > * characters, else we might find a false match to a trailing byte. In > * supported server encodings, there is no possibility of a false > match, and > * it's faster to make useless comparisons to trailing bytes than it is to > * invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii > is true > * when we have to do it the hard way. > > The references to pg_encoding_mblen() and encoding_embeds_ascii, are out > of date for copy-from. I'm not sure the rest is relevant to copy-from > anymore, either. Can you confirm? Yeah, that comment is obsolete for COPY FROM, the encoding conversion works differently now. Removed it from copyfrom_internal.h. > This comment inside the struct is now out of date as well: > > * Similarly, line_buf holds the whole input line being processed. The > * input cycle is first to read the whole line into line_buf, convert it > * to server encoding there, and then extract the individual attribute > > HEAD has this macro already: > > /* Shorthand for number of unconsumed bytes available in raw_buf */ > #define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - > (cstate)->raw_buf_index) > > It might make sense to create a CONVERSION_BUF_BYTES equivalent since > the patch calculates cstate->conversion_buf_len - > cstate->conversion_buf_index in a couple places. Thanks for the review! I spent some time refactoring and adding comments all around the patch, hopefully making it all more clear. One notable difference is that I renamed 'raw_buf' (which exists in master too) to 'input_buf', and renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this patch again another day with fresh eyes, and also try to add some tests for the corner cases at buffer boundaries. Attached is a new set of patches. I added some regression tests for the built-in conversion functions, which cover the bug you found, and many other interesting cases that did not have test coverage yet. It comes in two patches: the first patch uses just the existing convert_from() SQL function, and the second one uses the new "noError" variants of the conversion functions. I also kept the bug-fixes compared to the previous patch version as a separate commit, for easier review. - Heikki
Attachment
- v3-0001-Add-regression-tests-for-built-in-encoding-conver.patch
- v3-0002-Change-conversion-function-signature.patch
- v3-0003-Add-tests-for-the-new-noError-variants-of-built-i.patch
- v3-0004-Fix-bugs-in-the-commit-to-change-conversion-funct.patch
- v3-0005-Do-COPY-FROM-encoding-conversion-verification-in-.patch
On Mon, Feb 1, 2021 at 12:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Thanks. I fixed it slightly differently, and also changed LocalToUtf()
> to follow the same pattern, even though LocalToUtf() did not have the
> same bug.
Looks good to me.
> I added a bunch of tests for various built-in conversions.
Nice! I would like to have utf8 tests for every category of invalid byte (overlong, surrogate, 5 bytes, etc), but it's not necessary for this patch.
> I spent some time refactoring and adding comments all around the patch,
> hopefully making it all more clear. One notable difference is that I
> renamed 'raw_buf' (which exists in master too) to 'input_buf', and
> renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this
> patch again another day with fresh eyes, and also try to add some tests
> for the corner cases at buffer boundaries.
The comments and renaming are really helpful in understanding that file!
Although a new patch is likely forthcoming, I did take a brief look and found the following:
In copyfromparse.c, this is now out of date:
* Read the next input line and stash it in line_buf, with conversion to
* server encoding.
One of your FIXME comments seems to allude to this, but if we really need a difference here, maybe it should be explained:
+#define INPUT_BUF_SIZE 65536 /* we palloc INPUT_BUF_SIZE+1 bytes */
+#define RAW_BUF_SIZE 65536 /* allocated size of the buffer */
Lastly, it looks like pg_do_encoding_conversion_buf() ended up in 0003 accidentally?
--
John Naylor
EDB: http://www.enterprisedb.com
> Thanks. I fixed it slightly differently, and also changed LocalToUtf()
> to follow the same pattern, even though LocalToUtf() did not have the
> same bug.
Looks good to me.
> I added a bunch of tests for various built-in conversions.
Nice! I would like to have utf8 tests for every category of invalid byte (overlong, surrogate, 5 bytes, etc), but it's not necessary for this patch.
> I spent some time refactoring and adding comments all around the patch,
> hopefully making it all more clear. One notable difference is that I
> renamed 'raw_buf' (which exists in master too) to 'input_buf', and
> renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this
> patch again another day with fresh eyes, and also try to add some tests
> for the corner cases at buffer boundaries.
The comments and renaming are really helpful in understanding that file!
Although a new patch is likely forthcoming, I did take a brief look and found the following:
In copyfromparse.c, this is now out of date:
* Read the next input line and stash it in line_buf, with conversion to
* server encoding.
One of your FIXME comments seems to allude to this, but if we really need a difference here, maybe it should be explained:
+#define INPUT_BUF_SIZE 65536 /* we palloc INPUT_BUF_SIZE+1 bytes */
+#define RAW_BUF_SIZE 65536 /* allocated size of the buffer */
Lastly, it looks like pg_do_encoding_conversion_buf() ended up in 0003 accidentally?
--
John Naylor
EDB: http://www.enterprisedb.com
On 02/02/2021 23:42, John Naylor wrote: > Although a new patch is likely forthcoming, I did take a brief look and > found the following: > > > In copyfromparse.c, this is now out of date: > > * Read the next input line and stash it in line_buf, with conversion to > * server encoding. > > > One of your FIXME comments seems to allude to this, but if we really > need a difference here, maybe it should be explained: > > +#define INPUT_BUF_SIZE 65536 /* we palloc INPUT_BUF_SIZE+1 bytes */ > > +#define RAW_BUF_SIZE 65536 /* allocated size of the buffer */ We do in fact still need the +1 for the NUL terminator. It was missing from the last patch version, but that was wrong; my fuzz testing actually uncovered a bug caused by that. Fixed. Attached are new patch versions. The first patch is same as before, but rebased, pgindented, and with a couple of tiny fixes where conversion functions were still missing the "if (noError) break;" checks. I've hacked on the second patch more, doing more refactoring and commenting for readability. I think it's in pretty good shape now. - Heikki
Attachment
On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 02/02/2021 23:42, John Naylor wrote:
> >
> > In copyfromparse.c, this is now out of date:
> >
> > * Read the next input line and stash it in line_buf, with conversion to
> > * server encoding.
This comment for CopyReadLine() is still there. Conversion already happened by now, so I think this comment is outdated.
Other than that, I think this is ready for commit.
--
John Naylor
EDB: http://www.enterprisedb.com
>
> On 02/02/2021 23:42, John Naylor wrote:
> >
> > In copyfromparse.c, this is now out of date:
> >
> > * Read the next input line and stash it in line_buf, with conversion to
> > * server encoding.
This comment for CopyReadLine() is still there. Conversion already happened by now, so I think this comment is outdated.
Other than that, I think this is ready for commit.
--
John Naylor
EDB: http://www.enterprisedb.com
On 09/02/2021 15:40, John Naylor wrote: > On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas <hlinnaka@iki.fi > <mailto:hlinnaka@iki.fi>> wrote: > > > > On 02/02/2021 23:42, John Naylor wrote: > > > > > > In copyfromparse.c, this is now out of date: > > > > > > * Read the next input line and stash it in line_buf, with > conversion to > > > * server encoding. > > This comment for CopyReadLine() is still there. Conversion already > happened by now, so I think this comment is outdated. > > Other than that, I think this is ready for commit. Fixed. And also fixed one more bug in allocating raw_buf_size, the "+ 1" somehow went missing again. That was causing a failure on Windows at cfbot.cputube.org. I'll read through this one more time with fresh eyes tomorrow or the day after, and push. Thanks for all the review! - Heikki
On 09/02/2021 19:36, Heikki Linnakangas wrote: > On 09/02/2021 15:40, John Naylor wrote: >> On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas <hlinnaka@iki.fi >> <mailto:hlinnaka@iki.fi>> wrote: >> > >> > On 02/02/2021 23:42, John Naylor wrote: >> > > >> > > In copyfromparse.c, this is now out of date: >> > > >> > > * Read the next input line and stash it in line_buf, with >> conversion to >> > > * server encoding. >> >> This comment for CopyReadLine() is still there. Conversion already >> happened by now, so I think this comment is outdated. >> >> Other than that, I think this is ready for commit. > > Fixed. And also fixed one more bug in allocating raw_buf_size, the "+ 1" > somehow went missing again. That was causing a failure on Windows at > cfbot.cputube.org. > > I'll read through this one more time with fresh eyes tomorrow or the day > after, and push. Thanks for all the review! Forgot attachment.. - Heikki
Attachment
On Tue, Feb 9, 2021 at 1:44 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > Fixed. And also fixed one more bug in allocating raw_buf_size, the "+ 1"
> > somehow went missing again. That was causing a failure on Windows at
> > cfbot.cputube.org.
> >
> > I'll read through this one more time with fresh eyes tomorrow or the day
> > after, and push. Thanks for all the review!
>
> Forgot attachment..
>
> - Heikki
I went ahead and rebased these.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachment
I wrote:
> I went ahead and rebased these.
It looks like FreeBSD doesn't like this for some reason.
I also wanted to see if this patch set had any performance effect, with and without changing how UTF-8 is validated, using the blackhole am from https://github.com/michaelpq/pg_plugins/tree/master/blackhole_am.
create extension blackhole_am;
create table blackhole_tab (a text) using blackhole_am ;
time ./inst/bin/psql -c "copy blackhole_tab from '/path/to/test-copy.txt'"
....where copy-test.txt is made by
for i in {1..100}; do cat UTF-8-Sampler.htm >> test-copy.txt ; done;
On Linux x86-64, gcc 8.4, I get these numbers (minimum of five runs):
master:
109ms
v6 do encoding in larger chunks:
109ms
v7 utf8 SIMD:
98ms
--
John Naylor
EDB: http://www.enterprisedb.com
I also wanted to see if this patch set had any performance effect, with and without changing how UTF-8 is validated, using the blackhole am from https://github.com/michaelpq/pg_plugins/tree/master/blackhole_am.
create extension blackhole_am;
create table blackhole_tab (a text) using blackhole_am ;
time ./inst/bin/psql -c "copy blackhole_tab from '/path/to/test-copy.txt'"
....where copy-test.txt is made by
for i in {1..100}; do cat UTF-8-Sampler.htm >> test-copy.txt ; done;
On Linux x86-64, gcc 8.4, I get these numbers (minimum of five runs):
master:
109ms
v6 do encoding in larger chunks:
109ms
v7 utf8 SIMD:
98ms
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Mar 18, 2021 at 2:05 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> I wrote:
>
> > I went ahead and rebased these.
>
> It looks like FreeBSD doesn't like this for some reason.
On closer examination, make check was "terminated", not that the tests failed...
--
John Naylor
EDB: http://www.enterprisedb.com
On 18/03/2021 20:05, John Naylor wrote: > I wrote: > > > I went ahead and rebased these. Thanks! > I also wanted to see if this patch set had any performance effect, with > and without changing how UTF-8 is validated, using the blackhole am from > https://github.com/michaelpq/pg_plugins/tree/master/blackhole_am > <https://github.com/michaelpq/pg_plugins/tree/master/blackhole_am>. > > create extension blackhole_am; > create table blackhole_tab (a text) using blackhole_am ; > time ./inst/bin/psql -c "copy blackhole_tab from '/path/to/test-copy.txt'" > > ....where copy-test.txt is made by > > for i in {1..100}; do cat UTF-8-Sampler.htm >> test-copy.txt ; done; > > On Linux x86-64, gcc 8.4, I get these numbers (minimum of five runs): > > master: > 109ms > > v6 do encoding in larger chunks: > 109ms > > v7 utf8 SIMD: > 98ms That's disappointing. Perhaps the file size is just too small to see the effect? I'm seeing results between 40 ms and 75 ms on my laptop when I run a test like that multiple times. I used "WHERE false" instead of the blackhole AM but I don't think that makes much difference (only showing a few runs here for brevity): for i in {1..100}; do cat /tmp/utf8.html >> /tmp/test-copy.txt ; done; postgres=# create table blackhole_tab (a text) ; CREATE TABLE postgres=# \timing Timing is on. postgres=# copy blackhole_tab from '/tmp/test-copy.txt' where false; COPY 0 Time: 53.166 ms postgres=# copy blackhole_tab from '/tmp/test-copy.txt' where false; COPY 0 Time: 43.981 ms postgres=# copy blackhole_tab from '/tmp/test-copy.txt' where false; COPY 0 Time: 71.850 ms postgres=# copy blackhole_tab from '/tmp/test-copy.txt' where false; COPY 0 ... I tested that with a larger file: for i in {1..10000}; do cat /tmp/utf8.html >> /tmp/test-copy.txt ; done; postgres=# copy blackhole_tab from '/tmp/test-copy.txt' where false; v6 do encoding in larger chunks (best of five): Time: 3955.514 ms (00:03.956) master (best of five): Time: 4133.767 ms (00:04.134) So with that, I'm seeing a measurable difference. - Heikki
On 01/04/2021 11:09, Heikki Linnakangas wrote: > On 18/03/2021 20:05, John Naylor wrote: >> I wrote: >> >> > I went ahead and rebased these. > > Thanks! I read through the patches one more time, made a few small comment fixes, and pushed. - Heikki
On 04/01/21 05:27, Heikki Linnakangas wrote: > I read through the patches one more time, made a few small comment fixes, > and pushed. Wow, this whole thread escaped my attention at the time, though my ears would have perked right up if the subject had been something like 'improve encoding conversion API to stream a buffer at a time'. I think this is of great interest beyond one particular use case in COPY FROM. For example, it could limit the allocations needed when streaming a large text value out to a client; it might be used to advantage with the recent work in incrementally detoasting large values, and so on. This part seems a little underdeveloped: > * TODO: The conversion function interface is not great. Firstly, it > * would be nice to pass through the destination buffer size to the > * conversion function, so that if you pass a shorter destination buffer, it > * could still continue to fill up the whole buffer. Currently, we have to > * assume worst case expansion and stop the conversion short, even if there > * is in fact space left in the destination buffer. Secondly, it would be > * nice to return the number of bytes written to the caller, to avoid a call > * to strlen(). If I understand correctly, this patch already makes a breaking change to the conversion function API. If that's going to be the case anyway, I wonder if it's worth going further and changing the API further to eliminate this odd limitation. There seems to be a sort of common shape that conversion APIs have evolved toward, that can be seen in both the ICU4C converters [0] and in Java's [1]. This current tweak to our conversion API seems to get allllmmoooosst there, but just not quite. For example, noError allows us to keep control when the function has stopped converting, but we don't find out which reason it stopped. If we just went the rest of the way and structured the API like those existing ones, then: - it would be super easy to write wrappers around ICU4C converters, if there were any we wanted to use; - I could very easily write wrappers presenting any PG-supported charset as a Java charset. The essence of the API common to ICU4C and Java is this: 1. You pass the function the address and length of a source buffer, the address and length of a destination buffer, and a flag that is true if you know there is no further input where this source buffer came from. (It's allowable to pass false and only then discover you really have no more input after all; then you just make one final call passing true.) 2. The function eats as much as it can of the source buffer, fills as much as it can of the destination buffer, and returns indicating one of four reasons it stopped: underflow - ran out of source buffer overflow - ran out of destination buffer malformed - something in source buffer isn't valid in that representation unmappable - a valid codepoint not available in destination encoding Based on that, the caller refills the source buffer, or drains the destination buffer, or handles or reports the malformed or unmappable, then repeats. 3. The function should update pointers on return to indicate how much of the source buffer it consumed and how much of the destination buffer it filled. 4. If it left any bytes unconsumed in the source buffer, the caller must preserve them (perhaps moved to the front) for the next call. 5. The converter can have internal state (so it is an object in Java, or has a UConverter struct allocated in ICU4C, to have a place for its state). The state gets flushed on the final call where the flag is passed true. In many cases, the converter can be implemented without keeping internal state, if it simply leaves, for example, an incomplete sequence at the end of the source buffer unconsumed, so the caller will move it to the front and supply the rest. On the other hand, any unconsumed input after the final call with flush flag true must be treated as malformed. 6. On a malformed or unmappable return, the source buffer is left pointed at the start of the offending sequence and the length in bytes of that sequence is available for error reporting/recovery. The efficient handling of states, returning updated pointers, and so on, probably requires a function signature with 'internal' in it ... but the current function signature already has 'internal', so that doesn't seem like a deal-breaker. Thoughts? It seems a shame to make a breaking change in the conversion API, only to still end up with an API that "is not great" and is still impedance-mismatched to other existing prominent conversion APIs. Regards, -Chap [0] https://unicode-org.github.io/icu/userguide/conversion/converters.html#3-buffered-or-streamed [1] https://docs.oracle.com/javase/9/docs/api/java/nio/charset/CharsetDecoder.html#decode-java.nio.ByteBuffer-java.nio.CharBuffer-boolean-