Thread: confusing / inefficient "need_transcoding" handling in copy
Hi, Looking at the profiles in [1], and similar profiles locally, made me wonder why a basic COPY TO shows pg_server_to_any() and the strlen() to compute the length of the to-be-converted string so heavily in profiles. Example profile, for [2]: - 88.11% 12.02% postgres postgres [.] CopyOneRowTo - 76.09% CopyOneRowTo - 37.24% CopyAttributeOutText + 14.25% __strlen_evex + 2.76% pg_server_to_any + 0.03% 0xffffffff82a00c86 + 31.82% OutputFunctionCall + 2.98% CopySendEndOfRow + 2.75% appendBinaryStringInfo + 0.58% MemoryContextReset + 0.02% 0xffffffff82a00c86 + 12.01% standard_ExecutorRun + 0.02% PostgresMain In the basic cases the client and server encoding should be the same after all, so why do we need to do any conversion? The code has a comment about this: /* * Set up encoding conversion info. Even if the file and server encodings * are the same, we must apply pg_any_to_server() to validate data in * multibyte encodings. */ cstate->need_transcoding = (cstate->file_encoding != GetDatabaseEncoding() || pg_database_encoding_max_length() > 1); I don't really understand why we need to validate anything during COPY TO? Which is good, because it turns out that we don't actually validate anything, as pg_server_to_any() returns without doing anything if the encoding matches: if (encoding == DatabaseEncoding->encoding || encoding == PG_SQL_ASCII) return unconstify(char *, s); /* assume data is valid */ This means that the strlen() we do in the call do pg_server_to_any(), which on its own takes 14.25% of the cycles, computes something that will never be used. Unsurprisingly, only doing transcoding when encodings differ yields a sizable improvement, about 18% for [2]. I haven't yet dug into the code history. One guess is that this should only have been set this way for COPY FROM. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/ZcGE8LrjGW8pmtOf%40paquier.xyz [2] COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1,1000000::int4)) TO '/dev/null';
On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote: > I don't really understand why we need to validate anything during COPY TO? > Which is good, because it turns out that we don't actually validate anything, > as pg_server_to_any() returns without doing anything if the encoding matches: > > if (encoding == DatabaseEncoding->encoding || > encoding == PG_SQL_ASCII) > return unconstify(char *, s); /* assume data is valid */ > > This means that the strlen() we do in the call do pg_server_to_any(), which on > its own takes 14.25% of the cycles, computes something that will never be > used. Indeed, that's wasting cycles for nothing when the client and server encoding match. > Unsurprisingly, only doing transcoding when encodings differ yields a sizable > improvement, about 18% for [2]. > > I haven't yet dug into the code history. One guess is that this should only > have been set this way for COPY FROM. Looking the git history, this looks like an oversight of c61a2f58418e that has added the condition on pg_database_encoding_max_length(), no? Adding Tom and Ishii-san, even if this comes from 2005. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote: >> I haven't yet dug into the code history. One guess is that this should only >> have been set this way for COPY FROM. > Looking the git history, this looks like an oversight of c61a2f58418e > that has added the condition on pg_database_encoding_max_length(), no? > Adding Tom and Ishii-san, even if this comes from 2005. Yeah, back in 8.1 that code was applied for both directions, but probably it should have enforced validation for same-encoding cases only for COPY FROM. It looks like now we have a mess, because the condition was copied verbatim into copyto.c but not copyfrom.c. Aren't we failing to validate encoding in this case in COPY FROM, which is where we actually need to? regards, tom lane
Hi, On 2024-02-06 12:51:48 -0500, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: > > On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote: > >> I haven't yet dug into the code history. One guess is that this should only > >> have been set this way for COPY FROM. > > > Looking the git history, this looks like an oversight of c61a2f58418e > > that has added the condition on pg_database_encoding_max_length(), no? > > Adding Tom and Ishii-san, even if this comes from 2005. > > Yeah, back in 8.1 that code was applied for both directions, but > probably it should have enforced validation for same-encoding > cases only for COPY FROM. > > It looks like now we have a mess, because the condition was copied > verbatim into copyto.c but not copyfrom.c. Aren't we failing to > validate encoding in this case in COPY FROM, which is where we > actually need to? I think the code is just very confusing - there actually *is* verification of the encoding, it just happens at a different, earlier, layer, namely in copyfromparse.c: CopyConvertBuf() which says: /* * If the file and server encoding are the same, no encoding conversion is * required. However, we still need to verify that the input is valid for * the encoding. */ And does indeed verify that. One unfortunate issue: We don't have any tests verifying that COPY FROM catches encoding issues. Regards, Andres
On Tue, Feb 06, 2024 at 02:24:45PM -0800, Andres Freund wrote: > I think the code is just very confusing - there actually *is* verification of > the encoding, it just happens at a different, earlier, layer, namely in > copyfromparse.c: CopyConvertBuf() which says: > /* > * If the file and server encoding are the same, no encoding conversion is > * required. However, we still need to verify that the input is valid for > * the encoding. > */ > > And does indeed verify that. This has been switched by Heikki in f82de5c46bdf, in 2021, that has removed pg_database_encoding_max_length() in the COPY FROM case. (Adding him now in CC, in case he has any comments). > One unfortunate issue: We don't have any tests verifying that COPY FROM > catches encoding issues. Oops. Anyway, I was looking at the copyto.c code because I need to get something on this thread to be able to do something about the pluggable APIs in COPY, and echoing with what you mentioned upthread, what we only need to do is to set need_transcoding only when the client and the server encodings are not the same? Am I missing something? Runtime gets much better in average, around 1260ms on HEAD vs 1023ms with the attached for the example of upthread on a single process. Some profile data from CopyOneRowTo(), if relevant: * HEAD: - 82.78% 10.96% postgres postgres [.] CopyOneRowTo - 71.82% CopyOneRowTo + 30.87% OutputFunctionCall - 13.21% CopyAttributeOutText pg_server_to_any - 9.48% appendBinaryStringInfo 4.93% enlargeStringInfo 3.33% 0xffffa4e1e234 + 3.20% CopySendEndOfRow 2.66% 0xffffa4e1e214 1.02% pgstat_progress_update_param 0.86% memcpy@plt 0.74% 0xffffa4e1cba4 0.72% MemoryContextReset 0.72% 0xffffa4e1cba8 0.59% enlargeStringInfo 0.55% 0xffffa4e1cb40 0.54% 0xffffa4e1cb74 0.52% 0xffffa4e1cb8c + 10.96% _start * patch: - 80.82% 12.25% postgres postgres [.] CopyOneRowTo - 68.57% CopyOneRowTo + 36.55% OutputFunctionCall 11.44% CopyAttributeOutText + 8.87% appendBinaryStringInfo + 3.79% CopySendEndOfRow 1.01% pgstat_progress_update_param 0.79% int2out 0.66% MemoryContextReset 0.63% 0xffffaa624ba8 0.60% memcpy@plt 0.60% enlargeStringInfo 0.53% 0xffffaa624ba4 + 12.25% _start That's a performance-only change, but there may be a good argument for backpatching something, perhaps? -- Michael
Attachment
Hi, In <20240206222445.hzq22pb2nye7rm67@awork3.anarazel.de> "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 Feb 2024 14:24:45 -0800, Andres Freund <andres@anarazel.de> wrote: > One unfortunate issue: We don't have any tests verifying that COPY FROM > catches encoding issues. How about the attached patch for it? How about the following to avoid needless transcoding? ---- diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index bd4710a79b..80ec26cafe 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -612,13 +612,14 @@ BeginCopyTo(ParseState *pstate, cstate->file_encoding = cstate->opts.file_encoding; /* - * Set up encoding conversion info. Even if the file and server encodings - * are the same, we must apply pg_any_to_server() to validate data in - * multibyte encodings. + * Set up encoding conversion info. We use pg_server_to_any() for the + * conversion. pg_server_to_any() does nothing with an encoding that + * equals to GetDatabaseEncoding() and PG_SQL_ASCII. If + * cstate->file_encoding equals to GetDatabaseEncoding() and PG_SQL_ASCII, + * we don't need to transcode. */ - cstate->need_transcoding = - (cstate->file_encoding != GetDatabaseEncoding() || - pg_database_encoding_max_length() > 1); + cstate->need_transcoding = !(cstate->file_encoding == GetDatabaseEncoding() || + cstate->file_encoding == PG_SQL_ASCII); /* See Multibyte encoding comment above */ cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->file_encoding); ---- Numbers on my environment: master: 861.646ms patched: 697.547ms SQL: COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1,1000000::int4)) TO '/dev/null' \watch c=5 Thanks, -- kou From 387f11bde4eb928af23f6a66101aa9d63b3dcfdf Mon Sep 17 00:00:00 2001 From: Sutou Kouhei <kou@clear-code.com> Date: Thu, 8 Feb 2024 17:17:25 +0900 Subject: [PATCH v1] Add a test for invalid encoding for COPY FROM The test data uses UTF-8 "hello" in Japanese but the test specifies EUC_JP. So it's an invalid data. --- src/test/regress/expected/copyencoding.out | 8 ++++++++ src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 10 ++++++++++ 3 files changed, 19 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 0000000000..aa4ecea09e --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,8 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- +CREATE TABLE test (t text); +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 1d8a414eea..238cef28c4 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # ---------- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # ---------- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 0000000000..07c85e988b --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,10 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- + +CREATE TABLE test (t text); +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); +こんにちは +\. + +DROP TABLE test; -- 2.43.0
On 08/02/2024 09:05, Michael Paquier wrote: > On Tue, Feb 06, 2024 at 02:24:45PM -0800, Andres Freund wrote: >> I think the code is just very confusing - there actually *is* verification of >> the encoding, it just happens at a different, earlier, layer, namely in >> copyfromparse.c: CopyConvertBuf() which says: >> /* >> * If the file and server encoding are the same, no encoding conversion is >> * required. However, we still need to verify that the input is valid for >> * the encoding. >> */ >> >> And does indeed verify that. > > This has been switched by Heikki in f82de5c46bdf, in 2021, that has > removed pg_database_encoding_max_length() in the COPY FROM case. > (Adding him now in CC, in case he has any comments). Yeah, I agree the "pg_database_encoding_max_length() > 1" check in COPY TO is unnecessary. It's always been like that, but now that the COPY TO and COPY FROM cases don't share the code, it's more obvious. Let's remove it. On your patch: > + * Set up encoding conversion info, validating data if server and > + * client encodings are not the same (see also pg_server_to_any). There's no validation, just conversion. I'd suggest: "Set up encoding conversion info if the file and server encodings differ (see also pg_server_to_any)." Other than that, +1 > That's a performance-only change, but there may be a good argument for > backpatching something, perhaps? -1 for backpatching, out of principle. This is not a regression, it's always been like that. Seems innocent, but why is this different from any other performance improvement we make. BTW, I can see an optimization opportunity even if the encodings differ: Currently, CopyAttributeOutText() calls pg_server_to_any(), and then grovels through the string to find any characters that need to be quoted. You could do it the other way round and handle quoting before the conversion. That has two benefits: 1. You don't need the strlen() call, because you just scanned through the string so you already know its length. 2. You don't need to worry about 'encoding_embeds_ascii' when you operate on the server encoding. -- Heikki Linnakangas Neon (https://neon.tech)
Hi, In <20240208.172501.2177371292839763981.kou@clear-code.com> "Re: confusing / inefficient "need_transcoding" handling in copy" on Thu, 08 Feb 2024 17:25:01 +0900 (JST), Sutou Kouhei <kou@clear-code.com> wrote: > How about the following to avoid needless transcoding? Oh, sorry. I missed the Michael's patch: https://www.postgresql.org/message-id/flat/ZcR9Q9hJ8GedFSCd%40paquier.xyz#e73272b042a22befac7a95f7bcb4fb9a I withdraw my change. Thanks, -- kou
On Thu, Feb 08, 2024 at 05:29:46PM +0900, Sutou Kouhei wrote: > Oh, sorry. I missed the Michael's patch: > https://www.postgresql.org/message-id/flat/ZcR9Q9hJ8GedFSCd%40paquier.xyz#e73272b042a22befac7a95f7bcb4fb9a > > I withdraw my change. No problem. Thanks for caring about that. -- Michael
Attachment
On Thu, Feb 08, 2024 at 10:25:07AM +0200, Heikki Linnakangas wrote: > There's no validation, just conversion. I'd suggest: > > "Set up encoding conversion info if the file and server encodings differ > (see also pg_server_to_any)." > > Other than that, +1 Cool. I've used your wording and applied that on HEAD. > BTW, I can see an optimization opportunity even if the encodings differ: > Currently, CopyAttributeOutText() calls pg_server_to_any(), and then grovels > through the string to find any characters that need to be quoted. You could > do it the other way round and handle quoting before the conversion. That has > two benefits: > > 1. You don't need the strlen() call, because you just scanned through the > string so you already know its length. > 2. You don't need to worry about 'encoding_embeds_ascii' when you operate on > the server encoding. That sounds right, still it looks like there would be cases where you'd need the strlen() call if !encoding_embeds_ascii. -- Michael
Attachment
Hi, On 2024-02-09 09:36:28 +0900, Michael Paquier wrote: > On Thu, Feb 08, 2024 at 10:25:07AM +0200, Heikki Linnakangas wrote: > > There's no validation, just conversion. I'd suggest: > > > > "Set up encoding conversion info if the file and server encodings differ > > (see also pg_server_to_any)." > > > > Other than that, +1 > > Cool. I've used your wording and applied that on HEAD. Thanks. LGTM. Greetings, Andres Freund
On Thu, Feb 08, 2024 at 05:25:01PM +0900, Sutou Kouhei wrote: > In <20240206222445.hzq22pb2nye7rm67@awork3.anarazel.de> > "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 Feb 2024 14:24:45 -0800, > Andres Freund <andres@anarazel.de> wrote: > >> One unfortunate issue: We don't have any tests verifying that COPY FROM >> catches encoding issues. > > How about the attached patch for it? > > +CREATE TABLE test (t text); > +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); > +こんにちは > +\. > + > +DROP TABLE test; We have a couple of non-ASCII characters in the tests, but I suspect that this one will not be digested correctly everywhere, even if EUC_JP should be OK to use for the check. How about writing an arbitrary sequence of bytes into a temporary file that gets used for the COPY FROM instead? See for example how we do that with abs_builddir in copy.sql. -- Michael
Attachment
Hi, In <ZcvlgMEjt3qY8eiL@paquier.xyz> "Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 06:56:16 +0900, Michael Paquier <michael@paquier.xyz> wrote: > We have a couple of non-ASCII characters in the tests, but I suspect > that this one will not be digested correctly everywhere, even if > EUC_JP should be OK to use for the check. How about writing an > arbitrary sequence of bytes into a temporary file that gets used for > the COPY FROM instead? See for example how we do that with > abs_builddir in copy.sql. It makes sense. How about the attached patch? Thanks, -- kou From 6eb9669f97c54f8b85fac63db40ad80664692d12 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei <kou@clear-code.com> Date: Wed, 14 Feb 2024 11:44:13 +0900 Subject: [PATCH v2] Add a test for invalid encoding for COPY FROM The test data use an UTF-8 character (U+3042 HIRAGANA LETTER A) but the test specifies EUC_JP. So it's an invalid data. --- src/test/regress/expected/copyencoding.out | 13 +++++++++++++ src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 15 +++++++++++++++ 3 files changed, 29 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 0000000000..32a9d918fa --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,13 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR +CREATE TABLE test (t text); +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 1d8a414eea..238cef28c4 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # ---------- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # ---------- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 0000000000..89e2124996 --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,15 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- + +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR + +CREATE TABLE test (t text); + +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); + +DROP TABLE test; -- 2.43.0
Hi Michael, ping. (Do you think that this patch is still needed?) Thanks, -- kou In <20240214.114608.2091541942684063981.kou@clear-code.com> "Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 11:46:08 +0900 (JST), Sutou Kouhei <kou@clear-code.com> wrote: > Hi, > > In <ZcvlgMEjt3qY8eiL@paquier.xyz> > "Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 06:56:16 +0900, > Michael Paquier <michael@paquier.xyz> wrote: > >> We have a couple of non-ASCII characters in the tests, but I suspect >> that this one will not be digested correctly everywhere, even if >> EUC_JP should be OK to use for the check. How about writing an >> arbitrary sequence of bytes into a temporary file that gets used for >> the COPY FROM instead? See for example how we do that with >> abs_builddir in copy.sql. > > It makes sense. How about the attached patch? > > > Thanks, > -- > kou
Hi, In <Z1fKrTkT-eIVAK7F@paquier.xyz> "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 10 Dec 2024 13:59:25 +0900, Michael Paquier <michael@paquier.xyz> wrote: > client_encoding would be used by COPY when not specifying ENCODING > option. Perhaps more tests should be added with this value specified > by a SET client_encoding? It makes sense. I missed the case. I've added the case to the v3 patch. > Another one would be valid conversions back and forth. For example, > I recall that LATIN1 accepts any bytes and can apply a conversion to > UTF-8, so we could use it and expand a bit more the proposed tests? > Or something like that? OK. I've added valid cases too by using LATIN1 as you suggested. > This is not going to be portable across the buildfarm. Two reasons > are spotted by the CI (there may be others): > 1) For Windows, as in the following regression.diffs: > COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); > +ERROR: character with byte sequence 0xe3 0x81 0x82 in encoding "UTF8" has no equivalent in encoding "WIN1252" > 2) Second failure on Linux, with 32-bit builds: > COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); > +ERROR: conversion between UTF8 and SQL_ASCII is not supported > > Likely, this should be made conditional, based on the fact that the > database needs to be able to support utf8? There are a couple of > examples like that in the tree, based on the following SQL trick: > SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset > \if :skip_test > \quit > \endif Thanks. I didn't notice the portability problem. I've added the skip trick. > This requires an alternate output for the non-utf8 case. Oh! I didn't know the "XXX_1.out" feature. Thanks, -- kou From 104c21bbd8d4fe063b74510d16be32372c4ac1bc Mon Sep 17 00:00:00 2001 From: Sutou Kouhei <kou@clear-code.com> Date: Wed, 14 Feb 2024 11:44:13 +0900 Subject: [PATCH v3] Add tests for invalid encoding for COPY FROM The test data use an UTF-8 character (U+3042 HIRAGANA LETTER A) but the test specifies EUC_JP. So it's an invalid data. The added tests use ENCODING and client_encoding. The client_encoding value is used when we don't specify ENCODING explicitly. --- src/test/regress/expected/copyencoding.out | 27 +++++++++++++++++++ src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 30 ++++++++++++++++++++++ 3 files changed, 58 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 00000000000..86fcb640313 --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,27 @@ +-- +-- Test cases for COPY encoding +-- +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit +\endif +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR +CREATE TABLE test (t text); +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +SET client_encoding TO EUC_JP; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 81e4222d26a..1edd9e45ebb 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # ---------- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # ---------- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 00000000000..46a467cc31e --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,30 @@ +-- +-- Test cases for COPY encoding +-- + +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit +\endif + +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR + +CREATE TABLE test (t text); + +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' + +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); + +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +SET client_encoding TO EUC_JP; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); + +DROP TABLE test; -- 2.45.2
Hi, In <Z1ukEe2d7ml6-oaZ@paquier.xyz> "Re: confusing / inefficient "need_transcoding" handling in copy" on Fri, 13 Dec 2024 12:03:45 +0900, Michael Paquier <michael@paquier.xyz> wrote: >> OK. I've added valid cases too by using LATIN1 as you >> suggested. > > I may have missed something but v3 does not use that for a valid > conversion? Oh, sorry... I attached wrong patch... I attach the v4 patch that includes this case. >> Oh! I didn't know the "XXX_1.out" feature. > > You have missed the inclusion of an alternate output, which should be > something like that to bypass the test rather than failing: > --- /dev/null > +++ b/src/test/regress/expected/copyencoding_1.out > @@ -0,0 +1,7 @@ > +-- > +-- Test cases for COPY encoding > +-- > +-- skip test if not UTF8 server encoding > +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset > +\if :skip_test > +\quit > > I guess that you have the file, forgot a `git add`. I did "git add" but I attached a wrong file... The v4 patch includes this too. Thanks, -- kou From 65976f2357da73ea0d821b3b071f479a1650169f Mon Sep 17 00:00:00 2001 From: Sutou Kouhei <kou@clear-code.com> Date: Wed, 14 Feb 2024 11:44:13 +0900 Subject: [PATCH v4] Add tests for encoding for COPY FROM This adds valid cases and invalid cases. Valid cases read UTF8 character as LATIN1 character. Because LATIN1 accepts any bytes. Invalid cases read UTF8 character as EUC_JP character. The added tests use ENCODING and client_encoding. The client_encoding value is used when we don't specify ENCODING explicitly. --- src/test/regress/expected/copyencoding.out | 43 ++++++++++++++++ src/test/regress/expected/copyencoding_1.out | 7 +++ src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 52 ++++++++++++++++++++ 4 files changed, 103 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/expected/copyencoding_1.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 00000000000..d7ee888997c --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,43 @@ +-- +-- Test cases for COPY encoding +-- +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit +\endif +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR +CREATE TABLE test (t text); +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- Valid cases +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +-- Read UTF8 data as LATIN1: No error +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'LATIN1'); +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +-- Read UTF8 data as LATIN1: No error +SET client_encoding TO LATIN1; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); +-- Invalid cases +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +-- Read UTF8 data as EUC_JP: No error +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +-- Read UTF8 data as EUC_JP: No error +SET client_encoding TO EUC_JP; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/expected/copyencoding_1.out b/src/test/regress/expected/copyencoding_1.out new file mode 100644 index 00000000000..c754e23ff7d --- /dev/null +++ b/src/test/regress/expected/copyencoding_1.out @@ -0,0 +1,7 @@ +-- +-- Test cases for COPY encoding +-- +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 81e4222d26a..1edd9e45ebb 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # ---------- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # ---------- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 00000000000..3e41e6d0c98 --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,52 @@ +-- +-- Test cases for COPY encoding +-- + +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit +\endif + +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR + +CREATE TABLE test (t text); + +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' + + +-- Valid cases + +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +-- Read UTF8 data as LATIN1: No error +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'LATIN1'); + +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +-- Read UTF8 data as LATIN1: No error +SET client_encoding TO LATIN1; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); + + +-- Invalid cases + +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +-- Read UTF8 data as EUC_JP: No error +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); + +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +-- Read UTF8 data as EUC_JP: No error +SET client_encoding TO EUC_JP; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); + +DROP TABLE test; -- 2.45.2
Hi, In <Z1-PjE6KPimF8w55@paquier.xyz> "Re: confusing / inefficient "need_transcoding" handling in copy" on Mon, 16 Dec 2024 11:25:16 +0900, Michael Paquier <michael@paquier.xyz> wrote: > On Sat, Dec 14, 2024 at 04:46:57PM +0900, Michael Paquier wrote: >> Note that using "test" as table name for the tests is not a good idea, >> as this could very likely conflict with some concurrent activity. I >> would also add two RESET queries to remove the dependency to >> client_encoding once the test has no need to rely on it. No need to >> send a new patch for all that, just noticing in passing. > > I got some time to look again at this one. Applied with the tweaks > for the table name and the two RESET queries. Thanks! I'll use better table name next time. Thanks, -- kou