Thread: confusing / inefficient "need_transcoding" handling in copy

confusing / inefficient "need_transcoding" handling in copy

From

Andres Freund

Date:

06 February 2024, 02:05:04

Hi,

Looking at the profiles in [1], and similar profiles locally, made me wonder
why a basic COPY TO shows pg_server_to_any() and the strlen() to compute the
length of the to-be-converted string so heavily in profiles. Example
profile, for [2]:

-   88.11%    12.02%  postgres      postgres          [.] CopyOneRowTo
   - 76.09% CopyOneRowTo
      - 37.24% CopyAttributeOutText
         + 14.25% __strlen_evex
         + 2.76% pg_server_to_any
         + 0.03% 0xffffffff82a00c86
      + 31.82% OutputFunctionCall
      + 2.98% CopySendEndOfRow
      + 2.75% appendBinaryStringInfo
      + 0.58% MemoryContextReset
      + 0.02% 0xffffffff82a00c86
   + 12.01% standard_ExecutorRun
   + 0.02% PostgresMain

In the basic cases the client and server encoding should be the same after
all, so why do we need to do any conversion?

The code has a comment about this:

    /*
     * Set up encoding conversion info.  Even if the file and server encodings
     * are the same, we must apply pg_any_to_server() to validate data in
     * multibyte encodings.
     */
    cstate->need_transcoding =
        (cstate->file_encoding != GetDatabaseEncoding() ||
         pg_database_encoding_max_length() > 1);

I don't really understand why we need to validate anything during COPY TO?
Which is good, because it turns out that we don't actually validate anything,
as pg_server_to_any() returns without doing anything if the encoding matches:

    if (encoding == DatabaseEncoding->encoding ||
        encoding == PG_SQL_ASCII)
        return unconstify(char *, s);    /* assume data is valid */

This means that the strlen() we do in the call do pg_server_to_any(), which on
its own takes 14.25% of the cycles, computes something that will never be
used.


Unsurprisingly, only doing transcoding when encodings differ yields a sizable
improvement, about 18% for [2].


I haven't yet dug into the code history. One guess is that this should only
have been set this way for COPY FROM.

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/ZcGE8LrjGW8pmtOf%40paquier.xyz
[2] COPY (SELECT
1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2,
generate_series(1,1000000::int4)) TO '/dev/null';

Re: confusing / inefficient "need_transcoding" handling in copy

From

Michael Paquier

Date:

06 February 2024, 04:49:38

On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote:
> I don't really understand why we need to validate anything during COPY TO?
> Which is good, because it turns out that we don't actually validate anything,
> as pg_server_to_any() returns without doing anything if the encoding matches:
>
>     if (encoding == DatabaseEncoding->encoding ||
>         encoding == PG_SQL_ASCII)
>         return unconstify(char *, s);    /* assume data is valid */
>
> This means that the strlen() we do in the call do pg_server_to_any(), which on
> its own takes 14.25% of the cycles, computes something that will never be
> used.

Indeed, that's wasting cycles for nothing when the client and server
encoding match.

> Unsurprisingly, only doing transcoding when encodings differ yields a sizable
> improvement, about 18% for [2].
>
> I haven't yet dug into the code history. One guess is that this should only
> have been set this way for COPY FROM.

Looking the git history, this looks like an oversight of c61a2f58418e
that has added the condition on pg_database_encoding_max_length(), no?
Adding Tom and Ishii-san, even if this comes from 2005.
--
Michael

Attachment

signature.asc

Re: confusing / inefficient "need_transcoding" handling in copy

From

Tom Lane

Date:

06 February 2024, 17:51:48

Michael Paquier <michael@paquier.xyz> writes:
> On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote:
>> I haven't yet dug into the code history. One guess is that this should only
>> have been set this way for COPY FROM.

> Looking the git history, this looks like an oversight of c61a2f58418e
> that has added the condition on pg_database_encoding_max_length(), no?
> Adding Tom and Ishii-san, even if this comes from 2005.

Yeah, back in 8.1 that code was applied for both directions, but
probably it should have enforced validation for same-encoding
cases only for COPY FROM.

It looks like now we have a mess, because the condition was copied
verbatim into copyto.c but not copyfrom.c.  Aren't we failing to
validate encoding in this case in COPY FROM, which is where we
actually need to?

            regards, tom lane

Re: confusing / inefficient "need_transcoding" handling in copy

From

Andres Freund

Date:

06 February 2024, 22:24:45

Hi,

On 2024-02-06 12:51:48 -0500, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote:
> >> I haven't yet dug into the code history. One guess is that this should only
> >> have been set this way for COPY FROM.
>
> > Looking the git history, this looks like an oversight of c61a2f58418e
> > that has added the condition on pg_database_encoding_max_length(), no?
> > Adding Tom and Ishii-san, even if this comes from 2005.
>
> Yeah, back in 8.1 that code was applied for both directions, but
> probably it should have enforced validation for same-encoding
> cases only for COPY FROM.
>
> It looks like now we have a mess, because the condition was copied
> verbatim into copyto.c but not copyfrom.c.  Aren't we failing to
> validate encoding in this case in COPY FROM, which is where we
> actually need to?

I think the code is just very confusing - there actually *is* verification of
the encoding, it just happens at a different, earlier, layer, namely in
copyfromparse.c: CopyConvertBuf() which says:
    /*
     * If the file and server encoding are the same, no encoding conversion is
     * required.  However, we still need to verify that the input is valid for
     * the encoding.
     */

And does indeed verify that.


One unfortunate issue: We don't have any tests verifying that COPY FROM
catches encoding issues.

Regards,

Andres

Re: confusing / inefficient "need_transcoding" handling in copy

From

Michael Paquier

Date:

08 February 2024, 07:05:39

On Tue, Feb 06, 2024 at 02:24:45PM -0800, Andres Freund wrote:
> I think the code is just very confusing - there actually *is* verification of
> the encoding, it just happens at a different, earlier, layer, namely in
> copyfromparse.c: CopyConvertBuf() which says:
>     /*
>      * If the file and server encoding are the same, no encoding conversion is
>      * required.  However, we still need to verify that the input is valid for
>      * the encoding.
>      */
>
> And does indeed verify that.

This has been switched by Heikki in f82de5c46bdf, in 2021, that has
removed pg_database_encoding_max_length() in the COPY FROM case.
(Adding him now in CC, in case he has any comments).

> One unfortunate issue: We don't have any tests verifying that COPY FROM
> catches encoding issues.

Oops.

Anyway, I was looking at the copyto.c code because I need to get
something on this thread to be able to do something about the
pluggable APIs in COPY, and echoing with what you mentioned upthread,
what we only need to do is to set need_transcoding only when the
client and the server encodings are not the same?  Am I missing
something?

Runtime gets much better in average, around 1260ms on HEAD vs 1023ms
with the attached for the example of upthread on a single process.
Some profile data from CopyOneRowTo(), if relevant:
* HEAD:
-   82.78%    10.96%  postgres  postgres            [.] CopyOneRowTo
    - 71.82% CopyOneRowTo
       + 30.87% OutputFunctionCall
       - 13.21% CopyAttributeOutText
            pg_server_to_any
       - 9.48% appendBinaryStringInfo
            4.93% enlargeStringInfo
         3.33% 0xffffa4e1e234
       + 3.20% CopySendEndOfRow
         2.66% 0xffffa4e1e214
         1.02% pgstat_progress_update_param
         0.86% memcpy@plt
         0.74% 0xffffa4e1cba4
         0.72% MemoryContextReset
         0.72% 0xffffa4e1cba8
         0.59% enlargeStringInfo
         0.55% 0xffffa4e1cb40
         0.54% 0xffffa4e1cb74
         0.52% 0xffffa4e1cb8c
    + 10.96% _start
* patch:
-   80.82%    12.25%  postgres  postgres            [.] CopyOneRowTo
    - 68.57% CopyOneRowTo
       + 36.55% OutputFunctionCall
         11.44% CopyAttributeOutText
       + 8.87% appendBinaryStringInfo
       + 3.79% CopySendEndOfRow
         1.01% pgstat_progress_update_param
         0.79% int2out
         0.66% MemoryContextReset
         0.63% 0xffffaa624ba8
         0.60% memcpy@plt
         0.60% enlargeStringInfo
         0.53% 0xffffaa624ba4
    + 12.25% _start

That's a performance-only change, but there may be a good argument for
backpatching something, perhaps?
--
Michael

Attachment

Re: confusing / inefficient "need_transcoding" handling in copy

From

Sutou Kouhei

Date:

08 February 2024, 08:25:01

Hi,

In <20240206222445.hzq22pb2nye7rm67@awork3.anarazel.de>
  "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 Feb 2024 14:24:45 -0800,
  Andres Freund <andres@anarazel.de> wrote:

> One unfortunate issue: We don't have any tests verifying that COPY FROM
> catches encoding issues.

How about the attached patch for it?


How about the following to avoid needless transcoding?

----
diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index bd4710a79b..80ec26cafe 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -612,13 +612,14 @@ BeginCopyTo(ParseState *pstate,
         cstate->file_encoding = cstate->opts.file_encoding;
 
     /*
-     * Set up encoding conversion info.  Even if the file and server encodings
-     * are the same, we must apply pg_any_to_server() to validate data in
-     * multibyte encodings.
+     * Set up encoding conversion info. We use pg_server_to_any() for the
+     * conversion. pg_server_to_any() does nothing with an encoding that
+     * equals to GetDatabaseEncoding() and PG_SQL_ASCII. If
+     * cstate->file_encoding equals to GetDatabaseEncoding() and PG_SQL_ASCII,
+     * we don't need to transcode.
      */
-    cstate->need_transcoding =
-        (cstate->file_encoding != GetDatabaseEncoding() ||
-         pg_database_encoding_max_length() > 1);
+    cstate->need_transcoding = !(cstate->file_encoding == GetDatabaseEncoding() ||
+                                 cstate->file_encoding == PG_SQL_ASCII);
     /* See Multibyte encoding comment above */
     cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->file_encoding);
 
----

Numbers on my environment:

master:  861.646ms
patched: 697.547ms

SQL:
COPY (SELECT
1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2,
generate_series(1,1000000::int4)) TO '/dev/null' \watch c=5
 


Thanks,
-- 
kou
From 387f11bde4eb928af23f6a66101aa9d63b3dcfdf Mon Sep 17 00:00:00 2001
From: Sutou Kouhei <kou@clear-code.com>
Date: Thu, 8 Feb 2024 17:17:25 +0900
Subject: [PATCH v1] Add a test for invalid encoding for COPY FROM

The test data uses UTF-8 "hello" in Japanese but the test specifies
EUC_JP. So it's an invalid data.
---
 src/test/regress/expected/copyencoding.out |  8 ++++++++
 src/test/regress/parallel_schedule         |  2 +-
 src/test/regress/sql/copyencoding.sql      | 10 ++++++++++
 3 files changed, 19 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/copyencoding.out
 create mode 100644 src/test/regress/sql/copyencoding.sql

diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out
new file mode 100644
index 0000000000..aa4ecea09e
--- /dev/null
+++ b/src/test/regress/expected/copyencoding.out
@@ -0,0 +1,8 @@
+--
+-- Test cases for COPY WITH (ENCODING)
+--
+CREATE TABLE test (t text);
+COPY test FROM stdin WITH (ENCODING 'EUC_JP');
+ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT:  COPY test, line 1
+DROP TABLE test;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 1d8a414eea..238cef28c4 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment
 # execute two copy tests in parallel, to check that copy itself
 # is concurrent safe.
 # ----------
-test: copy copyselect copydml insert insert_conflict
+test: copy copyselect copydml copyencoding insert insert_conflict
 
 # ----------
 # More groups of parallel tests
diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql
new file mode 100644
index 0000000000..07c85e988b
--- /dev/null
+++ b/src/test/regress/sql/copyencoding.sql
@@ -0,0 +1,10 @@
+--
+-- Test cases for COPY WITH (ENCODING)
+--
+
+CREATE TABLE test (t text);
+COPY test FROM stdin WITH (ENCODING 'EUC_JP');
+こんにちは
+\.
+
+DROP TABLE test;
-- 
2.43.0

Re: confusing / inefficient "need_transcoding" handling in copy

From

Heikki Linnakangas

Date:

08 February 2024, 08:25:07

On 08/02/2024 09:05, Michael Paquier wrote:
> On Tue, Feb 06, 2024 at 02:24:45PM -0800, Andres Freund wrote:
>> I think the code is just very confusing - there actually *is* verification of
>> the encoding, it just happens at a different, earlier, layer, namely in
>> copyfromparse.c: CopyConvertBuf() which says:
>>     /*
>>      * If the file and server encoding are the same, no encoding conversion is
>>      * required.  However, we still need to verify that the input is valid for
>>      * the encoding.
>>      */
>>
>> And does indeed verify that.
> 
> This has been switched by Heikki in f82de5c46bdf, in 2021, that has
> removed pg_database_encoding_max_length() in the COPY FROM case.
> (Adding him now in CC, in case he has any comments).

Yeah, I agree the "pg_database_encoding_max_length() > 1" check in COPY 
TO is unnecessary. It's always been like that, but now that the COPY TO 
and COPY FROM cases don't share the code, it's more obvious. Let's 
remove it.

On your patch:

> +     * Set up encoding conversion info, validating data if server and
> +     * client encodings are not the same (see also pg_server_to_any).

There's no validation, just conversion. I'd suggest:

"Set up encoding conversion info if the file and server encodings differ 
(see also pg_server_to_any)."

Other than that, +1

> That's a performance-only change, but there may be a good argument for
> backpatching something, perhaps?

-1 for backpatching, out of principle. This is not a regression, it's 
always been like that. Seems innocent, but why is this different from 
any other performance improvement we make.

BTW, I can see an optimization opportunity even if the encodings differ: 
Currently, CopyAttributeOutText() calls pg_server_to_any(), and then 
grovels through the string to find any characters that need to be 
quoted. You could do it the other way round and handle quoting before 
the conversion. That has two benefits:

1. You don't need the strlen() call, because you just scanned through 
the string so you already know its length.
2. You don't need to worry about 'encoding_embeds_ascii' when you 
operate on the server encoding.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: confusing / inefficient "need_transcoding" handling in copy

From

Sutou Kouhei

Date:

08 February 2024, 08:29:46

Hi,

In <20240208.172501.2177371292839763981.kou@clear-code.com>
  "Re: confusing / inefficient "need_transcoding" handling in copy" on Thu, 08 Feb 2024 17:25:01 +0900 (JST),
  Sutou Kouhei <kou@clear-code.com> wrote:

> How about the following to avoid needless transcoding?

Oh, sorry. I missed the Michael's patch:
https://www.postgresql.org/message-id/flat/ZcR9Q9hJ8GedFSCd%40paquier.xyz#e73272b042a22befac7a95f7bcb4fb9a

I withdraw my change.


Thanks,
-- 
kou

Re: confusing / inefficient "need_transcoding" handling in copy

From

Michael Paquier

Date:

08 February 2024, 23:33:52

On Thu, Feb 08, 2024 at 05:29:46PM +0900, Sutou Kouhei wrote:
> Oh, sorry. I missed the Michael's patch:
> https://www.postgresql.org/message-id/flat/ZcR9Q9hJ8GedFSCd%40paquier.xyz#e73272b042a22befac7a95f7bcb4fb9a
>
> I withdraw my change.

No problem.  Thanks for caring about that.
--
Michael

Attachment

signature.asc

Re: confusing / inefficient "need_transcoding" handling in copy

From

Michael Paquier

Date:

09 February 2024, 00:36:28

On Thu, Feb 08, 2024 at 10:25:07AM +0200, Heikki Linnakangas wrote:
> There's no validation, just conversion. I'd suggest:
>
> "Set up encoding conversion info if the file and server encodings differ
> (see also pg_server_to_any)."
>
> Other than that, +1

Cool.  I've used your wording and applied that on HEAD.

> BTW, I can see an optimization opportunity even if the encodings differ:
> Currently, CopyAttributeOutText() calls pg_server_to_any(), and then grovels
> through the string to find any characters that need to be quoted. You could
> do it the other way round and handle quoting before the conversion. That has
> two benefits:
>
> 1. You don't need the strlen() call, because you just scanned through the
> string so you already know its length.
> 2. You don't need to worry about 'encoding_embeds_ascii' when you operate on
> the server encoding.

That sounds right, still it looks like there would be cases where
you'd need the strlen() call if !encoding_embeds_ascii.
--
Michael

Attachment

signature.asc

Re: confusing / inefficient "need_transcoding" handling in copy

From

Andres Freund

Date:

09 February 2024, 00:43:07

Hi,

On 2024-02-09 09:36:28 +0900, Michael Paquier wrote:
> On Thu, Feb 08, 2024 at 10:25:07AM +0200, Heikki Linnakangas wrote:
> > There's no validation, just conversion. I'd suggest:
> > 
> > "Set up encoding conversion info if the file and server encodings differ
> > (see also pg_server_to_any)."
> > 
> > Other than that, +1
> 
> Cool.  I've used your wording and applied that on HEAD.

Thanks. LGTM.

Greetings,

Andres Freund

Re: confusing / inefficient "need_transcoding" handling in copy

From

Michael Paquier

Date:

13 February 2024, 21:56:16

On Thu, Feb 08, 2024 at 05:25:01PM +0900, Sutou Kouhei wrote:
> In <20240206222445.hzq22pb2nye7rm67@awork3.anarazel.de>
>   "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 Feb 2024 14:24:45 -0800,
>   Andres Freund <andres@anarazel.de> wrote:
>
>> One unfortunate issue: We don't have any tests verifying that COPY FROM
>> catches encoding issues.
>
> How about the attached patch for it?
>
> +CREATE TABLE test (t text);
> +COPY test FROM stdin WITH (ENCODING 'EUC_JP');
> +こんにちは
> +\.
> +
> +DROP TABLE test;

We have a couple of non-ASCII characters in the tests, but I suspect
that this one will not be digested correctly everywhere, even if
EUC_JP should be OK to use for the check.  How about writing an
arbitrary sequence of bytes into a temporary file that gets used for
the COPY FROM instead?  See for example how we do that with
abs_builddir in copy.sql.
--
Michael

Attachment

signature.asc

Re: confusing / inefficient "need_transcoding" handling in copy

From

Sutou Kouhei

Date:

14 February 2024, 02:46:08

Hi,

In <ZcvlgMEjt3qY8eiL@paquier.xyz>
  "Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 06:56:16 +0900,
  Michael Paquier <michael@paquier.xyz> wrote:

> We have a couple of non-ASCII characters in the tests, but I suspect
> that this one will not be digested correctly everywhere, even if
> EUC_JP should be OK to use for the check.  How about writing an
> arbitrary sequence of bytes into a temporary file that gets used for 
> the COPY FROM instead?  See for example how we do that with
> abs_builddir in copy.sql.

It makes sense. How about the attached patch?


Thanks,
-- 
kou
From 6eb9669f97c54f8b85fac63db40ad80664692d12 Mon Sep 17 00:00:00 2001
From: Sutou Kouhei <kou@clear-code.com>
Date: Wed, 14 Feb 2024 11:44:13 +0900
Subject: [PATCH v2] Add a test for invalid encoding for COPY FROM

The test data use an UTF-8 character (U+3042 HIRAGANA LETTER A) but
the test specifies EUC_JP. So it's an invalid data.
---
 src/test/regress/expected/copyencoding.out | 13 +++++++++++++
 src/test/regress/parallel_schedule         |  2 +-
 src/test/regress/sql/copyencoding.sql      | 15 +++++++++++++++
 3 files changed, 29 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/copyencoding.out
 create mode 100644 src/test/regress/sql/copyencoding.sql

diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out
new file mode 100644
index 0000000000..32a9d918fa
--- /dev/null
+++ b/src/test/regress/expected/copyencoding.out
@@ -0,0 +1,13 @@
+--
+-- Test cases for COPY WITH (ENCODING)
+--
+-- directory paths are passed to us in environment variables
+\getenv abs_builddir PG_ABS_BUILDDIR
+CREATE TABLE test (t text);
+\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv'
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT:  COPY test, line 1
+DROP TABLE test;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 1d8a414eea..238cef28c4 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment
 # execute two copy tests in parallel, to check that copy itself
 # is concurrent safe.
 # ----------
-test: copy copyselect copydml insert insert_conflict
+test: copy copyselect copydml copyencoding insert insert_conflict
 
 # ----------
 # More groups of parallel tests
diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql
new file mode 100644
index 0000000000..89e2124996
--- /dev/null
+++ b/src/test/regress/sql/copyencoding.sql
@@ -0,0 +1,15 @@
+--
+-- Test cases for COPY WITH (ENCODING)
+--
+
+-- directory paths are passed to us in environment variables
+\getenv abs_builddir PG_ABS_BUILDDIR
+
+CREATE TABLE test (t text);
+
+\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv'
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+
+DROP TABLE test;
-- 
2.43.0

Re: confusing / inefficient "need_transcoding" handling in copy

From

Sutou Kouhei

Date:

06 December 2024, 10:20:42

Hi Michael,

ping.

(Do you think that this patch is still needed?)


Thanks,
-- 
kou

In <20240214.114608.2091541942684063981.kou@clear-code.com>
  "Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 11:46:08 +0900 (JST),
  Sutou Kouhei <kou@clear-code.com> wrote:

> Hi,
> 
> In <ZcvlgMEjt3qY8eiL@paquier.xyz>
>   "Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 06:56:16 +0900,
>   Michael Paquier <michael@paquier.xyz> wrote:
> 
>> We have a couple of non-ASCII characters in the tests, but I suspect
>> that this one will not be digested correctly everywhere, even if
>> EUC_JP should be OK to use for the check.  How about writing an
>> arbitrary sequence of bytes into a temporary file that gets used for 
>> the COPY FROM instead?  See for example how we do that with
>> abs_builddir in copy.sql.
> 
> It makes sense. How about the attached patch?
> 
> 
> Thanks,
> -- 
> kou

Re: confusing / inefficient "need_transcoding" handling in copy

From

Sutou Kouhei

Date:

12 December 2024, 09:25:41

Hi,

In <Z1fKrTkT-eIVAK7F@paquier.xyz>
  "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 10 Dec 2024 13:59:25 +0900,
  Michael Paquier <michael@paquier.xyz> wrote:

> client_encoding would be used by COPY when not specifying ENCODING
> option.  Perhaps more tests should be added with this value specified
> by a SET client_encoding?

It makes sense. I missed the case. I've added the case to
the v3 patch.

> Another one would be valid conversions back and forth.  For example,
> I recall that LATIN1 accepts any bytes and can apply a conversion to
> UTF-8, so we could use it and expand a bit more the proposed tests?
> Or something like that?

OK. I've added valid cases too by using LATIN1 as you
suggested.

> This is not going to be portable across the buildfarm.  Two reasons
> are spotted by the CI (there may be others):
> 1) For Windows, as in the following regression.diffs:
>  COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
> +ERROR:  character with byte sequence 0xe3 0x81 0x82 in encoding "UTF8" has no equivalent in encoding "WIN1252"
> 2) Second failure on Linux, with 32-bit builds:
>  COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
> +ERROR:  conversion between UTF8 and SQL_ASCII is not supported
> 
> Likely, this should be made conditional, based on the fact that the
> database needs to be able to support utf8?  There are a couple of
> examples like that in the tree, based on the following SQL trick:
> SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
> \if :skip_test
> \quit
> \endif

Thanks. I didn't notice the portability problem. I've added
the skip trick.

> This requires an alternate output for the non-utf8 case.

Oh! I didn't know the "XXX_1.out" feature.


Thanks,
-- 
kou
From 104c21bbd8d4fe063b74510d16be32372c4ac1bc Mon Sep 17 00:00:00 2001
From: Sutou Kouhei <kou@clear-code.com>
Date: Wed, 14 Feb 2024 11:44:13 +0900
Subject: [PATCH v3] Add tests for invalid encoding for COPY FROM

The test data use an UTF-8 character (U+3042 HIRAGANA LETTER A) but
the test specifies EUC_JP. So it's an invalid data.

The added tests use ENCODING and client_encoding. The client_encoding
value is used when we don't specify ENCODING explicitly.
---
 src/test/regress/expected/copyencoding.out | 27 +++++++++++++++++++
 src/test/regress/parallel_schedule         |  2 +-
 src/test/regress/sql/copyencoding.sql      | 30 ++++++++++++++++++++++
 3 files changed, 58 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/copyencoding.out
 create mode 100644 src/test/regress/sql/copyencoding.sql

diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out
new file mode 100644
index 00000000000..86fcb640313
--- /dev/null
+++ b/src/test/regress/expected/copyencoding.out
@@ -0,0 +1,27 @@
+--
+-- Test cases for COPY encoding
+--
+-- skip test if not UTF8 server encoding
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- directory paths are passed to us in environment variables
+\getenv abs_builddir PG_ABS_BUILDDIR
+CREATE TABLE test (t text);
+\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv'
+-- Use ENCODING explicitly
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT:  COPY test, line 1
+-- Use client_encoding
+SET client_encoding TO UTF8;
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv);
+SET client_encoding TO EUC_JP;
+COPY test FROM :'utf8_csv' WITH (FORMAT csv);
+ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT:  COPY test, line 1
+DROP TABLE test;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 81e4222d26a..1edd9e45ebb 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment
 # execute two copy tests in parallel, to check that copy itself
 # is concurrent safe.
 # ----------
-test: copy copyselect copydml insert insert_conflict
+test: copy copyselect copydml copyencoding insert insert_conflict
 
 # ----------
 # More groups of parallel tests
diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql
new file mode 100644
index 00000000000..46a467cc31e
--- /dev/null
+++ b/src/test/regress/sql/copyencoding.sql
@@ -0,0 +1,30 @@
+--
+-- Test cases for COPY encoding
+--
+
+-- skip test if not UTF8 server encoding
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- directory paths are passed to us in environment variables
+\getenv abs_builddir PG_ABS_BUILDDIR
+
+CREATE TABLE test (t text);
+
+\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv'
+
+-- Use ENCODING explicitly
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+
+-- Use client_encoding
+SET client_encoding TO UTF8;
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv);
+SET client_encoding TO EUC_JP;
+COPY test FROM :'utf8_csv' WITH (FORMAT csv);
+
+DROP TABLE test;
-- 
2.45.2

Re: confusing / inefficient "need_transcoding" handling in copy

From

Sutou Kouhei

Date:

13 December 2024, 06:27:37

Hi,

In <Z1ukEe2d7ml6-oaZ@paquier.xyz>
  "Re: confusing / inefficient "need_transcoding" handling in copy" on Fri, 13 Dec 2024 12:03:45 +0900,
  Michael Paquier <michael@paquier.xyz> wrote:

>> OK. I've added valid cases too by using LATIN1 as you
>> suggested.
> 
> I may have missed something but v3 does not use that for a valid
> conversion?

Oh, sorry... I attached wrong patch...
I attach the v4 patch that includes this case.

>> Oh! I didn't know the "XXX_1.out" feature.
> 
> You have missed the inclusion of an alternate output, which should be
> something like that to bypass the test rather than failing:
> --- /dev/null
> +++ b/src/test/regress/expected/copyencoding_1.out
> @@ -0,0 +1,7 @@
> +--
> +-- Test cases for COPY encoding
> +--
> +-- skip test if not UTF8 server encoding
> +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
> +\if :skip_test
> +\quit
> 
> I guess that you have the file, forgot a `git add`.

I did "git add" but I attached a wrong file...

The v4 patch includes this too.


Thanks,
-- 
kou
From 65976f2357da73ea0d821b3b071f479a1650169f Mon Sep 17 00:00:00 2001
From: Sutou Kouhei <kou@clear-code.com>
Date: Wed, 14 Feb 2024 11:44:13 +0900
Subject: [PATCH v4] Add tests for encoding for COPY FROM

This adds valid cases and invalid cases.

Valid cases read UTF8 character as LATIN1 character. Because LATIN1
accepts any bytes.

Invalid cases read UTF8 character as EUC_JP character.

The added tests use ENCODING and client_encoding. The client_encoding
value is used when we don't specify ENCODING explicitly.
---
 src/test/regress/expected/copyencoding.out   | 43 ++++++++++++++++
 src/test/regress/expected/copyencoding_1.out |  7 +++
 src/test/regress/parallel_schedule           |  2 +-
 src/test/regress/sql/copyencoding.sql        | 52 ++++++++++++++++++++
 4 files changed, 103 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/copyencoding.out
 create mode 100644 src/test/regress/expected/copyencoding_1.out
 create mode 100644 src/test/regress/sql/copyencoding.sql

diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out
new file mode 100644
index 00000000000..d7ee888997c
--- /dev/null
+++ b/src/test/regress/expected/copyencoding.out
@@ -0,0 +1,43 @@
+--
+-- Test cases for COPY encoding
+--
+-- skip test if not UTF8 server encoding
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- directory paths are passed to us in environment variables
+\getenv abs_builddir PG_ABS_BUILDDIR
+CREATE TABLE test (t text);
+\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv'
+-- Valid cases
+-- Use ENCODING explicitly
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+-- Read UTF8 data as LATIN1: No error
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'LATIN1');
+-- Use client_encoding
+SET client_encoding TO UTF8;
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv);
+-- Read UTF8 data as LATIN1: No error
+SET client_encoding TO LATIN1;
+COPY test FROM :'utf8_csv' WITH (FORMAT csv);
+-- Invalid cases
+-- Use ENCODING explicitly
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+-- Read UTF8 data as EUC_JP: No error
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT:  COPY test, line 1
+-- Use client_encoding
+SET client_encoding TO UTF8;
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv);
+-- Read UTF8 data as EUC_JP: No error
+SET client_encoding TO EUC_JP;
+COPY test FROM :'utf8_csv' WITH (FORMAT csv);
+ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT:  COPY test, line 1
+DROP TABLE test;
diff --git a/src/test/regress/expected/copyencoding_1.out b/src/test/regress/expected/copyencoding_1.out
new file mode 100644
index 00000000000..c754e23ff7d
--- /dev/null
+++ b/src/test/regress/expected/copyencoding_1.out
@@ -0,0 +1,7 @@
+--
+-- Test cases for COPY encoding
+--
+-- skip test if not UTF8 server encoding
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 81e4222d26a..1edd9e45ebb 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment
 # execute two copy tests in parallel, to check that copy itself
 # is concurrent safe.
 # ----------
-test: copy copyselect copydml insert insert_conflict
+test: copy copyselect copydml copyencoding insert insert_conflict
 
 # ----------
 # More groups of parallel tests
diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql
new file mode 100644
index 00000000000..3e41e6d0c98
--- /dev/null
+++ b/src/test/regress/sql/copyencoding.sql
@@ -0,0 +1,52 @@
+--
+-- Test cases for COPY encoding
+--
+
+-- skip test if not UTF8 server encoding
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- directory paths are passed to us in environment variables
+\getenv abs_builddir PG_ABS_BUILDDIR
+
+CREATE TABLE test (t text);
+
+\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv'
+
+
+-- Valid cases
+
+-- Use ENCODING explicitly
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+-- Read UTF8 data as LATIN1: No error
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'LATIN1');
+
+-- Use client_encoding
+SET client_encoding TO UTF8;
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv);
+-- Read UTF8 data as LATIN1: No error
+SET client_encoding TO LATIN1;
+COPY test FROM :'utf8_csv' WITH (FORMAT csv);
+
+
+-- Invalid cases
+
+-- Use ENCODING explicitly
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+-- Read UTF8 data as EUC_JP: No error
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+
+-- Use client_encoding
+SET client_encoding TO UTF8;
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv);
+-- Read UTF8 data as EUC_JP: No error
+SET client_encoding TO EUC_JP;
+COPY test FROM :'utf8_csv' WITH (FORMAT csv);
+
+DROP TABLE test;
-- 
2.45.2

Re: confusing / inefficient "need_transcoding" handling in copy

From

Sutou Kouhei

Date:

16 December 2024, 05:39:58

Hi,

In <Z1-PjE6KPimF8w55@paquier.xyz>
  "Re: confusing / inefficient "need_transcoding" handling in copy" on Mon, 16 Dec 2024 11:25:16 +0900,
  Michael Paquier <michael@paquier.xyz> wrote:

> On Sat, Dec 14, 2024 at 04:46:57PM +0900, Michael Paquier wrote:
>> Note that using "test" as table name for the tests is not a good idea,
>> as this could very likely conflict with some concurrent activity.  I
>> would also add two RESET queries to remove the dependency to
>> client_encoding once the test has no need to rely on it.  No need to
>> send a new patch for all that, just noticing in passing.
> 
> I got some time to look again at this one.  Applied with the tweaks 
> for the table name and the two RESET queries.

Thanks!

I'll use better table name next time.


Thanks,
-- 
kou