Re: confusing / inefficient "need_transcoding" handling in copy - Mailing list pgsql-hackers

From Sutou Kouhei
Subject Re: confusing / inefficient "need_transcoding" handling in copy
Date
Msg-id 20241212.152541.1227846217843897891.kou@clear-code.com
Whole thread Raw
In response to confusing / inefficient "need_transcoding" handling in copy  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

In <Z1fKrTkT-eIVAK7F@paquier.xyz>
  "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 10 Dec 2024 13:59:25 +0900,
  Michael Paquier <michael@paquier.xyz> wrote:

> client_encoding would be used by COPY when not specifying ENCODING
> option.  Perhaps more tests should be added with this value specified
> by a SET client_encoding?

It makes sense. I missed the case. I've added the case to
the v3 patch.

> Another one would be valid conversions back and forth.  For example,
> I recall that LATIN1 accepts any bytes and can apply a conversion to
> UTF-8, so we could use it and expand a bit more the proposed tests?
> Or something like that?

OK. I've added valid cases too by using LATIN1 as you
suggested.

> This is not going to be portable across the buildfarm.  Two reasons
> are spotted by the CI (there may be others):
> 1) For Windows, as in the following regression.diffs:
>  COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
> +ERROR:  character with byte sequence 0xe3 0x81 0x82 in encoding "UTF8" has no equivalent in encoding "WIN1252"
> 2) Second failure on Linux, with 32-bit builds:
>  COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
> +ERROR:  conversion between UTF8 and SQL_ASCII is not supported
> 
> Likely, this should be made conditional, based on the fact that the
> database needs to be able to support utf8?  There are a couple of
> examples like that in the tree, based on the following SQL trick:
> SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
> \if :skip_test
> \quit
> \endif

Thanks. I didn't notice the portability problem. I've added
the skip trick.

> This requires an alternate output for the non-utf8 case.

Oh! I didn't know the "XXX_1.out" feature.


Thanks,
-- 
kou
From 104c21bbd8d4fe063b74510d16be32372c4ac1bc Mon Sep 17 00:00:00 2001
From: Sutou Kouhei <kou@clear-code.com>
Date: Wed, 14 Feb 2024 11:44:13 +0900
Subject: [PATCH v3] Add tests for invalid encoding for COPY FROM

The test data use an UTF-8 character (U+3042 HIRAGANA LETTER A) but
the test specifies EUC_JP. So it's an invalid data.

The added tests use ENCODING and client_encoding. The client_encoding
value is used when we don't specify ENCODING explicitly.
---
 src/test/regress/expected/copyencoding.out | 27 +++++++++++++++++++
 src/test/regress/parallel_schedule         |  2 +-
 src/test/regress/sql/copyencoding.sql      | 30 ++++++++++++++++++++++
 3 files changed, 58 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/copyencoding.out
 create mode 100644 src/test/regress/sql/copyencoding.sql

diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out
new file mode 100644
index 00000000000..86fcb640313
--- /dev/null
+++ b/src/test/regress/expected/copyencoding.out
@@ -0,0 +1,27 @@
+--
+-- Test cases for COPY encoding
+--
+-- skip test if not UTF8 server encoding
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- directory paths are passed to us in environment variables
+\getenv abs_builddir PG_ABS_BUILDDIR
+CREATE TABLE test (t text);
+\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv'
+-- Use ENCODING explicitly
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT:  COPY test, line 1
+-- Use client_encoding
+SET client_encoding TO UTF8;
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv);
+SET client_encoding TO EUC_JP;
+COPY test FROM :'utf8_csv' WITH (FORMAT csv);
+ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT:  COPY test, line 1
+DROP TABLE test;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 81e4222d26a..1edd9e45ebb 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment
 # execute two copy tests in parallel, to check that copy itself
 # is concurrent safe.
 # ----------
-test: copy copyselect copydml insert insert_conflict
+test: copy copyselect copydml copyencoding insert insert_conflict
 
 # ----------
 # More groups of parallel tests
diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql
new file mode 100644
index 00000000000..46a467cc31e
--- /dev/null
+++ b/src/test/regress/sql/copyencoding.sql
@@ -0,0 +1,30 @@
+--
+-- Test cases for COPY encoding
+--
+
+-- skip test if not UTF8 server encoding
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- directory paths are passed to us in environment variables
+\getenv abs_builddir PG_ABS_BUILDDIR
+
+CREATE TABLE test (t text);
+
+\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv'
+
+-- Use ENCODING explicitly
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+
+-- Use client_encoding
+SET client_encoding TO UTF8;
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv);
+SET client_encoding TO EUC_JP;
+COPY test FROM :'utf8_csv' WITH (FORMAT csv);
+
+DROP TABLE test;
-- 
2.45.2


pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Skip collecting decoded changes of already-aborted transactions
Next
From: Shubham Khanna
Date:
Subject: Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.