Re: confusing / inefficient "need_transcoding" handling in copy - Mailing list pgsql-hackers

From Sutou Kouhei
Subject Re: confusing / inefficient "need_transcoding" handling in copy
Date
Msg-id 20241213.122737.1043800892304338262.kou@clear-code.com
Whole thread Raw
In response to confusing / inefficient "need_transcoding" handling in copy  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

In <Z1ukEe2d7ml6-oaZ@paquier.xyz>
  "Re: confusing / inefficient "need_transcoding" handling in copy" on Fri, 13 Dec 2024 12:03:45 +0900,
  Michael Paquier <michael@paquier.xyz> wrote:

>> OK. I've added valid cases too by using LATIN1 as you
>> suggested.
> 
> I may have missed something but v3 does not use that for a valid
> conversion?

Oh, sorry... I attached wrong patch...
I attach the v4 patch that includes this case.

>> Oh! I didn't know the "XXX_1.out" feature.
> 
> You have missed the inclusion of an alternate output, which should be
> something like that to bypass the test rather than failing:
> --- /dev/null
> +++ b/src/test/regress/expected/copyencoding_1.out
> @@ -0,0 +1,7 @@
> +--
> +-- Test cases for COPY encoding
> +--
> +-- skip test if not UTF8 server encoding
> +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
> +\if :skip_test
> +\quit
> 
> I guess that you have the file, forgot a `git add`.

I did "git add" but I attached a wrong file...

The v4 patch includes this too.


Thanks,
-- 
kou
From 65976f2357da73ea0d821b3b071f479a1650169f Mon Sep 17 00:00:00 2001
From: Sutou Kouhei <kou@clear-code.com>
Date: Wed, 14 Feb 2024 11:44:13 +0900
Subject: [PATCH v4] Add tests for encoding for COPY FROM

This adds valid cases and invalid cases.

Valid cases read UTF8 character as LATIN1 character. Because LATIN1
accepts any bytes.

Invalid cases read UTF8 character as EUC_JP character.

The added tests use ENCODING and client_encoding. The client_encoding
value is used when we don't specify ENCODING explicitly.
---
 src/test/regress/expected/copyencoding.out   | 43 ++++++++++++++++
 src/test/regress/expected/copyencoding_1.out |  7 +++
 src/test/regress/parallel_schedule           |  2 +-
 src/test/regress/sql/copyencoding.sql        | 52 ++++++++++++++++++++
 4 files changed, 103 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/copyencoding.out
 create mode 100644 src/test/regress/expected/copyencoding_1.out
 create mode 100644 src/test/regress/sql/copyencoding.sql

diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out
new file mode 100644
index 00000000000..d7ee888997c
--- /dev/null
+++ b/src/test/regress/expected/copyencoding.out
@@ -0,0 +1,43 @@
+--
+-- Test cases for COPY encoding
+--
+-- skip test if not UTF8 server encoding
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- directory paths are passed to us in environment variables
+\getenv abs_builddir PG_ABS_BUILDDIR
+CREATE TABLE test (t text);
+\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv'
+-- Valid cases
+-- Use ENCODING explicitly
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+-- Read UTF8 data as LATIN1: No error
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'LATIN1');
+-- Use client_encoding
+SET client_encoding TO UTF8;
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv);
+-- Read UTF8 data as LATIN1: No error
+SET client_encoding TO LATIN1;
+COPY test FROM :'utf8_csv' WITH (FORMAT csv);
+-- Invalid cases
+-- Use ENCODING explicitly
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+-- Read UTF8 data as EUC_JP: No error
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT:  COPY test, line 1
+-- Use client_encoding
+SET client_encoding TO UTF8;
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv);
+-- Read UTF8 data as EUC_JP: No error
+SET client_encoding TO EUC_JP;
+COPY test FROM :'utf8_csv' WITH (FORMAT csv);
+ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT:  COPY test, line 1
+DROP TABLE test;
diff --git a/src/test/regress/expected/copyencoding_1.out b/src/test/regress/expected/copyencoding_1.out
new file mode 100644
index 00000000000..c754e23ff7d
--- /dev/null
+++ b/src/test/regress/expected/copyencoding_1.out
@@ -0,0 +1,7 @@
+--
+-- Test cases for COPY encoding
+--
+-- skip test if not UTF8 server encoding
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 81e4222d26a..1edd9e45ebb 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment
 # execute two copy tests in parallel, to check that copy itself
 # is concurrent safe.
 # ----------
-test: copy copyselect copydml insert insert_conflict
+test: copy copyselect copydml copyencoding insert insert_conflict
 
 # ----------
 # More groups of parallel tests
diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql
new file mode 100644
index 00000000000..3e41e6d0c98
--- /dev/null
+++ b/src/test/regress/sql/copyencoding.sql
@@ -0,0 +1,52 @@
+--
+-- Test cases for COPY encoding
+--
+
+-- skip test if not UTF8 server encoding
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- directory paths are passed to us in environment variables
+\getenv abs_builddir PG_ABS_BUILDDIR
+
+CREATE TABLE test (t text);
+
+\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv'
+
+
+-- Valid cases
+
+-- Use ENCODING explicitly
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+-- Read UTF8 data as LATIN1: No error
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'LATIN1');
+
+-- Use client_encoding
+SET client_encoding TO UTF8;
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv);
+-- Read UTF8 data as LATIN1: No error
+SET client_encoding TO LATIN1;
+COPY test FROM :'utf8_csv' WITH (FORMAT csv);
+
+
+-- Invalid cases
+
+-- Use ENCODING explicitly
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+-- Read UTF8 data as EUC_JP: No error
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+
+-- Use client_encoding
+SET client_encoding TO UTF8;
+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv);
+-- Read UTF8 data as EUC_JP: No error
+SET client_encoding TO EUC_JP;
+COPY test FROM :'utf8_csv' WITH (FORMAT csv);
+
+DROP TABLE test;
-- 
2.45.2


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Add Postgres module info
Next
From: Kirill Reshke
Date:
Subject: Re: Allow subfield references without parentheses