Re: confusing / inefficient "need_transcoding" handling in copy - Mailing list pgsql-hackers
From | Sutou Kouhei |
---|---|
Subject | Re: confusing / inefficient "need_transcoding" handling in copy |
Date | |
Msg-id | 20241213.122737.1043800892304338262.kou@clear-code.com Whole thread Raw |
In response to | confusing / inefficient "need_transcoding" handling in copy (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hi, In <Z1ukEe2d7ml6-oaZ@paquier.xyz> "Re: confusing / inefficient "need_transcoding" handling in copy" on Fri, 13 Dec 2024 12:03:45 +0900, Michael Paquier <michael@paquier.xyz> wrote: >> OK. I've added valid cases too by using LATIN1 as you >> suggested. > > I may have missed something but v3 does not use that for a valid > conversion? Oh, sorry... I attached wrong patch... I attach the v4 patch that includes this case. >> Oh! I didn't know the "XXX_1.out" feature. > > You have missed the inclusion of an alternate output, which should be > something like that to bypass the test rather than failing: > --- /dev/null > +++ b/src/test/regress/expected/copyencoding_1.out > @@ -0,0 +1,7 @@ > +-- > +-- Test cases for COPY encoding > +-- > +-- skip test if not UTF8 server encoding > +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset > +\if :skip_test > +\quit > > I guess that you have the file, forgot a `git add`. I did "git add" but I attached a wrong file... The v4 patch includes this too. Thanks, -- kou From 65976f2357da73ea0d821b3b071f479a1650169f Mon Sep 17 00:00:00 2001 From: Sutou Kouhei <kou@clear-code.com> Date: Wed, 14 Feb 2024 11:44:13 +0900 Subject: [PATCH v4] Add tests for encoding for COPY FROM This adds valid cases and invalid cases. Valid cases read UTF8 character as LATIN1 character. Because LATIN1 accepts any bytes. Invalid cases read UTF8 character as EUC_JP character. The added tests use ENCODING and client_encoding. The client_encoding value is used when we don't specify ENCODING explicitly. --- src/test/regress/expected/copyencoding.out | 43 ++++++++++++++++ src/test/regress/expected/copyencoding_1.out | 7 +++ src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 52 ++++++++++++++++++++ 4 files changed, 103 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/expected/copyencoding_1.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 00000000000..d7ee888997c --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,43 @@ +-- +-- Test cases for COPY encoding +-- +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit +\endif +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR +CREATE TABLE test (t text); +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- Valid cases +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +-- Read UTF8 data as LATIN1: No error +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'LATIN1'); +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +-- Read UTF8 data as LATIN1: No error +SET client_encoding TO LATIN1; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); +-- Invalid cases +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +-- Read UTF8 data as EUC_JP: No error +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +-- Read UTF8 data as EUC_JP: No error +SET client_encoding TO EUC_JP; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/expected/copyencoding_1.out b/src/test/regress/expected/copyencoding_1.out new file mode 100644 index 00000000000..c754e23ff7d --- /dev/null +++ b/src/test/regress/expected/copyencoding_1.out @@ -0,0 +1,7 @@ +-- +-- Test cases for COPY encoding +-- +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 81e4222d26a..1edd9e45ebb 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # ---------- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # ---------- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 00000000000..3e41e6d0c98 --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,52 @@ +-- +-- Test cases for COPY encoding +-- + +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit +\endif + +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR + +CREATE TABLE test (t text); + +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' + + +-- Valid cases + +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +-- Read UTF8 data as LATIN1: No error +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'LATIN1'); + +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +-- Read UTF8 data as LATIN1: No error +SET client_encoding TO LATIN1; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); + + +-- Invalid cases + +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +-- Read UTF8 data as EUC_JP: No error +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); + +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +-- Read UTF8 data as EUC_JP: No error +SET client_encoding TO EUC_JP; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); + +DROP TABLE test; -- 2.45.2
pgsql-hackers by date: