Re: confusing / inefficient "need_transcoding" handling in copy - Mailing list pgsql-hackers
From | Sutou Kouhei |
---|---|
Subject | Re: confusing / inefficient "need_transcoding" handling in copy |
Date | |
Msg-id | 20241212.152541.1227846217843897891.kou@clear-code.com Whole thread Raw |
In response to | confusing / inefficient "need_transcoding" handling in copy (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hi, In <Z1fKrTkT-eIVAK7F@paquier.xyz> "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 10 Dec 2024 13:59:25 +0900, Michael Paquier <michael@paquier.xyz> wrote: > client_encoding would be used by COPY when not specifying ENCODING > option. Perhaps more tests should be added with this value specified > by a SET client_encoding? It makes sense. I missed the case. I've added the case to the v3 patch. > Another one would be valid conversions back and forth. For example, > I recall that LATIN1 accepts any bytes and can apply a conversion to > UTF-8, so we could use it and expand a bit more the proposed tests? > Or something like that? OK. I've added valid cases too by using LATIN1 as you suggested. > This is not going to be portable across the buildfarm. Two reasons > are spotted by the CI (there may be others): > 1) For Windows, as in the following regression.diffs: > COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); > +ERROR: character with byte sequence 0xe3 0x81 0x82 in encoding "UTF8" has no equivalent in encoding "WIN1252" > 2) Second failure on Linux, with 32-bit builds: > COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); > +ERROR: conversion between UTF8 and SQL_ASCII is not supported > > Likely, this should be made conditional, based on the fact that the > database needs to be able to support utf8? There are a couple of > examples like that in the tree, based on the following SQL trick: > SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset > \if :skip_test > \quit > \endif Thanks. I didn't notice the portability problem. I've added the skip trick. > This requires an alternate output for the non-utf8 case. Oh! I didn't know the "XXX_1.out" feature. Thanks, -- kou From 104c21bbd8d4fe063b74510d16be32372c4ac1bc Mon Sep 17 00:00:00 2001 From: Sutou Kouhei <kou@clear-code.com> Date: Wed, 14 Feb 2024 11:44:13 +0900 Subject: [PATCH v3] Add tests for invalid encoding for COPY FROM The test data use an UTF-8 character (U+3042 HIRAGANA LETTER A) but the test specifies EUC_JP. So it's an invalid data. The added tests use ENCODING and client_encoding. The client_encoding value is used when we don't specify ENCODING explicitly. --- src/test/regress/expected/copyencoding.out | 27 +++++++++++++++++++ src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 30 ++++++++++++++++++++++ 3 files changed, 58 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 00000000000..86fcb640313 --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,27 @@ +-- +-- Test cases for COPY encoding +-- +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit +\endif +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR +CREATE TABLE test (t text); +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +SET client_encoding TO EUC_JP; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 81e4222d26a..1edd9e45ebb 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # ---------- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # ---------- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 00000000000..46a467cc31e --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,30 @@ +-- +-- Test cases for COPY encoding +-- + +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit +\endif + +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR + +CREATE TABLE test (t text); + +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' + +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); + +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +SET client_encoding TO EUC_JP; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); + +DROP TABLE test; -- 2.45.2
pgsql-hackers by date: