Re: confusing / inefficient "need_transcoding" handling in copy - Mailing list pgsql-hackers

From Sutou Kouhei
Subject Re: confusing / inefficient "need_transcoding" handling in copy
Date
Msg-id 20240208.172501.2177371292839763981.kou@clear-code.com
Whole thread Raw
In response to Re: confusing / inefficient "need_transcoding" handling in copy  (Andres Freund <andres@anarazel.de>)
Responses Re: confusing / inefficient "need_transcoding" handling in copy
Re: confusing / inefficient "need_transcoding" handling in copy
List pgsql-hackers
Hi,

In <20240206222445.hzq22pb2nye7rm67@awork3.anarazel.de>
  "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 Feb 2024 14:24:45 -0800,
  Andres Freund <andres@anarazel.de> wrote:

> One unfortunate issue: We don't have any tests verifying that COPY FROM
> catches encoding issues.

How about the attached patch for it?


How about the following to avoid needless transcoding?

----
diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index bd4710a79b..80ec26cafe 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -612,13 +612,14 @@ BeginCopyTo(ParseState *pstate,
         cstate->file_encoding = cstate->opts.file_encoding;
 
     /*
-     * Set up encoding conversion info.  Even if the file and server encodings
-     * are the same, we must apply pg_any_to_server() to validate data in
-     * multibyte encodings.
+     * Set up encoding conversion info. We use pg_server_to_any() for the
+     * conversion. pg_server_to_any() does nothing with an encoding that
+     * equals to GetDatabaseEncoding() and PG_SQL_ASCII. If
+     * cstate->file_encoding equals to GetDatabaseEncoding() and PG_SQL_ASCII,
+     * we don't need to transcode.
      */
-    cstate->need_transcoding =
-        (cstate->file_encoding != GetDatabaseEncoding() ||
-         pg_database_encoding_max_length() > 1);
+    cstate->need_transcoding = !(cstate->file_encoding == GetDatabaseEncoding() ||
+                                 cstate->file_encoding == PG_SQL_ASCII);
     /* See Multibyte encoding comment above */
     cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->file_encoding);
 
----

Numbers on my environment:

master:  861.646ms
patched: 697.547ms

SQL:
COPY (SELECT
1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2,
generate_series(1,1000000::int4)) TO '/dev/null' \watch c=5
 


Thanks,
-- 
kou
From 387f11bde4eb928af23f6a66101aa9d63b3dcfdf Mon Sep 17 00:00:00 2001
From: Sutou Kouhei <kou@clear-code.com>
Date: Thu, 8 Feb 2024 17:17:25 +0900
Subject: [PATCH v1] Add a test for invalid encoding for COPY FROM

The test data uses UTF-8 "hello" in Japanese but the test specifies
EUC_JP. So it's an invalid data.
---
 src/test/regress/expected/copyencoding.out |  8 ++++++++
 src/test/regress/parallel_schedule         |  2 +-
 src/test/regress/sql/copyencoding.sql      | 10 ++++++++++
 3 files changed, 19 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/copyencoding.out
 create mode 100644 src/test/regress/sql/copyencoding.sql

diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out
new file mode 100644
index 0000000000..aa4ecea09e
--- /dev/null
+++ b/src/test/regress/expected/copyencoding.out
@@ -0,0 +1,8 @@
+--
+-- Test cases for COPY WITH (ENCODING)
+--
+CREATE TABLE test (t text);
+COPY test FROM stdin WITH (ENCODING 'EUC_JP');
+ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT:  COPY test, line 1
+DROP TABLE test;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 1d8a414eea..238cef28c4 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment
 # execute two copy tests in parallel, to check that copy itself
 # is concurrent safe.
 # ----------
-test: copy copyselect copydml insert insert_conflict
+test: copy copyselect copydml copyencoding insert insert_conflict
 
 # ----------
 # More groups of parallel tests
diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql
new file mode 100644
index 0000000000..07c85e988b
--- /dev/null
+++ b/src/test/regress/sql/copyencoding.sql
@@ -0,0 +1,10 @@
+--
+-- Test cases for COPY WITH (ENCODING)
+--
+
+CREATE TABLE test (t text);
+COPY test FROM stdin WITH (ENCODING 'EUC_JP');
+こんにちは
+\.
+
+DROP TABLE test;
-- 
2.43.0


pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: What about Perl autodie?
Next
From: Heikki Linnakangas
Date:
Subject: Re: confusing / inefficient "need_transcoding" handling in copy