Re: confusing / inefficient "need_transcoding" handling in copy - Mailing list pgsql-hackers
From | Sutou Kouhei |
---|---|
Subject | Re: confusing / inefficient "need_transcoding" handling in copy |
Date | |
Msg-id | 20240208.172501.2177371292839763981.kou@clear-code.com Whole thread Raw |
In response to | Re: confusing / inefficient "need_transcoding" handling in copy (Andres Freund <andres@anarazel.de>) |
Responses |
Re: confusing / inefficient "need_transcoding" handling in copy
Re: confusing / inefficient "need_transcoding" handling in copy |
List | pgsql-hackers |
Hi, In <20240206222445.hzq22pb2nye7rm67@awork3.anarazel.de> "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 Feb 2024 14:24:45 -0800, Andres Freund <andres@anarazel.de> wrote: > One unfortunate issue: We don't have any tests verifying that COPY FROM > catches encoding issues. How about the attached patch for it? How about the following to avoid needless transcoding? ---- diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index bd4710a79b..80ec26cafe 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -612,13 +612,14 @@ BeginCopyTo(ParseState *pstate, cstate->file_encoding = cstate->opts.file_encoding; /* - * Set up encoding conversion info. Even if the file and server encodings - * are the same, we must apply pg_any_to_server() to validate data in - * multibyte encodings. + * Set up encoding conversion info. We use pg_server_to_any() for the + * conversion. pg_server_to_any() does nothing with an encoding that + * equals to GetDatabaseEncoding() and PG_SQL_ASCII. If + * cstate->file_encoding equals to GetDatabaseEncoding() and PG_SQL_ASCII, + * we don't need to transcode. */ - cstate->need_transcoding = - (cstate->file_encoding != GetDatabaseEncoding() || - pg_database_encoding_max_length() > 1); + cstate->need_transcoding = !(cstate->file_encoding == GetDatabaseEncoding() || + cstate->file_encoding == PG_SQL_ASCII); /* See Multibyte encoding comment above */ cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->file_encoding); ---- Numbers on my environment: master: 861.646ms patched: 697.547ms SQL: COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1,1000000::int4)) TO '/dev/null' \watch c=5 Thanks, -- kou From 387f11bde4eb928af23f6a66101aa9d63b3dcfdf Mon Sep 17 00:00:00 2001 From: Sutou Kouhei <kou@clear-code.com> Date: Thu, 8 Feb 2024 17:17:25 +0900 Subject: [PATCH v1] Add a test for invalid encoding for COPY FROM The test data uses UTF-8 "hello" in Japanese but the test specifies EUC_JP. So it's an invalid data. --- src/test/regress/expected/copyencoding.out | 8 ++++++++ src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 10 ++++++++++ 3 files changed, 19 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 0000000000..aa4ecea09e --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,8 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- +CREATE TABLE test (t text); +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 1d8a414eea..238cef28c4 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # ---------- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # ---------- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 0000000000..07c85e988b --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,10 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- + +CREATE TABLE test (t text); +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); +こんにちは +\. + +DROP TABLE test; -- 2.43.0
pgsql-hackers by date: