Thread: BUG #17501: COPY is failing with "ERROR: invalid byte sequence for encoding "UTF8": 0xe5"

The following bug has been logged on the website:

Bug reference:      17501
Logged by:          Vitaly Voronov
Email address:      wizard_1024@tut.by
PostgreSQL version: 14.3
Operating system:   CentOS Linux release 7.9.2009 (Core)
Description:

Hello,

We've seen a such bug: COPY command shows error "ERROR:  invalid byte
sequence for encoding "UTF8": 0xe5" on file.
The same file with small amount of lines is imported without any errors.

How to reproduce bug:
# create database
# create database with
# SQL_ASCII, C, C
createdb --encoding=SQL_ASCII --lc-collate=C --lc-ctype=C
--template=template0 test

# connect to the database
psql test

# Create table
CREATE TABLE test_data (
    test_data text
);

# Import without error
truncate table test_data;
COPY test_data (test_data) FROM '/tmp/test_pass.csv' WITH DELIMITER AS ','
CSV QUOTE AS '"';

COPY 207

# Import with error
truncate table test_data;
COPY test_data (test_data) FROM '/tmp/test_fail.csv' WITH DELIMITER AS ','
CSV QUOTE AS '"';

ERROR:  invalid byte sequence for encoding "UTF8": 0xe5
CONTEXT:  COPY test_data, line 627

# both files contains the same rows, but test_fail contains more rows
# seems that the file more than 65K size cannot be imported
# if create DB with UTF8 encoding instead of SQL_ASCII - both tests will be
passed

# How to generate files:
# Imported without errors
for i in $(seq 1 207); do echo

"NURO光です。明日の宅内工事お立合いよろしくお願い致します。2回目の屋外工事につきましては具体的な工事日案内の準備が整い次第、こちらからご連絡いたします。※詳細はこちら【工事について】https://www.test.jp/1234/5678.html&id=12211"
>> /tmp/test_pass.csv; done;
# Imported with errors
for i in $(seq 1 5722); do echo

"NURO光です。明日の宅内工事お立合いよろしくお願い致します。2回目の屋外工事につきましては具体的な工事日案内の準備が整い次第、こちらからご連絡いたします。※詳細はこちら【工事について】https://www.test.jp/1234/5678.html&id=12211"
>> /tmp/test_fail.csv; done;

# Both files can be imported without any problem to PostgreSQL 11.


Sounds a lot like a bug in commit 
f82de5c46bdf8cd65812a7b04c9509c218e1545d. Thanks for the report, I'll 
investigate!

- Heikki

On 28/05/2022 23:57, Vitaly V. Voronov wrote:
> Hello,
> Right commands:
> # Imported without errors
> for i in $(seq 1 207); do echo
> "NURO光です。明日の宅内工事お立合いよろしくお願い致します。2回目の屋外工 
> 事につきましては具体的な工事日案内の準備が整い次第、こちらからご連絡いた 
> します。※詳細はこちら【工事について】https://www.test.jp/1234 
> /5678.html&id=12211 <https://www.test.jp/1234/5678.html&id=12211>" >> 
> /tmp/test_pass.csv; done;
> # Imported with errors
> for i in $(seq 1 5722); do echo "NURO光です。明日の宅内工事お立合いよろ 
> しくお願い致します。2回目の屋外工事につきましては具体的な工事日案内の準 
> 備が整い次第、こちらからご連絡いたします。※詳細はこちら【工事について】 
> https://www.test.jp/1234/5678.html&id=12211" >> /tmp/test_fail.csv; done;
> 
> 
> 28.05.2022, 23:53, "PG Bug reporting form" <noreply@postgresql.org>:
> 
>     The following bug has been logged on the website:
> 
>     Bug reference: 17501
>     Logged by: Vitaly Voronov
>     Email address: wizard_1024@tut.by <mailto:wizard_1024@tut.by>
>     PostgreSQL version: 14.3
>     Operating system: CentOS Linux release 7.9.2009 (Core)
>     Description:
> 
>     Hello,
> 
>     We've seen a such bug: COPY command shows error "ERROR: invalid byte
>     sequence for encoding "UTF8": 0xe5" on file.
>     The same file with small amount of lines is imported without any errors.
> 
>     How to reproduce bug:
>     # create database
>     # create database with
>     # SQL_ASCII, C, C
>     createdb --encoding=SQL_ASCII --lc-collate=C --lc-ctype=C
>     --template=template0 test
> 
>     # connect to the database
>     psql test
> 
>     # Create table
>     CREATE TABLE test_data (
>          test_data text
>     );
> 
>     # Import without error
>     truncate table test_data;
>     COPY test_data (test_data) FROM '/tmp/test_pass.csv' WITH DELIMITER
>     AS ','
>     CSV QUOTE AS '"';
> 
>     COPY 207
> 
>     # Import with error
>     truncate table test_data;
>     COPY test_data (test_data) FROM '/tmp/test_fail.csv' WITH DELIMITER
>     AS ','
>     CSV QUOTE AS '"';
> 
>     ERROR: invalid byte sequence for encoding "UTF8": 0xe5
>     CONTEXT: COPY test_data, line 627
> 
>     # both files contains the same rows, but test_fail contains more rows
>     # seems that the file more than 65K size cannot be imported
>     # if create DB with UTF8 encoding instead of SQL_ASCII - both tests
>     will be
>     passed
> 
>     # How to generate files:
>     # Imported without errors
>     for i in $(seq 1 207); do echo
>     "NURO光です。明日の宅内工事お立合いよろしくお願い致します。2回目の屋
>     外工事につきましては具体的な工事日案内の準備が整い次第、こちらからご
>     連絡いたします。※詳細はこちら【工事について】https://www.test.jp
>     /1234/5678.html&id=12211 <https://www.test.jp/1234/5678.html&id=12211>"
> 
>               /tmp/test_pass.csv; done;
> 
>     # Imported with errors
>     for i in $(seq 1 5722); do echo
>     "NURO光です。明日の宅内工事お立合いよろしくお願い致します。2回目の屋
>     外工事につきましては具体的な工事日案内の準備が整い次第、こちらからご
>     連絡いたします。※詳細はこちら【工事について】https://www.test.jp
>     /1234/5678.html&id=12211 <https://www.test.jp/1234/5678.html&id=12211>"
> 
>               /tmp/test_fail.csv; done;
> 
> 
>     # Both files can be imported without any problem to PostgreSQL 11.
> 




On 28/05/2022 23:52, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      17501
> Logged by:          Vitaly Voronov
> Email address:      wizard_1024@tut.by
> PostgreSQL version: 14.3
> Operating system:   CentOS Linux release 7.9.2009 (Core)
> Description:
> 
> Hello,
> 
> We've seen a such bug: COPY command shows error "ERROR:  invalid byte
> sequence for encoding "UTF8": 0xe5" on file.
> The same file with small amount of lines is imported without any errors.

I was able to reproduce this with your instructions. Thank you!

The bug was indeed introduced by commit f82de5c46b, in v14. Attached 
patch fixes it. Barring any objections, I'll commit and backpatch this.

I also noticed that there are some outdated comments in copyfromparse.c 
that still talk about the old protocol version 2 and COPY_OLD_FE. I'll 
clean that up as a separate commit.

- Heikki
Attachment
On 29/05/2022 13:39, Heikki Linnakangas wrote:
> On 28/05/2022 23:52, PG Bug reporting form wrote:
>> The following bug has been logged on the website:
>>
>> Bug reference:      17501
>> Logged by:          Vitaly Voronov
>> Email address:      wizard_1024@tut.by
>> PostgreSQL version: 14.3
>> Operating system:   CentOS Linux release 7.9.2009 (Core)
>> Description:
>>
>> Hello,
>>
>> We've seen a such bug: COPY command shows error "ERROR:  invalid byte
>> sequence for encoding "UTF8": 0xe5" on file.
>> The same file with small amount of lines is imported without any errors.
> 
> I was able to reproduce this with your instructions. Thank you!
> 
> The bug was indeed introduced by commit f82de5c46b, in v14. Attached
> patch fixes it. Barring any objections, I'll commit and backpatch this.
> 
> I also noticed that there are some outdated comments in copyfromparse.c
> that still talk about the old protocol version 2 and COPY_OLD_FE. I'll
> clean that up as a separate commit.

Committed. And I included those comment changes in the same commit, 
after all.

Thanks for the report!

- Heikki