Thread: Issue with loading unicode characters with copy command

Issue with loading unicode characters with copy command

From

Kiran K V

Date:

12 January, 18:23:06

Hi,

I have a UTF8 database and simple table with two columns (integer and varchar). Created a csv file with some multibyte characters and trying to perform load operation using the copy command.

Database info:

Postgresql database details:

-----------+----------+----------+--------------------+--------------------+-----------------------

(Note: I also tried with collate utf8 and no luck)

postgres=# set client_encoding='UTF8';

SET

Table:

create table public.test ( PKCOL integer not null, STR1 character varying(64) null, primary key( PKCOL ))

csv contents:

1|"àáâãäåæçèéêëìíîï"

After data loading, actual data is becoming

Ã Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯

hex of this is - c2a1c2a2c2a3c2a4c2a5c2a6c2a7c2a8c2a9c2aac2abc2acc2aec2af

The hex values are indeed the UTF-8 encodings of the characters in your expected string, and the presence of `C2` before each character is indicative of how UTF-8 represents certain characters.

In UTF-8, characters from the extended Latin set (like `à`, `á`, `â`, etc.) are represented as two bytes. The first byte `C2` or `C3` indicates that this is a two-byte character, and the second byte specifies the character. For example:

- `à` is represented as `C3 A0`

- `á` is `C3 A1`

- `â` is `C3 A2`, and so on.

In this case, the `C2` byte is getting interpreted as a separate character and that is the likely reason that an `Â` (which corresponds to `C2`) is seen before each intended character. Looks like UTF-8 encoded data is mistakenly interpreted as Latin-1 (ISO-8859-1) or Windows-1252, where each byte is treated as a separate character.

Please advise. Thank you very much.

Regards,

Kiran

Re: Issue with loading unicode characters with copy command

From

Adrian Klaver

Date:

12 January, 20:18:38

On 1/12/24 07:23, Kiran K V wrote:
> Hi,
> 
> 
> I have a UTF8 database and simple table with two columns (integer and 
> varchar). Created a csv file with some multibyte characters and trying 
> to perform load operation using the copy command.

The multibyte characters come from what character set?

> 
> __ __
> 
> Database info:____
> 
> Postgresql database details:____
> 
>     Name    |  Owner   | Encoding |      Collate       |       
> Ctype        |   Access privileges____
> 
> -----------+----------+----------+--------------------+--------------------+-----------------------____
> 
> postgres  | postgres | UTF8     | English_India.1252 | 
> English_India.1252 |____
> 
> __ __
> 
> (Note: I also tried with collate utf8 and no luck)
> 
> 
> postgres=# set client_encoding='UTF8';____
> 
> SET____
> 
> __ __
> 
> Table:____
> 
> create table public.test ( PKCOL integer not null, STR1 character 
> varying(64) null, primary key( PKCOL )) ____
> 
> ____
> 
> csv contents:____
> 
> 1|"àáâãäåæçèéêëìíîï"____
> 
> __ __
> 
> After data loading, actual data is becoming____
> 
> Ã Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯____
> 
> hex of this is -  
> c2a1c2a2c2a3c2a4c2a5c2a6c2a7c2a8c2a9c2aac2abc2acc2aec2af____
> 
> __ __
> 
> The hex values are indeed the UTF-8 encodings of the characters in your 
> expected string, and the presence of `C2` before each character is 
> indicative of how UTF-8 represents certain characters.____
> 
> In UTF-8, characters from the extended Latin set (like `à`, `á`, `â`, 
> etc.) are represented as two bytes. The first byte `C2` or `C3` 
> indicates that this is a two-byte character, and the second byte 
> specifies the character. For example:____
> 
> - `à` is represented as `C3 A0`____
> 
> - `á` is `C3 A1`____
> 
> - `â` is `C3 A2`, and so on.____
> 
> In this case, the `C2` byte is getting interpreted as a separate 
> character and that is the likely reason that an `Â` (which corresponds 
> to `C2`) is seen before each intended character. Looks like UTF-8 
> encoded data is mistakenly interpreted as Latin-1 (ISO-8859-1) or 
> Windows-1252, where each byte is treated as a separate character.
> 
> 
> Please advise. Thank you very much.
> 
> 
> Regards,
> 
> Kiran
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Issue with loading unicode characters with copy command

From

Kiran K V

Date:

12 January, 20:47:27

Its UTF-8. Also verified the load file and its utf-8.

Regards,

Kiran

On Fri, Jan 12, 2024 at 10:48 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 1/12/24 07:23, Kiran K V wrote:
> Hi,
>
>
> I have a UTF8 database and simple table with two columns (integer and
> varchar). Created a csv file with some multibyte characters and trying
> to perform load operation using the copy command.

The multibyte characters come from what character set?

>
> __ __
>
> Database info:____
>
> Postgresql database details:____
>
>    Name    | Owner   | Encoding |      Collate       |
> Ctype        |   Access privileges____
>
> -----------+----------+----------+--------------------+--------------------+-----------------------____
>
> postgres | postgres | UTF8     | English_India.1252 |
> English_India.1252 |____
>
> __ __
>
> (Note: I also tried with collate utf8 and no luck)
>
>
> postgres=# set client_encoding='UTF8';____
>
> SET____
>
> __ __
>
> Table:____
>
> create table public.test ( PKCOL integer not null, STR1 character
> varying(64) null, primary key( PKCOL )) ____
>
> ____
>
> csv contents:____
>
> 1|"àáâãäåæçèéêëìíîï"____
>
> __ __
>
> After data loading, actual data is becoming____
>
> Ã Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬ÃÃ®Ã¯____
>
> hex of this is -
> c2a1c2a2c2a3c2a4c2a5c2a6c2a7c2a8c2a9c2aac2abc2acc2aec2af____
>
> __ __
>
> The hex values are indeed the UTF-8 encodings of the characters in your
> expected string, and the presence of `C2` before each character is
> indicative of how UTF-8 represents certain characters.____
>
> In UTF-8, characters from the extended Latin set (like `à`, `á`, `â`,
> etc.) are represented as two bytes. The first byte `C2` or `C3`
> indicates that this is a two-byte character, and the second byte
> specifies the character. For example:____
>
> - `à` is represented as `C3 A0`____
>
> - `á` is `C3 A1`____
>
> - `â` is `C3 A2`, and so on.____
>
> In this case, the `C2` byte is getting interpreted as a separate
> character and that is the likely reason that an `Â` (which corresponds
> to `C2`) is seen before each intended character. Looks like UTF-8
> encoded data is mistakenly interpreted as Latin-1 (ISO-8859-1) or
> Windows-1252, where each byte is treated as a separate character.
>
>
> Please advise. Thank you very much.
>
>
> Regards,
>
> Kiran
>

--
Adrian Klaver
adrian.klaver@aklaver.com