Thread: Maintaining accents with "COPY" ?

Maintaining accents with "COPY" ?

From
Laura Smith
Date:
Hi

I'm currently doing a CSV export using COPY:

COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;


This works great apart from accents are not preserved in the output, for example é gets converted to random characters,
e.g.é or similar. 


How can I preserve accents ?


Thanks !


Laura



Re: Maintaining accents with "COPY" ?

From
Erik Wienhold
Date:
> On 25/05/2023 09:14 CEST Laura Smith <n5d9xq3ti233xiyif2vp@protonmail.ch> wrote:
>
> I'm currently doing a CSV export using COPY:
>
> COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;
>
> This works great apart from accents are not preserved in the output, for
> example é gets converted to random characters, e.g. √© or similar.
>
> How can I preserve accents ?

Looks like an encoding issue and a mismatch between database encoding and client
encoding.  You can check both with:

    SHOW server_encoding;
    SHOW client_encoding;

Then either set the client encoding or use COPY's encoding option to match the
database encoding (I assume utf8 in this example):

    SET client_encoding = 'utf8';
    COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8';

--
Erik



Re: Maintaining accents with "COPY" ?

From
Laura Smith
Date:
> Looks like an encoding issue and a mismatch between database encoding and client
> encoding. You can check both with:
>
> SHOW server_encoding;
> SHOW client_encoding;
>
> Then either set the client encoding or use COPY's encoding option to match the
> database encoding (I assume utf8 in this example):
>
> SET client_encoding = 'utf8';
> COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8';
>
> --
> Erik

Hi Erik,

Looks like you could well be right about encoding:

postgres=# SHOW server_encoding;
 server_encoding
-----------------
 UTF8
(1 row)

postgres=# SHOW client_encoding;
 client_encoding
-----------------
 SQL_ASCII
(1 row)

I will try your suggestion...




Re: Maintaining accents with "COPY" ?

From
"Peter J. Holzer"
Date:
On 2023-05-25 07:14:40 +0000, Laura Smith wrote:
> I'm currently doing a CSV export using COPY:
>
> COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;
>
>
> This works great apart from accents are not preserved in the output,
> for example é gets converted to random characters, e.g. √© or similar.

How do you check the output?

If a single character is turned into 2 or 3 characters the issue is
usually that the program which produces the output (in the case of COPY
I think that would be the PostgreSQL server, not the client) produces
UTF-8, but the program consuming it expects an 8-bit character set
(typically windows-1252). See if oyu can tell that program that the file
is in UTF-8.

> How can I preserve accents ?

They probably already are preserved.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Maintaining accents with "COPY" ?

From
Erik Wienhold
Date:
> On 25/05/2023 13:26 CEST Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
>
> On 2023-05-25 07:14:40 +0000, Laura Smith wrote:
> > I'm currently doing a CSV export using COPY:
> >
> > COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER;
> >
> > This works great apart from accents are not preserved in the output,
> > for example é gets converted to random characters, e.g. √© or similar.
>
> How do you check the output?
>
> If a single character is turned into 2 or 3 characters the issue is
> usually that the program which produces the output (in the case of COPY
> I think that would be the PostgreSQL server, not the client) produces
> UTF-8, but the program consuming it expects an 8-bit character set
> (typically windows-1252). See if oyu can tell that program that the file
> is in UTF-8.
>
> > How can I preserve accents ?
>
> They probably already are preserved.

You're right.  The bytes are probably interpreted as Mac OS Roman:

    $ echo é | iconv -f macintosh
    é
    $ echo -n é | xxd
    00000000: c3a9

--
Erik



Re: Maintaining accents with "COPY" ?

From
Erik Wienhold
Date:
> On 25/05/2023 12:08 CEST Laura Smith <n5d9xq3ti233xiyif2vp@protonmail.ch> wrote:
>
> > Looks like an encoding issue and a mismatch between database encoding and
> > client encoding. You can check both with:
> >
> > SHOW server_encoding;
> > SHOW client_encoding;
> >
> > Then either set the client encoding or use COPY's encoding option to match
> > the database encoding (I assume utf8 in this example):
> >
> > SET client_encoding = 'utf8';
> > COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8';
>
> Hi Erik,
>
> Looks like you could well be right about encoding:
>
> postgres=# SHOW server_encoding;
>  server_encoding
> -----------------
>  UTF8
> (1 row)
>
> postgres=# SHOW client_encoding;
>  client_encoding
> -----------------
>  SQL_ASCII
> (1 row)
>
> I will try your suggestion...

The client encoding is not the problem here.  Using SQL_ASCII effectively uses
the server encoding.  SQL_ASCII basically means uninterpreted bytes/characters.

From https://www.postgresql.org/docs/15/multibyte.html#id-1.6.11.5.7:

"If the client character set is defined as SQL_ASCII, encoding conversion is
 disabled, regardless of the server's character set. (However, if the server's
 character set is not SQL_ASCII, the server will still check that incoming data
 is valid for that encoding; so the net effect is as though the client character
 set were the same as the server's.) Just as for the server, use of SQL_ASCII is
 unwise unless you are working with all-ASCII data."

--
Erik