Thread: Maintaining accents with "COPY" ?
Hi I'm currently doing a CSV export using COPY: COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER; This works great apart from accents are not preserved in the output, for example é gets converted to random characters, e.g.√© or similar. How can I preserve accents ? Thanks ! Laura
> On 25/05/2023 09:14 CEST Laura Smith <n5d9xq3ti233xiyif2vp@protonmail.ch> wrote: > > I'm currently doing a CSV export using COPY: > > COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER; > > This works great apart from accents are not preserved in the output, for > example é gets converted to random characters, e.g. √© or similar. > > How can I preserve accents ? Looks like an encoding issue and a mismatch between database encoding and client encoding. You can check both with: SHOW server_encoding; SHOW client_encoding; Then either set the client encoding or use COPY's encoding option to match the database encoding (I assume utf8 in this example): SET client_encoding = 'utf8'; COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8'; -- Erik
> Looks like an encoding issue and a mismatch between database encoding and client > encoding. You can check both with: > > SHOW server_encoding; > SHOW client_encoding; > > Then either set the client encoding or use COPY's encoding option to match the > database encoding (I assume utf8 in this example): > > SET client_encoding = 'utf8'; > COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8'; > > -- > Erik Hi Erik, Looks like you could well be right about encoding: postgres=# SHOW server_encoding; server_encoding ----------------- UTF8 (1 row) postgres=# SHOW client_encoding; client_encoding ----------------- SQL_ASCII (1 row) I will try your suggestion...
On 2023-05-25 07:14:40 +0000, Laura Smith wrote: > I'm currently doing a CSV export using COPY: > > COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER; > > > This works great apart from accents are not preserved in the output, > for example é gets converted to random characters, e.g. √© or similar. How do you check the output? If a single character is turned into 2 or 3 characters the issue is usually that the program which produces the output (in the case of COPY I think that would be the PostgreSQL server, not the client) produces UTF-8, but the program consuming it expects an 8-bit character set (typically windows-1252). See if oyu can tell that program that the file is in UTF-8. > How can I preserve accents ? They probably already are preserved. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
> On 25/05/2023 13:26 CEST Peter J. Holzer <hjp-pgsql@hjp.at> wrote: > > On 2023-05-25 07:14:40 +0000, Laura Smith wrote: > > I'm currently doing a CSV export using COPY: > > > > COPY (select * from foo where bar='foo') TO '/tmp/bar.csv' DELIMITER ',' CSV HEADER; > > > > This works great apart from accents are not preserved in the output, > > for example é gets converted to random characters, e.g. √© or similar. > > How do you check the output? > > If a single character is turned into 2 or 3 characters the issue is > usually that the program which produces the output (in the case of COPY > I think that would be the PostgreSQL server, not the client) produces > UTF-8, but the program consuming it expects an 8-bit character set > (typically windows-1252). See if oyu can tell that program that the file > is in UTF-8. > > > How can I preserve accents ? > > They probably already are preserved. You're right. The bytes are probably interpreted as Mac OS Roman: $ echo é | iconv -f macintosh √© $ echo -n é | xxd 00000000: c3a9 -- Erik
> On 25/05/2023 12:08 CEST Laura Smith <n5d9xq3ti233xiyif2vp@protonmail.ch> wrote: > > > Looks like an encoding issue and a mismatch between database encoding and > > client encoding. You can check both with: > > > > SHOW server_encoding; > > SHOW client_encoding; > > > > Then either set the client encoding or use COPY's encoding option to match > > the database encoding (I assume utf8 in this example): > > > > SET client_encoding = 'utf8'; > > COPY (...) TO /tmp/bar.csv DELIMITER ',' CSV HEADER ENCODING 'utf8'; > > Hi Erik, > > Looks like you could well be right about encoding: > > postgres=# SHOW server_encoding; > server_encoding > ----------------- > UTF8 > (1 row) > > postgres=# SHOW client_encoding; > client_encoding > ----------------- > SQL_ASCII > (1 row) > > I will try your suggestion... The client encoding is not the problem here. Using SQL_ASCII effectively uses the server encoding. SQL_ASCII basically means uninterpreted bytes/characters. From https://www.postgresql.org/docs/15/multibyte.html#id-1.6.11.5.7: "If the client character set is defined as SQL_ASCII, encoding conversion is disabled, regardless of the server's character set. (However, if the server's character set is not SQL_ASCII, the server will still check that incoming data is valid for that encoding; so the net effect is as though the client character set were the same as the server's.) Just as for the server, use of SQL_ASCII is unwise unless you are working with all-ASCII data." -- Erik