Thread: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

From
tutiluren@tutanota.com
Date:
After taking a longer break from my problem, I have now made a fresh, clean, scientifically conducted experiment in order to truly get to the bottom of this annoying problem once and for all.

The issue is that pg_dump refuses to dump the database, with bizarre errors outputted, whenever the "--exclude-table-data" is included in the command *and* its value contains anything but lowercase a-z. See for yourselves:

The database in question is verified to have "UTF8" as encoding. It has one schema called "Test schäma" containing one table called "Test täble", containing one column called "Test cålumn" with one row where the column has the text "This should not be in the dump." (so that I can easily check if it's included in the dump).

First, I set cmd.exe to use Unicode, just to be sure:

C:\pg_dump_test>chcp 65001
Active code page: 65001

Now I try this series of commands:

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
[retracted]
pg_dump: creating CONSTRAINT "Test schäma.Test täble Test täble_pkey"

= WORKS. The dump was successful. (In spite of weird output chars.)

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="test" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
[retracted]
pg_dump: creating CONSTRAINT "Test schäma.Test täble Test täble_pkey"

= WORKS. The dump was successful. (In spite of weird output chars.)

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="\"Test schäma\".\"Test täble\"" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe4 0x62 0x6c
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(Test täble)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(Test schäma)$'

= FAILED. The dump was aborted with these nonsensical errors.

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="Test schäma"."Test täble" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe4 0x62 0x6c
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(test täble)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(test schäma)$'

= FAILED. The dump was aborted with these nonsensical errors.

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="Test schäma.Test täble" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe4 0x62 0x6c
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(test täble)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(test schäma)$'

= FAILED. The dump was aborted with these nonsensical errors.

Finally, I tried the same command again like this:

--exclude-table-data="ä"

= FAILS!

--exclude-table-data="a"

= WORKS!

I looked everywhere for some kind of "client-encoding" option in the pg_dump manual, but there is no such thing. The only thing I can think of is that the client's encoding (that is, pg_dump) is for some reason not set to "UTF8" even though that's the encoding of the "test" database which I'm connecting to.

I believe that I have taken every reasonable step at this point to debug this on my own, and it truly appears as if the issue is that pg_dump has some broken internal logic which fails to account for non-simplistic identifiers, specifically the code which powers the "--exclude-table-data" option. However, it *could* still be "my fault"... although I don't see how!
On Thursday, July 23, 2020, <tutiluren@tutanota.com> wrote:
After taking a longer break from my problem, I have now made a fresh, clean, scientifically conducted experiment in order to truly get to the bottom of this annoying problem once and for all.

First, I set cmd.exe to use Unicode, just to be sure:

C:\pg_dump_test>chcp 65001
Active code page: 65001

pg_dump: creating CONSTRAINT "Test schäma.Test täble Test täble_pkey"

= WORKS. The dump was successful. (In spite of weird output chars.)

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="Test schäma.Test täble" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe4 0x62 0x6c
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(test täble)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(test schäma)$'

= FAILED. The dump was aborted with these nonsensical errors.

Finally, I tried the same command again like this:


I looked everywhere for some kind of "client-encoding" option in the pg_dump manual, but there is no such thing. The only thing I can think of is that the client's encoding (that is, pg_dump) is for some reason not set to "UTF8" even though that's the encoding of the "test" database which I'm connecting to.

The main thing here is this isn’t properly considered a bug in pg_dump but rather with most likely any client command run in the cmd.exe shell.  Or even more generally with the data flowing through the pipeline, which is libpq.  Thus what you want is a libpq option.


And here is an old, similar, bug report, with an apparent work-around (or solution depending on details that as an English Linux user I am not in a position to judge).


David J.

Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

From
Juan José Santamaría Flecha
Date:

On Fri, Jul 24, 2020 at 8:14 AM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Thursday, July 23, 2020, <tutiluren@tutanota.com> wrote:
After taking a longer break from my problem, I have now made a fresh, clean, scientifically conducted experiment in order to truly get to the bottom of this annoying problem once and for all.

First, I set cmd.exe to use Unicode, just to be sure:

C:\pg_dump_test>chcp 65001
Active code page: 65001

pg_dump: creating CONSTRAINT "Test schäma.Test täble Test täble_pkey"

= WORKS. The dump was successful. (In spite of weird output chars.)

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="Test schäma.Test täble" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe4 0x62 0x6c
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(test täble)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(test schäma)$'

= FAILED. The dump was aborted with these nonsensical errors.

Finally, I tried the same command again like this:

I looked everywhere for some kind of "client-encoding" option in the pg_dump manual, but there is no such thing. The only thing I can think of is that the client's encoding (that is, pg_dump) is for some reason not set to "UTF8" even though that's the encoding of the "test" database which I'm connecting to.

The main thing here is this isn’t properly considered a bug in pg_dump but rather with most likely any client command run in the cmd.exe shell.

The problem is coming from CMD and its UTF8 support, but I guess I am not making a great job in convincing you. I will try to explain it through an example:

1. Start a CMD using "CMD /U", and set the code page to UTF8:

C:\postgres>chcp 65001
Active code page: 65001

2.  Create a database named 'test', an schema named "Ö" and table named "Ä":

C:\postgres>psql -c "create database test" postgres
CREATE DATABASE

C:\postgres>psql -c "\l test" postgres
                                              List of databases
 Name |    Owner     | Encoding |          Collate           |           Ctype            | Access privileges
------+--------------+----------+----------------------------+----------------------------+-------------------
 test | postgres     | UTF8     | English_United States.1252 | English_United States.1252 |
(1 row)

C:\postgres>psql -c "create schema "Ö"" test
CREATE SCHEMA
C:\postgres>psql -c " create table "Ö"."Ä" (c1 char(1))" test
CREATE TABLE

3. Test the table and check your client encoding:

C:\postgres>psql -c "select * from "Ö"."Ä"" test
 c1
----
(0 rows)

C:\postgres>psql -c "show client_encoding" test
 client_encoding
-----------------
 WIN1252
(1 row)

4. Change the client encoding to UTF8 and repeat the test:

C:\postgres>set PGCLIENTENCODING=UTF8

C:\postgres>psql -c "show client_encoding" test
 client_encoding
-----------------
 UTF8
(1 row)

C:\postgres>psql -c "select * from "Ö"."Ä"" test
2020-07-24 11:25:01.710 CEST [1452] ERROR:  invalid byte sequence for encoding "UTF8": 0xd6 0x2e
ERROR:  invalid byte sequence for encoding "UTF8": 0xd6 0x2e

The reason why this happens is because the limited CMD's UTF8 support, 'chcp' only changes the code page of the displayed characters, shell commands, piping, redirection and most commands are still ANSI only.

Does explicitly setting 'PGCLIENTENCODING=WIN1252' on your shell solves the issue?

Regards,

Juan José Santamaría Flecha

Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

From
tutiluren@tutanota.com
Date:
Jul 24, 2020, 12:30 PM by juanjo.santamaria@gmail.com:

On Fri, Jul 24, 2020 at 8:14 AM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Thursday, July 23, 2020, <tutiluren@tutanota.com> wrote:
After taking a longer break from my problem, I have now made a fresh, clean, scientifically conducted experiment in order to truly get to the bottom of this annoying problem once and for all.

First, I set cmd.exe to use Unicode, just to be sure:

C:\pg_dump_test>chcp 65001
Active code page: 65001

pg_dump: creating CONSTRAINT "Test schäma.Test täble Test täble_pkey"

= WORKS. The dump was successful. (In spite of weird output chars.)

C:\pg_dump_test>pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="Test schäma.Test täble" --host="localhost" --port="5432" --username="postgres" --dbname="test"
pg_dump: last built-in OID is 16383
pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe4 0x62 0x6c
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(test täble)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(test schäma)$'

= FAILED. The dump was aborted with these nonsensical errors.

Finally, I tried the same command again like this:

I looked everywhere for some kind of "client-encoding" option in the pg_dump manual, but there is no such thing. The only thing I can think of is that the client's encoding (that is, pg_dump) is for some reason not set to "UTF8" even though that's the encoding of the "test" database which I'm connecting to.

The main thing here is this isn’t properly considered a bug in pg_dump but rather with most likely any client command run in the cmd.exe shell.

The problem is coming from CMD and its UTF8 support, but I guess I am not making a great job in convincing you. I will try to explain it through an example:

1. Start a CMD using "CMD /U", and set the code page to UTF8:

C:\postgres>chcp 65001
Active code page: 65001

2.  Create a database named 'test', an schema named "Ö" and table named "Ä":

C:\postgres>psql -c "create database test" postgres
CREATE DATABASE

C:\postgres>psql -c "\l test" postgres
                                              List of databases
 Name |    Owner     | Encoding |          Collate           |           Ctype            | Access privileges
------+--------------+----------+----------------------------+----------------------------+-------------------
 test | postgres     | UTF8     | English_United States.1252 | English_United States.1252 |
(1 row)

C:\postgres>psql -c "create schema "Ö"" test
CREATE SCHEMA
C:\postgres>psql -c " create table "Ö"."Ä" (c1 char(1))" test
CREATE TABLE

3. Test the table and check your client encoding:

C:\postgres>psql -c "select * from "Ö"."Ä"" test
 c1
----
(0 rows)

C:\postgres>psql -c "show client_encoding" test
 client_encoding
-----------------
 WIN1252
(1 row)

4. Change the client encoding to UTF8 and repeat the test:

C:\postgres>set PGCLIENTENCODING=UTF8

C:\postgres>psql -c "show client_encoding" test
 client_encoding
-----------------
 UTF8
(1 row)

C:\postgres>psql -c "select * from "Ö"."Ä"" test
2020-07-24 11:25:01.710 CEST [1452] ERROR:  invalid byte sequence for encoding "UTF8": 0xd6 0x2e
ERROR:  invalid byte sequence for encoding "UTF8": 0xd6 0x2e

The reason why this happens is because the limited CMD's UTF8 support, 'chcp' only changes the code page of the displayed characters, shell commands, piping, redirection and most commands are still ANSI only.

Does explicitly setting 'PGCLIENTENCODING=WIN1252' on your shell solves the issue?

Regards,

Juan José Santamaría Flecha
  1. Thanks for putting in the effort to help.
  2. Thanks for teaching me about the "cmd.exe /U" option; I never even knew that it existed and was always annoyed by how Unicode characters looked messed up even though Unicode worked (outside of this case).
  3. As a matter of fact, 'PGCLIENTENCODING=WIN1252' causes it to stop failing and does dump the database. To make 100% sure that this was it, I set 'PGCLIENTENCODING=UTF8' immediately afterwards and repeated the command: failed. Then changed it back to 'PGCLIENTENCODING=WIN1252' once again and repeated: worked. So it's definitely due to this, which makes zero sense since the database *HAS* the encoding UTF8!
  4. (Lots of too-early negative conclusions by me removed here.) It turns out that, using your =WIN1252 hack, and this specific version of the exclude syntax: --exclude-table-data="\"Test schäma\".\"Test täble\"" ... it works. It dumps correctly and ignores that specific table in that schema.
  5. If anyone responsible for the documentation for PostgreSQL is reading this, please update it to make clear what syntax one is supposed to use for the confusing --exclude-table-data option!
  6. Thanks for finding a workaround, but I sure don't like the fact that I have to lie to PG about the encoding to be able to dump my database. Who knows what kind of unintended consequences this might have? Will it dump the DB's contents incorrectly? Will it remove/garble any non-WIN1252-compatible data? Will it corrupt my backups? It feels horribly wrong to set the wrong encoding like this, when *everything* I do/use works in UTF-8...

On Friday, July 24, 2020, <tutiluren@tutanota.com> wrote:
  1. If anyone responsible for the documentation for PostgreSQL is reading this, please update it to make clear what syntax one is supposed to use for the confusing --exclude-table-data option!
While not that exact example a sufficient command is already documented that deals with handling identifiers in shell command.

 pg_dump -t "\"MixedCaseName\"" mydb > mytab.sql

David J.

Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

From
tutiluren@tutanota.com
Date:
Although I now have a work-around, which *appears* to work, I'm still utterly confused as to *why* it works. To be clear: why is it that setting the environment variable 'PGCLIENTENCODING' to 'WIN1252' makes the pg_dump succeed when everything, including the database I'm trying to dump, uses UTF8? The only thing not using UTF8 is the "postgres" database (which indeed uses 'WIN1252', certainly not out of my active choice), but that's not involved here... My command connects to my testdb, does it not?

pg_dump --format plain --verbose --file "testdump.txt" --exclude-table-data="\"Test schäma\".\"Test täble\"" --host="localhost" --port="5432" --username="postgres" --dbname="test"

The '--dbname="test"' part suggests to me that I'm not connecting to "postgres" or involving it in any way. So why does it (apparently) have something to do with these bizarre errors which only happen if I have the PGCLIENTENCODING set to UTF8 but not WIN1252? My test database is verified UTF8.
On 2020-Jul-25, tutiluren@tutanota.com wrote:

> Although I now have a work-around, which *appears* to work, I'm still utterly confused as to *why* it works. To be
clear:why is it that setting the environment variable 'PGCLIENTENCODING' to 'WIN1252' makes the pg_dump succeed when
everything,including the database I'm trying to dump, uses UTF8? The only thing not using UTF8 is the "postgres"
database(which indeed uses 'WIN1252', certainly not out of my active choice), but that's not involved here... My
commandconnects to my testdb, does it not?
 

What doesn't use UTF8 is your terminal.  So when you enter the accented
letter, you're not actually sending the UTF8 character you think you're
sending, but instead a Win1252 character which matches nothing.

When you use "cmd /U", then your terminal *is* UTF8, so the character is
correct and things work.

You can further test this by setting log_statements=all in
postgresql.conf, sending a reload signal to postgres, and then rerunning
the pg_dump command.  When run in the non-UTF8 terminal, the server log
file will show different garbage than when run in the UTF8 terminal.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

From
tutiluren@tutanota.com
Date:
Alright. Sigh. I noticed a huge difference in the file sizes between my normal backups and my latest one. I started getting suspicious, so I made a new test WITHOUT using "--exclude-table-data" at all:

I dumped the same database:

1. With setting PGCLIENTENCODING=WIN1252, the weird "workaround".
2. Without setting that. (That is, UTF8.)

The first one is *MUCH* smaller. Opening it up in a visual diff viewer, I can see that HUGE amounts of my data has simply not been copied in the first case. Which is a nightmare scenario and thank God that I noticed this instead of just assuming that it was working now... my backups would've been worthless.

In other words: setting PGCLIENTENCODING=WIN1252 when the database is UTF8 makes pg_dump ignore massive amounts of the data in the database. For this reason, I cannot possibly use this as a "workaround" for my "--exclude-table-data" problem.

Yes, I have very carefully tried with this with "cmd.exe /U" as well as setting the Unicode codepage; it makes *no difference*. Nothing seems to make a difference; pg_dump doesn't seem to *want* to work. It's "all or nothing". I can't exclude any part of the database. The "PGCLIENTENCODING=WIN1252" workaround is sadly insanely dangerous and unusable.

Frankly, at this point I don't have the slightest idea what I could possibly do/try. I have tried so many things by now. If this is truly due to cmd.exe somehow, it must be doing something extraordinarily evil and broken even for Microsoft.

What do you suggest? Should I just pretend as if the --exclude* options don't exist as I've been forced to do many times with a lot of "crucial" features in a lot of software which just won't work correctly? It's just so incredibly inconvenient to have to either store all that temporary junk data or be forced to TRUNCATE it all for that specific table before each backup; what if I'm working right around midnight when the backup is made and I want the last few hours worth of data? I'd have just seconds worth of data instead since I couldn't exclude the table due to "some bug" which I don't even know where it lives: pg_dump or cmd.exe.

Is it entirely unthinkable that this is a pg_dump bug?

Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

From
tutiluren@tutanota.com
Date:
I don't know if anyone is getting or reading my messages, but I just want to make it clear that my problem has
*not* been resolved, and I suspect more strongly than ever that pg_dump is at fault.

I don't know what I could possibly add to what I've already posted describing my issue, so I probably shouldn't repeat myself any further now, but simply hope that somebody will read my previous posts and be able to help me.

It truly looks as if I have to forget that the "--exclude*" options exist at all...
On Sat, Jul 25, 2020 at 6:44 PM <tutiluren@tutanota.com> wrote:
I don't know if anyone is getting or reading my messages,

You can search the mailing list archives to confirm receipt by the mailing list.


but I just want to make it clear that my problem has
*not* been resolved, and I suspect more strongly than ever that pg_dump is at fault.

I don't know what I could possibly add to what I've already posted describing my issue, so I probably shouldn't repeat myself any further now, but simply hope that somebody will read my previous posts and be able to help me.

Pretty much.  Consider too that the people likely to be of help here (a significant subset of available hackers given that it seems to be Windows only) are busy wrapping up coding for version 13 due out within a month.

David J.

tutiluren:

Since at least part of your struggle has been with cmd.exe not handling UTF8 well, I wonder if you have considered
WindowsTerminal, which purports to handle UTF8 better?
 

https://www.microsoft.com/en-us/p/windows-terminal/9n0dx20hk701

I'm not saying it will solve everything, but it might at least remove one layer of confusion.

Graham

At 7/25/2020 06:44 PM, tutiluren@tutanota.com wrote:
>I don't know if anyone is getting or reading my messages, but I just want to make it clear that my problem has 
>*not* been resolved, and I suspect more strongly than ever that pg_dump is at fault.
>
>I don't know what I could possibly add to what I've already posted describing my issue, so I probably shouldn't repeat
myselfany further now, but simply hope that somebody will read my previous posts and be able to help me.
 
>
>It truly looks as if I have to forget that the "--exclude*" options exist at all...




Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

From
Juan José Santamaría Flecha
Date:
On Sat, Jul 25, 2020 at 4:21 AM <tutiluren@tutanota.com> wrote:
Alright. Sigh. I noticed a huge difference in the file sizes between my normal backups and my latest one. I started getting suspicious, so I made a new test WITHOUT using "--exclude-table-data" at all:

I dumped the same database:

1. With setting PGCLIENTENCODING=WIN1252, the weird "workaround".
2. Without setting that. (That is, UTF8.)

The first one is *MUCH* smaller. Opening it up in a visual diff viewer, I can see that HUGE amounts of my data has simply not been copied in the first case. Which is a nightmare scenario and thank God that I noticed this instead of just assuming that it was working now... my backups would've been worthless.

In other words: setting PGCLIENTENCODING=WIN1252 when the database is UTF8 makes pg_dump ignore massive amounts of the data in the database. For this reason, I cannot possibly use this as a "workaround" for my "--exclude-table-data" problem.

You are comparing a file that uses a single-byte encoding (WIN1252) with another file that uses multibyte encoding (UTF8), so the size difference is not unexplainable.

Also, diff-ing two files with mismatched encodings is not going to work as expected. What you can do is, change the display code page of the CMD to match the PGCLIENTENCODING (chcp 1252 & chcp 65001), and use the command "type" to print on screen the content of the dump files generated with both encodings. If you find a mismatch, please share.
 
Yes, I have very carefully tried with this with "cmd.exe /U" as well as setting the Unicode codepage; it makes *no difference*. Nothing seems to make a difference; pg_dump doesn't seem to *want* to work. It's "all or nothing". I can't exclude any part of the database. The "PGCLIENTENCODING=WIN1252" workaround is sadly insanely dangerous and unusable.

Your OS code page is WIN1252, that is something with a heavy impact in the system. In fact, your client is natively WIN1252 and explicitly setting the PGCLIENTENCODING is not a weird hack, but a regular configuration parameter. With all the configuring on the CMD code page, we can only change the encoding of the displayed text.

Is it entirely unthinkable that this is a pg_dump bug?

What you are describing does not look like a bug to me, but a client encoding problem. 
 
If PGCLIENTENCODING=WIN1252 was failing for pg_dump, it would not do it silently. You would see something like:

pg_dump: error: Dumping the contents of table "Ä" failed: PQgetResult() failed.
pg_dump: error: Error message from server: ERROR:  character with byte sequence 0xe5 0x82 0x89 in encoding "UTF8" has no equivalent in encoding "WIN1252"
pg_dump: error: The command was: COPY "Ö"."Ä" (c1) TO stdout;

If you see this error, then PGCLIENTENCODING=1252 will not be a viable workaround for you, and will have to resort to any of the possible solutions that have already been suggested: activate the beta UTF8 support of the Windows Regional settings or access your database from a system with true UTF8 terminal support.

Regards,

Juan José Santamaría Flecha

Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

From
tutiluren@tutanota.com
Date:

Since at least part of your struggle has been with cmd.exe not handling UTF8 well, I wonder if you have considered Windows Terminal, which purports to handle UTF8 better?

https://www.microsoft.com/en-us/p/windows-terminal/9n0dx20hk701

I'm not saying it will solve everything, but it might at least remove one layer of confusion.

Well, I'll probably start using that "Windows Terminal" thing when it actually is built into Windows. I've heard it mentioned numerous times, but it's some kind of "external download" and involves the vile "Windows Store" on top of it, and I know it will cause numerous issues since everything about my system expects cmd.exe. I spent ages trying to get PowerShell (which is apparently now already outdated if Windows Terminal is the new "hot" terminal to use...) to work properly, but it never did, and thus I'm not very eager to switch.

Plus nothing else other than pg_dump has any problems with Unicode, so that's also a bit... odd.

PS: I actually just tried following the link to download it, but it just gets stuck at a nag screen to make me "log in" to my (nonexistent) "Microsoft account". Ugh. Not a chance. Sorry.

Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

From
tutiluren@tutanota.com
Date:

Also, diff-ing two files with mismatched encodings is not going to work as expected.

I used WinMerge and visually saw how the "COPY" command was "cut off". It truly had not copied the data. Maybe the error code was "error", but that's what the dumped file looked like. I'm not making this up.

If you see this error, then PGCLIENTENCODING=1252 will not be a viable workaround
Since it skips huge amounts of the data in the database, it's not viable, no... And it never made any sense in the first place, since the database is UTF-8 and everything is UTF-8...
activate the beta UTF8 support of the Windows Regional settings
"Beta UTF8 support"? Regional settings? Nothing other than pg_dump has these problems, yet it's somehow Windows' fault? pg_dump is the program that either errors out or silently ignores data. It's not Windows/cmd.exe not sending the right table/schema names...

To prove my point, also to myself, I opened up a normal cmd.exe, without even using any /U flag, and ran:

php test.php --exclude-table-data="\"Test ❆ schäma\".\"Test ❆ täble\""

My test.php simply outputs the value of the "--exclude-table-data" parameter. Guess what it outputs?

"Test ❆ schäma"."Test ❆ täble"

So, evidently, the standard cmd.exe has no problems sending non-ANSI characters or double quotes to a program/script. Or even to output them (at least in this case). In other words: since my own script receives the Unicode chars and the double quotes, pg_dump also should, unless there's some issue with it. Unless I'm still not understanding what you're saying.

or access your database from a system with true UTF8 terminal support.
I don't have a random system just lying around, ready to migrate my entire "life" machine to. If that were the case, this world would look very different from what it is, and I would have finally have an alternative to this nightmare OS. (No, Linux is not an alternative. I've spent the last 20 years regularly coming back to Windows due to the neverending problems with literally *everything* in Linux/Unix... even considering how horrible Windows is. Which is quite a feat. Please don't think I'm defending modern Windows for one second.)
On Mon, Jul 27, 2020 at 1:06 AM <tutiluren@tutanota.com> wrote:

or access your database from a system with true UTF8 terminal support.
I don't have a random system just lying around, ready to migrate my entire "life" machine to. If that were the case, this world would look very different from what it is, and I would have finally have an alternative to this nightmare OS. (No, Linux is not an alternative. I've spent the last 20 years regularly coming back to Windows due to the neverending problems with literally *everything* in Linux/Unix... even considering how horrible Windows is. Which is quite a feat. Please don't think I'm defending modern Windows for one second.)

Ten years ago my first reaction would be to download msys and try the command there.  It looks, however, that the development stalled around 2008: http://www.mingw.org/wiki/MSYS
Now I read that a second generation of that is available, but haven't tried it: https://www.msys2.org/

You'll have to quote the special characters differently, but that way of quoting will be familiar to the majority of folks here, I guess.

--
Alex

Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

From
tutiluren@tutanota.com
Date:

Ten years ago my first reaction would be to download msys and try the command there.  It looks, however, that the development stalled around 2008: http://www.mingw.org/wiki/MSYS
Now I read that a second generation of that is available, but haven't tried it: https://www.msys2.org/

You'll have to quote the special characters differently, but that way of quoting will be familiar to the majority of folks here, I guess.

I don't want to download or install anything. I thought I would finally convince you all when I showed that my own script is fully able to receive the Unicode characters and quotes from the cmd.exe which doesn't even use the /U flag, so clearly pg_dump is not reading this correctly? I can't see any other explanation at this point. It frankly feels like you don't *want* to support Windows. "Cross-platform" seems to mean "several Linux distributions" these days...
On Mon, Jul 27, 2020 at 4:15 PM <tutiluren@tutanota.com> wrote:
I don't want to download or install anything. I thought I would finally convince you all when I showed that my own script is fully able to receive the Unicode characters and quotes from the cmd.exe which doesn't even use the /U flag, so clearly pg_dump is not reading this correctly? I can't see any other explanation at this point. It frankly feels like you don't *want* to support Windows. "Cross-platform" seems to mean "several Linux distributions" these days...

I do believe there is something here worth exploring.  You will just need to be patient and wait for someone capable to agree.  If it's really important to you paid support services are an option.

The fact that there are few reports being made of this nature, coupled with the upfront hostility of the reports, makes it easy to put the concern aside while less difficult problems exist - and, to repeat, while being at the very end of a development cycle for a new release.

David J.

On Tue, Jul 28, 2020 at 1:15 AM <tutiluren@tutanota.com> wrote:
Ten years ago my first reaction would be to download msys and try the command there.  It looks, however, that the development stalled around 2008: http://www.mingw.org/wiki/MSYS
Now I read that a second generation of that is available, but haven't tried it: https://www.msys2.org/

You'll have to quote the special characters differently, but that way of quoting will be familiar to the majority of folks here, I guess.
I don't want to download or install anything. I thought I would finally convince you all when I showed that my own script is fully able to receive the Unicode characters and quotes from the cmd.exe which doesn't even use the /U flag, so clearly pg_dump is not reading this correctly? I can't see any other explanation at this point. It frankly feels like you don't *want* to support Windows. "Cross-platform" seems to mean "several Linux distributions" these days...

I'm not entirely convinced that reading the arguments from the command line and printing them back on the terminal proves anything about the actual encoding of those characters.  In the end it is cmd.exe that interprets them in both cases.  "Garbage in—garbage out", as they say.

I also don't think that pg_dump or any other CLI program has to do anything special in order to "read it correctly".  It reads what the system (Windows, cmd.exe) gives it.  If the system chooses to play tricks, there's little we can do about it.

I do not have access to a Windows system these days, but I guess the most reasonable thing at this point is to come up with a minimal and self-contained reproducer, so that someone who does can try it and see what you see on your end.  So far you have only provided some terminal invocations accompanied by error messages and vague descriptions of comparing the actual and expected output of pg_dump.

Regards,
--
Alex

Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

From
tutiluren@tutanota.com
Date:
> come up with a minimal and self-contained reproducer, so that someone who does can try it and see what you see on your end.  So far you have only provided some terminal invocations accompanied by error messages and vague descriptions of comparing the actual and expected output of pg_dump.

Did you miss my e-mail starting with "After taking a longer break from my problem, I have now made a fresh, clean, scientifically conducted experiment in order to truly get to the bottom of this annoying problem once and for all."? (I don't now how I'd link to it.) There I go through everything with full commands and a minimal test database (which I also describe how I created).

I don't see how I could possibly create a smaller or clearer example of the issue. Whatever "strange things" cmd.exe does clearly doesn't show up for my cross-platform PHP CLI script. (I guess it's theoretically possible that PHP for Windows does some sort of "processing" automatically which pg_dump lacks, but I doubt it.)
On Tue, Jul 28, 2020 at 3:40 PM <tutiluren@tutanota.com> wrote: 
(I don't now how I'd link to it.)

Messages and threads in the mailing list archive I mentioned before can be linked to.


There I go through everything with full commands and a minimal test database (which I also describe how I created).

I don't see how I could possibly create a smaller or clearer example of the issue.

A psql script for setup would increase the self-contained nature of the test.  In Linux I'd also use something like grep to test the output file and/or script outputs.  More generally the test script should probably include pg_restore and then various psql commands to check that the resulting restored database contains the expected data.

That said, If I was setup for experimenting in Windows, probably including compiling, the provided information is enough to put together a complete test case to experiment with.

Whatever "strange things" cmd.exe does clearly doesn't show up for my cross-platform PHP CLI script. (I guess it's theoretically possible that PHP for Windows does some sort of "processing" automatically which pg_dump lacks, but I doubt it.)

The PHP script you show is a pure client-only script.  The observed behavior is occurring in a client-server architecture, specifically in the communication interface between the two.  It's also, quite probably, not restricted to pg_dump (a more useful test script would include trying to demonstrate the problem using just psql, probably using the -v option).

David J.
On Wed, Jul 29, 2020 at 12:40 AM <tutiluren@tutanota.com> wrote:
> come up with a minimal and self-contained reproducer, so that someone who does can try it and see what you see on your end.  So far you have only provided some terminal invocations accompanied by error messages and vague descriptions of comparing the actual and expected output of pg_dump.

Did you miss my e-mail starting with "After taking a longer break from my problem, I have now made a fresh, clean, scientifically conducted experiment in order to truly get to the bottom of this annoying problem once and for all."? (I don't now how I'd link to it.) There I go through everything with full commands and a minimal test database (which I also describe how I created).

I don't see how I could possibly create a smaller or clearer example of the issue. Whatever "strange things" cmd.exe does clearly doesn't show up for my cross-platform PHP CLI script. (I guess it's theoretically possible that PHP for Windows does some sort of "processing" automatically which pg_dump lacks, but I doubt it.)

I haven't missed that.  I'm trying to point out that your description of the problem is not self-contained: it lacks the schema definitions and test data.

Didn't you mention that once you could overcome the runtime errors you still have seen some unexpected output from pg_dump?  A script that one can download and run on their (Windows) system would help to confirm the problem (if it does reproduce on others' systems) and speed up diagnosys and a fix, if required.

My expectation from such a script is, that it:

1. Connects to a postgres server on localhost:5432.
2. Creates a test database and re-connects to it.
3. Creates the test schema and table(s), and populatest the tables with the test data.
4. Issues the pg_dump command.
5. Compares the resulting output to the expected one (or runs checks against the output produced by pg_dump).

Kind regards,
--
Alex

Oleksandr Shulgin <oleksandr.shulgin@zalando.de> writes:
> Didn't you mention that once you could overcome the runtime errors you
> still have seen some unexpected output from pg_dump?  A script that one can
> download and run on their (Windows) system would help to confirm the
> problem (if it does reproduce on others' systems) and speed up diagnosys
> and a fix, if required.

My understanding of what's happening here (admittedly, I've not been
paying really close attention) is that some component is misprocessing
non-ASCII data as it's being typed into the pg_dump shell command.
That being the case, I'm not sure that a pre-written shell script could
reproduce the issue accurately.  A script to create the initial database
contents would surely be helpful, but the steps after that might need to
be more like "type this into the terminal, and then you will get X instead
of Y".

            regards, tom lane



Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.

From
Juan José Santamaría Flecha
Date:

On Wed, Jul 29, 2020 at 4:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Oleksandr Shulgin <oleksandr.shulgin@zalando.de> writes:
> Didn't you mention that once you could overcome the runtime errors you
> still have seen some unexpected output from pg_dump?  A script that one can
> download and run on their (Windows) system would help to confirm the
> problem (if it does reproduce on others' systems) and speed up diagnosys
> and a fix, if required.

My understanding of what's happening here (admittedly, I've not been
paying really close attention) is that some component is misprocessing
non-ASCII data as it's being typed into the pg_dump shell command.
That being the case, I'm not sure that a pre-written shell script could
reproduce the issue accurately.  A script to create the initial database
contents would surely be helpful, but the steps after that might need to
be more like "type this into the terminal, and then you will get X instead
of Y".

On a Windows "char* argv[]" uses the system default Windows ANSI code page (CP_ACP), this can only be changed through the system Regional settings. The "chcp" command changes the code page of the printed characters on CMD.

Please find attached a simple test program: "test_argv.c". The output for this program is:

C:\test>chcp 65001
Active code page: 65001

C:\test>test_argv Hëllö
Hëllö
0x48C3AB6C6CC3B6
Hll
0x48EB6C6CF6
C:\test>chcp 1252
Active code page: 1252

C:\test>test_argv Hëllö
Hëllö
0x48C3AB6C6CC3B6
Hëllö
0x48EB6C6CF6

To match this behaviour PGCLIENTENCODING should match the Windows system ANSI encoding.

Regards,

Juan José Santamaría Flecha
Attachment