Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows. - Mailing list pgsql-bugs

From Juan José Santamaría Flecha
Subject Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.
Date
Msg-id CAC+AXB1XEkkWRn-3it=nY3n4kyOGMiqVwPeTiRVTT6CebJapvg@mail.gmail.com
Whole thread Raw
In response to Re: pg_dump seems to be broken in regards to the "--exclude-table-data" option on Windows.  (tutiluren@tutanota.com)
List pgsql-bugs
On Sat, Jul 25, 2020 at 4:21 AM <tutiluren@tutanota.com> wrote:
Alright. Sigh. I noticed a huge difference in the file sizes between my normal backups and my latest one. I started getting suspicious, so I made a new test WITHOUT using "--exclude-table-data" at all:

I dumped the same database:

1. With setting PGCLIENTENCODING=WIN1252, the weird "workaround".
2. Without setting that. (That is, UTF8.)

The first one is *MUCH* smaller. Opening it up in a visual diff viewer, I can see that HUGE amounts of my data has simply not been copied in the first case. Which is a nightmare scenario and thank God that I noticed this instead of just assuming that it was working now... my backups would've been worthless.

In other words: setting PGCLIENTENCODING=WIN1252 when the database is UTF8 makes pg_dump ignore massive amounts of the data in the database. For this reason, I cannot possibly use this as a "workaround" for my "--exclude-table-data" problem.

You are comparing a file that uses a single-byte encoding (WIN1252) with another file that uses multibyte encoding (UTF8), so the size difference is not unexplainable.

Also, diff-ing two files with mismatched encodings is not going to work as expected. What you can do is, change the display code page of the CMD to match the PGCLIENTENCODING (chcp 1252 & chcp 65001), and use the command "type" to print on screen the content of the dump files generated with both encodings. If you find a mismatch, please share.
 
Yes, I have very carefully tried with this with "cmd.exe /U" as well as setting the Unicode codepage; it makes *no difference*. Nothing seems to make a difference; pg_dump doesn't seem to *want* to work. It's "all or nothing". I can't exclude any part of the database. The "PGCLIENTENCODING=WIN1252" workaround is sadly insanely dangerous and unusable.

Your OS code page is WIN1252, that is something with a heavy impact in the system. In fact, your client is natively WIN1252 and explicitly setting the PGCLIENTENCODING is not a weird hack, but a regular configuration parameter. With all the configuring on the CMD code page, we can only change the encoding of the displayed text.

Is it entirely unthinkable that this is a pg_dump bug?

What you are describing does not look like a bug to me, but a client encoding problem. 
 
If PGCLIENTENCODING=WIN1252 was failing for pg_dump, it would not do it silently. You would see something like:

pg_dump: error: Dumping the contents of table "Ä" failed: PQgetResult() failed.
pg_dump: error: Error message from server: ERROR:  character with byte sequence 0xe5 0x82 0x89 in encoding "UTF8" has no equivalent in encoding "WIN1252"
pg_dump: error: The command was: COPY "Ö"."Ä" (c1) TO stdout;

If you see this error, then PGCLIENTENCODING=1252 will not be a viable workaround for you, and will have to resort to any of the possible solutions that have already been suggested: activate the beta UTF8 support of the Windows Regional settings or access your database from a system with true UTF8 terminal support.

Regards,

Juan José Santamaría Flecha

pgsql-bugs by date:

Previous
From: Andy Fan
Date:
Subject: Re: Reported type mismatch improperly
Next
From: PG Bug reporting form
Date:
Subject: BUG #16555: Postgresql is not LTO ready