Thread: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade
Running into an issue with helping a client upgrade from 8.3 to 10 (yes, I know, please keep the out of support comments to a minimum, thanks :).
The old database was in SQL_ASCII and it needs to stay that way for now unfortunately. The dump and restore itself works fine, but we're now running into issues with some data returning encoding errors unless we specifically set the client_encoding value to SQL_ASCII.
Looking at the 8.3 database, it has the client_encoding value set to UTF8 and queries seem to work fine. Is this just a bug in the old 8.3 not enforcing encoding properly?
The other thing I noticed on the 10 instance was that, while the LOCALE was set to SQL_ASCII, the COLLATE and CTYPE values for the restored databases were en_US.UTF-8. Could this be having an affect? Is there any way to see what these values were on the old 8.3 database? The pg_database catalog does not have these values stored back then.
--
On 04/16/2018 08:16 AM, Keith Fiske wrote: > Running into an issue with helping a client upgrade from 8.3 to 10 (yes, > I know, please keep the out of support comments to a minimum, thanks :). > > The old database was in SQL_ASCII and it needs to stay that way for now > unfortunately. The dump and restore itself works fine, but we're now > running into issues with some data returning encoding errors unless we > specifically set the client_encoding value to SQL_ASCII. > > Looking at the 8.3 database, it has the client_encoding value set to > UTF8 and queries seem to work fine. Is this just a bug in the old 8.3 > not enforcing encoding properly? > > The other thing I noticed on the 10 instance was that, while the LOCALE > was set to SQL_ASCII, the COLLATE and CTYPE values for the restored > databases were en_US.UTF-8. Could this be having an affect? Is there any > way to see what these values were on the old 8.3 database? The > pg_database catalog does not have these values stored back then. If I remember correctly: https://www.postgresql.org/docs/8.3/static/app-pgcontroldata.html > > -- > Keith Fiske > Senior Database Engineer > Crunchy Data - http://crunchydata.com -- Adrian Klaver adrian.klaver@aklaver.com
On 04/16/2018 08:16 AM, Keith Fiske wrote: > Running into an issue with helping a client upgrade from 8.3 to 10 (yes, > I know, please keep the out of support comments to a minimum, thanks :). > > The old database was in SQL_ASCII and it needs to stay that way for now > unfortunately. The dump and restore itself works fine, but we're now > running into issues with some data returning encoding errors unless we > specifically set the client_encoding value to SQL_ASCII. > > Looking at the 8.3 database, it has the client_encoding value set to > UTF8 and queries seem to work fine. Is this just a bug in the old 8.3 > not enforcing encoding properly?e AFAIK, SQL_ASCII basically means no encoding: https://www.postgresql.org/docs/10/static/multibyte.html "The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0-127 according to the ASCII standard, while byte values 128-255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII. Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting because PostgreSQL will be unable to help you by converting or validating non-ASCII characters." What client are you working with? If psql then its behavior has changed between 8.3 and 10: https://www.postgresql.org/docs/10/static/release-9-1.html#id-1.11.6.121.3 " Have psql set the client encoding from the operating system locale by default (Heikki Linnakangas) This only happens if the PGCLIENTENCODING environment variable is not set. " https://www.postgresql.org/docs/10/static/app-psql.html "If both standard input and standard output are a terminal, then psql sets the client encoding to “auto”, which will detect the appropriate client encoding from the locale settings (LC_CTYPE environment variable on Unix systems). If this doesn't work out as expected, the client encoding can be overridden using the environment variable PGCLIENTENCODING." > > The other thing I noticed on the 10 instance was that, while the LOCALE > was set to SQL_ASCII, the COLLATE and CTYPE values for the restored > databases were en_US.UTF-8. Could this be having an affect? Is there any > way to see what these values were on the old 8.3 database? The > pg_database catalog does not have these values stored back then. > > -- > Keith Fiske > Senior Database Engineer > Crunchy Data - http://crunchydata.com -- Adrian Klaver adrian.klaver@aklaver.com
Keith Fiske <keith.fiske@crunchydata.com> writes: > Running into an issue with helping a client upgrade from 8.3 to 10 (yes, I > know, please keep the out of support comments to a minimum, thanks :). > The old database was in SQL_ASCII and it needs to stay that way for now > unfortunately. The dump and restore itself works fine, but we're now > running into issues with some data returning encoding errors unless we > specifically set the client_encoding value to SQL_ASCII. I'm guessing you might be hitting this 9.1 change: * Have psql set the client encoding from the operating system locale by default (Heikki Linnakangas) This only happens if the PGCLIENTENCODING environment variable is not set. I think the previous default was to set client encoding equal to the server encoding. > Looking at the 8.3 database, it has the client_encoding value set to UTF8 > and queries seem to work fine. Is this just a bug in the old 8.3 not > enforcing encoding properly? Somewhere along the line we made SQL_ASCII -> something else conversions check that the data was valid per the other encoding, even though no actual data change happens. > The other thing I noticed on the 10 instance was that, while the LOCALE was > set to SQL_ASCII, You mean encoding, I assume. > the COLLATE and CTYPE values for the restored databases > were en_US.UTF-8. Could this be having an affect? This is not a great idea, no. You could be getting strange misbehaviors in e.g. string comparison, because strcoll() will expect UTF8 data and will likely not cope well with data that isn't valid in that encoding. If you can't sanitize the encoding of your data, I'd suggest running with lc_collate and lc_ctype set to "C". regards, tom lane
On Mon, Apr 16, 2018 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Keith Fiske <keith.fiske@crunchydata.com> writes:
> Running into an issue with helping a client upgrade from 8.3 to 10 (yes, I
> know, please keep the out of support comments to a minimum, thanks :).
> The old database was in SQL_ASCII and it needs to stay that way for now
> unfortunately. The dump and restore itself works fine, but we're now
> running into issues with some data returning encoding errors unless we
> specifically set the client_encoding value to SQL_ASCII.
I'm guessing you might be hitting this 9.1 change:
* Have psql set the client encoding from the operating system locale
by default (Heikki Linnakangas)
This only happens if the PGCLIENTENCODING environment variable is
not set.
I think the previous default was to set client encoding equal to the
server encoding.
> Looking at the 8.3 database, it has the client_encoding value set to UTF8
> and queries seem to work fine. Is this just a bug in the old 8.3 not
> enforcing encoding properly?
Somewhere along the line we made SQL_ASCII -> something else conversions
check that the data was valid per the other encoding, even though no
actual data change happens.
> The other thing I noticed on the 10 instance was that, while the LOCALE was
> set to SQL_ASCII,
You mean encoding, I assume.
> the COLLATE and CTYPE values for the restored databases
> were en_US.UTF-8. Could this be having an affect?
This is not a great idea, no. You could be getting strange misbehaviors
in e.g. string comparison, because strcoll() will expect UTF8 data and
will likely not cope well with data that isn't valid in that encoding.
If you can't sanitize the encoding of your data, I'd suggest running
with lc_collate and lc_ctype set to "C".
regards, tom lane
Thanks to both of you Adrian & Tom.
It is the 9.1 change to the psql client that seems to be causing this.
And pg_controldata was able to show that the CTYPE and COLLATE were UTF8 on the old system. If that's the case, do you still think it's a good idea to set the COLLATE and CTYPE to "C"?
Keith Fiske <keith.fiske@crunchydata.com> writes: > On Mon, Apr 16, 2018 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This is not a great idea, no. You could be getting strange misbehaviors >> in e.g. string comparison, because strcoll() will expect UTF8 data and >> will likely not cope well with data that isn't valid in that encoding. > And pg_controldata was able to show that the CTYPE and COLLATE were UTF8 on > the old system. If that's the case, do you still think it's a good idea to > set the COLLATE and CTYPE to "C"? Well, if the customer's been happy with the behavior of the system so far, maybe it's all right. But this is sure the first thing I'd look at if there are any gripes about its behavior with non-UTF8 strings. I'd be especially worried about this if you try to migrate the database to any new platform, as it's a bet about the behavior of libc not PG itself. regards, tom lane
Hi Keith,
Not sure if this will help but a couple of years ago I migrated from an SQL_ASCII encoding to UTF8. The data was primarily ASCII with some windows garbage, and a little bit of UTF8 from customers filling out forms that were not specifically encoded anything.
I wrote a utility that in-place scans and updates the tables in your SQL_ASCII-encoded database and ensures that everything is 100% UTF8 NFC at the end. For us, there were some characters in some bizarre local encodings, and we had to either toss or make educated guesses for them.
After the cleaning, you dump with client encoding UTF8, then restore into the final database with UTF8 encoding.
You can find it on my github along with documentation and tests to verify it works: https://github.com/khera/utf8-inline-cleaner
On Mon, Apr 16, 2018 at 11:16 AM, Keith Fiske <keith.fiske@crunchydata.com> wrote:
Running into an issue with helping a client upgrade from 8.3 to 10 (yes, I know, please keep the out of support comments to a minimum, thanks :).The old database was in SQL_ASCII and it needs to stay that way for now unfortunately. The dump and restore itself works fine, but we're now running into issues with some data returning encoding errors unless we specifically set the client_encoding value to SQL_ASCII.Looking at the 8.3 database, it has the client_encoding value set to UTF8 and queries seem to work fine. Is this just a bug in the old 8.3 not enforcing encoding properly?The other thing I noticed on the 10 instance was that, while the LOCALE was set to SQL_ASCII, the COLLATE and CTYPE values for the restored databases were en_US.UTF-8. Could this be having an affect? Is there any way to see what these values were on the old 8.3 database? The pg_database catalog does not have these values stored back then.--
On Mon, Apr 16, 2018 at 12:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Keith Fiske <keith.fiske@crunchydata.com> writes:
> On Mon, Apr 16, 2018 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This is not a great idea, no. You could be getting strange misbehaviors
>> in e.g. string comparison, because strcoll() will expect UTF8 data and
>> will likely not cope well with data that isn't valid in that encoding.
> And pg_controldata was able to show that the CTYPE and COLLATE were UTF8 on
> the old system. If that's the case, do you still think it's a good idea to
> set the COLLATE and CTYPE to "C"?
Well, if the customer's been happy with the behavior of the system so far,
maybe it's all right. But this is sure the first thing I'd look at if
there are any gripes about its behavior with non-UTF8 strings. I'd be
especially worried about this if you try to migrate the database to any
new platform, as it's a bet about the behavior of libc not PG itself.
regards, tom lane
This is going from RHEL 6.7 to RHEL 7.4
It is a dump and restore upgrade as well.
On Mon, Apr 16, 2018 at 12:30 PM, Vick Khera <vivek@khera.org> wrote:
Hi Keith,Not sure if this will help but a couple of years ago I migrated from an SQL_ASCII encoding to UTF8. The data was primarily ASCII with some windows garbage, and a little bit of UTF8 from customers filling out forms that were not specifically encoded anything.I wrote a utility that in-place scans and updates the tables in your SQL_ASCII-encoded database and ensures that everything is 100% UTF8 NFC at the end. For us, there were some characters in some bizarre local encodings, and we had to either toss or make educated guesses for them.After the cleaning, you dump with client encoding UTF8, then restore into the final database with UTF8 encoding.You can find it on my github along with documentation and tests to verify it works: https://github.com/khera/utf8-inline-cleaner On Mon, Apr 16, 2018 at 11:16 AM, Keith Fiske <keith.fiske@crunchydata.com> wrote:Running into an issue with helping a client upgrade from 8.3 to 10 (yes, I know, please keep the out of support comments to a minimum, thanks :).The old database was in SQL_ASCII and it needs to stay that way for now unfortunately. The dump and restore itself works fine, but we're now running into issues with some data returning encoding errors unless we specifically set the client_encoding value to SQL_ASCII.Looking at the 8.3 database, it has the client_encoding value set to UTF8 and queries seem to work fine. Is this just a bug in the old 8.3 not enforcing encoding properly?The other thing I noticed on the 10 instance was that, while the LOCALE was set to SQL_ASCII, the COLLATE and CTYPE values for the restored databases were en_US.UTF-8. Could this be having an affect? Is there any way to see what these values were on the old 8.3 database? The pg_database catalog does not have these values stored back then.--
Thanks Vick! We will hopefully be helping them to get off SQL_ASCII after this upgrade. Was challenging enough to get the upgrade itself done, so doing the encoding migration at the same time unfortunately wasn't possible. It's more than just the database that needs fixing, it's an entire data ingestion process as well.
On Mon, Apr 16, 2018 at 12:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Keith Fiske <keith.fiske@crunchydata.com> writes:
> On Mon, Apr 16, 2018 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This is not a great idea, no. You could be getting strange misbehaviors
>> in e.g. string comparison, because strcoll() will expect UTF8 data and
>> will likely not cope well with data that isn't valid in that encoding.
> And pg_controldata was able to show that the CTYPE and COLLATE were UTF8 on
> the old system. If that's the case, do you still think it's a good idea to
> set the COLLATE and CTYPE to "C"?
Well, if the customer's been happy with the behavior of the system so far,
maybe it's all right. But this is sure the first thing I'd look at if
there are any gripes about its behavior with non-UTF8 strings. I'd be
especially worried about this if you try to migrate the database to any
new platform, as it's a bet about the behavior of libc not PG itself.
regards, tom lane
So playing around with the "client_encoding" configuration option in postgresql.conf. According to the docs, setting this should set the default encoding for any client that connects, right?
That description also seems misleading in saying that the default client encoding is what the database encoding is as well. At least as far as psql is concerned, right? I've tried setting that value in postgresql.conf but psql keeps setting it to UTF8 when I connect to the SQL_ASCII database unless I then specifically run "set client_encoding = 'SQL_ASCII';"
On 04/16/2018 10:18 AM, Keith Fiske wrote: > > > So playing around with the "client_encoding" configuration option in > postgresql.conf. According to the docs, setting this should set the > default encoding for any client that connects, right? For any client that does not set a client_encoding when it connects. > > https://www.postgresql.org/docs/10/static/runtime-config-client.html#GUC-CLIENT-ENCODING > <https://www.postgresql.org/docs/10/static/runtime-config-client.html#GUC-CLIENT-ENCODING> > > That description also seems misleading in saying that the default client > encoding is what the database encoding is as well. At least as far as > psql is concerned, right? I've tried setting that value in > postgresql.conf but psql keeps setting it to UTF8 when I connect to the > SQL_ASCII database unless I then specifically run "set client_encoding = > 'SQL_ASCII';" That is due to this: https://www.postgresql.org/docs/10/static/app-psql.html "If both standard input and standard output are a terminal, then psql sets the client encoding to “auto”, which will detect the appropriate client encoding from the locale settings (LC_CTYPE environment variable on Unix systems). If this doesn't work out as expected, the client encoding can be overridden using the environment variable PGCLIENTENCODING." If you want to override set the PGCLIENTENCODING env variable to SQL_ASCII. > > -- > Keith Fiske > Senior Database Engineer > Crunchy Data - http://crunchydata.com -- Adrian Klaver adrian.klaver@aklaver.com
On 04/16/2018 10:18 AM, Keith Fiske wrote: > > > > So playing around with the "client_encoding" configuration option in > postgresql.conf. According to the docs, setting this should set the > default encoding for any client that connects, right? > > https://www.postgresql.org/docs/10/static/runtime-config-client.html#GUC-CLIENT-ENCODING > <https://www.postgresql.org/docs/10/static/runtime-config-client.html#GUC-CLIENT-ENCODING> > > That description also seems misleading in saying that the default client > encoding is what the database encoding is as well. At least as far as > psql is concerned, right? I've tried setting that value in > postgresql.conf but psql keeps setting it to UTF8 when I connect to the > SQL_ASCII database unless I then specifically run "set client_encoding = > 'SQL_ASCII';" The order that parameters are evaluated for libpq programs is: https://www.postgresql.org/docs/10/static/libpq-connect.html "PQconnectdbParams ... If any parameter is NULL or an empty string, the corresponding environment variable (see Section 33.14) is checked. If the environment variable is not set either, then the indicated built-in defaults are used. ... " > > -- > Keith Fiske > Senior Database Engineer > Crunchy Data - http://crunchydata.com -- Adrian Klaver adrian.klaver@aklaver.com
On 04/16/2018 10:18 AM, Keith Fiske wrote: > > > On Mon, Apr 16, 2018 at 12:21 PM, Tom Lane <tgl@sss.pgh.pa.us > <mailto:tgl@sss.pgh.pa.us>> wrote: > > > So playing around with the "client_encoding" configuration option in > postgresql.conf. According to the docs, setting this should set the > default encoding for any client that connects, right? > > https://www.postgresql.org/docs/10/static/runtime-config-client.html#GUC-CLIENT-ENCODING > <https://www.postgresql.org/docs/10/static/runtime-config-client.html#GUC-CLIENT-ENCODING> > > That description also seems misleading in saying that the default client > encoding is what the database encoding is as well. At least as far as > psql is concerned, right? I've tried setting that value in > postgresql.conf but psql keeps setting it to UTF8 when I connect to the > SQL_ASCII database unless I then specifically run "set client_encoding = > 'SQL_ASCII';" Realized that while my previous suggestion of setting PGCLIENTENCODING to 'SQL_ASCII' would work it would also affect libpq programs beside psql. For a finer grained approach set \encoding 'SQL_ASCII' in a .psqlrc file.: https://www.postgresql.org/docs/10/static/app-psql.html "Files psqlrc and ~/.psqlrc Unless it is passed an -X option, psql attempts to read and execute commands from the system-wide startup file (psqlrc) and then the user's personal startup file (~/.psqlrc), after connecting to the database but before accepting normal commands. These files can be used to set up the client and/or the server to taste, typically with \set and SET commands. The system-wide startup file is named psqlrc and is sought in the installation's “system configuration” directory, which is most reliably identified by running pg_config --sysconfdir. By default this directory will be ../etc/ relative to the directory containing the PostgreSQL executables. The name of this directory can be set explicitly via the PGSYSCONFDIR environment variable. The user's personal startup file is named .psqlrc and is sought in the invoking user's home directory. On Windows, which lacks such a concept, the personal startup file is named %APPDATA%\postgresql\psqlrc.conf. The location of the user's startup file can be set explicitly via the PSQLRC environment variable. Both the system-wide startup file and the user's personal startup file can be made psql-version-specific by appending a dash and the PostgreSQL major or minor release number to the file name, for example ~/.psqlrc-9.2 or ~/.psqlrc-9.2.5. The most specific version-matching file will be read in preference to a non-version-specific file. " > > -- > Keith Fiske > Senior Database Engineer > Crunchy Data - http://crunchydata.com -- Adrian Klaver adrian.klaver@aklaver.com
On Mon, Apr 16, 2018 at 2:26 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 04/16/2018 10:18 AM, Keith Fiske wrote:
On Mon, Apr 16, 2018 at 12:21 PM, Tom Lane <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> wrote:
So playing around with the "client_encoding" configuration option in postgresql.conf. According to the docs, setting this should set the default encoding for any client that connects, right?
https://www.postgresql.org/docs/10/static/runtime-config-cli ent.html#GUC-CLIENT-ENCODING <https://www.postgresql.org/do cs/10/static/runtime-config-cl ient.html#GUC-CLIENT-ENCODING>
That description also seems misleading in saying that the default client encoding is what the database encoding is as well. At least as far as psql is concerned, right? I've tried setting that value in postgresql.conf but psql keeps setting it to UTF8 when I connect to the SQL_ASCII database unless I then specifically run "set client_encoding = 'SQL_ASCII';"
Realized that while my previous suggestion of setting PGCLIENTENCODING to 'SQL_ASCII' would work it would also affect libpq programs beside psql.
For a finer grained approach set \encoding 'SQL_ASCII' in a .psqlrc file.:
https://www.postgresql.org/docs/10/static/app-psql.html
"Files
psqlrc and ~/.psqlrc
Unless it is passed an -X option, psql attempts to read and execute commands from the system-wide startup file (psqlrc) and then the user's personal startup file (~/.psqlrc), after connecting to the database but before accepting normal commands. These files can be used to set up the client and/or the server to taste, typically with \set and SET commands.
The system-wide startup file is named psqlrc and is sought in the installation's “system configuration” directory, which is most reliably identified by running pg_config --sysconfdir. By default this directory will be ../etc/ relative to the directory containing the PostgreSQL executables. The name of this directory can be set explicitly via the PGSYSCONFDIR environment variable.
The user's personal startup file is named .psqlrc and is sought in the invoking user's home directory. On Windows, which lacks such a concept, the personal startup file is named %APPDATA%\postgresql\psqlrc.conf. The location of the user's startup file can be set explicitly via the PSQLRC environment variable.
Both the system-wide startup file and the user's personal startup file can be made psql-version-specific by appending a dash and the PostgreSQL major or minor release number to the file name, for example ~/.psqlrc-9.2 or ~/.psqlrc-9.2.5. The most specific version-matching file will be read in preference to a non-version-specific file.--
Adrian Klaver
adrian.klaver@aklaver.com
Adrian,
Thanks for all the info! This will definitely help a lot getting the new environment set up until we can start working on migrating off SQL_ASCII.