Home > mailing lists

Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade - Mailing list pgsql-general

From	Keith Fiske
Subject	Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade
Date	April 16, 2018 19:36:45
Msg-id	CAODZiv5G3QGQ_TnPHSb1p6-wowzkwekw6AHjqCvrTvsm=osP5w@mail.gmail.com Whole thread
In response to	Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade (Vick Khera <vivek@khera.org>)
List	pgsql-general

Tree view

On Mon, Apr 16, 2018 at 12:30 PM, Vick Khera <vivek@khera.org> wrote:

Hi Keith,

Not sure if this will help but a couple of years ago I migrated from an SQL_ASCII encoding to UTF8. The data was primarily ASCII with some windows garbage, and a little bit of UTF8 from customers filling out forms that were not specifically encoded anything.

I wrote a utility that in-place scans and updates the tables in your SQL_ASCII-encoded database and ensures that everything is 100% UTF8 NFC at the end. For us, there were some characters in some bizarre local encodings, and we had to either toss or make educated guesses for them.

After the cleaning, you dump with client encoding UTF8, then restore into the final database with UTF8 encoding.

You can find it on my github along with documentation and tests to verify it works: https://github.com/khera/utf8-inline-cleaner

On Mon, Apr 16, 2018 at 11:16 AM, Keith Fiske <keith.fiske@crunchydata.com> wrote:
Running into an issue with helping a client upgrade from 8.3 to 10 (yes, I know, please keep the out of support comments to a minimum, thanks :).

The old database was in SQL_ASCII and it needs to stay that way for now unfortunately. The dump and restore itself works fine, but we're now running into issues with some data returning encoding errors unless we specifically set the client_encoding value to SQL_ASCII.

Looking at the 8.3 database, it has the client_encoding value set to UTF8 and queries seem to work fine. Is this just a bug in the old 8.3 not enforcing encoding properly?

The other thing I noticed on the 10 instance was that, while the LOCALE was set to SQL_ASCII, the COLLATE and CTYPE values for the restored databases were en_US.UTF-8. Could this be having an affect? Is there any way to see what these values were on the old 8.3 database? The pg_database catalog does not have these values stored back then.

--
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com

Thanks Vick! We will hopefully be helping them to get off SQL_ASCII after this upgrade. Was challenging enough to get the upgrade itself done, so doing the encoding migration at the same time unfortunately wasn't possible. It's more than just the database that needs fixing, it's an entire data ingestion process as well.

Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com

pgsql-general by date:

From: Keith Fiske
Date: 16 April 2018, 19:30:46
Subject: Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade

From: Keith Fiske
Date: 16 April 2018, 20:18:25
Subject: Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade

Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade - Mailing list pgsql-general

Previous

Next