Thread: Mixed Locales and Upgrading

Mixed Locales and Upgrading

From
Don Seiler
Date:
Good morning,

I have a few clusters that need to be upgraded from PG 9.6 or 10 to 12. Normally I'd just plan to pg_upgrade the lot and be good to go. However I've found that quite a few (including our biggest/busiest database) have mixed locales. In the case of the biggest/busiest database, the cluster was created with locale en_US (NOT en_US.UTF-8), and so the databases have encoding LATIN1. 

However this database has encoding UTF8 while still having ctype and collation of en_US. I've since found that when this was last upgraded, they ran "update pg_database set encoding = pg_char_to_encoding('UTF8') where datname = 'test';" to change the encoding. In my testing, pg_upgrade breaks when trying to restore this since UTF8 isn't supported in en_US for the CREATE DATABASE command used during pg_restore:

command: "/usr/lib/postgresql/12/bin/pg_restore" --host /var/lib/postgresql --port 50432 --username postgres --create --exit-on-error --verbose --dbname template1 "pg_upgrade_dump_16385.custom" >> "pg_upgrade_dump_16385.log" 2>&1
pg_restore: connecting to database for restore
pg_restore: creating DATABASE "test"
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 2164; 1262 16385 DATABASE test postgres
pg_restore: error: could not execute query: ERROR:  encoding "UTF8" does not match locale "en_US"
DETAIL:  The chosen LC_CTYPE setting requires encoding "LATIN1".
Command was: CREATE DATABASE "test" WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'en_US' LC_CTYPE = 'en_US';


Is there a way around this while still using pg_upgrade? My understanding is that a full dump and restore into a new DB with everything set to en_US.UTF-8 is the only to convert these (if I have to dump/restore, I wouldn't want to keep the mixed environment). Even with parallel jobs, I imagine that's a bit of downtime but I'll have to wait until I can get a copy of prod data to test with to be sure.

Is logical replication an option here? Either maintaining the mixed environment or converting everything to en_US.UTF-8? I'm relatively new in this shop but I'm told they didn't mean to use en_US and there's no reason they wouldn't want to just use the standard/default UTF-8.

Thanks,
Don.

--
Don Seiler
www.seiler.us

Re: Mixed Locales and Upgrading

From
Tom Lane
Date:
Don Seiler <don@seiler.us> writes:
> However this database has encoding UTF8 while still having ctype and
> collation of en_US. I've since found that when this was last upgraded, they
> ran "update pg_database set encoding = pg_char_to_encoding('UTF8') where
> datname = 'test';" to change the encoding.

Egad.

> In my testing, pg_upgrade breaks
> when trying to restore this since UTF8 isn't supported in en_US for the CREATE
> DATABASE command used during pg_restore:

Well, in principle you could likewise manually update pg_database's
datcollate and datctype columns to say "en_US.utf8".  However, there's
a much bigger problem here --- what steps if any did this cowboy take
to ensure that the data inside the database was valid UTF8?

I don't think you should use pg_upgrade here at all.  A dump/restore
is really the only way to make sure that you have validly encoded data.

However, if it's only one database out of a bunch, you could do something
like

* pg_dump that one database;
* drop said database;
* pg_upgrade everything else;
* restore that one database from dump.

            regards, tom lane



Re: Mixed Locales and Upgrading

From
Don Seiler
Date:
On Mon, Mar 16, 2020 at 10:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Egad.

My thoughts exactly.
 
Well, in principle you could likewise manually update pg_database's
datcollate and datctype columns to say "en_US.utf8".  However, there's
a much bigger problem here --- what steps if any did this cowboy take
to ensure that the data inside the database was valid UTF8?

No steps that I've seen from the chat history I've been able to search. I'm not sure if there was an (invalid) assumption that LATIN1 is a subset of UTF-8 or if it was done in a panic to get the import/update working years ago.
 
I don't think you should use pg_upgrade here at all.  A dump/restore
is really the only way to make sure that you have validly encoded data.

That is what I thought, and probably not what they'll want to hear given the downtime involved. Even with parallel dump/restore jobs, I imagine it will take quite a while (this first DB is almost 900GB).
 
However, if it's only one database out of a bunch, you could do something
like

* pg_dump that one database;
* drop said database;
* pg_upgrade everything else;
* restore that one database from dump.

In the case of this busy cluster, the layout is like this:

postgres=# \l
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges
-----------+----------+----------+------------+------------+-----------------------
 foooo_all | postgres | UTF8     | en_US      | en_US      |
 postgres  | postgres | LATIN1   | en_US      | en_US      | =Tc/postgres         +
           |          |          |            |            | postgres=CTc/postgres
 template0 | postgres | LATIN1   | en_US      | en_US      | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
(4 rows)


So the template1 DB was dropped and recreated with the collate and ctype as well and then set to be a template again. But I believe that was well after foooo_all was changed (and so probably no need for the template1 change). In this case, if this is the only DB in the cluster, would it make sense to just create a new one as en_US.UTF-8 and then restore a dump of foooo_all into a pre-created en_US.UTF-8 DB?

We have a few other mixed environments similar to this as well. Some have postgres and both template DBs with this same UTF8/en_US/en_US configuration.

Is logical replication an option here? If the target DB were setup as en_US.UTF-8 across the board, would logical replication safely replicate and convert the data until we could then cut over?


Thanks,
Don.
--
Don Seiler
www.seiler.us

Re: Mixed Locales and Upgrading

From
Don Seiler
Date:
On Tue, Mar 17, 2020 at 8:06 AM Don Seiler <don@seiler.us> wrote:
On Mon, Mar 16, 2020 at 10:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, in principle you could likewise manually update pg_database's
datcollate and datctype columns to say "en_US.utf8".  However, there's
a much bigger problem here --- what steps if any did this cowboy take
to ensure that the data inside the database was valid UTF8?

Is there a way to programmatically check for data that might be a problem now?

--
Don Seiler
www.seiler.us

Re: Mixed Locales and Upgrading

From
Tom Lane
Date:
Don Seiler <don@seiler.us> writes:
> On Mon, Mar 16, 2020 at 10:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't think you should use pg_upgrade here at all.  A dump/restore
>> is really the only way to make sure that you have validly encoded data.

> That is what I thought, and probably not what they'll want to hear given
> the downtime involved. Even with parallel dump/restore jobs, I imagine it
> will take quite a while (this first DB is almost 900GB).

Yikes.  Well, if there aren't obvious operational problems, it might be
that the data is actually UTF8-clean, or almost entirely so.  Maybe you
could look at the problem as being one of validation.  In that case,
it'd be possible to consider not taking the production DB down, but just
doing a pg_dump from it and seeing if you can restore somewhere else.
If not, fix the broken data; repeat till clean.  After that you could
do pg_upgrade with a clear conscience.  I think you'll still end up
manually fixing the inconsistent datcollate/datctype settings though.

> Is logical replication an option here? If the target DB were setup as
> en_US.UTF-8 across the board, would logical replication safely replicate
> and convert the data until we could then cut over?

I think you need to make sure the data is clean first.  I doubt that
logical replication will magically fix any problems in data it's trying
to push over, and I also doubt that we have any really good answer to
what happens if a replication update fails due to bad data.

            regards, tom lane



Re: Mixed Locales and Upgrading

From
Don Seiler
Date:
On Tue, Mar 17, 2020 at 8:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yikes.  Well, if there aren't obvious operational problems, it might be
that the data is actually UTF8-clean, or almost entirely so.  Maybe you
could look at the problem as being one of validation.  In that case,
it'd be possible to consider not taking the production DB down, but just
doing a pg_dump from it and seeing if you can restore somewhere else.
If not, fix the broken data; repeat till clean.  After that you could
do pg_upgrade with a clear conscience.  I think you'll still end up
manually fixing the inconsistent datcollate/datctype settings though.

For this test, would we restore into an en_US.UTF-8/UTF8 database? Then, assuming no errors (or fixing any errors until clean), we change the datcollate/datctype settings in prod and proceed with pg_upgrade (obviously after testing all of that heavily)?

What are the ramifications of changing collation like that? Should we consider rebuilding indexes ASAP after that?

Don.

--
Don Seiler
www.seiler.us

Re: Mixed Locales and Upgrading

From
Tom Lane
Date:
Don Seiler <don@seiler.us> writes:
> On Tue, Mar 17, 2020 at 8:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yikes.  Well, if there aren't obvious operational problems, it might be
>> that the data is actually UTF8-clean, or almost entirely so.  Maybe you
>> could look at the problem as being one of validation.

> For this test, would we restore into an en_US.UTF-8/UTF8 database? Then,
> assuming no errors (or fixing any errors until clean), we change the
> datcollate/datctype settings in prod and proceed with pg_upgrade (obviously
> after testing all of that heavily)?

Yeah, that's the basic idea.

> What are the ramifications of changing collation like that? Should we
> consider rebuilding indexes ASAP after that?

Text indexes would definitely be at risk here.  I'm not really certain
how bad the problem would be.  Do you have a feeling for how much of
the data is 100% ASCII?  If you could be sure of that for any given
column, you wouldn't have to reindex indexes on that column.

            regards, tom lane



Re: Mixed Locales and Upgrading

From
Michael Paquier
Date:
On Tue, Mar 17, 2020 at 10:45:50AM -0400, Tom Lane wrote:
> Don Seiler <don@seiler.us> writes:
>> What are the ramifications of changing collation like that? Should we
>> consider rebuilding indexes ASAP after that?
>
> Text indexes would definitely be at risk here.  I'm not really certain
> how bad the problem would be.  Do you have a feeling for how much of
> the data is 100% ASCII?  If you could be sure of that for any given
> column, you wouldn't have to reindex indexes on that column.

There is no way to know how much indexes would get broken without
having a look at it.  Anything ASCII-based should be of no problem.
If you have a doubt, reindexing evey index which includes text column
data is the best course of action in my opinion if you have any
doubts, because that's safe even if it has a higher cost.
--
Michael

Attachment

Re: Mixed Locales and Upgrading

From
Don Seiler
Date:
On Tue, Mar 17, 2020 at 9:25 PM Michael Paquier <michael@paquier.xyz> wrote:

There is no way to know how much indexes would get broken without
having a look at it.  Anything ASCII-based should be of no problem.
If you have a doubt, reindexing evey index which includes text column
data is the best course of action in my opinion if you have any
doubts, because that's safe even if it has a higher cost.

Here's the fun part. A lot of the tables use UUIDv4 strings for primary keys. However these are stored in text/varchar columns.

--
Don Seiler
www.seiler.us

Re: Mixed Locales and Upgrading

From
Don Seiler
Date:
On Sun, Mar 22, 2020 at 4:48 PM Don Seiler <don@seiler.us> wrote:

Here's the fun part. A lot of the tables use UUIDv4 strings for primary keys. However these are stored in text/varchar columns.

Actually, would I need to re-index on text columns that we know contain UUID strings? UUID characters seem to be pretty basic alphanumeric ASCII characters. 

--
Don Seiler
www.seiler.us

Re: Mixed Locales and Upgrading

From
Tom Lane
Date:
Don Seiler <don@seiler.us> writes:
> Actually, would I need to re-index on text columns that we know contain
> UUID strings? UUID characters seem to be pretty basic alphanumeric ASCII
> characters.

I think you're all right with respect to those, since they're the
same under any encoding.  It's columns containing non-ASCII characters
that you'd want to worry about reindexing.

            regards, tom lane



Re: Mixed Locales and Upgrading

From
Don Seiler
Date:
On Mon, Mar 30, 2020 at 4:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Don Seiler <don@seiler.us> writes:
> Actually, would I need to re-index on text columns that we know contain
> UUID strings? UUID characters seem to be pretty basic alphanumeric ASCII
> characters.

I think you're all right with respect to those, since they're the
same under any encoding.  It's columns containing non-ASCII characters
that you'd want to worry about reindexing.

That's what I was hoping to hear. Thanks!

Don.

--
Don Seiler
www.seiler.us

Re: Mixed Locales and Upgrading

From
Don Seiler
Date:
On Mon, Mar 30, 2020 at 4:39 PM Don Seiler <don@seiler.us> wrote:
On Mon, Mar 30, 2020 at 4:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Don Seiler <don@seiler.us> writes:
> Actually, would I need to re-index on text columns that we know contain
> UUID strings? UUID characters seem to be pretty basic alphanumeric ASCII
> characters.

I think you're all right with respect to those, since they're the
same under any encoding.  It's columns containing non-ASCII characters
that you'd want to worry about reindexing.

Follow-up question, the locale setting on the host would still be set to en_US (as would the postgres and template0 databases). Should I look to change that locale on the system to en_US.UTF-8, or even just for the postgres user that the DB cluster runs as? What are the ramification for doing (or not doing) so?

Don. 

--
Don Seiler
www.seiler.us

Re: Mixed Locales and Upgrading

From
Peter Eisentraut
Date:
On 2020-04-07 18:41, Don Seiler wrote:
> Follow-up question, the locale setting on the host would still be set to 
> en_US (as would the postgres and template0 databases). Should I look to 
> change that locale on the system to en_US.UTF-8, or even just for the 
> postgres user that the DB cluster runs as? What are the ramification for 
> doing (or not doing) so?

I think the only place where this would really matter is that psql by 
default sets the client encoding based on the current OS locale setting. 
  So if you don't change the setting to en_US.UTF-8, then you might get 
encoding errors when selecting data that is not representable as LATIN1 
or whatever.  However, if you change the setting, that doesn't mean your 
terminal setup will actually display Unicode correctly.  You said you're 
dealing with mostly ASCII-ish data anyway, so it will probably not make 
a difference.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Mixed Locales and Upgrading

From
"Peter J. Holzer"
Date:
On 2020-03-30 17:30:32 -0400, Tom Lane wrote:
> Don Seiler <don@seiler.us> writes:
> > Actually, would I need to re-index on text columns that we know contain
> > UUID strings? UUID characters seem to be pretty basic alphanumeric ASCII
> > characters.
>
> I think you're all right with respect to those, since they're the
> same under any encoding.  It's columns containing non-ASCII characters
> that you'd want to worry about reindexing.

I'm not so sure whether all locales agree on whether to sort digits
before or after letters. However there are only two possibilities and
that's easy to check.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Mixed Locales and Upgrading

From
Don Seiler
Date:
On Tue, Apr 7, 2020 at 11:41 AM Don Seiler <don@seiler.us> wrote:

Follow-up question, the locale setting on the host would still be set to en_US (as would the postgres and template0 databases). Should I look to change that locale on the system to en_US.UTF-8, or even just for the postgres user that the DB cluster runs as? What are the ramification for doing (or not doing) so?

One more question around the GUC settings for locale. It is currently set to this:

# select name,setting from pg_settings where name like 'lc%';
    name     | setting
-------------+---------
 lc_collate  | en_US
 lc_ctype    | en_US
 lc_messages |
 lc_monetary | C
 lc_numeric  | C
 lc_time     | C


Since I'm not changing the postgres or template0 databases (leaving those as en_US/LATIN1), do I keep lc_collate/lc_ctype as en_US? It's just the template1 and application database that I've set to en_US.UTF-8.

I'm also struggling to see how lc_messages is an empty string. It is commented out in postgresql.conf but suggests 'C' will be the default. The OS locale LC_MESSAGES is set to en_US on the primary but I also see it is set to en_US on the newer replica hosts. What value would be used for lc_messages? I'm trying to create an empty DB with these same settings but if I omit --lc-messages it uses the OS locale value, and I can't set it to an empty string.

Don.

--
Don Seiler
www.seiler.us

Re: Mixed Locales and Upgrading

From
Adrian Klaver
Date:
On 4/29/20 1:10 PM, Don Seiler wrote:
> On Tue, Apr 7, 2020 at 11:41 AM Don Seiler <don@seiler.us 
> <mailto:don@seiler.us>> wrote:
> 
> 
>     Follow-up question, the locale setting on the host would still be
>     set to en_US (as would the postgres and template0 databases). Should
>     I look to change that locale on the system to en_US.UTF-8, or even
>     just for the postgres user that the DB cluster runs as? What are the
>     ramification for doing (or not doing) so?
> 
> 
> One more question around the GUC settings for locale. It is currently 
> set to this:
> 
> # select name,setting from pg_settings where name like 'lc%';
>      name     | setting
> -------------+---------
>   lc_collate  | en_US
>   lc_ctype    | en_US
>   lc_messages |
>   lc_monetary | C
>   lc_numeric  | C
>   lc_time     | C
> 
> Since I'm not changing the postgres or template0 databases (leaving 
> those as en_US/LATIN1), do I keep lc_collate/lc_ctype as en_US? It's 
> just the template1 and application database that I've set to en_US.UTF-8.
> 
> I'm also struggling to see how lc_messages is an empty string. It is 
> commented out in postgresql.conf but suggests 'C' will be the default. 

Are you sure?:

https://www.postgresql.org/docs/12/runtime-config-client.html#RUNTIME-CONFIG-CLIENT-FORMAT
"lc_messages (string)

     Sets the language in which messages are displayed. Acceptable 
values are system-dependent; see Section 23.1 for more information. If 
this variable is set to the empty string (which is the default) then the 
value is inherited from the execution environment of the server in a 
system-dependent way. ..."

> The OS locale LC_MESSAGES is set to en_US on the primary but I also see 
> it is set to en_US on the newer replica hosts. What value would be used 
> for lc_messages? I'm trying to create an empty DB with these same 
> settings but if I omit --lc-messages it uses the OS locale value, and I 
> can't set it to an empty string.
> 
> Don.
> 
> -- 
> Don Seiler
> www.seiler.us <http://www.seiler.us>


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Mixed Locales and Upgrading

From
Tom Lane
Date:
Adrian Klaver <adrian.klaver@aklaver.com> writes:
> On 4/29/20 1:10 PM, Don Seiler wrote:
>> On Tue, Apr 7, 2020 at 11:41 AM Don Seiler <don@seiler.us 
>> <mailto:don@seiler.us>> wrote:
>> Since I'm not changing the postgres or template0 databases (leaving 
>> those as en_US/LATIN1), do I keep lc_collate/lc_ctype as en_US? It's 
>> just the template1 and application database that I've set to en_US.UTF-8.

I believe that those settings only affect locale-dependent processing
in the postmaster and/or background processes, which there should be
little or none of anyway.  Processes connected to specific databases will
adopt the settings defined for those databases.

>> I'm also struggling to see how lc_messages is an empty string. It is 
>> commented out in postgresql.conf but suggests 'C' will be the default. 

> Are you sure?:
>      Sets the language in which messages are displayed. Acceptable 
> values are system-dependent; see Section 23.1 for more information. If 
> this variable is set to the empty string (which is the default) then the 
> value is inherited from the execution environment of the server in a 
> system-dependent way. ..."

The "system-dependent way" is "adopt whatever the LANG/LC_foo environment
variables say at server startup", at least on non-Windows machines.
I think that C is the fallback if none of those variables are set, though.

Short answer is you shouldn't need to mess with these.

            regards, tom lane



Re: Mixed Locales and Upgrading

From
Don Seiler
Date:
Good morning,

Back with a follow-up question to all this. I'm wondering if we shouldn't also change the locale settings for postgres/template0/template1 to match our new desires setting en_US.UTF-8 with UTF8 encoding. We haven't written anything to postgres. Some of our DB clusters have changed these already (but still leaving things mixed) but I want to make things uniform for future upgrades to go a lot more smoothly.

Some examples of current DB clusters that need to be fixed. The first one has the more problematic incompatible mixing within the DBs (the original point of this thread), the others just have different settings across DBs that I'd like change going forward as well. 

      datname       | encoding | datcollate | datctype | size_mb
--------------------+----------+------------+----------+---------
 xxxxxxxxxxxxxxxxxx | UTF8     | en_US      | en_US    | 1390789
 postgres           | UTF8     | en_US      | en_US    |       6
 template0          | UTF8     | en_US      | en_US    |       6
 template1          | UTF8     | en_US      | en_US    |       6

       datname        | encoding | datcollate |  datctype  | size_mb
----------------------+----------+------------+------------+---------
 xxxxxxxxxxxxxxxxxxx  | UTF8     | en_US.utf8 | en_US.utf8 |    2178
 postgres             | LATIN1   | en_US      | en_US      |       7
 template0            | LATIN1   | en_US      | en_US      |       7
 template1            | UTF8     | en_US.utf8 | en_US.utf8 |       7

  datname  | encoding | datcollate |  datctype  | size_mb
-----------+----------+------------+------------+---------
 xxxxxx    | UTF8     | en_US.utf8 | en_US.utf8 |     345
 postgres  | LATIN1   | en_US      | en_US      |       7
 template0 | UTF8     | en_US.utf8 | en_US.utf8 |       7
 template1 | UTF8     | en_US.utf8 | en_US.utf8 |       7


For the smaller DBs I was planning to just dump/restore into a new cluster as I upgrade to PG12 as well. However two that have the problem are the two biggest ones where the downtime for dump/restore would be too great. So I'm wondering if there is risk or harm in running an UPDATE pg_database command on postgres/template0/template1 as needed and re-indexing afterward.

--
Don Seiler
www.seiler.us