Thread: [GENERAL] Changing collate & ctype for an existing database

[GENERAL] Changing collate & ctype for an existing database

From

rihad

Date:

10 July 2017, 19:21:41

Hi there. We have a working database that was unfortunately created by
initdb with default ("C") collation & ctype. All other locale specific
settings have the value en_US.UTF-8 in postgresql.conf. The database
itself is multilingual and all its data is stored in UTF-8. Sorting
doesn't work correctly, though. To fix that, can I just do this:


update pg_database set datcollate='en_US.UTF-8', datctype='en_US.UTF-8'
where datname='mydb';


This does seem to work on a testing copy of the database, i.e. select
lower('БлаБлаБла') now works correctly when connected to that database.


Is there still any chance for corrupting data by doing this, or indexes
stopping working etc?

p.s.: postgres 9.6.3

Thanks.

Re: [GENERAL] Changing collate & ctype for an existing database

From

Albe Laurenz

Date:

12 July 2017, 12:54:43

rihad wrote:
> Hi there. We have a working database that was unfortunately created by
> initdb with default ("C") collation & ctype. All other locale specific
> settings have the value en_US.UTF-8 in postgresql.conf. The database
> itself is multilingual and all its data is stored in UTF-8. Sorting
> doesn't work correctly, though. To fix that, can I just do this:
> 
> 
> update pg_database set datcollate='en_US.UTF-8', datctype='en_US.UTF-8'
> where datname='mydb';
> 
> 
> This does seem to work on a testing copy of the database, i.e. select
> lower('БлаБлаБла') now works correctly when connected to that database.
> 
> 
> Is there still any chance for corrupting data by doing this, or indexes
> stopping working etc?
> 
> p.s.: postgres 9.6.3

As explained, yes.  Indexes on string columns will be corrupted.

See this example:

test=# CREATE DATABASE breakme LC_COLLATE "C" LC_CTYPE "C" TEMPLATE template0;
test=# \c breakme
breakme=# CREATE TABLE sort(id integer PRIMARY KEY, val text NOT NULL);
breakme=# INSERT INTO sort VALUES (1, 'LITTLE'), (2, 'big'), (3, 'b-less');
breakme=# CREATE INDEX ON sort(val);
breakme=# SET enable_seqscan=off;  -- force index use
breakme=# SELECT * FROM sort ORDER BY val;
┌────┬────────┐
│ id │  val   │
├────┼────────┤
│  1 │ LITTLE │
│  3 │ b-less │
│  2 │ big    │
└────┴────────┘
(3 rows)

breakme=# UPDATE pg_database SET datcollate='en_US.UTF-8', datctype='en_US.UTF-8' WHERE datname='breakme';
breakme=# \c breakme
breakme=# SET enable_seqscan=off;  -- force index use
breakme=# SELECT * FROM sort ORDER BY val;
┌────┬────────┐
│ id │  val   │
├────┼────────┤
│  1 │ LITTLE │
│  3 │ b-less │
│  2 │ big    │
└────┴────────┘
(3 rows)

breakme=# SET enable_seqscan=on;  -- this and the following force sequential scan
breakme=# SET enable_bitmapscan=off;
breakme=# SET enable_indexscan=off;
breakme=# SET enable_indexonlyscan=off;
breakme=# SELECT * FROM sort ORDER BY val;  -- this returns the correct order
┌────┬────────┐
│ id │  val   │
├────┼────────┤
│  2 │ big    │
│  3 │ b-less │
│  1 │ LITTLE │
└────┴────────┘
(3 rows)

As you see, your index is still sorted according to the C collation
and scanning it returns wrong results.

Yours,
Laurenz Albe

Re: [GENERAL] Changing collate & ctype for an existing database

From

rihad

Date:

12 July 2017, 19:56:13

On 07/12/2017 01:54 PM, Albe Laurenz wrote:
> rihad wrote:
>> Hi there. We have a working database that was unfortunately created by
>> initdb with default ("C") collation & ctype. All other locale specific
>> settings have the value en_US.UTF-8 in postgresql.conf. The database
>> itself is multilingual and all its data is stored in UTF-8. Sorting
>> doesn't work correctly, though. To fix that, can I just do this:
>>
>>
>> update pg_database set datcollate='en_US.UTF-8', datctype='en_US.UTF-8'
>> where datname='mydb';
>>
>>
>> This does seem to work on a testing copy of the database, i.e. select
>> lower('БлаБлаБла') now works correctly when connected to that database.
>>
>>
>> Is there still any chance for corrupting data by doing this, or indexes
>> stopping working etc?
>>
>> p.s.: postgres 9.6.3
> As explained, yes.  Indexes on string columns will be corrupted.
>
> See this example:
>
> test=# CREATE DATABASE breakme LC_COLLATE "C" LC_CTYPE "C" TEMPLATE template0;
> test=# \c breakme
> breakme=# CREATE TABLE sort(id integer PRIMARY KEY, val text NOT NULL);
> breakme=# INSERT INTO sort VALUES (1, 'LITTLE'), (2, 'big'), (3, 'b-less');
> breakme=# CREATE INDEX ON sort(val);
> breakme=# SET enable_seqscan=off;  -- force index use
> breakme=# SELECT * FROM sort ORDER BY val;
> ┌────┬────────┐
> │ id │  val   │
> ├────┼────────┤
> │  1 │ LITTLE │
> │  3 │ b-less │
> │  2 │ big    │
> └────┴────────┘
> (3 rows)
>
> breakme=# UPDATE pg_database SET datcollate='en_US.UTF-8', datctype='en_US.UTF-8' WHERE datname='breakme';
> breakme=# \c breakme
> breakme=# SET enable_seqscan=off;  -- force index use
> breakme=# SELECT * FROM sort ORDER BY val;
> ┌────┬────────┐
> │ id │  val   │
> ├────┼────────┤
> │  1 │ LITTLE │
> │  3 │ b-less │
> │  2 │ big    │
> └────┴────────┘
> (3 rows)
>
> breakme=# SET enable_seqscan=on;  -- this and the following force sequential scan
> breakme=# SET enable_bitmapscan=off;
> breakme=# SET enable_indexscan=off;
> breakme=# SET enable_indexonlyscan=off;
> breakme=# SELECT * FROM sort ORDER BY val;  -- this returns the correct order
> ┌────┬────────┐
> │ id │  val   │
> ├────┼────────┤
> │  2 │ big    │
> │  3 │ b-less │
> │  1 │ LITTLE │
> └────┴────────┘
> (3 rows)
>
> As you see, your index is still sorted according to the C collation
> and scanning it returns wrong results.
>
> Yours,
> Laurenz Albe

This ordering issue can certainly be classified as an inconsistency, but
nothing to lose sleep over. Is this all that is normally meant when
saying "index corruption"? What about updating or deleting the wrong row
addressed by the textual index that hasn't been rebuilt after
datcollate/datctype change, complete table/database corruption, or other
scary night-time stories of this kind? Possible?

Re: [GENERAL] Changing collate & ctype for an existing database

From

Tom Lane

Date:

12 July 2017, 20:31:23

rihad <rihad@mail.ru> writes:
> On 07/12/2017 01:54 PM, Albe Laurenz wrote:
>> As you see, your index is still sorted according to the C collation
>> and scanning it returns wrong results.

> This ordering issue can certainly be classified as an inconsistency, but
> nothing to lose sleep over. Is this all that is normally meant when
> saying "index corruption"?

Laurenz neglected to point out that if the index isn't sorted the way that
the system assumes it is, then searches may fail to find values that are
present (due to descending into the wrong subtree), and by the same token
insertions may fail to enforce uniqueness.  That's pretty corrupt in
my book.

            regards, tom lane

Re: [GENERAL] Changing collate & ctype for an existing database

From

rihad

Date:

12 July 2017, 20:41:38

On 07/12/2017 09:31 PM, Tom Lane wrote:
> rihad <rihad@mail.ru> writes:
>> On 07/12/2017 01:54 PM, Albe Laurenz wrote:
>>> As you see, your index is still sorted according to the C collation
>>> and scanning it returns wrong results.
>> This ordering issue can certainly be classified as an inconsistency, but
>> nothing to lose sleep over. Is this all that is normally meant when
>> saying "index corruption"?
> Laurenz neglected to point out that if the index isn't sorted the way that
> the system assumes it is, then searches may fail to find values that are
> present (due to descending into the wrong subtree), and by the same token
> insertions may fail to enforce uniqueness.  That's pretty corrupt in
> my book.
>
>             regards, tom lane
>
Wow. It sure is.

Re: [GENERAL] Changing collate & ctype for an existing database

From

rihad

Date:

12 July 2017, 20:48:08

On 07/12/2017 09:31 PM, Tom Lane wrote:
> rihad <rihad@mail.ru> writes:
>> On 07/12/2017 01:54 PM, Albe Laurenz wrote:
>>> As you see, your index is still sorted according to the C collation
>>> and scanning it returns wrong results.
>> This ordering issue can certainly be classified as an inconsistency, but
>> nothing to lose sleep over. Is this all that is normally meant when
>> saying "index corruption"?
> Laurenz neglected to point out that if the index isn't sorted the way that
> the system assumes it is, then searches may fail to find values that are
> present (due to descending into the wrong subtree), and by the same token
> insertions may fail to enforce uniqueness.  That's pretty corrupt in
> my book.
>
>             regards, tom lane
>
What if only English letters are used in the textual indices (ascii
0-127), would they still be impacted after datctype&datcollate
"C"->"en_US.UTF-8" change? Encoding has always been UTF8, btw.


postgres=# \l
                                   List of databases
    Name    |  Owner   | Encoding |   Collate   |    Ctype    | Access
privileges
-----------+----------+----------+-------------+-------------+-----------------------
  mydb    | myuser   | UTF8     | C           | C           |

Re: [GENERAL] Changing collate & ctype for an existing database

From

Tom Lane

Date:

12 July 2017, 22:25:56

rihad <rihad@mail.ru> writes:
> What if only English letters are used in the textual indices (ascii
> 0-127), would they still be impacted after datctype&datcollate
> "C"->"en_US.UTF-8" change?

Yes, as even minimal testing would have told you.  C sort order is
more case-sensitive, for instance.

            regards, tom lane

Re: [GENERAL] Changing collate & ctype for an existing database

From

rihad

Date:

13 July 2017, 08:41:42

On 07/12/2017 11:25 PM, Tom Lane wrote:
> rihad <rihad@mail.ru> writes:
>> What if only English letters are used in the textual indices (ascii
>> 0-127), would they still be impacted after datctype&datcollate
>> "C"->"en_US.UTF-8" change?
> Yes, as even minimal testing would have told you.  C sort order is
> more case-sensitive, for instance.
>
>             regards, tom lane
> .
>
Thanks. It would be great if initdb didn't assume an implicit encoding,
to prevent such fundamental configuration mistakes in the future. More
often than not collation/ctype settings of an ssh login session used to
run initdb aren't what must be used to set up the cluster. It'd be great
if initdb didn't go any further if not provided with an explicit
encoding. The error message would require the user to think twice before
proceeding, and to read up on the matter. Explicit is better than
implicit, as the old saying goes :)

Re: [GENERAL] Changing collate & ctype for an existing database

From

rihad

Date:

13 July 2017, 16:37:25

On 07/12/2017 11:25 PM, Tom Lane wrote:
> rihad <rihad@mail.ru> writes:
>> What if only English letters are used in the textual indices (ascii
>> 0-127), would they still be impacted after datctype&datcollate
>> "C"->"en_US.UTF-8" change?
> Yes, as even minimal testing would have told you.  C sort order is
> more case-sensitive, for instance.
>
>             regards, tom lane
> .
>
Btw, can we wrap the update of datcollate &datctype and rebuilding of
textual indices inside a transaction with effectively 0 downtime?

Re: [GENERAL] Changing collate & ctype for an existing database

From

Albe Laurenz

Date:

14 July 2017, 12:41:12

rihad wrote:
> Btw, can we wrap the update of datcollate &datctype and rebuilding of
> textual indices inside a transaction with effectively 0 downtime?

No.  Building indexes takes time and will lock the tables
until the transaction is done.

Yours,
Laurenz Albe