Thread: Ways to change a database collation with removing duplicates

Ways to change a database collation with removing duplicates

From

Alexey Murz Korepov

Date:

11 March 2023, 18:23:43

Could anyone suggest to me the ways to change a database collation with removing all the duplicates, caused by this change?

I have a pretty large database (around 500 Gb) that was created with `en_US.UTF-8` collation, but the new version of the application requires that the collation should be strictly `C`.

I can successfully create a dump of the old database using `pgdump`.

But when I'm importing the dump to the new database with `COLLATE=C`, I see a lot of errors on ALTER TABLE when creating primary keys, and the same - for `CREATE INDEX` commands:

ALTER TABLE
ERROR:  could not create unique index "access_tokens_pkey"
DETAIL:  Key (id)=(16734) is duplicated.
ERROR:  could not create unique index "access_tokens_token_key"
DETAIL:  Key (token)=(XXX) is duplicated.
CONTEXT:  parallel worker
ERROR:  could not create unique index "account_data_uniqueness"
DETAIL:  Key (user_id, account_data_type)=(@username:XXX, im.vector.setting.breadcrumbs) is duplicated.

CREATE INDEX
ERROR:  could not create unique index "e2e_cross_signing_keys_idx"
DETAIL:  Key (user_id, keytype, stream_id)=(@-----------------:matrix.org, master, 1606172) is duplicated.
ERROR:  could not create unique index "e2e_cross_signing_keys_stream_idx"
DETAIL:  Key (stream_id)=(1779009) is duplicated.

So, could anyone give some advice on how to perform the collation change with cleaning out all the duplicates? Thanks!

Best regards,
Alexey Murz Korepov.
E-mail: murznn@gmail.com
Messengers: Matrix - https://matrix.to/#/@murz:ru-matrix.org Telegram - @MurzNN

Re: Ways to change a database collation with removing duplicates

From

Erik Wienhold

Date:

11 March 2023, 19:20:12

> On 11/03/2023 19:23 CET Alexey Murz Korepov <murznn@gmail.com> wrote:
>
> Could anyone suggest to me the ways to change a database collation with
> removing all the duplicates, caused by this change?

Collations can only affect uniqueness if they are nondeterministic or if you
have functional indexes, e.g. using lower(text) for a case-insensitive unique
index.  Otherwise the collations only affect text ordering.

You need to find all duplicates in the original database (or a copy) by using
the target collation for comparisons[0].  Delete all duplicates, pg_dump that
database and restore in database with new collation.

> I have a pretty large database (around 500 Gb) that was created with
> `en_US.UTF-8` collation, but the new version of the application requires
> that the collation should be strictly `C`.

Both collations should be deterministic[1] which means they consider byte
sequences when comparing strings for equality.  What does pg_collation say?

    select * from pg_collation where collname in ('C', 'en_US.utf8')

> I can successfully create a dump of the old database using `pgdump`.
>
> But when I'm importing the dump to the new database with `COLLATE=C`, I see
> a lot of errors on ALTER TABLE when creating primary keys, and the same -
> for `CREATE INDEX` commands:

Restore the dump in a new database with same collation en_US.UTF-8 to rule out
that the duplicates come from collation C and do not already exist in the
original database.

> ALTER TABLE
> ERROR:  could not create unique index "access_tokens_pkey"
> DETAIL:  Key (id)=(16734) is duplicated.

Is column "id" an integer?  If so then this looks like a corrupt index in the
original database because collations should not affect indexes on non-text
columns.

> ERROR:  could not create unique index "access_tokens_token_key"
> DETAIL:  Key (token)=(XXX) is duplicated.
> CONTEXT:  parallel worker
> ERROR:  could not create unique index "account_data_uniqueness"
> DETAIL:  Key (user_id, account_data_type)=(@username:XXX, im.vector.setting.breadcrumbs) is duplicated.
> CREATE INDEX
> ERROR:  could not create unique index "e2e_cross_signing_keys_idx"
> DETAIL:  Key (user_id, keytype, stream_id)=(@-----------------:matrix.org (http://matrix.org), master, 1606172) is
duplicated.
> ERROR:  could not create unique index "e2e_cross_signing_keys_stream_idx"
> DETAIL:  Key (stream_id)=(1779009) is duplicated.
>
> So, could anyone give some advice on how to perform the collation change
> with cleaning out all the duplicates? Thanks!

[0] https://www.postgresql.org/docs/current/indexes-collations.html
[1] https://www.postgresql.org/docs/current/collation.html#COLLATION-NONDETERMINISTIC

--
Erik

Re: Ways to change a database collation with removing duplicates

From

Tom Lane

Date:

11 March 2023, 20:53:23

Erik Wienhold <ewie@ewie.name> writes:
> On 11/03/2023 19:23 CET Alexey Murz Korepov <murznn@gmail.com> wrote:
>> Could anyone suggest to me the ways to change a database collation with
>> removing all the duplicates, caused by this change?

> Collations can only affect uniqueness if they are nondeterministic or if you
> have functional indexes, e.g. using lower(text) for a case-insensitive unique
> index.  Otherwise the collations only affect text ordering.

Yeah.  I suspect that what actually happened here was a previous change in
the host system's sort ordering (cf [1]), leading to text indexes becoming
functionally corrupt and unable to enforce uniqueness correctly, after
which you accumulated some unintentional duplicates.  If you try
reindex'ing on the source database you'll probably find that it fails with
the same errors.  I don't know of any automatic tools for fixing up such
duplications, and wouldn't trust one hugely anyway --- you'll probably
need manual curation of the fixes.

            regards, tom lane

[1] https://wiki.postgresql.org/wiki/Locale_data_changes

Re: Ways to change a database collation with removing duplicates

From

Alexey Murz Korepov

Date:

12 March 2023, 05:52:00

Thanks for the explanation! Yes, it seems this is not related to collations, because I see an integer duplicate there now, thanks for pointing to this. Maybe the restoration of the database is started twice somehow, that produces that duplicates, so I will try to restart the migration process from scratch.

On Sun, Mar 12, 2023 at 12:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Erik Wienhold <ewie@ewie.name> writes:
> On 11/03/2023 19:23 CET Alexey Murz Korepov <murznn@gmail.com> wrote:
>> Could anyone suggest to me the ways to change a database collation with
>> removing all the duplicates, caused by this change?

> Collations can only affect uniqueness if they are nondeterministic or if you
> have functional indexes, e.g. using lower(text) for a case-insensitive unique
> index. Otherwise the collations only affect text ordering.

Yeah. I suspect that what actually happened here was a previous change in
the host system's sort ordering (cf [1]), leading to text indexes becoming
functionally corrupt and unable to enforce uniqueness correctly, after
which you accumulated some unintentional duplicates. If you try
reindex'ing on the source database you'll probably find that it fails with
the same errors. I don't know of any automatic tools for fixing up such
duplications, and wouldn't trust one hugely anyway --- you'll probably
need manual curation of the fixes.

regards, tom lane

[1] https://wiki.postgresql.org/wiki/Locale_data_changes

Best regards,
Alexey Murz Korepov.
E-mail: murznn@gmail.com
Messengers: Matrix - https://matrix.to/#/@murz:ru-matrix.org Telegram - @MurzNN