Thread: Ways to change a database collation with removing duplicates
Could anyone suggest to me the ways to change a database collation with removing all the duplicates, caused by this change?
I have a pretty large database (around 500 Gb) that was created with `en_US.UTF-8` collation, but the new version of the application requires that the collation should be strictly `C`.
I can successfully create a dump of the old database using `pgdump`.
But when I'm importing the dump to the new database with `COLLATE=C`, I see a lot of errors on ALTER TABLE when creating primary keys, and the same - for `CREATE INDEX` commands:
ALTER TABLE ERROR: could not create unique index "access_tokens_pkey" DETAIL: Key (id)=(16734) is duplicated. ERROR: could not create unique index "access_tokens_token_key" DETAIL: Key (token)=(XXX) is duplicated. CONTEXT: parallel worker ERROR: could not create unique index "account_data_uniqueness" DETAIL: Key (user_id, account_data_type)=(@username:XXX, im.vector.setting.breadcrumbs) is duplicated.
CREATE INDEX ERROR: could not create unique index "e2e_cross_signing_keys_idx" DETAIL: Key (user_id, keytype, stream_id)=(@-----------------:matrix.org, master, 1606172) is duplicated. ERROR: could not create unique index "e2e_cross_signing_keys_stream_idx" DETAIL: Key (stream_id)=(1779009) is duplicated.
So, could anyone give some advice on how to perform the collation change with cleaning out all the duplicates? Thanks!
--
Best regards,
Alexey Murz Korepov.
E-mail: murznn@gmail.com
Messengers: Matrix - https://matrix.to/#/@murz:ru-matrix.org Telegram - @MurzNN
Alexey Murz Korepov.
E-mail: murznn@gmail.com
Messengers: Matrix - https://matrix.to/#/@murz:ru-matrix.org Telegram - @MurzNN
> On 11/03/2023 19:23 CET Alexey Murz Korepov <murznn@gmail.com> wrote: > > Could anyone suggest to me the ways to change a database collation with > removing all the duplicates, caused by this change? Collations can only affect uniqueness if they are nondeterministic or if you have functional indexes, e.g. using lower(text) for a case-insensitive unique index. Otherwise the collations only affect text ordering. You need to find all duplicates in the original database (or a copy) by using the target collation for comparisons[0]. Delete all duplicates, pg_dump that database and restore in database with new collation. > I have a pretty large database (around 500 Gb) that was created with > `en_US.UTF-8` collation, but the new version of the application requires > that the collation should be strictly `C`. Both collations should be deterministic[1] which means they consider byte sequences when comparing strings for equality. What does pg_collation say? select * from pg_collation where collname in ('C', 'en_US.utf8') > I can successfully create a dump of the old database using `pgdump`. > > But when I'm importing the dump to the new database with `COLLATE=C`, I see > a lot of errors on ALTER TABLE when creating primary keys, and the same - > for `CREATE INDEX` commands: Restore the dump in a new database with same collation en_US.UTF-8 to rule out that the duplicates come from collation C and do not already exist in the original database. > ALTER TABLE > ERROR: could not create unique index "access_tokens_pkey" > DETAIL: Key (id)=(16734) is duplicated. Is column "id" an integer? If so then this looks like a corrupt index in the original database because collations should not affect indexes on non-text columns. > ERROR: could not create unique index "access_tokens_token_key" > DETAIL: Key (token)=(XXX) is duplicated. > CONTEXT: parallel worker > ERROR: could not create unique index "account_data_uniqueness" > DETAIL: Key (user_id, account_data_type)=(@username:XXX, im.vector.setting.breadcrumbs) is duplicated. > CREATE INDEX > ERROR: could not create unique index "e2e_cross_signing_keys_idx" > DETAIL: Key (user_id, keytype, stream_id)=(@-----------------:matrix.org (http://matrix.org), master, 1606172) is duplicated. > ERROR: could not create unique index "e2e_cross_signing_keys_stream_idx" > DETAIL: Key (stream_id)=(1779009) is duplicated. > > So, could anyone give some advice on how to perform the collation change > with cleaning out all the duplicates? Thanks! [0] https://www.postgresql.org/docs/current/indexes-collations.html [1] https://www.postgresql.org/docs/current/collation.html#COLLATION-NONDETERMINISTIC -- Erik
Erik Wienhold <ewie@ewie.name> writes: > On 11/03/2023 19:23 CET Alexey Murz Korepov <murznn@gmail.com> wrote: >> Could anyone suggest to me the ways to change a database collation with >> removing all the duplicates, caused by this change? > Collations can only affect uniqueness if they are nondeterministic or if you > have functional indexes, e.g. using lower(text) for a case-insensitive unique > index. Otherwise the collations only affect text ordering. Yeah. I suspect that what actually happened here was a previous change in the host system's sort ordering (cf [1]), leading to text indexes becoming functionally corrupt and unable to enforce uniqueness correctly, after which you accumulated some unintentional duplicates. If you try reindex'ing on the source database you'll probably find that it fails with the same errors. I don't know of any automatic tools for fixing up such duplications, and wouldn't trust one hugely anyway --- you'll probably need manual curation of the fixes. regards, tom lane [1] https://wiki.postgresql.org/wiki/Locale_data_changes
Thanks for the explanation! Yes, it seems this is not related to collations, because I see an integer duplicate there now, thanks for pointing to this. Maybe the restoration of the database is started twice somehow, that produces that duplicates, so I will try to restart the migration process from scratch.
On Sun, Mar 12, 2023 at 12:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Erik Wienhold <ewie@ewie.name> writes:
> On 11/03/2023 19:23 CET Alexey Murz Korepov <murznn@gmail.com> wrote:
>> Could anyone suggest to me the ways to change a database collation with
>> removing all the duplicates, caused by this change?
> Collations can only affect uniqueness if they are nondeterministic or if you
> have functional indexes, e.g. using lower(text) for a case-insensitive unique
> index. Otherwise the collations only affect text ordering.
Yeah. I suspect that what actually happened here was a previous change in
the host system's sort ordering (cf [1]), leading to text indexes becoming
functionally corrupt and unable to enforce uniqueness correctly, after
which you accumulated some unintentional duplicates. If you try
reindex'ing on the source database you'll probably find that it fails with
the same errors. I don't know of any automatic tools for fixing up such
duplications, and wouldn't trust one hugely anyway --- you'll probably
need manual curation of the fixes.
regards, tom lane
[1] https://wiki.postgresql.org/wiki/Locale_data_changes
Best regards,
Alexey Murz Korepov.
E-mail: murznn@gmail.com
Messengers: Matrix - https://matrix.to/#/@murz:ru-matrix.org Telegram - @MurzNN
Alexey Murz Korepov.
E-mail: murznn@gmail.com
Messengers: Matrix - https://matrix.to/#/@murz:ru-matrix.org Telegram - @MurzNN