Thread: Inexplicable duplicate rows with unique constraint
I'm trying to track down the cause of some duplicate rows in a table which I would expect to be impossible due to a unique constraint. I'm hoping that somebody here will be able to suggest something I might have missed. The problem relates to a bug filed against our application (https://github.com/matrix-org/synapse/issues/6696). At first I put this down to random data corruption on a single user's postgres instance, but I've now seen three separate reports in as many days and am wondering if there is more to it. We have a table whose schema is as follows: synapse=# \d current_state_events Table "public.current_state_events" Column | Type | Modifiers ------------+------+----------- event_id | text | not null room_id | text | not null type | text | not null state_key | text | not null membership | text | Indexes: "current_state_events_event_id_key" UNIQUE CONSTRAINT, btree (event_id) "current_state_events_room_id_type_state_key_key" UNIQUE CONSTRAINT, btree (room_id, type, state_key) "current_state_events_member_index" btree (state_key) WHERE type = 'm.room.member'::text Despite the presence of the current_state_events_room_id_type_state_key_key constraint, several users have reported seeing errors which suggest that their tables have duplicate rows for the same (room_id, type, state_key) triplet and indeed querying confirms that to be the case: synapse=> select count(*), room_id, type, state_key from current_state_events group by 2,3,4 order by count(*) DESC LIMIT 2; count | room_id | type | state_key -------+-----------------------------------+---------------+------------------------------------- 3 | !ueLfVrSWYGMnFnoCbT:darkfasel.net | m.room.member | @irc_ebi_:darkfasel.net 3 | !HwocBmCtBcHQhILtYQ:matrix.org | m.room.member | @freenode_AlmarShenwan_:matrix.org (2 rows) Further investigation suggests that these are genuinely separate rows rather than duplicate entries in an index. The index appears to consider itself valid: synapse=> select i.* from pg_class c join pg_index i on i.indexrelid=c.oid where relname='current_state_events_room_id_type_state_key_key'; indexrelid | indrelid | indnatts | indisunique | indisprimary | indisexclusion | indimmediate | indisclustered | indisvalid | indcheckxmin | indisready | indislive | indisreplident | indkey | indcollation | indclass | indoption | indexprs | indpred ------------+----------+----------+-------------+--------------+----------------+--------------+----------------+------------+--------------+------------+-----------+----------------+--------+--------------+----------------+-----------+----------+--------- 17023 | 16456 | 3 | t | f | f | t | f | t | f | t | t | f | 2 3 4 | 100 100 100 | 3126 3126 3126 | 0 0 0 | | (1 row) So, question: what could we be doing wrong to get ourselves into this situation? Some other datapoints which may be relevant: * this has been reported by one user on postgres 9.6.15 and one on 10.10, though it's hard to be certain of the version that was running when the duplication occurred * the constraint is added when the table is first created (before any data is added) * At least one user reports that he has recently migrated his database from one server to another via a `pg_dump -C` and later piping into psql.
On 1/16/20 8:50 AM, Richard van der Hoff wrote: > I'm trying to track down the cause of some duplicate rows in a table > which I would expect to be impossible due to a unique constraint. I'm > hoping that somebody here will be able to suggest something I might have > missed. > > The problem relates to a bug filed against our application > (https://github.com/matrix-org/synapse/issues/6696). At first I put this > down to random data corruption on a single user's postgres instance, but > I've now seen three separate reports in as many days and am wondering if > there is more to it. > > We have a table whose schema is as follows: > > synapse=# \d current_state_events > Table "public.current_state_events" > Column | Type | Modifiers > ------------+------+----------- > event_id | text | not null > room_id | text | not null > type | text | not null > state_key | text | not null > membership | text | > Indexes: > "current_state_events_event_id_key" UNIQUE CONSTRAINT, btree > (event_id) > "current_state_events_room_id_type_state_key_key" UNIQUE > CONSTRAINT, btree (room_id, type, state_key) > "current_state_events_member_index" btree (state_key) WHERE type = > 'm.room.member'::text > > Despite the presence of the > current_state_events_room_id_type_state_key_key constraint, several > users have reported seeing errors which suggest that their tables have > duplicate rows for the same (room_id, type, state_key) triplet and > indeed querying confirms that to be the case: > > synapse=> select count(*), room_id, type, state_key from > current_state_events group by 2,3,4 order by count(*) DESC LIMIT 2; > count | room_id | type | state_key > -------+-----------------------------------+---------------+------------------------------------- > > 3 | !ueLfVrSWYGMnFnoCbT:darkfasel.net | m.room.member | > @irc_ebi_:darkfasel.net > 3 | !HwocBmCtBcHQhILtYQ:matrix.org | m.room.member | > @freenode_AlmarShenwan_:matrix.org > (2 rows) > I'm assuming the above are obfuscated? > > Further investigation suggests that these are genuinely separate rows > rather than duplicate entries in an index. If you use length() on the values are they the same? > > The index appears to consider itself valid: > > synapse=> select i.* from pg_class c join pg_index i on > i.indexrelid=c.oid where > relname='current_state_events_room_id_type_state_key_key'; > indexrelid | indrelid | indnatts | indisunique | indisprimary | > indisexclusion | indimmediate | indisclustered | indisvalid | > indcheckxmin | indisready | indislive | indisreplident | indkey | > indcollation | indclass | indoption | indexprs | indpred > ------------+----------+----------+-------------+--------------+----------------+--------------+----------------+------------+--------------+------------+-----------+----------------+--------+--------------+----------------+-----------+----------+--------- > > 17023 | 16456 | 3 | t | f | f > | t | f | t | f | t > | t | f | 2 3 4 | 100 100 100 | 3126 > 3126 3126 | 0 0 0 | | > (1 row) > > So, question: what could we be doing wrong to get ourselves into this > situation? > > Some other datapoints which may be relevant: > * this has been reported by one user on postgres 9.6.15 and one on > 10.10, though it's hard to be certain of the version that was running > when the duplication occurred > * the constraint is added when the table is first created (before any > data is added) > * At least one user reports that he has recently migrated his database > from one server to another via a `pg_dump -C` and later piping into psql. > > -- Adrian Klaver adrian.klaver@aklaver.com
Richard van der Hoff <richard@matrix.org> writes: > I'm trying to track down the cause of some duplicate rows in a table > which I would expect to be impossible due to a unique constraint. I'm > hoping that somebody here will be able to suggest something I might have > missed. Since these are text columns, one possibility you should be looking into is that the indexes have become corrupt due to a change in the operating system's sorting rules for the underlying locale. I don't recall details at the moment, but I do remember that a recent glibc update changed the sorting rules for some popular locale settings. If an installation had applied such an update underneath an existing database, you'd have a situation where existing entries in an index are not in-order according to the new behavior of the text comparison operators, leading to havoc because btree searching relies on the entries being correctly sorted. Unless you happen to notice searches failing to find rows you know are there, the first visible symptom is often appearance of "impossible" duplicate rows, after the search to verify uniqueness of a new entry fails to find the old entry. > * At least one user reports that he has recently migrated his database > from one server to another via a `pg_dump -C` and later piping into psql. Dump-and-restore wouldn't cause this (and, indeed, is one way to clean up the mess). But this is suspicious anyway because it suggests there may have been some general system upgrades going on in the vicinity. Reindexing all text indexes is the recommended remediation procedure if you suspect a locale behavior change. There's some work afoot to make PG notice the need for this automatically, but it's not done yet. regards, tom lane
Richard van der Hoff wrote: > So, question: what could we be doing wrong to get ourselves into this > situation? OS/libc upgrades without reindexing come to mind. See https://wiki.postgresql.org/wiki/Collations > * At least one user reports that he has recently migrated his database > from one server to another via a `pg_dump -C` and later piping into psql. This kind of migration recreates indexes (since it recreates everything) so it's not subject to the same hazard as an OS/libc upgrade without reindexing. In fact it would detect the problem if it existed, as the creation of the unique constraint in the new db would fail if the data in the dump did not satisfy it. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On Thu, Jan 16, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Richard van der Hoff <richard@matrix.org> writes: > > I'm trying to track down the cause of some duplicate rows in a table > > which I would expect to be impossible due to a unique constraint. I'm > > hoping that somebody here will be able to suggest something I might have > > missed. > > Since these are text columns, one possibility you should be looking into > is that the indexes have become corrupt due to a change in the operating > system's sorting rules for the underlying locale. I don't recall details > at the moment, but I do remember that a recent glibc update changed the > sorting rules for some popular locale settings. If an installation had > applied such an update underneath an existing database, you'd have a > situation where existing entries in an index are not in-order according > to the new behavior of the text comparison operators, leading to havoc > because btree searching relies on the entries being correctly sorted. See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on which linux distros updated when. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On 16/01/2020 17:12, Magnus Hagander wrote: > On Thu, Jan 16, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Richard van der Hoff <richard@matrix.org> writes: >>> I'm trying to track down the cause of some duplicate rows in a table >>> which I would expect to be impossible due to a unique constraint. I'm >>> hoping that somebody here will be able to suggest something I might have >>> missed. >> >> Since these are text columns, one possibility you should be looking into >> is that the indexes have become corrupt due to a change in the operating >> system's sorting rules for the underlying locale. I don't recall details >> at the moment, but I do remember that a recent glibc update changed the >> sorting rules for some popular locale settings. If an installation had >> applied such an update underneath an existing database, you'd have a >> situation where existing entries in an index are not in-order according >> to the new behavior of the text comparison operators, leading to havoc >> because btree searching relies on the entries being correctly sorted. > > See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on > which linux distros updated when. > Right, thanks to all who have suggested this. It seems like a plausible explanation but it's worth noting that all the indexed data here is (despite being in text columns), plain ascii. I'm surprised that a change in collation rules would change the sorting of such strings, and hence that it could lead to this problem. Am I naive? To answer Adrian's question: the lengths of the values in the indexed columns are identical between the duplicated rows.
Richard van der Hoff wrote: > synapse=> select count(*), room_id, type, state_key from > current_state_events group by 2,3,4 order by count(*) DESC LIMIT 2; > count | room_id | type | state_key > -------+-----------------------------------+---------------+------------------------------------- > 3 | !ueLfVrSWYGMnFnoCbT:darkfasel.net | m.room.member | > @irc_ebi_:darkfasel.net > 3 | !HwocBmCtBcHQhILtYQ:matrix.org | m.room.member | > @freenode_AlmarShenwan_:matrix.org Looking at these columns which are of type text but do not contain words of any particular language, there's probably no point in using a linguistic-aware collation for them. If you maintain the database schema, what you could do to avoid the dependency on the OS collation and stay clear of the particular upgrade difficulty of collations is to use COLLATE "C" for this kind of field, as opposed to the default collation of the database. As a bonus, operations with the "C" collations tend to be faster, sometimes even much faster. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 1/16/20 9:24 AM, Richard van der Hoff wrote: > > On 16/01/2020 17:12, Magnus Hagander wrote: >> On Thu, Jan 16, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> >>> Richard van der Hoff <richard@matrix.org> writes: >>>> I'm trying to track down the cause of some duplicate rows in a table >>>> which I would expect to be impossible due to a unique constraint. I'm >>>> hoping that somebody here will be able to suggest something I might >>>> have >>>> missed. >>> >>> Since these are text columns, one possibility you should be looking into >>> is that the indexes have become corrupt due to a change in the operating >>> system's sorting rules for the underlying locale. I don't recall >>> details >>> at the moment, but I do remember that a recent glibc update changed the >>> sorting rules for some popular locale settings. If an installation had >>> applied such an update underneath an existing database, you'd have a >>> situation where existing entries in an index are not in-order according >>> to the new behavior of the text comparison operators, leading to havoc >>> because btree searching relies on the entries being correctly sorted. >> >> See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on >> which linux distros updated when. >> > Right, thanks to all who have suggested this. > > It seems like a plausible explanation but it's worth noting that all the > indexed data here is (despite being in text columns), plain ascii. I'm > surprised that a change in collation rules would change the sorting of > such strings, and hence that it could lead to this problem. Am I naive? In psql who does: \l the_database_name show? > > To answer Adrian's question: the lengths of the values in the indexed > columns are identical between the duplicated rows. > > -- Adrian Klaver adrian.klaver@aklaver.com
On 16/01/2020 17:24, Daniel Verite wrote: > Looking at these columns which are of type text but do not > contain words of any particular language, there's probably > no point in using a linguistic-aware collation for them. > > If you maintain the database schema, what you could do to avoid > the dependency on the OS collation and stay clear of the particular > upgrade difficulty of collations is to use COLLATE "C" for this kind of > field, as opposed to the default collation of the database. > As a bonus, operations with the "C" collations tend to be faster, > sometimes even much faster. Good to know, thanks Daniel!
On 16/01/2020 17:27, Adrian Klaver wrote: > On 1/16/20 9:24 AM, Richard van der Hoff wrote: >> >> It seems like a plausible explanation but it's worth noting that all >> the indexed data here is (despite being in text columns), plain ascii. >> I'm surprised that a change in collation rules would change the >> sorting of such strings, and hence that it could lead to this problem. >> Am I naive? > > In psql who does: > > \l the_database_name > > show? synapse=> \l synapse List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges ---------+----------+----------+-------------+-------------+----------------------- synapse | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | postgres=CTc/postgres+ | | | | | =Tc/postgres + | | | | | synapse=CTc/postgres (1 row)
Richard van der Hoff <richard@matrix.org> writes: > On 16/01/2020 17:12, Magnus Hagander wrote: >> See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on >> which linux distros updated when. > It seems like a plausible explanation but it's worth noting that all the > indexed data here is (despite being in text columns), plain ascii. I'm > surprised that a change in collation rules would change the sorting of > such strings, and hence that it could lead to this problem. Am I naive? Unfortunately, strings containing punctuation do sort differently after these changes, even with all-ASCII data. The example given on that wiki page demonstrates this. RHEL6 (old glibc): $ ( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.utf8 sort 11 1-1 Fedora 30 (new glibc): $ ( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.utf8 sort 1-1 11 I concur with Daniel's suggestion that maybe "C" locale is the thing to use for this data. regards, tom lane
On 16/01/2020 17:48, Tom Lane wrote: > Richard van der Hoff <richard@matrix.org> writes: >> On 16/01/2020 17:12, Magnus Hagander wrote: >>> See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on >>> which linux distros updated when. > >> It seems like a plausible explanation but it's worth noting that all the >> indexed data here is (despite being in text columns), plain ascii. I'm >> surprised that a change in collation rules would change the sorting of >> such strings, and hence that it could lead to this problem. Am I naive? > > Unfortunately, strings containing punctuation do sort differently > after these changes, even with all-ASCII data. The example given > on that wiki page demonstrates this. Thank you Tom. I should learn to read properly. I think this very much explains the symptoms we're seeing. Thanks very much for the detailed and very helpful answers!
That's why I created a virtual_string function to squeeze out everything but alpha characters and numbers 0-9 from any varchar or text columns that I want to use as business key columns. For example, if I have a column named job_name, I will have a companion column named v_job_name. The v_ column is to replicate Oracle's virtual column, since postgres doesn't have it. You don't put any values in the v_ column directly. I simply have a trigger on insert or update to put the value in the v_job_name column using the virtual_string(new.job_name) function. It's the v_job_name column that use in my unique constraint so that I avoid any unexpected sorting. Meanwhile, my job_name column is still human readable with whatever characters I want to see, including diacritics. Here is my function, if you want to try it out: create or replace function store.virtual_string(string_in text) returns text as $body$ declare l_return text; begin l_return := regexp_replace (lower(unaccent(string_in)),'[^0-9a-z]','','g'); return l_return; end; $body$ language plpgsql volatile security definer ; Sue --- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Susan E Hurst Principal Consultant Brookhurst Data LLC Email: susan.hurst@brookhurstdata.com Mobile: 314-486-3261 On 2020-01-16 11:48, Tom Lane wrote: > Richard van der Hoff <richard@matrix.org> writes: >> On 16/01/2020 17:12, Magnus Hagander wrote: >>> See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on >>> which linux distros updated when. > >> It seems like a plausible explanation but it's worth noting that all >> the >> indexed data here is (despite being in text columns), plain ascii. I'm >> surprised that a change in collation rules would change the sorting of >> such strings, and hence that it could lead to this problem. Am I >> naive? > > Unfortunately, strings containing punctuation do sort differently > after these changes, even with all-ASCII data. The example given > on that wiki page demonstrates this. > > RHEL6 (old glibc): > > $ ( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.utf8 sort > 11 > 1-1 > > Fedora 30 (new glibc): > > $ ( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.utf8 sort > 1-1 > 11 > > I concur with Daniel's suggestion that maybe "C" locale is > the thing to use for this data. > > regards, tom lane
On Thu, 2020-01-16 at 16:50 +0000, Richard van der Hoff wrote: > I'm trying to track down the cause of some duplicate rows in a table > which I would expect to be impossible due to a unique constraint. I'm > hoping that somebody here will be able to suggest something I might have > missed. > > The problem relates to a bug filed against our application > (https://github.com/matrix-org/synapse/issues/6696). At first I put this > down to random data corruption on a single user's postgres instance, but > I've now seen three separate reports in as many days and am wondering if > there is more to it. [...] > So, question: what could we be doing wrong to get ourselves into this > situation? > > Some other datapoints which may be relevant: > * this has been reported by one user on postgres 9.6.15 and one on > 10.10, though it's hard to be certain of the version that was running > when the duplication occurred > * the constraint is added when the table is first created (before any > data is added) > * At least one user reports that he has recently migrated his database > from one server to another via a `pg_dump -C` and later piping into psql. I see no hint that this may be the problem, but I have seen corruption like this because of changes in the collations of the C library (which PostgreSQL uses). This only happens with collations other than C, and it cannot be caused by dump/restore. It may, however, be caused by the following: - Upgrading the operating system where PostgreSQL is running to a different glibc. - Streaming replication between machines with different glibc version, and failing over to the standby. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com