Thread: BUG #11141: Duplicate primary key values corruption

BUG #11141: Duplicate primary key values corruption

From

behrmann@ndgf.org

Date:

11 August 2014, 17:49:30

The following bug has been logged on the website:

Bug reference:      11141
Logged by:          Gerd Behrmann
Email address:      behrmann@ndgf.org
PostgreSQL version: 9.3.4
Operating system:   Ubuntu 14.04
Description:

We have observed two rows with the same primary key in one of our production
databases. We don't know how to reproduce this problem, but given that this
looks like a corruption bug, I figure you will want to know about this.

The table description is:

spacemanager=# \d srmlinkgroup
                                         Table "public.srmlinkgroup"
        Column         |           Type           |
Modifiers
-----------------------+--------------------------+-----------------------------------------------------------
 id                    | bigint                   | not null default
nextval('srmlinkgroup_id_seq'::regclass)
 name                  | character varying(32672) | not null
 availablespaceinbytes | bigint                   | not null
 lastupdatetime        | bigint                   |
 onlineallowed         | integer                  |
 nearlineallowed       | integer                  |
 replicaallowed        | integer                  |
 outputallowed         | integer                  |
 custodialallowed      | integer                  |
 reservedspaceinbytes  | bigint                   | not null
Indexes:
    "srmlinkgroup_pkey" PRIMARY KEY, btree (id)
    "srmlinkgroup_name_unique" UNIQUE CONSTRAINT, btree (name)
Referenced by:
    TABLE "srmlinkgroupvos" CONSTRAINT "fk_srmlinkgroupvos_l" FOREIGN KEY
(linkgroupid) REFERENCES srmlinkgroup(id) ON DELETE RESTRICT
    TABLE "srmspace" CONSTRAINT "fk_srmspace_l" FOREIGN KEY (linkgroupid)
REFERENCES srmlinkgroup(id)

Note the primary key on the id field and the unique constraint on the name
field.

We now have this content in the table:

spacemanager=# select * from srmlinkgroup;
    id    |          name          | availablespaceinbytes | lastupdatetime
| onlineallowed | nearlineallowed | replicaallowed | outputallowed |
custodialallowed | reservedspaceinbytes

----------+------------------------+-----------------------+----------------+---------------+-----------------+----------------+---------------+------------------+----------------------
        4 | ops-group              |       193697853978554 |  1404265701756
|             1 |               1 |              1 |             1 |
       1 |           8609513666
        1 | atlas-disk-group       |       227082050145530 |  1407745149142
|             1 |               0 |              1 |             0 |
       0 |      516957429039712
        3 | alice-disk-group       |       489413225195319 |  1407745149142
|             1 |               0 |              1 |             0 |
       0 |                    0
        4 | ops-group              |       169808990342525 |  1407745149142
|             1 |               1 |              1 |             1 |
       1 |             -1395777
 30041585 | atlas-disk-dk-group    |         1018807348216 |  1407745149142
|             1 |               0 |              1 |             0 |
       0 |       20257443592787
        2 | alice-tape-group       |                     0 |  1407745149142
|             0 |               1 |              0 |             0 |
       1 |                    0
  6021392 | biogrid-disk-group     |                     0 |  1407745149142
|             1 |               0 |              1 |             0 |
       0 |                    0
        0 | atlas-tape-write-group |        -9137070489748 |  1407745149142
|             0 |               1 |              0 |             0 |
       1 |       21228972097684
(8 rows)

Note how we got two rows with the key value 4 and both have the same name.

Just to verify we didn't misunderstand some corner case semantics somewhere,
we tried to dump the database and restore it under a different name. That
resulted in

psql:/tmp/spacemanager.sql:31528493: ERROR:  there is no unique constraint
matching given keys for referenced table "srmlinkgroup"
ALTER TABLE
psql:/tmp/spacemanager.sql:31528509: ERROR:  there is no unique constraint
matching given keys for referenced table "srmlinkgroup"
ALTER TABLE

errors as expected.

How could we possibly have ended up with two rows with the same primary key?

Re: BUG #11141: Duplicate primary key values corruption

From

Alvaro Herrera

Date:

12 August 2014, 23:42:38

behrmann@ndgf.org wrote:

> We have observed two rows with the same primary key in one of our production
> databases. We don't know how to reproduce this problem, but given that this
> looks like a corruption bug, I figure you will want to know about this.


Can you please install extension pageinspect, then grab the TIDs of both
copies (select ctid from srmlinkgroup where id = ..), then post the
output of:

select *
from heap_page_items(get_raw_page('srmlinkgroup', BLKNO))
where lp = (OFFNO)

where the BLKNO value is the first half of the TID, and OFFNO is the
second half of the TID, for each of those tuples?

Please attach the results in text files rather than pasting in the
email, as whitespace is mangled by your mail program.

Thanks

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #11141: Duplicate primary key values corruption

From

Gerd Behrmann

Date:

13 August 2014, 16:29:06

Den 12/08/14 22.42, Alvaro Herrera skrev:
> behrmann@ndgf.org wrote:
>
>> We have observed two rows with the same primary key in one of our production
>> databases. We don't know how to reproduce this problem, but given that this
>> looks like a corruption bug, I figure you will want to know about this.
>
>
> Can you please install extension pageinspect, then grab the TIDs of both
> copies (select ctid from srmlinkgroup where id = ..), then post the
> output of:
>
> select *
> from heap_page_items(get_raw_page('srmlinkgroup', BLKNO))
> where lp = (OFFNO)
>
> where the BLKNO value is the first half of the TID, and OFFNO is the
> second half of the TID, for each of those tuples?
>
> Please attach the results in text files rather than pasting in the
> email, as whitespace is mangled by your mail program.
>

Please find the result of the above query attached.

I can add that filtering by the id or name fields only gives me one of
the rows (the one corresponding to the first row in the attached file).
I guess that means only that row is linked from the indexes.


Thanks for your support,

/gerd



--
Acting NT1 Area Coordinator, NeIC

Attachment

heap_page_items

Re: BUG #11141: Duplicate primary key values corruption

From

Alvaro Herrera

Date:

13 August 2014, 16:48:23

Gerd Behrmann wrote:

>  lp | lp_off | lp_flags | lp_len |  t_xmin   | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff |
t_bits| t_oid  
>
----+--------+----------+--------+-----------+--------+----------+--------+-------------+------------+--------+--------+-------
>   5 |   7992 |        1 |     96 | 541168217 |      0 |        3 | (21,5) |       32778 |      10498 |     24 |
|       
> (1 row)
>
>  lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits |
t_oid 
>
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------
>  62 |   8096 |        1 |     96 |      2 |      0 |        4 | (5,62) |       32778 |      10498 |     24 |        |
     
> (1 row)

So t_infomask is 0x2902, or
HEAP_UPDATED | HEAP_XMAX_INVALID | HEAP_XMIN_COMMITTED | HEAP_HASVARWIDTH

Note both tuples have the same t_infomask.  Andres Freund suggests these
might be two updated versions from a common "ancestor" tuple.  I don't
have reason to think different, except that t_xmin in one of them is
frozen and so it's probably considerably older than the other one (if
enough time has passed to have one of them frozen, then why didn't you
detect this earlier?)

Anyway this might be fixed in 9.3.5, per the commit below.  I suggest
you upgrade to that one, remove one of the copies, and verify other
tables for duplicates.

commit c0bd128c81c2b23a1cbc53305180fca51b3b61c3
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   Thu Apr 24 15:41:55 2014 -0300

    Fix race when updating a tuple concurrently locked by another process

    If a tuple is locked, and this lock is later upgraded either to an
    update or to a stronger lock, and in the meantime some other process
    tries to lock, update or delete the same tuple, it (the tuple) could end
    up being updated twice, or having conflicting locks held.

    The reason for this is that the second updater checks for a change in
    Xmax value, or in the HEAP_XMAX_IS_MULTI infomask bit, after noticing
    the first lock; and if there's a change, it restarts and re-evaluates
    its ability to update the tuple.  But it neglected to check for changes
    in lock strength or in lock-vs-update status when those two properties
    stayed the same.  This would lead it to take the wrong decision and
    continue with its own update, when in reality it shouldn't do so but
    instead restart from the top.

    This could lead to either an assertion failure much later (when a
    multixact containing multiple updates is detected), or duplicate copies
    of tuples.

    To fix, make sure to compare the other relevant infomask bits alongside
    the Xmax value and HEAP_XMAX_IS_MULTI bit, and restart from the top if
    necessary.

    Also, in the belt-and-suspenders spirit, add a check to
    MultiXactCreateFromMembers that a multixact being created does not have
    two or more members that are claimed to be updates.  This should protect
    against other bugs that might cause similar bogus situations.

    Backpatch to 9.3, where the possibility of multixacts containing updates
    was introduced.  (In prior versions it was possible to have the tuple
    lock upgraded from shared to exclusive, and an update would not restart
    from the top; yet we're protected against a bug there because there's
    always a sleep to wait for the locking transaction to complete before
    continuing to do anything.  Really, the fact that tuple locks always
    conflicted with concurrent updates is what protected against bugs here.)

    Per report from Andrew Dunstan and Josh Berkus in thread at
    http://www.postgresql.org/message-id/534C8B33.9050807@pgexperts.com

    Bug analysis by Andres Freund.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #11141: Duplicate primary key values corruption

From

Gerd Behrmann

Date:

14 August 2014, 01:04:54

Den 13/08/14 15.48, Alvaro Herrera skrev:
> Gerd Behrmann wrote:
>
>>   lp | lp_off | lp_flags | lp_len |  t_xmin   | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff |
t_bits| t_oid 
>>
----+--------+----------+--------+-----------+--------+----------+--------+-------------+------------+--------+--------+-------
>>    5 |   7992 |        1 |     96 | 541168217 |      0 |        3 | (21,5) |       32778 |      10498 |     24 |
  | 
>> (1 row)
>>
>>   lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits
|t_oid 
>>
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------
>>   62 |   8096 |        1 |     96 |      2 |      0 |        4 | (5,62) |       32778 |      10498 |     24 |
|
>> (1 row)
>
> So t_infomask is 0x2902, or
> HEAP_UPDATED | HEAP_XMAX_INVALID | HEAP_XMIN_COMMITTED | HEAP_HASVARWIDTH
>
> Note both tuples have the same t_infomask.  Andres Freund suggests these
> might be two updated versions from a common "ancestor" tuple.  I don't
> have reason to think different, except that t_xmin in one of them is
> frozen and so it's probably considerably older than the other one (if
> enough time has passed to have one of them frozen, then why didn't you
> detect this earlier?)
>
> Anyway this might be fixed in 9.3.5, per the commit below.  I suggest
> you upgrade to that one, remove one of the copies, and verify other
> tables for duplicates.
>
> commit c0bd128c81c2b23a1cbc53305180fca51b3b61c3
> Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Date:   Thu Apr 24 15:41:55 2014 -0300
>
>      Fix race when updating a tuple concurrently locked by another process
>
>      If a tuple is locked, and this lock is later upgraded either to an
>      update or to a stronger lock, and in the meantime some other process
>      tries to lock, update or delete the same tuple, it (the tuple) could end
>      up being updated twice, or having conflicting locks held.
>
>      The reason for this is that the second updater checks for a change in
>      Xmax value, or in the HEAP_XMAX_IS_MULTI infomask bit, after noticing
>      the first lock; and if there's a change, it restarts and re-evaluates
>      its ability to update the tuple.  But it neglected to check for changes
>      in lock strength or in lock-vs-update status when those two properties
>      stayed the same.  This would lead it to take the wrong decision and
>      continue with its own update, when in reality it shouldn't do so but
>      instead restart from the top.
>
>      This could lead to either an assertion failure much later (when a
>      multixact containing multiple updates is detected), or duplicate copies
>      of tuples.
>
>      To fix, make sure to compare the other relevant infomask bits alongside
>      the Xmax value and HEAP_XMAX_IS_MULTI bit, and restart from the top if
>      necessary.
>
>      Also, in the belt-and-suspenders spirit, add a check to
>      MultiXactCreateFromMembers that a multixact being created does not have
>      two or more members that are claimed to be updates.  This should protect
>      against other bugs that might cause similar bogus situations.
>
>      Backpatch to 9.3, where the possibility of multixacts containing updates
>      was introduced.  (In prior versions it was possible to have the tuple
>      lock upgraded from shared to exclusive, and an update would not restart
>      from the top; yet we're protected against a bug there because there's
>      always a sleep to wait for the locking transaction to complete before
>      continuing to do anything.  Really, the fact that tuple locks always
>      conflicted with concurrent updates is what protected against bugs here.)
>
>      Per report from Andrew Dunstan and Josh Berkus in thread at
>      http://www.postgresql.org/message-id/534C8B33.9050807@pgexperts.com
>
>      Bug analysis by Andres Freund.
>

Thanks.

One tuple is indeed considerably older (a month or two maybe). Since
only the new tuple shows up when filtering on the key, our application
continued to work, and hence nobody noticed the problem.

I will do as suggested. You can close the report.

Cheers,

/gerd

--
Acting NT1 Area Coordinator, NeIC