Thread: Urgent: Key constraints behaving weirdly

Urgent: Key constraints behaving weirdly

From
"Russell Garrett"
Date:
Your name               : Russ Garrett
Your email address      : russ@last.fm

System Configuration
---------------------
  Architecture (example: Intel Pentium)         : P4 Xeon
  Operating System (example: Linux 2.0.26 ELF)  : Linux 2.4.22 ELF
  PostgreSQL version (example: PostgreSQL-7.4)  : PostgreSQL-7.4
  Compiler used (example:  gcc 2.95.2)          : gcc version 3.2.2 20030222
(Red Hat Linux 3.2.2-5)

Please enter a FULL description of your problem:
------------------------------------------------

Constraints are being weird. The reproduction instructions speak for
themselves.
We're using this on a production database and we're understandably getting a
bit edgy.


Please describe a way to repeat the problem.   Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------

last=> \d profile
...
Indexes:
    "profile_pkey" primary key, btree (id)

last=> \di+ profile_pkey
                       List of relations
 Schema |     Name     | Type  | Owner  | Description |  Table
--------+--------------+-------+--------+-------------+---------
 public | profile_pkey | index | lastfm |             | profile
(1 row)

last=> UPDATE profile SET lastsongtime=now(),  r1=r1+0, r2=r2+0, r3=r3+0,
r4=r4+0, r5=r5+0, scrobcount=scrobcount+1 WHERE id=1017057;
ERROR:  duplicate key violates unique constraint "profile_pkey"

last=> SELECT * FROM profile WHERE id=1017057;
   id    |  user   |  name   | description | songspaminterval |
lastsongtime     | entropy | newbie |        creationdate        | r1 | r2 |
r3 | r4 | r5 | scrobcount | newbielistenas | e1 | e2 | e3 | e4  | e5 |
numdisconnections | totalstreamtime
---------+---------+---------+-------------+------------------+-------------
--------+---------+--------+----------------------------+----+----+----+----
+----+------------+----------------+----+----+----+-----+----+--------------
-----+-----------------
 1017057 | 1017075 | Default |             |               60 | 2003-12-18
04:32:34 |       1 | f      | 2003-12-17 23:57:01.582757 |  0 |  0 |  0 |  0
|  0 |         23 |                |  2 |  0 |  0 | 219 |  0 |
0 |               0
(1 row)

last=>


If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------

No idea :).

Re: Urgent: Key constraints behaving weirdly

From
Tom Lane
Date:
"Russell Garrett" <rg@tcslon.com> writes:
> Constraints are being weird. The reproduction instructions speak for
> themselves.

You haven't really provided enough info to let anyone do anything about
this.  Certainly no one else is going to be able to reproduce the
problem based on what you've provided.

I can think of a number of theories:

1. There's a rule or trigger ON UPDATE that is modifying the behavior of
the update you showed, and is causing the duplicate-key error (ie,
there's a bug in the rule or trigger logic and the error report is
legitimate).

2. The index is corrupt, possibly due to a hardware glitch.  (This seems
unlikely because the SELECT result appears normal, but I can't rule it
out entirely.)

3. You've stumbled across a corner-case bug that no one has seen before.

If the problem is #2 then the likely recovery method would be a REINDEX,
hopefully followed by some intensive hardware testing.

If the problem is #3 then it's fairly likely that a REINDEX would make
the failure go away, but we'd lose this opportunity to find the bug.

If you want to pursue theory #3, I would recommend the following
procedure:
    A. Shut down postmaster for long enough to make a tarball copy
       of the whole $PGDATA directory tree.  You should then be able
       to investigate the problem at leisure by loading the tarball
       on a playpen machine.
    B. REINDEX to (hopefully) make the problem go away on the
       production machine, so it can get back to work.
    C. Build a debug-enabled Postgres installation on your playpen
       machine, and start tracing through the failure with gdb.

I'd be willing to help out if you don't feel competent to carry through
step C.

            regards, tom lane

Re: Urgent: Key constraints behaving weirdly

From
"Russell Garrett"
Date:
Tom Lane wrote:
> "Russell Garrett" <rg@tcslon.com> writes:
>> Constraints are being weird. The reproduction instructions speak for
>> themselves.
>
> You haven't really provided enough info to let anyone do anything
> about this.  Certainly no one else is going to be able to reproduce
> the problem based on what you've provided.

I'm aware of that, but it's really quite difficult to rule out everything.
Our database isn't that small, and we don't have the resources to
immediately deduce a test case (i.e. it's just me ;).

> I can think of a number of theories:
>
> 1. There's a rule or trigger ON UPDATE that is modifying the behavior
> of the update you showed, and is causing the duplicate-key error (ie,
> there's a bug in the rule or trigger logic and the error report is
> legitimate).

There's nothing triggered by that update.

> 2. The index is corrupt, possibly due to a hardware glitch.  (This
> seems unlikely because the SELECT result appears normal, but I can't
> rule it out entirely.)

We had a table error a few weeks back, however we re-imported the table from
scratch. So it may well be this. I doubt it's a actually a Postgres bug now,
since we aren't doing anything particularly unusual, and it's been working
fine for several weeks. Still, I'm not ruling it out.

Russ

Re: Urgent: Key constraints behaving weirdly

From
"Russell Garrett"
Date:
>> 2. The index is corrupt, possibly due to a hardware glitch.  (This
>> seems unlikely because the SELECT result appears normal, but I can't
>> rule it out entirely.)
>
> We had a table error a few weeks back, however we re-imported the
> table from scratch. So it may well be this. I doubt it's a actually a
> Postgres bug now, since we aren't doing anything particularly
> unusual, and it's been working fine for several weeks. Still, I'm not
> ruling it out.

OK, I'm not sure whether this is index corruption or a postgres bug or both:

last=> select id, count(id) AS a FROM profile GROUP BY id ORDER BY a DESC;
   id    | a
---------+---
 1017057 | 2
 1018316 | 2
       0 | 1
      80 | 1
...

last=> SELECT id from profile WHERE id = 1017057;
   id
---------
 1017057
(1 row)

last=> SELECT id from profile WHERE id = 1018316;
   id
---------
 1018316
(1 row)

last=> reindex index profile_pkey;
ERROR:  could not create unique index
DETAIL:  Table contains duplicated values.


Does REINDEX use the current index to check for duplicates? :)

Many thanks for all the help,

Russ

Re: Urgent: Key constraints behaving weirdly

From
Tom Lane
Date:
"Russell Garrett" <rg@tcslon.com> writes:
> last=> reindex index profile_pkey;
> ERROR:  could not create unique index
> DETAIL:  Table contains duplicated values.

Okay, it looks like the table actually contains duplicate rows but the
index has managed to lose track of that fact.  You can see the
duplicates in the GROUP BY query (which isn't using the index) but
not when you do "select * from profile where id = 1017057", because that
query will use the index.

If you did "set enable_indexscan = off" then the "select * from profile
where id = 1017057" query would probably show you two rows.  I'd be
interested to see the results of

select ctid, oid, xmin, xmax, cmin, cmax from profile where id = 1017057;

with enable_indexscan off.

> Does REINDEX use the current index to check for duplicates? :)

No, it's building a new index from scratch, and so it notices the
duplicates.

What you've got here is definitely a case of index corruption that has
led to logical corruption of the table (ie, duplicate rows).  To get
back to a valid state you will need to delete whichever of the
duplicates seems to be out-of-date, and then do a REINDEX.  But I think
it is important first to try to determine what caused the corruption
(software error or hardware?).  Again, if you can take a physical dump
of the data directory, that would provide an opportunity to study the
problem later after you get the production machine back on its feet.

            regards, tom lane

Re: Urgent: Key constraints behaving weirdly

From
"Russell Garrett"
Date:
Tom Lane wrote:
> "Russell Garrett" <rg@tcslon.com> writes:
>> last=> reindex index profile_pkey;
>> ERROR:  could not create unique index
>> DETAIL:  Table contains duplicated values.
>
> Okay, it looks like the table actually contains duplicate rows but the
> index has managed to lose track of that fact.  You can see the
> duplicates in the GROUP BY query (which isn't using the index) but
> not when you do "select * from profile where id = 1017057", because
> that query will use the index.

Ah, it went so quick I was sure it was using the index :).

> If you did "set enable_indexscan = off" then the "select * from
> profile where id = 1017057" query would probably show you two rows.
> I'd be interested to see the results of
>
> select ctid, oid, xmin, xmax, cmin, cmax from profile where id =
> 1017057;
> with enable_indexscan off.

last=> set enable_indexscan = off;
SET
last=> select ctid, oid, xmin, xmax, cmin, cmax from profile where id =
1017057;
   ctid    |   oid    |   xmin   |   xmax   |   cmin   | cmax
-----------+----------+----------+----------+----------+------
  (482,27) | 65196187 | 21095783 | 21953819 | 21953819 |    0
 (3095,56) | 64140531 | 20037571 | 20037571 |      545 |  555
(2 rows)

>> Does REINDEX use the current index to check for duplicates? :)
>
> No, it's building a new index from scratch, and so it notices the
> duplicates.

Yeah, I see now.

> What you've got here is definitely a case of index corruption that has
> led to logical corruption of the table (ie, duplicate rows).  To get
> back to a valid state you will need to delete whichever of the
> duplicates seems to be out-of-date, and then do a REINDEX.  But I
> think it is important first to try to determine what caused the
> corruption (software error or hardware?).  Again, if you can take a
> physical dump of the data directory, that would provide an
> opportunity to study the problem later after you get the production
> machine back on its feet.

OK, I've kept a copy of the db and then fixed the problem.

We had an incidence of table corruption a few weeks back, but we just
ignored it (wishful thinking, maybe). Postgres is compiled normally, with
gcc3. OS is Red Hat 9.

Hardware is a Dell Poweredge 2.4Ghz Xeon (can't remember the model number),
DB is using XFS on a MegaRaid U320 controller, running the MegaRaid 1.18j
drivers. Naturally, the drivers don't report any errors. I doubt it's memory
corruption as the system has been rock-solid. So I'm guessing it's probably
MegaRaid, or XFS, or the kernel I suppose (although I'm using the latest).

Problem is, it's quite hard to reproduce the corruption, since it seems to
be fairly rare.

Maybe this should be moved to pgsql-general now?

Russ