Tom Lane wrote:
> "Russell Garrett" <rg@tcslon.com> writes:
>> last=> reindex index profile_pkey;
>> ERROR: could not create unique index
>> DETAIL: Table contains duplicated values.
>
> Okay, it looks like the table actually contains duplicate rows but the
> index has managed to lose track of that fact. You can see the
> duplicates in the GROUP BY query (which isn't using the index) but
> not when you do "select * from profile where id = 1017057", because
> that query will use the index.
Ah, it went so quick I was sure it was using the index :).
> If you did "set enable_indexscan = off" then the "select * from
> profile where id = 1017057" query would probably show you two rows.
> I'd be interested to see the results of
>
> select ctid, oid, xmin, xmax, cmin, cmax from profile where id =
> 1017057;
> with enable_indexscan off.
last=> set enable_indexscan = off;
SET
last=> select ctid, oid, xmin, xmax, cmin, cmax from profile where id =
1017057;
ctid | oid | xmin | xmax | cmin | cmax
-----------+----------+----------+----------+----------+------
(482,27) | 65196187 | 21095783 | 21953819 | 21953819 | 0
(3095,56) | 64140531 | 20037571 | 20037571 | 545 | 555
(2 rows)
>> Does REINDEX use the current index to check for duplicates? :)
>
> No, it's building a new index from scratch, and so it notices the
> duplicates.
Yeah, I see now.
> What you've got here is definitely a case of index corruption that has
> led to logical corruption of the table (ie, duplicate rows). To get
> back to a valid state you will need to delete whichever of the
> duplicates seems to be out-of-date, and then do a REINDEX. But I
> think it is important first to try to determine what caused the
> corruption (software error or hardware?). Again, if you can take a
> physical dump of the data directory, that would provide an
> opportunity to study the problem later after you get the production
> machine back on its feet.
OK, I've kept a copy of the db and then fixed the problem.
We had an incidence of table corruption a few weeks back, but we just
ignored it (wishful thinking, maybe). Postgres is compiled normally, with
gcc3. OS is Red Hat 9.
Hardware is a Dell Poweredge 2.4Ghz Xeon (can't remember the model number),
DB is using XFS on a MegaRaid U320 controller, running the MegaRaid 1.18j
drivers. Naturally, the drivers don't report any errors. I doubt it's memory
corruption as the system has been rock-solid. So I'm guessing it's probably
MegaRaid, or XFS, or the kernel I suppose (although I'm using the latest).
Problem is, it's quite hard to reproduce the corruption, since it seems to
be fairly rare.
Maybe this should be moved to pgsql-general now?
Russ