Thread: Duplicate rows sneaking in despite PRIMARY KEY / UNIQUE constraint

Duplicate rows sneaking in despite PRIMARY KEY / UNIQUE constraint

From
Travis Cross
Date:
I have a table that I am using to store email token data for DSPAM.

I'm noticing that a handful (4-16) of rows with duplicate columns
(uid,token) are sneaking into the table every day despite the
primary key constraint.

The server currently processes a few thousand emails per day, and
this particular table currently has about 4.5 million rows in it.

I feel as though I must be missing something here, because I have
always strongly assumed that postgresql prevents this sort of chaos
from happening by default.

When these duplicate pairs make their way into the table, all havoc
breaks loose ;)  The rows with the duplicate pairs seem to become
land mines.  The postgresql process handling a query that
subsequently runs across one of these rows dies, taking down
the DSPAM daemon with it, and sometimes corrupting the postgresql
shared memory enough that the postmaster has to shutdown the other
processes and restart everything anew [1].

I am usually able to clean things up by running the following, but
once or twice I've had to drop the unique constraint before
postgresql would process the request without choking:

delete from dspam_token_data
    where row(uid,token) in
            (select uid,token
              from dspam_token_data
              group by uid,token
              having count(*) > 1);

(I don't worry about preserving one of the duplicates here.)

I'm running postgresql-8.1.3.  Here is the table in question:

CREATE TABLE dspam.dspam_token_data
(
    uid int4 NOT NULL,
    token int8 NOT NULL,
    spam_hits int4,
    innocent_hits int4,
    last_hit date,
    CONSTRAINT dspam_token_data_pkey PRIMARY KEY (uid, token)
)
WITHOUT OIDS;

[2]

What steps can I take to fix this?  Is there any additional
information I can provide?

I've cleaned the table many times now.  I then run VACUUM ANALYZE.

My next step will probably be to hack the DSPAM sources to make the
application more careful about not trying to insert rows that would
violate the unique constraint.  Even still, it seems that would only
reduce the frequency of these occurrences, not eliminate them
completely.

Thanks!

Cheers,

-- Travis


----
Notes:
[1] A condensed log file showing off exactly what happens here is
attached.

[2]  Previously, the table lacked a primary key and instead used a
unique constraint and index.  This yielded the exact same results I
am currently seeing using a two-column primary key, as above.  The
old table schema was:

CREATE TABLE dspam.dspam_token_data
(
    uid int4,
    token int8,
    spam_hits int4,
    innocent_hits int4,
    last_hit date,
    CONSTRAINT dspam_token_data_uid_key UNIQUE (uid, token)
)
WITHOUT OIDS;

Jun  5 17:58:14 shadow postgres[28775]: [21-1] PANIC:  right sibling's left-link doesn't match
Jun  5 17:58:14 shadow postgres[28775]: [21-2] STATEMENT:  PREPARE dspam_update_plan (bigint) AS UPDATE
dspam_token_dataSET last_hit = CURRENT_DATE, innocent_hits = 
Jun  5 17:58:14 shadow postgres[28775]: [21-3]  innocent_hits + 1 WHERE uid = '1' AND token = $1;PREPARE
dspam_insert_plan(bigint, int, int) AS INSERT INTO dspam_token_data 
Jun  5 17:58:14 shadow postgres[28775]: [21-4]  (uid, token, spam_hits, innocent_hits, last_hit) VALUES (1, $1, $2, $3,
CURRENT_DATE);
Jun  5 17:58:14 shadow dspam[7780]: query error: (null): see sql.errors for more details
Jun  5 17:58:14 shadow postgres[10566]: [21-1] LOG:  server process (PID 28775) was terminated by signal 6
Jun  5 17:58:14 shadow postgres[10566]: [22-1] LOG:  terminating any other active server processes
Jun  5 17:58:14 shadow postgres[7219]: [21-1] WARNING:  terminating connection because of crash of another server
process
Jun  5 17:58:14 shadow postgres[7219]: [21-2] DETAIL:  The postmaster has commanded this server process to roll back
thecurrent transaction and exit, because another server 
Jun  5 17:58:14 shadow postgres[7219]: [21-3]  process exited abnormally and possibly corrupted shared memory.
....
Jun  5 17:58:14 shadow postgres[7525]: [24-1] LOG:  database system was interrupted at 2006-06-05 17:58:01 UTC
Jun  5 17:58:14 shadow postgres[7525]: [25-1] LOG:  checkpoint record is at 16/DA1D6868
Jun  5 17:58:14 shadow postgres[7525]: [26-1] LOG:  redo record is at 16/DA0081D8; undo record is at 0/0; shutdown
FALSE
Jun  5 17:58:14 shadow postgres[7525]: [27-1] LOG:  next transaction ID: 5691748; next OID: 148615
Jun  5 17:58:14 shadow postgres[7525]: [28-1] LOG:  next MultiXactId: 4; next MultiXactOffset: 10
Jun  5 17:58:14 shadow postgres[7525]: [29-1] LOG:  database system was not properly shut down; automatic recovery in
progress
Jun  5 17:58:14 shadow postgres[7525]: [30-1] LOG:  redo starts at 16/DA0081D8
Jun  5 17:58:14 shadow postgres[7525]: [31-1] LOG:  record with zero length at 16/DC2858F8
Jun  5 17:58:14 shadow postgres[7525]: [32-1] LOG:  redo done at 16/DC2858C8
Jun  5 17:58:16 shadow postgres[7542]: [24-1] FATAL:  the database system is starting up
Jun  5 17:58:16 shadow postgres[7543]: [24-1] FATAL:  the database system is starting up
Jun  5 17:58:24 shadow postgres[7525]: [33-1] LOG:  database system is ready
Jun  5 17:58:24 shadow postgres[7525]: [34-1] LOG:  transaction ID wrap limit is 1078513827, limited by database
"xmail"
Jun  5 17:59:24 shadow postgres[7632]: [24-1] LOG:  unexpected EOF on client connection
Jun  5 17:59:24 shadow postgres[7768]: [24-1] LOG:  unexpected EOF on client connection
Jun  5 17:59:37 shadow postgres[7836]: [24-1] WARNING:  there is no transaction in progress


Travis Cross <travis@crosswirecorp.com> writes:
> I'm noticing that a handful (4-16) of rows with duplicate columns
> (uid,token) are sneaking into the table every day despite the
> primary key constraint.

Corrupt index, looks like ... you might try reindexing the index.

I don't believe that the PANIC you show has anything directly to do
with duplicate entries.  It is a symptom of corrupt index structure.
Now a corrupt index might also explain failure to notice duplications,
but changing your application isn't going to fix whatever is causing
it.  You need to look for server-side causes.

Any database or system crashes on this server (before this problem
started)?  Do you *know* that the disk drive will not lie about write
complete?  What is the platform and storage system, anyway?
        regards, tom lane


Re: Duplicate rows sneaking in despite PRIMARY KEY / UNIQUE constraint

From
"Ian Barwick"
Date:
On 6/6/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Travis Cross <travis@crosswirecorp.com> writes:
> > I'm noticing that a handful (4-16) of rows with duplicate columns
> > (uid,token) are sneaking into the table every day despite the
> > primary key constraint.
>
> Corrupt index, looks like ... you might try reindexing the index.
>
> I don't believe that the PANIC you show has anything directly to do
> with duplicate entries.  It is a symptom of corrupt index structure.
> Now a corrupt index might also explain failure to notice duplications,
> but changing your application isn't going to fix whatever is causing
> it.  You need to look for server-side causes.
>
> Any database or system crashes on this server (before this problem
> started)?  Do you *know* that the disk drive will not lie about write
> complete?  What is the platform and storage system, anyway?

FWIW I've seen similar behaviour to this (PostgreSQL processes exiting
"abnormally", index corruption with duplicate primary keys) on servers
with defective RAM chips.

Ian Barwick


Re: Duplicate rows sneaking in despite PRIMARY KEY / UNIQUE

From
Travis Cross
Date:
Kenneth Marshall wrote:
> We have used postgresql 7.4, 8.0, and 8.1 with DSPAM and have
> never had a single problem like you are describing. In the past
> on this mailing list, these sorts of issues have been caused by
> hardware problems on the DB server in some cases. Good luck with
> tracking it down.

Thanks.  It is always good to know that something *should* and 
*does* work well.  That gives me a good base for tracking down the 
issue.

Cheers,

-- Travis


Re: Duplicate rows sneaking in despite PRIMARY KEY / UNIQUE

From
Travis Cross
Date:
Tom Lane wrote:
> Travis Cross <travis@crosswirecorp.com> writes:
>> I'm noticing that a handful (4-16) of rows with duplicate columns
>> (uid,token) are sneaking into the table every day despite the
>> primary key constraint.
> 
> Corrupt index, looks like ... you might try reindexing the index.

I probably should have mentioned that I have indeed done a REINDEX 
on the table a couple of times in the past, suspecting that issue, 
and having seen it resolve similar issues on this list.  Upon your 
suggestion, I'm running one right now, and I will probably dump and 
reload the entire database after hours, unless anyone thinks that 
would be a bad idea (or unproductive in tracking this down).

> I don't believe that the PANIC you show has anything directly to do
> with duplicate entries.  It is a symptom of corrupt index structure.
> Now a corrupt index might also explain failure to notice duplications,
> but changing your application isn't going to fix whatever is causing
> it.  You need to look for server-side causes.

Indeed, you are correct.  I should also mention that the problem 
seems to build over time, in the sense that everything will run fine 
for awhile (a few days), and then will crash repeatedly.  Deleting 
the duplicate rows seems to reset the counter -- of course, I cannot 
run a successful REINDEX until I have deleted those duplicate rows.

> Any database or system crashes on this server (before this problem
> started)?

No.  In fact, this box, and a sister box running similar hardware, 
have been models of system stability.  My uptimes are 46 and 87 
days, respectively, representing the time since I've done a kernel 
upgrade and the time since I plugged the boxes into the rack.  The 
sister box is running real-time voice services.

> Do you *know* that the disk drive will not lie about write
> complete?

"Know" is such a strong word ;)  Honestly, I have very little idea.  I understand the nature of the problem this
presents,as I've read 
 
the very fine PostgreSQL manual many times over the years.

Because the drives I use are specifically designed to operate well 
in a RAID environment, I would 'hope' that the drives perform honest 
write operations.

I wonder if there is a utility to perform a deterministic test of 
this...

> What is the platform and storage system, anyway?

The platform is:

Linux 2.6.16.9 (w/o loadable modules)
Supermicro PDSMi (a single processor P-D board)
2G ECC DDRII SDRAM

The storage system is:

On-board SATA ICH7R Controller
2 x WD3200SD hard drives running in a Linux RAID 1 configuration.
That is to say: Western Digital 320G SATA 'enterprise' drives.  The 
drives have a somewhat unique feature: time-limited error recovery, 
which is supposed to let the RAID controller/software deal with 
errors after a certain point (7 seconds), rather than continuing to 
block, and causing the drive to fall out of the array.

The drive:
http://www.westerndigital.com/en/products/products.asp?driveid=114&language=en

I'll run file system consistency checks tonight to see if I can pick 
out a proximal cause for all this chaos.

I really do appreciate the assistance.

Cheers,

-- Travis


Re: Duplicate rows sneaking in despite PRIMARY KEY / UNIQUE

From
Travis Cross
Date:
Ian Barwick wrote:
> On 6/6/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Travis Cross <travis@crosswirecorp.com> writes:
>> > I'm noticing that a handful (4-16) of rows with duplicate columns
>> > (uid,token) are sneaking into the table every day despite the
>> > primary key constraint.
>>
>> Corrupt index, looks like ... you might try reindexing the index.
>>
>> I don't believe that the PANIC you show has anything directly to do
>> with duplicate entries.  It is a symptom of corrupt index structure.
>> Now a corrupt index might also explain failure to notice duplications,
>> but changing your application isn't going to fix whatever is causing
>> it.  You need to look for server-side causes.
>>
>> Any database or system crashes on this server (before this problem
>> started)?  Do you *know* that the disk drive will not lie about write
>> complete?  What is the platform and storage system, anyway?
> 
> FWIW I've seen similar behaviour to this (PostgreSQL processes exiting
> "abnormally", index corruption with duplicate primary keys) on servers
> with defective RAM chips.

That's a good thought, and also a possibility.  I tend to distrust 
RAM inherently.  If nothing else seems verifiable as the cause, I 
may have to take the system down on a Saturday night for a good 
thorough run through memtest86.

Cheers,

-- Travis


Re: Duplicate rows sneaking in despite PRIMARY KEY / UNIQUE

From
"Jim C. Nasby"
Date:
On Tue, Jun 06, 2006 at 12:58:10PM -0400, Travis Cross wrote:
> >Do you *know* that the disk drive will not lie about write
> >complete?
> 
> "Know" is such a strong word ;)  Honestly, I have very little idea. 
>  I understand the nature of the problem this presents, as I've read 
> the very fine PostgreSQL manual many times over the years.
> 
> Because the drives I use are specifically designed to operate well 
> in a RAID environment, I would 'hope' that the drives perform honest 
> write operations.
> 
> I wonder if there is a utility to perform a deterministic test of 
> this...

Brad from livejournal.com wrote a utility that does just that, thought
it requires 2 machines to run the test.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Duplicate rows sneaking in despite PRIMARY KEY / UNIQUE constraint

From
Kenneth Marshall
Date:
Travis,

We have used postgresql 7.4, 8.0, and 8.1 with DSPAM and have
never had a single problem like you are describing. In the past
on this mailing list, these sorts of issues have been caused by
hardware problems on the DB server in some cases. Good luck with
tracking it down.

Ken

On Tue, Jun 06, 2006 at 05:45:26AM -0400, Travis Cross wrote:
> I have a table that I am using to store email token data for DSPAM.
> 
> I'm noticing that a handful (4-16) of rows with duplicate columns
> (uid,token) are sneaking into the table every day despite the
> primary key constraint.
> 
> The server currently processes a few thousand emails per day, and
> this particular table currently has about 4.5 million rows in it.
> 
> I feel as though I must be missing something here, because I have
> always strongly assumed that postgresql prevents this sort of chaos
> from happening by default.
> 
> When these duplicate pairs make their way into the table, all havoc
> breaks loose ;)  The rows with the duplicate pairs seem to become
> land mines.  The postgresql process handling a query that
> subsequently runs across one of these rows dies, taking down
> the DSPAM daemon with it, and sometimes corrupting the postgresql
> shared memory enough that the postmaster has to shutdown the other
> processes and restart everything anew [1].
> 
> I am usually able to clean things up by running the following, but
> once or twice I've had to drop the unique constraint before
> postgresql would process the request without choking:
> 
> delete from dspam_token_data
>    where row(uid,token) in
>            (select uid,token
>              from dspam_token_data
>              group by uid,token
>              having count(*) > 1);
> 
> (I don't worry about preserving one of the duplicates here.)
> 
> I'm running postgresql-8.1.3.  Here is the table in question:
> 
> CREATE TABLE dspam.dspam_token_data
> (
>    uid int4 NOT NULL,
>    token int8 NOT NULL,
>    spam_hits int4,
>    innocent_hits int4,
>    last_hit date,
>    CONSTRAINT dspam_token_data_pkey PRIMARY KEY (uid, token)
> )
> WITHOUT OIDS;
> 
> [2]
> 
> What steps can I take to fix this?  Is there any additional
> information I can provide?
> 
> I've cleaned the table many times now.  I then run VACUUM ANALYZE.
> 
> My next step will probably be to hack the DSPAM sources to make the
> application more careful about not trying to insert rows that would
> violate the unique constraint.  Even still, it seems that would only
> reduce the frequency of these occurrences, not eliminate them
> completely.
> 
> Thanks!
> 
> Cheers,
> 
> -- Travis
> 
> 
> ----
> Notes:
> [1] A condensed log file showing off exactly what happens here is
> attached.
> 
> [2]  Previously, the table lacked a primary key and instead used a
> unique constraint and index.  This yielded the exact same results I
> am currently seeing using a two-column primary key, as above.  The
> old table schema was:
> 
> CREATE TABLE dspam.dspam_token_data
> (
>    uid int4,
>    token int8,
>    spam_hits int4,
>    innocent_hits int4,
>    last_hit date,
>    CONSTRAINT dspam_token_data_uid_key UNIQUE (uid, token)
> )
> WITHOUT OIDS;
> 

> Jun  5 17:58:14 shadow postgres[28775]: [21-1] PANIC:  right sibling's left-link doesn't match
> Jun  5 17:58:14 shadow postgres[28775]: [21-2] STATEMENT:  PREPARE dspam_update_plan (bigint) AS UPDATE
dspam_token_dataSET last_hit = CURRENT_DATE, innocent_hits =
 
> Jun  5 17:58:14 shadow postgres[28775]: [21-3]  innocent_hits + 1 WHERE uid = '1' AND token = $1;PREPARE
dspam_insert_plan(bigint, int, int) AS INSERT INTO dspam_token_data
 
> Jun  5 17:58:14 shadow postgres[28775]: [21-4]  (uid, token, spam_hits, innocent_hits, last_hit) VALUES (1, $1, $2,
$3,CURRENT_DATE);
 
> Jun  5 17:58:14 shadow dspam[7780]: query error: (null): see sql.errors for more details
> Jun  5 17:58:14 shadow postgres[10566]: [21-1] LOG:  server process (PID 28775) was terminated by signal 6
> Jun  5 17:58:14 shadow postgres[10566]: [22-1] LOG:  terminating any other active server processes
> Jun  5 17:58:14 shadow postgres[7219]: [21-1] WARNING:  terminating connection because of crash of another server
process
> Jun  5 17:58:14 shadow postgres[7219]: [21-2] DETAIL:  The postmaster has commanded this server process to roll back
thecurrent transaction and exit, because another server
 
> Jun  5 17:58:14 shadow postgres[7219]: [21-3]  process exited abnormally and possibly corrupted shared memory.
> ....
> Jun  5 17:58:14 shadow postgres[7525]: [24-1] LOG:  database system was interrupted at 2006-06-05 17:58:01 UTC
> Jun  5 17:58:14 shadow postgres[7525]: [25-1] LOG:  checkpoint record is at 16/DA1D6868
> Jun  5 17:58:14 shadow postgres[7525]: [26-1] LOG:  redo record is at 16/DA0081D8; undo record is at 0/0; shutdown
FALSE
> Jun  5 17:58:14 shadow postgres[7525]: [27-1] LOG:  next transaction ID: 5691748; next OID: 148615
> Jun  5 17:58:14 shadow postgres[7525]: [28-1] LOG:  next MultiXactId: 4; next MultiXactOffset: 10
> Jun  5 17:58:14 shadow postgres[7525]: [29-1] LOG:  database system was not properly shut down; automatic recovery in
progress
> Jun  5 17:58:14 shadow postgres[7525]: [30-1] LOG:  redo starts at 16/DA0081D8
> Jun  5 17:58:14 shadow postgres[7525]: [31-1] LOG:  record with zero length at 16/DC2858F8
> Jun  5 17:58:14 shadow postgres[7525]: [32-1] LOG:  redo done at 16/DC2858C8
> Jun  5 17:58:16 shadow postgres[7542]: [24-1] FATAL:  the database system is starting up
> Jun  5 17:58:16 shadow postgres[7543]: [24-1] FATAL:  the database system is starting up
> Jun  5 17:58:24 shadow postgres[7525]: [33-1] LOG:  database system is ready
> Jun  5 17:58:24 shadow postgres[7525]: [34-1] LOG:  transaction ID wrap limit is 1078513827, limited by database
"xmail"
> Jun  5 17:59:24 shadow postgres[7632]: [24-1] LOG:  unexpected EOF on client connection
> Jun  5 17:59:24 shadow postgres[7768]: [24-1] LOG:  unexpected EOF on client connection
> Jun  5 17:59:37 shadow postgres[7836]: [24-1] WARNING:  there is no transaction in progress
> 

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match