Thread: Duplicate rows sneaking in despite PRIMARY KEY / UNIQUE constraint
I have a table that I am using to store email token data for DSPAM. I'm noticing that a handful (4-16) of rows with duplicate columns (uid,token) are sneaking into the table every day despite the primary key constraint. The server currently processes a few thousand emails per day, and this particular table currently has about 4.5 million rows in it. I feel as though I must be missing something here, because I have always strongly assumed that postgresql prevents this sort of chaos from happening by default. When these duplicate pairs make their way into the table, all havoc breaks loose ;) The rows with the duplicate pairs seem to become land mines. The postgresql process handling a query that subsequently runs across one of these rows dies, taking down the DSPAM daemon with it, and sometimes corrupting the postgresql shared memory enough that the postmaster has to shutdown the other processes and restart everything anew [1]. I am usually able to clean things up by running the following, but once or twice I've had to drop the unique constraint before postgresql would process the request without choking: delete from dspam_token_data where row(uid,token) in (select uid,token from dspam_token_data group by uid,token having count(*) > 1); (I don't worry about preserving one of the duplicates here.) I'm running postgresql-8.1.3. Here is the table in question: CREATE TABLE dspam.dspam_token_data ( uid int4 NOT NULL, token int8 NOT NULL, spam_hits int4, innocent_hits int4, last_hit date, CONSTRAINT dspam_token_data_pkey PRIMARY KEY (uid, token) ) WITHOUT OIDS; [2] What steps can I take to fix this? Is there any additional information I can provide? I've cleaned the table many times now. I then run VACUUM ANALYZE. My next step will probably be to hack the DSPAM sources to make the application more careful about not trying to insert rows that would violate the unique constraint. Even still, it seems that would only reduce the frequency of these occurrences, not eliminate them completely. Thanks! Cheers, -- Travis ---- Notes: [1] A condensed log file showing off exactly what happens here is attached. [2] Previously, the table lacked a primary key and instead used a unique constraint and index. This yielded the exact same results I am currently seeing using a two-column primary key, as above. The old table schema was: CREATE TABLE dspam.dspam_token_data ( uid int4, token int8, spam_hits int4, innocent_hits int4, last_hit date, CONSTRAINT dspam_token_data_uid_key UNIQUE (uid, token) ) WITHOUT OIDS; Jun 5 17:58:14 shadow postgres[28775]: [21-1] PANIC: right sibling's left-link doesn't match Jun 5 17:58:14 shadow postgres[28775]: [21-2] STATEMENT: PREPARE dspam_update_plan (bigint) AS UPDATE dspam_token_dataSET last_hit = CURRENT_DATE, innocent_hits = Jun 5 17:58:14 shadow postgres[28775]: [21-3] innocent_hits + 1 WHERE uid = '1' AND token = $1;PREPARE dspam_insert_plan(bigint, int, int) AS INSERT INTO dspam_token_data Jun 5 17:58:14 shadow postgres[28775]: [21-4] (uid, token, spam_hits, innocent_hits, last_hit) VALUES (1, $1, $2, $3, CURRENT_DATE); Jun 5 17:58:14 shadow dspam[7780]: query error: (null): see sql.errors for more details Jun 5 17:58:14 shadow postgres[10566]: [21-1] LOG: server process (PID 28775) was terminated by signal 6 Jun 5 17:58:14 shadow postgres[10566]: [22-1] LOG: terminating any other active server processes Jun 5 17:58:14 shadow postgres[7219]: [21-1] WARNING: terminating connection because of crash of another server process Jun 5 17:58:14 shadow postgres[7219]: [21-2] DETAIL: The postmaster has commanded this server process to roll back thecurrent transaction and exit, because another server Jun 5 17:58:14 shadow postgres[7219]: [21-3] process exited abnormally and possibly corrupted shared memory. .... Jun 5 17:58:14 shadow postgres[7525]: [24-1] LOG: database system was interrupted at 2006-06-05 17:58:01 UTC Jun 5 17:58:14 shadow postgres[7525]: [25-1] LOG: checkpoint record is at 16/DA1D6868 Jun 5 17:58:14 shadow postgres[7525]: [26-1] LOG: redo record is at 16/DA0081D8; undo record is at 0/0; shutdown FALSE Jun 5 17:58:14 shadow postgres[7525]: [27-1] LOG: next transaction ID: 5691748; next OID: 148615 Jun 5 17:58:14 shadow postgres[7525]: [28-1] LOG: next MultiXactId: 4; next MultiXactOffset: 10 Jun 5 17:58:14 shadow postgres[7525]: [29-1] LOG: database system was not properly shut down; automatic recovery in progress Jun 5 17:58:14 shadow postgres[7525]: [30-1] LOG: redo starts at 16/DA0081D8 Jun 5 17:58:14 shadow postgres[7525]: [31-1] LOG: record with zero length at 16/DC2858F8 Jun 5 17:58:14 shadow postgres[7525]: [32-1] LOG: redo done at 16/DC2858C8 Jun 5 17:58:16 shadow postgres[7542]: [24-1] FATAL: the database system is starting up Jun 5 17:58:16 shadow postgres[7543]: [24-1] FATAL: the database system is starting up Jun 5 17:58:24 shadow postgres[7525]: [33-1] LOG: database system is ready Jun 5 17:58:24 shadow postgres[7525]: [34-1] LOG: transaction ID wrap limit is 1078513827, limited by database "xmail" Jun 5 17:59:24 shadow postgres[7632]: [24-1] LOG: unexpected EOF on client connection Jun 5 17:59:24 shadow postgres[7768]: [24-1] LOG: unexpected EOF on client connection Jun 5 17:59:37 shadow postgres[7836]: [24-1] WARNING: there is no transaction in progress
Travis Cross <travis@crosswirecorp.com> writes: > I'm noticing that a handful (4-16) of rows with duplicate columns > (uid,token) are sneaking into the table every day despite the > primary key constraint. Corrupt index, looks like ... you might try reindexing the index. I don't believe that the PANIC you show has anything directly to do with duplicate entries. It is a symptom of corrupt index structure. Now a corrupt index might also explain failure to notice duplications, but changing your application isn't going to fix whatever is causing it. You need to look for server-side causes. Any database or system crashes on this server (before this problem started)? Do you *know* that the disk drive will not lie about write complete? What is the platform and storage system, anyway? regards, tom lane
On 6/6/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Travis Cross <travis@crosswirecorp.com> writes: > > I'm noticing that a handful (4-16) of rows with duplicate columns > > (uid,token) are sneaking into the table every day despite the > > primary key constraint. > > Corrupt index, looks like ... you might try reindexing the index. > > I don't believe that the PANIC you show has anything directly to do > with duplicate entries. It is a symptom of corrupt index structure. > Now a corrupt index might also explain failure to notice duplications, > but changing your application isn't going to fix whatever is causing > it. You need to look for server-side causes. > > Any database or system crashes on this server (before this problem > started)? Do you *know* that the disk drive will not lie about write > complete? What is the platform and storage system, anyway? FWIW I've seen similar behaviour to this (PostgreSQL processes exiting "abnormally", index corruption with duplicate primary keys) on servers with defective RAM chips. Ian Barwick
Kenneth Marshall wrote: > We have used postgresql 7.4, 8.0, and 8.1 with DSPAM and have > never had a single problem like you are describing. In the past > on this mailing list, these sorts of issues have been caused by > hardware problems on the DB server in some cases. Good luck with > tracking it down. Thanks. It is always good to know that something *should* and *does* work well. That gives me a good base for tracking down the issue. Cheers, -- Travis
Tom Lane wrote: > Travis Cross <travis@crosswirecorp.com> writes: >> I'm noticing that a handful (4-16) of rows with duplicate columns >> (uid,token) are sneaking into the table every day despite the >> primary key constraint. > > Corrupt index, looks like ... you might try reindexing the index. I probably should have mentioned that I have indeed done a REINDEX on the table a couple of times in the past, suspecting that issue, and having seen it resolve similar issues on this list. Upon your suggestion, I'm running one right now, and I will probably dump and reload the entire database after hours, unless anyone thinks that would be a bad idea (or unproductive in tracking this down). > I don't believe that the PANIC you show has anything directly to do > with duplicate entries. It is a symptom of corrupt index structure. > Now a corrupt index might also explain failure to notice duplications, > but changing your application isn't going to fix whatever is causing > it. You need to look for server-side causes. Indeed, you are correct. I should also mention that the problem seems to build over time, in the sense that everything will run fine for awhile (a few days), and then will crash repeatedly. Deleting the duplicate rows seems to reset the counter -- of course, I cannot run a successful REINDEX until I have deleted those duplicate rows. > Any database or system crashes on this server (before this problem > started)? No. In fact, this box, and a sister box running similar hardware, have been models of system stability. My uptimes are 46 and 87 days, respectively, representing the time since I've done a kernel upgrade and the time since I plugged the boxes into the rack. The sister box is running real-time voice services. > Do you *know* that the disk drive will not lie about write > complete? "Know" is such a strong word ;) Honestly, I have very little idea. I understand the nature of the problem this presents,as I've read the very fine PostgreSQL manual many times over the years. Because the drives I use are specifically designed to operate well in a RAID environment, I would 'hope' that the drives perform honest write operations. I wonder if there is a utility to perform a deterministic test of this... > What is the platform and storage system, anyway? The platform is: Linux 2.6.16.9 (w/o loadable modules) Supermicro PDSMi (a single processor P-D board) 2G ECC DDRII SDRAM The storage system is: On-board SATA ICH7R Controller 2 x WD3200SD hard drives running in a Linux RAID 1 configuration. That is to say: Western Digital 320G SATA 'enterprise' drives. The drives have a somewhat unique feature: time-limited error recovery, which is supposed to let the RAID controller/software deal with errors after a certain point (7 seconds), rather than continuing to block, and causing the drive to fall out of the array. The drive: http://www.westerndigital.com/en/products/products.asp?driveid=114&language=en I'll run file system consistency checks tonight to see if I can pick out a proximal cause for all this chaos. I really do appreciate the assistance. Cheers, -- Travis
Ian Barwick wrote: > On 6/6/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Travis Cross <travis@crosswirecorp.com> writes: >> > I'm noticing that a handful (4-16) of rows with duplicate columns >> > (uid,token) are sneaking into the table every day despite the >> > primary key constraint. >> >> Corrupt index, looks like ... you might try reindexing the index. >> >> I don't believe that the PANIC you show has anything directly to do >> with duplicate entries. It is a symptom of corrupt index structure. >> Now a corrupt index might also explain failure to notice duplications, >> but changing your application isn't going to fix whatever is causing >> it. You need to look for server-side causes. >> >> Any database or system crashes on this server (before this problem >> started)? Do you *know* that the disk drive will not lie about write >> complete? What is the platform and storage system, anyway? > > FWIW I've seen similar behaviour to this (PostgreSQL processes exiting > "abnormally", index corruption with duplicate primary keys) on servers > with defective RAM chips. That's a good thought, and also a possibility. I tend to distrust RAM inherently. If nothing else seems verifiable as the cause, I may have to take the system down on a Saturday night for a good thorough run through memtest86. Cheers, -- Travis
On Tue, Jun 06, 2006 at 12:58:10PM -0400, Travis Cross wrote: > >Do you *know* that the disk drive will not lie about write > >complete? > > "Know" is such a strong word ;) Honestly, I have very little idea. > I understand the nature of the problem this presents, as I've read > the very fine PostgreSQL manual many times over the years. > > Because the drives I use are specifically designed to operate well > in a RAID environment, I would 'hope' that the drives perform honest > write operations. > > I wonder if there is a utility to perform a deterministic test of > this... Brad from livejournal.com wrote a utility that does just that, thought it requires 2 machines to run the test. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Travis, We have used postgresql 7.4, 8.0, and 8.1 with DSPAM and have never had a single problem like you are describing. In the past on this mailing list, these sorts of issues have been caused by hardware problems on the DB server in some cases. Good luck with tracking it down. Ken On Tue, Jun 06, 2006 at 05:45:26AM -0400, Travis Cross wrote: > I have a table that I am using to store email token data for DSPAM. > > I'm noticing that a handful (4-16) of rows with duplicate columns > (uid,token) are sneaking into the table every day despite the > primary key constraint. > > The server currently processes a few thousand emails per day, and > this particular table currently has about 4.5 million rows in it. > > I feel as though I must be missing something here, because I have > always strongly assumed that postgresql prevents this sort of chaos > from happening by default. > > When these duplicate pairs make their way into the table, all havoc > breaks loose ;) The rows with the duplicate pairs seem to become > land mines. The postgresql process handling a query that > subsequently runs across one of these rows dies, taking down > the DSPAM daemon with it, and sometimes corrupting the postgresql > shared memory enough that the postmaster has to shutdown the other > processes and restart everything anew [1]. > > I am usually able to clean things up by running the following, but > once or twice I've had to drop the unique constraint before > postgresql would process the request without choking: > > delete from dspam_token_data > where row(uid,token) in > (select uid,token > from dspam_token_data > group by uid,token > having count(*) > 1); > > (I don't worry about preserving one of the duplicates here.) > > I'm running postgresql-8.1.3. Here is the table in question: > > CREATE TABLE dspam.dspam_token_data > ( > uid int4 NOT NULL, > token int8 NOT NULL, > spam_hits int4, > innocent_hits int4, > last_hit date, > CONSTRAINT dspam_token_data_pkey PRIMARY KEY (uid, token) > ) > WITHOUT OIDS; > > [2] > > What steps can I take to fix this? Is there any additional > information I can provide? > > I've cleaned the table many times now. I then run VACUUM ANALYZE. > > My next step will probably be to hack the DSPAM sources to make the > application more careful about not trying to insert rows that would > violate the unique constraint. Even still, it seems that would only > reduce the frequency of these occurrences, not eliminate them > completely. > > Thanks! > > Cheers, > > -- Travis > > > ---- > Notes: > [1] A condensed log file showing off exactly what happens here is > attached. > > [2] Previously, the table lacked a primary key and instead used a > unique constraint and index. This yielded the exact same results I > am currently seeing using a two-column primary key, as above. The > old table schema was: > > CREATE TABLE dspam.dspam_token_data > ( > uid int4, > token int8, > spam_hits int4, > innocent_hits int4, > last_hit date, > CONSTRAINT dspam_token_data_uid_key UNIQUE (uid, token) > ) > WITHOUT OIDS; > > Jun 5 17:58:14 shadow postgres[28775]: [21-1] PANIC: right sibling's left-link doesn't match > Jun 5 17:58:14 shadow postgres[28775]: [21-2] STATEMENT: PREPARE dspam_update_plan (bigint) AS UPDATE dspam_token_dataSET last_hit = CURRENT_DATE, innocent_hits = > Jun 5 17:58:14 shadow postgres[28775]: [21-3] innocent_hits + 1 WHERE uid = '1' AND token = $1;PREPARE dspam_insert_plan(bigint, int, int) AS INSERT INTO dspam_token_data > Jun 5 17:58:14 shadow postgres[28775]: [21-4] (uid, token, spam_hits, innocent_hits, last_hit) VALUES (1, $1, $2, $3,CURRENT_DATE); > Jun 5 17:58:14 shadow dspam[7780]: query error: (null): see sql.errors for more details > Jun 5 17:58:14 shadow postgres[10566]: [21-1] LOG: server process (PID 28775) was terminated by signal 6 > Jun 5 17:58:14 shadow postgres[10566]: [22-1] LOG: terminating any other active server processes > Jun 5 17:58:14 shadow postgres[7219]: [21-1] WARNING: terminating connection because of crash of another server process > Jun 5 17:58:14 shadow postgres[7219]: [21-2] DETAIL: The postmaster has commanded this server process to roll back thecurrent transaction and exit, because another server > Jun 5 17:58:14 shadow postgres[7219]: [21-3] process exited abnormally and possibly corrupted shared memory. > .... > Jun 5 17:58:14 shadow postgres[7525]: [24-1] LOG: database system was interrupted at 2006-06-05 17:58:01 UTC > Jun 5 17:58:14 shadow postgres[7525]: [25-1] LOG: checkpoint record is at 16/DA1D6868 > Jun 5 17:58:14 shadow postgres[7525]: [26-1] LOG: redo record is at 16/DA0081D8; undo record is at 0/0; shutdown FALSE > Jun 5 17:58:14 shadow postgres[7525]: [27-1] LOG: next transaction ID: 5691748; next OID: 148615 > Jun 5 17:58:14 shadow postgres[7525]: [28-1] LOG: next MultiXactId: 4; next MultiXactOffset: 10 > Jun 5 17:58:14 shadow postgres[7525]: [29-1] LOG: database system was not properly shut down; automatic recovery in progress > Jun 5 17:58:14 shadow postgres[7525]: [30-1] LOG: redo starts at 16/DA0081D8 > Jun 5 17:58:14 shadow postgres[7525]: [31-1] LOG: record with zero length at 16/DC2858F8 > Jun 5 17:58:14 shadow postgres[7525]: [32-1] LOG: redo done at 16/DC2858C8 > Jun 5 17:58:16 shadow postgres[7542]: [24-1] FATAL: the database system is starting up > Jun 5 17:58:16 shadow postgres[7543]: [24-1] FATAL: the database system is starting up > Jun 5 17:58:24 shadow postgres[7525]: [33-1] LOG: database system is ready > Jun 5 17:58:24 shadow postgres[7525]: [34-1] LOG: transaction ID wrap limit is 1078513827, limited by database "xmail" > Jun 5 17:59:24 shadow postgres[7632]: [24-1] LOG: unexpected EOF on client connection > Jun 5 17:59:24 shadow postgres[7768]: [24-1] LOG: unexpected EOF on client connection > Jun 5 17:59:37 shadow postgres[7836]: [24-1] WARNING: there is no transaction in progress > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match