Thread: 8.0 -> 8.1 dump duplicate key problem?
I have an interesting situation...am staging a mock upgrade from 8.0 to 8.1 on a bzip compressed backup image from one of our production servers. The dump image is big (270 mb compressed) and so is the file in question ~ 440k rows. Am piping the backup image to the server via psql -cd backup.bz2 | psql zyx When the dump gets to the point where the indexes/keys are built, the primary key fails to build due to duplicate key constraint failure. However, after dump is complete, I can create the p-key without any modification to the table and everything is fine. The error is reproducible...I have confirmed in two separate runs. The source 8.0 database is still up and does not report any dups in the p-key fields, which of course it shouldn't. Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > When the dump gets to the point where the indexes/keys are built, the > primary key fails to build due to duplicate key constraint failure. > However, after dump is complete, I can create the p-key without any > modification to the table and everything is fine. That's pretty bizarre. What's the datatype of the key column(s)? Can you reduce it to a smaller test case, or perhaps send me the full dump off-list? (270m is a bit much for email, but web or ftp would work ... also, presumably only the pkey column is needed to generate the error ...) regards, tom lane
On Thu, 10 Nov 2005, Tom Lane wrote: > > When the dump gets to the point where the indexes/keys are built, the > > primary key fails to build due to duplicate key constraint failure. > > That's pretty bizarre. What's the datatype of the key column(s)? There was one guy on IRC that had the same problem. We didn't get that much info about it and suggested that he check the original database for corruption (rebuilding indexes) to make sure it's in a proper state. Either this is the same guy or/and we might have a bigger problem. The guy on irc also had an integer PK. -- /Dennis Björklund
> "Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > > When the dump gets to the point where the indexes/keys are built, the > > primary key fails to build due to duplicate key constraint failure. > > However, after dump is complete, I can create the p-key without any > > modification to the table and everything is fine. > > That's pretty bizarre. What's the datatype of the key column(s)? > > Can you reduce it to a smaller test case, or perhaps send me the full > dump off-list? (270m is a bit much for email, but web or ftp would > work ... also, presumably only the pkey column is needed to generate > the error ...) I am working on pairing down a test case. As you can imagine this is a nasty business. My best shot is to go to the source server and trying to dump just that file, but that may mask the problem. Come to think of it, I did make one configuration change: I bumped work_mem after the dump was loaded after noticing a lot of activity in pg_dump. Table schema follows. If the schema seems a bit odd, it is because it was converted from an ISAM file. [Dennis: I am not the guy from IRC] esp=# \d data1.parts_order_line_file Table "data1.parts_order_line_file" Column | Type | Modifiers --------------------------+-------------------------+-----------id | cuid |lastmod | ts |prl_combined_key | character(9) | not nullprl_seq_no | pic_9_3 | not nullprl_combined_key_2 | character(9) |prl_item_no | character varying(15) |prl_comment_desc | character varying(2500) |prl_location | character(4) |prl_workstation | character(4) |prl_stock_loc | character(4) |prl_qty | numeric(7,3) |prl_adj_price | numeric(8,2) |prl_cost | numeric(11,5) |prl_weight | numeric(7,2) |prl_uom | character(2) |prl_vendor_no | character(6) |prl_vendor_part_no |character varying(15) |prl_track_this_part | character(1) |prl_warranty_period | character varying(10) |prl_comments_1 | character varying(30) |prl_comments_2 | character varying(30) |prl_qty_shipped | numeric(6,2)[] |prl_qty_still_on_bo | numeric(6,2)[] |prl_qty_credited | numeric(6,2)[] |prl_credit_reason | character(2)[] |prl_credit_reason_type | character(1)[] |prl_cancel_ship | character(1)[] |prl_exchange_part | character(1) |prl_authorization_code | character varying(10) |prl_item_status | character(1) |prl_item_status_alpha | character(1) |prl_cancel_flag | character(1) |prl_charge_type_flag | character(1) |prl_ct_taxable_flag | character(1) |prl_account_cat_code | character(6) |prl_retail_price | numeric(8,2) |prl_line_needs_serials | character(1) |prl_chrg_type_ship_indx | pic_9_1 |prl_claim_type_flag | character(1) |prl_attached_wc_seq_no | pic_9_3 |prl_attached_claim_type | character varying(10) |prl_already_issued | character(1) |prl_returned_part_flag | character(1) |prl_prev_qty_shipped | numeric(6,2)[] |prl_prev_qty_still_on_bo| numeric(6,2)[] |prl_prev_qty_credited | numeric(6,2)[] | Indexes: "parts_order_line_file_pkey" PRIMARY KEY, btree (prl_combined_key, prl_seq_no) "parts_order_line_file_prl_exchange_part_key" UNIQUE, btree (prl_exchange_part, id) "parts_order_line_file_prl_item_no_key" UNIQUE, btree (prl_item_no, id) "parts_order_line_file_prl_trx_type_2_key" UNIQUE, btree (prl_combined_key_2, prl_item_no, id) CREATE DOMAIN public.pic_9_3 AS int2 DEFAULT 0 CONSTRAINT valid_range CHECK (((VALUE >= 0) AND (VALUE <= 999)));
> That's pretty bizarre. What's the datatype of the key column(s)? > > Can you reduce it to a smaller test case, or perhaps send me the full > dump off-list? (270m is a bit much for email, but web or ftp would > work ... also, presumably only the pkey column is needed to generate > the error ...) I just confirmed that there are duplicate p-keys in the source table :(. Three currently but last week there were six. Just FYI I am not 100% sure pg rebuilt the p-key with dups in it...I need to double check this. esp=# select * from esp-# ( esp(# select prl_combined_key, prl_seq_no, count(*) as c from parts_order_line_file group by 1,2 esp(# ) q where q.c > 1;prl_combined_key | prl_seq_no | c ------------------+------------+--- 00136860 | 20 | 2 00136860 | 23 | 2 00137050 | 1 | 2 (3 rows) esp=# \d parts_order_line_file Table "data1.parts_order_line_file" Column | Type | Modifiers --------------------------+-------------------------+----------- [...] Indexes: "parts_order_line_file_pkey" PRIMARY KEY, btree (prl_combined_key, prl_seq_no) "parts_order_line_file_prl_exchange_part_key" UNIQUE, btree (prl_exchange_part, id) "parts_order_line_file_prl_item_no_key" UNIQUE, btree (prl_item_no, id) "parts_order_line_file_prl_trx_type_2_key" UNIQUE, btree (prl_combined_key_2, prl_item_no, id) I keep a timestamp on every row for last modified date: esp=# select lastmod from parts_order_line_file where prl_combined_key = ' 00136860' and prl_seq_no in (20, 23); lastmod -------------------------2005-09-15 11:17:17.0622005-09-15 11:17:17.187 (2 rows) There have been no schema changes since 9/15... Merlin
On 11/14/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: > > I just confirmed that there are duplicate p-keys in the source table :(. > > Well, that's not very good either, but at least it narrows down the > problem. > > Do the "duplicate" rows appear to be independent insertions, or > successive states of the same logical row? Looking at their xmin/xmax > might help determine this, and if the table has OIDs then looking at the > oid would be pretty conclusive. no oid. esp=# select xmin, xmax, lastmod from parts_order_line_file where prl_combined_key = esp-# ' 00136860' and prl_seq_no in (20, 23); xmin | xmax | lastmod -----------+------+-------------------------584527952 | 0 | 2005-09-15 11:17:17.062584527961 | 0 | 2005-09-15 11:17:17.187 (2 rows) the nature of the file, unfortunately is that a record may be rewritten several times over it's lifespan, due to ISAM style resequencing on the table. the file is taken directly from user entry app so high speec race condition type behavior is unlikely, save for a order duplication aspect I need to check into. Being a ported COBOL app, tranasactions are one-record wonders, save for places where parts have been rewritten in pl/pgsql (does not apply here). Merlin
Merlin Moncure <mmoncure@gmail.com> writes: > On 11/14/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Do the "duplicate" rows appear to be independent insertions, or >> successive states of the same logical row? > esp=# select xmin, xmax, lastmod from parts_order_line_file where > prl_combined_key = > esp-# ' 00136860' and prl_seq_no in (20, 23); > xmin | xmax | lastmod > -----------+------+------------------------- > 584527952 | 0 | 2005-09-15 11:17:17.062 > 584527961 | 0 | 2005-09-15 11:17:17.187 > (2 rows) I think you need to try this with enable_indexscan = 0; it should be showing us 4 rows according to your prior result, and it's only showing 2, which suggests that the indexscan is short-circuiting because it "knows" there can only be 1 result row. Also, since you're probing for more than one primary key value, please include the pkey columns in the query so we can tell what's what... regards, tom lane
On 11/14/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: > > esp=# select xmin, xmax, lastmod from parts_order_line_file where > > prl_combined_key = > > esp-# ' 00136860' and prl_seq_no in (20, 23); > > xmin | xmax | lastmod > > -----------+------+------------------------- > > 584527952 | 0 | 2005-09-15 11:17:17.062 > > 584527961 | 0 | 2005-09-15 11:17:17.187 > > (2 rows) > > I think you need to try this with enable_indexscan = 0; it should be right, I missed that! esp=# select prl_combined_key, prl_seq_no, xmin, xmax, lastmod from parts_order_line_file where prl_combined_key = ' 00136860' and prl_seq_no in (20, 23);prl_combined_key | prl_seq_no | xmin | xmax | lastmod ------------------+------------+-----------+------+------------------------- 00136860 | 20 | 584527952 | 0 | 2005-09-15 11:17:17.062 00136860 | 20 | 584412245 | 0 | 2005-09-15 09:31:35.381 00136860 | 23 | 584527961 | 0 | 2005-09-15 11:17:17.187 00136860 | 23 | 584415243 | 0 | 2005-09-15 09:32:18.898 merlin
On 11/14/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think you need to try this with enable_indexscan = 0; it should be > showing us 4 rows according to your prior result, and it's only showing one thing I forgot to mention....there is a sequence on the table. Sequence is global for all tables hooked via default on a domain..highly unlikely records were inserted back to back. esp=# select id, prl_combined_key, prl_seq_no, xmin, xmax, lastmod from parts_order_line_file where prl_combined_key = ' 00136860' and prl_seq_no in (20, 23); id | prl_combined_key | prl_seq_no | xmin | xmax | lastmod ----------+------------------+------------+-----------+------+-------------------------15077227 | 00136860 | 20 | 584412245 | 0 | 2005-09-15 09:31:35.38115077260 | 00136860 | 23 | 584415243 | 0 | 2005-09-15 09:32:18.89815082475 | 00136860 | 20 | 584527952 | 0 | 2005-09-15 11:17:17.06215082522 | 00136860 | 23 | 584527961 | 0 | 2005-09-15 11:17:17.187 (4 rows)
Merlin Moncure <mmoncure@gmail.com> writes: > esp=# select prl_combined_key, prl_seq_no, xmin, xmax, lastmod from > parts_order_line_file where prl_combined_key = ' 00136860' and > prl_seq_no in (20, 23); > prl_combined_key | prl_seq_no | xmin | xmax | lastmod > ------------------+------------+-----------+------+------------------------- > 00136860 | 20 | 584527952 | 0 | 2005-09-15 11:17:17.062 > 00136860 | 20 | 584412245 | 0 | 2005-09-15 09:31:35.381 > 00136860 | 23 | 584527961 | 0 | 2005-09-15 11:17:17.187 > 00136860 | 23 | 584415243 | 0 | 2005-09-15 09:32:18.898 OK, so the fact that they all have xmax=0 proves that none are UPDATEd versions of others, which leaves us with the presumption that there was an outright failure of duplicate-key detection during INSERT :-( The pairing seems very odd though: judging from the proximity of xmin and lastmod, the first and third rows were inserted at almost the same time, and they do *not* have equal keys; the rows they should have conflicted with were inserted some time earlier. Can you think of anything special about the client-side logic that might have created unusual behavior in this situation? Not that it's not clearly a backend-side bug, I'm just looking for a handle to attack it with. (If you can determine the exact SQL commands that were issued to create these records, it'd be great.) One question that seems interesting is whether there might have been a VACUUM working on the pkey index concurrently with the later insertions. If you can either positively confirm or positively deny that, it'd be useful information. regards, tom lane
On Mon, 2005-11-14 at 14:51 -0500, Tom Lane wrote: > Merlin Moncure <mmoncure@gmail.com> writes: > > esp=# select prl_combined_key, prl_seq_no, xmin, xmax, lastmod from > > parts_order_line_file where prl_combined_key = ' 00136860' and > > prl_seq_no in (20, 23); > > prl_combined_key | prl_seq_no | xmin | xmax | lastmod > > ------------------+------------+-----------+------+------------------------- > > 00136860 | 20 | 584527952 | 0 | 2005-09-15 11:17:17.062 > > 00136860 | 20 | 584412245 | 0 | 2005-09-15 09:31:35.381 > > 00136860 | 23 | 584527961 | 0 | 2005-09-15 11:17:17.187 > > 00136860 | 23 | 584415243 | 0 | 2005-09-15 09:32:18.898 > > OK, so the fact that they all have xmax=0 proves that none are UPDATEd > versions of others, which leaves us with the presumption that there was > an outright failure of duplicate-key detection during INSERT :-( Is there any chance it is related to the 8.0 problem I reported on Wednesday? http://groups.google.ca/group/pgsql.hackers/browse_frm/thread/a832e9919fb53f92/83e8f0bd272b8597?lnk=st&q=%22possible+savepoint+bug%22&rnum=1&hl=en#83e8f0bd272b8597 --
Rod Taylor <pg@rbt.ca> writes: > Is there any chance it is related to the 8.0 problem I reported on > Wednesday? Too soon to tell ... though one would like to think we don't have more than one bug in that area ;-). If either of you can come up with even a low-probability test case, it'd be a great help. regards, tom lane
On 11/14/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The pairing seems very odd though: judging from the proximity of xmin > and lastmod, the first and third rows were inserted at almost the same > time, and they do *not* have equal keys; the rows they should have > conflicted with were inserted some time earlier. > Can you think of anything special about the client-side logic that might > have created unusual behavior in this situation? Not that it's not > clearly a backend-side bug, I'm just looking for a handle to attack it > with. (If you can determine the exact SQL commands that were issued to > create these records, it'd be great.) > One question that seems interesting is whether there might have been a > VACUUM working on the pkey index concurrently with the later insertions. > If you can either positively confirm or positively deny that, it'd be > useful information. VACUUM is scheuled to run weekly, so we can rul that out. Can rule out exotic transaction behavor (savepoint) and long running transactions generally, save for pg_dump. In my app, all I/O save for rare pgadmin edit is with PQExecParams/PQExecPrepared through libpq. Inserts and updates are currently not ever prepared. Following is an example of the insert logged from the ISAM drver logging utility. Transaction load on the table is moderate to heavy but is 10:1 read/write ratio at least. This is of course running windows...reading Rod's mail it seems we might have similar problem (running 8.0.2). The update would be virtually the same and matching on the p-key. 0.0181023 sec: insert into parts_order_line_file (prl_combined_key, prl_seq_no, prl_combined_key_2, prl_item_no, prl_comment_desc, prl_location, prl_workstation, prl_stock_loc, prl_qty, prl_adj_price, prl_cost, prl_weight, prl_uom, prl_vendor_no, prl_vendor_part_no, prl_track_this_part, prl_warranty_period, prl_comments_1, prl_comments_2, prl_qty_shipped, prl_qty_still_on_bo, prl_qty_credited, prl_credit_reason, prl_credit_reason_type, prl_cancel_ship, prl_exchange_part, prl_authorization_code, prl_item_status, prl_item_status_alpha, prl_cancel_flag, prl_charge_type_flag, prl_ct_taxable_flag, prl_account_cat_code, prl_retail_price, prl_line_needs_serials, prl_chrg_type_ship_indx, prl_claim_type_flag, prl_attached_wc_seq_no, prl_attached_claim_type, prl_already_issued, prl_returned_part_flag, prl_prev_qty_shipped, prl_prev_qty_still_on_bo, prl_prev_qty_credited) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44) params: $1= 00001126 $2=001 $3= 00001126 $4=BAR-000006 $5=BAR $6=PL-1 $7= $8= $9= 0001.000 $10= 000016.50 $11= 000015.00000 $12=00050.00 $13=EA $14=000006 $15=BAR-000006 $16=N $17= $18= $19= $20={" 0001.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"} $21={" 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"} $22={" 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"} $23={"","","","","","","","",""} $24={"","","","","","","","",""} $25={"","","","","","","","",""} $26=N $27= $28=2 $29=A $30= $31= $32= $33= $34= 000019.80 $35= $36=0 $37= $38=000 $39= $40= $41= $42={" 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"} $43={" 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"} $44={" 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"," 0000.00"}
> Rod Taylor <pg@rbt.ca> writes: > > Is there any chance it is related to the 8.0 problem I reported on > > Wednesday? > > Too soon to tell ... though one would like to think we don't have more > than one bug in that area ;-). > > If either of you can come up with even a low-probability test case, > it'd be a great help. > > regards, tom lane I spend several hours yesterday trying all different manner of things and came up with zero. Eventually I got to the point where I felt like I was wasting my time. ISTM there is likely some type of trigger which will be difficult to catch via simulation. Our production servers are running dual opterons and my dev box isn't, so it's impossible for me to to a pefect simiulation :(. While I was initially extremely concerned about this, an internal audit confirmed there was no actual data loss...the proper row appeared to be visible in all cases, so I classify this as more of correctible annoyance, fwiw. Talked with Rod a bit off-list and the best course of action is to implement a nightly process to check for things and work from there.I'll keep you posted. Merlin
On 11/16/05, Rod Taylor <pg@rbt.ca> wrote: > On Mon, 2005-11-14 at 14:51 -0500, Tom Lane wrote: > > Merlin Moncure <mmoncure@gmail.com> writes: > > > esp=# select prl_combined_key, prl_seq_no, xmin, xmax, lastmod from > > > parts_order_line_file where prl_combined_key = ' 00136860' and > > > prl_seq_no in (20, 23); > > > prl_combined_key | prl_seq_no | xmin | xmax | lastmod > > > ------------------+------------+-----------+------+------------------------- > > > 00136860 | 20 | 584527952 | 0 | 2005-09-15 11:17:17.062 > > > 00136860 | 20 | 584412245 | 0 | 2005-09-15 09:31:35.381 > > > 00136860 | 23 | 584527961 | 0 | 2005-09-15 11:17:17.187 > > > 00136860 | 23 | 584415243 | 0 | 2005-09-15 09:32:18.898 > > > > OK, so the fact that they all have xmax=0 proves that none are UPDATEd > > versions of others, which leaves us with the presumption that there was > > an outright failure of duplicate-key detection during INSERT :-( > > I realize this doesn't help much but I have found some recently which > are from updates. The duplicates were on a table which rarely has an > insert or delete (maybe one of each per day, but tens of thousands of > updates). > > Vacuum every 30 minutes. I can't prove it (yet) but looking at the phantom rows strongly indicates the same: xmin | xmax | id | lastmod | prl_combined_key | prl_seq_no -----------+------+----------+-------------------------+------------------+-----------584412869 | 0 | 15077217 | 2005-09-1509:31:48.163 | 00136860 | 19584527952 | 0 | 15082475 | 2005-09-15 11:17:17.062 | 00136860 | 20584412245 | 0 | 15077227 | 2005-09-15 09:31:35.381 | 00136860 | 20584527955 | 0 | 15082479| 2005-09-15 11:17:17.109 | 00136860 | 21 The way my application works, rows are only ever inserted at the end of the sequence number range. If a row is deleted, the rows are resequenced in place by updating the enitre record minus the p-key. Seq# 20 and 21 have xmin and ID very close suggesting phantom was generated during resequencing operation. This is all updates of course, except for the highest seq# which is deleted. Can confirm that as of yet but am putting auditing controls in which will catch it next time. Merlin