Thread: problems with table corruption continued
Okay, here's a follow up to my previous messages "ACK table corrupted, unique index violated." I've been trying to clean up the corruptions that i mentioned earlier. I felt most comfortable shutting down all my application servers, restarting postgres, doing a dump of my database and rebuilding it with a pginit and complete reload. So far so good. I went to fix one of the corrupted tables and i have another strange experience. I'm still looking into other possibilities such as a hardware failure; but i thought this might be interesting or helpful in the context of my previous post: Basically the table with duplicate oid/id now has unique oid from the relead, so I'm going to delete the duplicate rows and recreate the unique index on the identity column. basement=# select count(*),developer_aka_id from developer_aka group by developer_aka_id having count(*) <> 1;count | developer_aka_id -------+------------------ 2 | 9789 2 | 10025 2 | 40869 (3 rows) basement=# select oid,* from developer_aka where developer_aka_id in (9789,10025,40869); oid | developer_id | developer_aka_id | first_name | last_name -------+--------------+------------------+-------------------+-----------48390 | 1916 | 9789 | Chris | Smith48402 | 35682 | 40869 | Donald "Squirral" | Fisk48425 | 4209 | 10025 | Mike | Glosecki48426 | 1916 | 9789 | Chris | Smith48427 | 35682 | 40869 | Donald "Squirral" | Fisk48428 | 4209 | 10025 | Mike | Glosecki (6 rows) basement=# delete from developer_aka where oid in (48390,48402,48425); DELETE 3 basement=# select count(*),developer_aka_id from developer_aka group by developer_aka_id having count(*) <> 1;count | developer_aka_id -------+------------------ (0 rows) basement=# create unique index developer_aka_pkey on developer_aka(developer_aka_id); CREATE basement=# VACUUM ANALYZE developer_aka; ERROR: Cannot insert a duplicate key into unique index developer_aka_pkey
Tom & All: I've been looking into this problem some more and I've been able to consistantly reproduce the error. I've testing it on two different machines; both running 7.1.3 on a i686-pc-linux-gnu configuration. One machine is RH7.2 and the other is RH6.2. I still haven't been able to reproduce the problem with corrupted indexes/tables (NUMBER OF TUPLES IS NOT THE SAME AS HEAP -- duplicate rows with same oid & pkey); but I'm hopefull that these two problems are related. Also, i wanted to inform you that the same two tables became corrupted on 12-15-2001 after my 12-12-2001 reload). I've attached three files, a typescript, and two sql files. I found that if the commands were combined into a single file and run in a single pgsql session, the error does not occur -- so it's important to follow the commands exactly like they are in the typescript. If for some reason the errors aren't reproducable on your machines, let me know and we will try to find out what's unique about my setup. Thanks. ----- Original Message ----- From: "Brian Hirt" <bhirt@mobygames.com> To: "Postgres Hackers" <pgsql-hackers@postgresql.org> Cc: "Brian A Hirt" <bhirt@berkhirt.com> Sent: Wednesday, December 12, 2001 11:30 AM Subject: [HACKERS] problems with table corruption continued > Okay, here's a follow up to my previous messages "ACK table corrupted, > unique index violated." > > I've been trying to clean up the corruptions that i mentioned earlier. I > felt most comfortable shutting down all my application servers, restarting > postgres, doing a dump of my database and rebuilding it with a pginit and > complete reload. So far so good. I went to fix one of the corrupted tables > and i have another strange experience. I'm still looking into other > possibilities such as a hardware failure; but i thought this might be > interesting or helpful in the context of my previous post: Basically the > table with duplicate oid/id now has unique oid from the relead, so I'm going > to delete the duplicate rows and recreate the unique index on the identity > column. > > basement=# select count(*),developer_aka_id from developer_aka group by > developer_aka_id having count(*) <> 1; > count | developer_aka_id > -------+------------------ > 2 | 9789 > 2 | 10025 > 2 | 40869 > (3 rows) > > basement=# select oid,* from developer_aka where developer_aka_id in > (9789,10025,40869); > oid | developer_id | developer_aka_id | first_name | last_name > -------+--------------+------------------+-------------------+----------- > 48390 | 1916 | 9789 | Chris | Smith > 48402 | 35682 | 40869 | Donald "Squirral" | Fisk > 48425 | 4209 | 10025 | Mike | Glosecki > 48426 | 1916 | 9789 | Chris | Smith > 48427 | 35682 | 40869 | Donald "Squirral" | Fisk > 48428 | 4209 | 10025 | Mike | Glosecki > (6 rows) > > basement=# delete from developer_aka where oid in (48390,48402,48425); > DELETE 3 > basement=# select count(*),developer_aka_id from developer_aka group by > developer_aka_id having count(*) <> 1; > count | developer_aka_id > -------+------------------ > (0 rows) > > basement=# create unique index developer_aka_pkey on > developer_aka(developer_aka_id); > CREATE > basement=# VACUUM ANALYZE developer_aka; > ERROR: Cannot insert a duplicate key into unique index developer_aka_pkey > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
"Brian Hirt" <bhirt@mobygames.com> writes: > I've attached three files, a typescript, and two sql files. I found that if > the commands were combined into a single file and run in a single pgsql > session, the error does not occur -- so it's important to follow the > commands exactly like they are in the typescript. If for some reason the > errors aren't reproducable on your machines, let me know and we will try to > find out what's unique about my setup. Indeed, it reproduces just fine here --- not only that, but I get an Assert failure when I try it in current sources. Many, many thanks! I shall start digging forthwith... regards, tom lane
Great, I'm also trying to create a reproducable test case for the original problem i reported with duplicate rows/oids/pkeys. Maybe both problems are a result of the same bug; i don't know. BTW, if you remove the index that is based on the pgpsql function "developer_aka_seach_idx" the problem is not reproducable. But you've probably figured that out by now. ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Brian Hirt" <bhirt@mobygames.com> Cc: "Postgres Hackers" <pgsql-hackers@postgresql.org>; "Brian A Hirt" <bhirt@berkhirt.com> Sent: Tuesday, December 18, 2001 10:52 AM Subject: Re: [HACKERS] problems with table corruption continued > "Brian Hirt" <bhirt@mobygames.com> writes: > > I've attached three files, a typescript, and two sql files. I found that if > > the commands were combined into a single file and run in a single pgsql > > session, the error does not occur -- so it's important to follow the > > commands exactly like they are in the typescript. If for some reason the > > errors aren't reproducable on your machines, let me know and we will try to > > find out what's unique about my setup. > > Indeed, it reproduces just fine here --- not only that, but I get an > Assert failure when I try it in current sources. > > Many, many thanks! I shall start digging forthwith... > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html >
"Brian Hirt" <bhirt@mobygames.com> writes: > [ example case that creates a duplicate tuple ] Got it. The problem arises in this section of vacuum.c, near the bottom of repair_frag: if (!(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED)) { if ((TransactionId)tuple.t_data->t_cmin != myXID) elog(ERROR, "Invalid XID in t_cmin (3)"); if (tuple.t_data->t_infomask & HEAP_MOVED_OFF) { itemid->lp_flags &= ~LP_USED; num_tuples++; } else elog(ERROR,"HEAP_MOVED_OFF was expected (2)"); } This is trying to get rid of the original copy of a tuple that's been moved to another page. The problem is that your index function causes a table scan, which means that by the time control gets here, someone else has looked at this tuple and marked it good --- so the initial test of HEAP_XMIN_COMMITTED fails, and the tuple is never removed! I would say that it's incorrect for vacuum.c to assume that HEAP_XMIN_COMMITTED can't become set on HEAP_MOVED_OFF/HEAP_MOVED_IN tuples during the course of vacuum's processing; after all, the xmin definitely does refer to a committed xact, and we can't realistically assume that we know what processing will be induced by user-defined index functions. Vadim, what do you think? How should we fix this? In the meantime, Brian, I think you ought to get rid of your index CREATE INDEX "developer_aka_search_idx" on "developer_aka" using btree ( developer_aka_search_name ("developer_aka_id")"varchar_ops" ); I do not think this index is well-defined: an index function ought to have the property that its output depends solely on its input and cannot change over time. This function cannot make that claim. (In 7.2, you can't even create the index unless you mark the function iscachable, which is really a lie.) I'm not even real sure what you expect the index to do for you... I do not know whether this effect explains all the reports of duplicate tuples we've seen in the last few weeks. We need to ask whether the other complainants were using index functions that tried to do table scans. regards, tom lane
"Brian Hirt" <bhirt@mobygames.com> writes: > Great, I'm also trying to create a reproducable test case for the original > problem i reported with duplicate rows/oids/pkeys. Maybe both problems are > a result of the same bug; i don't know. Were the duplicate rows all in tables that had functional indexes based on functions similar to developer_aka_search_name? The problem we're seeing here seems to be due to VACUUM not being able to cope with the side effects of the SELECT inside the index function. regards, tom lane
Yes, both tables had similiar functions, and the corruption was limited to only those two tables. ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Brian Hirt" <bhirt@mobygames.com> Cc: "Postgres Hackers" <pgsql-hackers@postgresql.org>; "Brian A Hirt" <bhirt@berkhirt.com> Sent: Tuesday, December 18, 2001 11:53 AM Subject: Re: [HACKERS] problems with table corruption continued > "Brian Hirt" <bhirt@mobygames.com> writes: > > Great, I'm also trying to create a reproducable test case for the original > > problem i reported with duplicate rows/oids/pkeys. Maybe both problems are > > a result of the same bug; i don't know. > > Were the duplicate rows all in tables that had functional indexes based > on functions similar to developer_aka_search_name? The problem we're > seeing here seems to be due to VACUUM not being able to cope with the > side effects of the SELECT inside the index function. > > regards, tom lane >
Tom, I'm glad you found the cause. I already deleted the index a few days back after I strongly suspected that they were related to the problem. The reason i created the index was to help locate a record based on the name of a person with an index lookup instead of a sequential scan on a table with several hundred thousand rows. I was trying to avoid adding additional computed fields to the tables and maintaining them with triggers, indexing and searching on them. The index function seemed like an elegant solution to the problem; although at the time I was completely unaware of 'iscachable'; Do you think this might also explain the following errors i was seeing? NOTICE: Cannot rename init file /moby/pgsql/base/156130/pg_internal.init.19833 to /moby/pgsql/base/156130/pg_internal.init: No such file or directory --brian ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Brian Hirt" <bhirt@mobygames.com> Cc: "Vadim Mikheev" <vmikheev@sectorbase.com>; "Postgres Hackers" <pgsql-hackers@postgresql.org>; "Brian A Hirt" <bhirt@berkhirt.com> Sent: Tuesday, December 18, 2001 11:48 AM Subject: Re: [HACKERS] problems with table corruption continued > "Brian Hirt" <bhirt@mobygames.com> writes: > > [ example case that creates a duplicate tuple ] > > Got it. The problem arises in this section of vacuum.c, near the bottom > of repair_frag: > > if (!(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED)) > { > if ((TransactionId) tuple.t_data->t_cmin != myXID) > elog(ERROR, "Invalid XID in t_cmin (3)"); > if (tuple.t_data->t_infomask & HEAP_MOVED_OFF) > { > itemid->lp_flags &= ~LP_USED; > num_tuples++; > } > else > elog(ERROR, "HEAP_MOVED_OFF was expected (2)"); > } > > This is trying to get rid of the original copy of a tuple that's been > moved to another page. The problem is that your index function causes a > table scan, which means that by the time control gets here, someone else > has looked at this tuple and marked it good --- so the initial test of > HEAP_XMIN_COMMITTED fails, and the tuple is never removed! > > I would say that it's incorrect for vacuum.c to assume that > HEAP_XMIN_COMMITTED can't become set on HEAP_MOVED_OFF/HEAP_MOVED_IN > tuples during the course of vacuum's processing; after all, the xmin > definitely does refer to a committed xact, and we can't realistically > assume that we know what processing will be induced by user-defined > index functions. Vadim, what do you think? How should we fix this? > > In the meantime, Brian, I think you ought to get rid of your index > CREATE INDEX "developer_aka_search_idx" on "developer_aka" using btree ( developer_aka_search_name ("developer_aka_id") "varchar_ops" ); > I do not think this index is well-defined: an index function ought > to have the property that its output depends solely on its input and > cannot change over time. This function cannot make that claim. > (In 7.2, you can't even create the index unless you mark the function > iscachable, which is really a lie.) I'm not even real sure what you > expect the index to do for you... > > I do not know whether this effect explains all the reports of duplicate > tuples we've seen in the last few weeks. We need to ask whether the > other complainants were using index functions that tried to do table > scans. > > regards, tom lane >
> This is trying to get rid of the original copy of a tuple that's been > moved to another page. The problem is that your index > function causes a > table scan, which means that by the time control gets here, > someone else > has looked at this tuple and marked it good --- so the initial test of ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > HEAP_XMIN_COMMITTED fails, and the tuple is never removed! > > I would say that it's incorrect for vacuum.c to assume that > HEAP_XMIN_COMMITTED can't become set on HEAP_MOVED_OFF/HEAP_MOVED_IN > tuples during the course of vacuum's processing; after all, the xmin > definitely does refer to a committed xact, and we can't realistically > assume that we know what processing will be induced by user-defined > index functions. Vadim, what do you think? How should we fix this? But it's incorrect for table scan to mark tuple as good neither. Looks like we have to add checks for the case TransactionIdIsCurrentTransactionId(tuple->t_cmin) when there is HEAP_MOVED_OFF or HEAP_MOVED_IN in t_infomask to all HeapTupleSatisfies* in tqual.c as we do in HeapTupleSatisfiesDirty - note comments about uniq btree-s there. Vadim
"Brian Hirt" <bhirt@mobygames.com> writes: > I was trying to avoid adding additional computed fields to the tables and > maintaining them with triggers, indexing and searching on them. The index > function seemed like an elegant solution to the problem Understood, but can you write the index function in a way that avoids having it do a SELECT to get at data that it hasn't been passed? I'm wondering if you can't define the function as justf(first_name, last_name) = upper(first_name || ' ' || last_name) and create the index on f(first_name, last_name). You haven't shown us the queries you expect the index to be helpful for, so maybe this is not workable... regards, tom lane
"Brian Hirt" <bhirt@mobygames.com> writes: > Do you think this might also explain the following errors i was seeing? > NOTICE: Cannot rename init file > /moby/pgsql/base/156130/pg_internal.init.19833 to > /moby/pgsql/base/156130/pg_internal.init: No such file or directory No, that error is not coming from anywhere near VACUUM; it's from relcache startup (see write_irels in src/backend/utils/cache/relcache.c). The rename source file has just been created in the very same routine, so it's difficult to see how one would get a "No such file" failure from rename(). It is possible that several backends create temp init files at the same time and all try to rename their own temp files into place as the pg_internal.init file. However, that should work: the rename man page says The rename() system call causes the source file to be renamed to target. If target exists, it is first removed. Both source and target must be of the same type (that is, either directories or nondirectories), and must reside onthe same file system. If target can be created or if it existed before the call, rename() guarantees that an instance of target will exist,even if the system crashes in the midst of the operation. so we should end up with the extra copies deleted and just one init file remaining after the slowest backend renames its copy into place. Do you by chance have your database directory mounted via NFS, or some other strange filesystem where the normal semantics of concurrent rename might not work quite right? FWIW, I believe this condition is pretty harmless, but it would be nice to understand exactly why you're seeing the message. regards, tom lane
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: >> I would say that it's incorrect for vacuum.c to assume that >> HEAP_XMIN_COMMITTED can't become set on HEAP_MOVED_OFF/HEAP_MOVED_IN >> tuples during the course of vacuum's processing; after all, the xmin >> definitely does refer to a committed xact, and we can't realistically >> assume that we know what processing will be induced by user-defined >> index functions. Vadim, what do you think? How should we fix this? > But it's incorrect for table scan to mark tuple as good neither. Oh, that makes sense. > Looks like we have to add checks for the case > TransactionIdIsCurrentTransactionId(tuple->t_cmin) when > there is HEAP_MOVED_OFF or HEAP_MOVED_IN in t_infomask to > all HeapTupleSatisfies* in tqual.c as we do in > HeapTupleSatisfiesDirty - note comments about uniq btree-s there. Sounds like a plan. Do you want to work on this, or shall I? regards, tom lane
Tom, I probably could have written the function like you suggest in this email, and i probably will look into doing so. This was the first time I tried creating an index function, and one of the few times i've used plpgsql. In general i have very little experience with doing this kind of stuff in postgres (i try to stick to standard SQL as much as possible) and it looks like i've stumbled onto this problem because of a bad design decision on my part and a lack of understanding of how index functions work. Thanks for the suggestion. ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Brian Hirt" <bhirt@mobygames.com> Cc: "Postgres Hackers" <pgsql-hackers@postgresql.org>; "Brian A Hirt" <bhirt@berkhirt.com> Sent: Tuesday, December 18, 2001 12:20 PM Subject: Re: [HACKERS] problems with table corruption continued > "Brian Hirt" <bhirt@mobygames.com> writes: > > I was trying to avoid adding additional computed fields to the tables and > > maintaining them with triggers, indexing and searching on them. The index > > function seemed like an elegant solution to the problem > > Understood, but can you write the index function in a way that avoids > having it do a SELECT to get at data that it hasn't been passed? I'm > wondering if you can't define the function as just > f(first_name, last_name) = upper(first_name || ' ' || last_name) > and create the index on f(first_name, last_name). You haven't shown us > the queries you expect the index to be helpful for, so maybe this is not > workable... > > regards, tom lane >
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: >> Sounds like a plan. Do you want to work on this, or shall I? > Sorry, I'm not able to do this fast enough for current pre-release stage -:( Okay. I'll work on a patch and send it to you for review, if you like. regards, tom lane
Well, when my application starts, about 100 backend database connections start up at the same time so this fits in with the circumstance you describe. However, I'm just using a standard ext2 filesystem on 2.2 linux kernel. It's good to know that this error should be harmless. ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Brian Hirt" <bhirt@mobygames.com> Cc: "Postgres Hackers" <pgsql-hackers@postgresql.org>; "Brian A Hirt" <bhirt@berkhirt.com> Sent: Tuesday, December 18, 2001 12:32 PM Subject: Re: [HACKERS] problems with table corruption continued > "Brian Hirt" <bhirt@mobygames.com> writes: > > Do you think this might also explain the following errors i was seeing? > > > NOTICE: Cannot rename init file > > /moby/pgsql/base/156130/pg_internal.init.19833 to > > /moby/pgsql/base/156130/pg_internal.init: No such file or directory > > No, that error is not coming from anywhere near VACUUM; it's from > relcache startup (see write_irels in src/backend/utils/cache/relcache.c). > The rename source file has just been created in the very same routine, > so it's difficult to see how one would get a "No such file" failure from > rename(). > > It is possible that several backends create temp init files at the > same time and all try to rename their own temp files into place as > the pg_internal.init file. However, that should work: the rename > man page says > > The rename() system call causes the source file to be renamed to > target. If target exists, it is first removed. Both source and > target must be of the same type (that is, either directories or > nondirectories), and must reside on the same file system. > > If target can be created or if it existed before the call, rename() > guarantees that an instance of target will exist, even if the system > crashes in the midst of the operation. > > so we should end up with the extra copies deleted and just one init file > remaining after the slowest backend renames its copy into place. > > Do you by chance have your database directory mounted via NFS, or some > other strange filesystem where the normal semantics of concurrent rename > might not work quite right? > > FWIW, I believe this condition is pretty harmless, but it would be nice > to understand exactly why you're seeing the message. > > regards, tom lane >
> > Looks like we have to add checks for the case > > TransactionIdIsCurrentTransactionId(tuple->t_cmin) when > > there is HEAP_MOVED_OFF or HEAP_MOVED_IN in t_infomask to > > all HeapTupleSatisfies* in tqual.c as we do in > > HeapTupleSatisfiesDirty - note comments about uniq btree-s there. > > Sounds like a plan. Do you want to work on this, or shall I? Sorry, I'm not able to do this fast enough for current pre-release stage -:( Vadim
> -----Original Message----- > From: Tom Lane > > "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: > >> I would say that it's incorrect for vacuum.c to assume that > >> HEAP_XMIN_COMMITTED can't become set on HEAP_MOVED_OFF/HEAP_MOVED_IN > >> tuples during the course of vacuum's processing; after all, the xmin > >> definitely does refer to a committed xact, and we can't realistically > >> assume that we know what processing will be induced by user-defined > >> index functions. Vadim, what do you think? How should we fix this? > > > But it's incorrect for table scan to mark tuple as good neither. > > Oh, that makes sense. > > > Looks like we have to add checks for the case > > TransactionIdIsCurrentTransactionId(tuple->t_cmin) when > > there is HEAP_MOVED_OFF or HEAP_MOVED_IN in t_infomask to > > all HeapTupleSatisfies* in tqual.c as we do in > > HeapTupleSatisfiesDirty - note comments about uniq btree-s there. Should the change be TransactionIdIsInProgress(tuple->t_cmin) ? The cause of Brian's issue was exactly what I was afraid of. VACUUM is guarded by AccessExclusive lock but IMHO we shouldn't rely on it too heavily. regards, Hiroshi Inoue
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > Should the change be TransactionIdIsInProgress(tuple->t_cmin) ? I'd be willing to doif (tuple->t_cmin is my XID) do something;Assert(!TransactionIdIsInProgress(tuple->t_cmin)); if that makes you feel better. But anything that's scanning a table exclusive-locked by another transaction is broken. (BTW: contrib/pgstattuple is thus broken. Will fix.) regards, tom lane
Tom Lane wrote: > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > Should the change be TransactionIdIsInProgress(tuple->t_cmin) ? > > I'd be willing to do > if (tuple->t_cmin is my XID) > do something; > Assert(!TransactionIdIsInProgress(tuple->t_cmin)); > if that makes you feel better. It's useless in hard to reproduce cases. I've thought but given up many times to propose this change and my decision seems to have been right because I see only *Assert* even after this issue. > But anything that's scanning > a table exclusive-locked by another transaction is broken. > (BTW: contrib/pgstattuple is thus broken. Will fix.) It seems dangerous to me to rely only on developers' carefulness. There are some places where relations are opened with NoLock. We had better change them. I once examined them but AFAIR there are few cases when they are really opened with NoLock. In most cases they are already opened previously with other lock modes. I don't remember well if there was a really dangerous(scan an relation opened with NoLock) place. regards, Hiroshi Inoue
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > I don't remember well if there was a really dangerous(scan an relation > opened with NoLock) place. I have just looked for that, and the only such place is in the new contrib module pgstattuple. This is clearly broken, since there is nothing stopping someone from (eg) dropping the relation while pgsstattuple is scanning it. I think it should get AccessShareLock on the relation. The ri_triggers code has a lot of places that open things NoLock, but it only looks into the relcache entry and doesn't try to scan the relation. Nonetheless that code bothers me; we could be using an obsolete relcache entry if someone has just committed an ALTER TABLE on the relation. Some of the cases may be safe because a lock is held higher up (eg, on the table from which the trigger was fired) but I doubt they all are. regards, tom lane
On Tue, 18 Dec 2001, Tom Lane wrote: > The ri_triggers code has a lot of places that open things NoLock, > but it only looks into the relcache entry and doesn't try to scan > the relation. Nonetheless that code bothers me; we could be using > an obsolete relcache entry if someone has just committed an ALTER > TABLE on the relation. Some of the cases may be safe because a lock > is held higher up (eg, on the table from which the trigger was fired) > but I doubt they all are. Probably not, since it looks like that's being done for the other table of the constraint (not the one on which the trigger was fired).
Stephan Szabo wrote: > > On Tue, 18 Dec 2001, Tom Lane wrote: > > > The ri_triggers code has a lot of places that open things NoLock, > > but it only looks into the relcache entry and doesn't try to scan > > the relation. Nonetheless that code bothers me; we could be using > > an obsolete relcache entry if someone has just committed an ALTER > > TABLE on the relation. Some of the cases may be safe because a lock > > is held higher up (eg, on the table from which the trigger was fired) > > but I doubt they all are. > > Probably not, since it looks like that's being done for the other table of > the constraint (not the one on which the trigger was fired). If a lock is held already, acquiring an AccessShareLock would cause no addtional conflict. I don't see any reason to walk a tightrope with NoLock intentionally. regards, Hiroshi Inoue
On Wed, 19 Dec 2001, Hiroshi Inoue wrote: > Stephan Szabo wrote: > > > > On Tue, 18 Dec 2001, Tom Lane wrote: > > > > > The ri_triggers code has a lot of places that open things NoLock, > > > but it only looks into the relcache entry and doesn't try to scan > > > the relation. Nonetheless that code bothers me; we could be using > > > an obsolete relcache entry if someone has just committed an ALTER > > > TABLE on the relation. Some of the cases may be safe because a lock > > > is held higher up (eg, on the table from which the trigger was fired) > > > but I doubt they all are. > > > > Probably not, since it looks like that's being done for the other table of > > the constraint (not the one on which the trigger was fired). > > If a lock is held already, acquiring an AccessShareLock > would cause no addtional conflict. I don't see any reason > to walk a tightrope with NoLock intentionally. I don't know why NoLock was used there, I was just pointing out that the odds of a lock being held higher up is probably low.
Okay, I've applied the attached patch to tqual.c. This brings all the variants of HeapTupleSatisfies up to speed: I have compared them carefully and they now all make equivalent series of tests on the tuple status (though they may interpret the results differently). I decided that Hiroshi had a good point about testing TransactionIdIsInProgress, so the code now does that before risking a change to t_infomask, even in the VACUUM cases. regards, tom lane *** src/backend/utils/time/tqual.c.orig Mon Oct 29 15:31:08 2001 --- src/backend/utils/time/tqual.c Wed Dec 19 12:09:32 2001 *************** *** 31,37 **** /* * HeapTupleSatisfiesItself * True iff heap tuple is valid for "itself." ! * "{it}self" means valid as of everything that's happened * in the current transaction, _including_ thecurrent command. * * Note: --- 31,37 ---- /* * HeapTupleSatisfiesItself * True iff heap tuple is valid for "itself." ! * "itself" means valid as of everything that's happened * in the current transaction, _including_ the currentcommand. * * Note: *************** *** 53,59 **** bool HeapTupleSatisfiesItself(HeapTupleHeader tuple) { - if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED)) { if (tuple->t_infomask & HEAP_XMIN_INVALID) --- 53,58 ---- *************** *** 61,86 **** if (tuple->t_infomask & HEAP_MOVED_OFF) { ! if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) ! { ! tuple->t_infomask |= HEAP_XMIN_INVALID; return false; } } elseif (tuple->t_infomask & HEAP_MOVED_IN) { ! if (!TransactionIdDidCommit((TransactionId) tuple->t_cmin)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! return false; } } else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin)) { if (tuple->t_infomask & HEAP_XMAX_INVALID) /*xid invalid */ return true; if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE) return true; return false; } else if (!TransactionIdDidCommit(tuple->t_xmin)) --- 60,102 ---- if (tuple->t_infomask & HEAP_MOVED_OFF) { ! if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) return false; + if (!TransactionIdIsInProgress((TransactionId) tuple->t_cmin)) + { + if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) + { + tuple->t_infomask |= HEAP_XMIN_INVALID; + return false; + } + tuple->t_infomask |= HEAP_XMIN_COMMITTED; } } else if (tuple->t_infomask &HEAP_MOVED_IN) { ! if (!TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) { ! if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin)) ! return false; ! if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! else ! { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! return false; ! } } } else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin)) { if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid */ return true; + + Assert(TransactionIdIsCurrentTransactionId(tuple->t_xmax)); + if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE) return true; + return false; } else if (!TransactionIdDidCommit(tuple->t_xmin)) *************** *** 89,97 **** tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */ return false; } ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; } ! /* the tuple was inserted validly */ if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */ return true; --- 105,115 ---- tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */ return false; } ! else ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; } ! ! /* by here, the inserting transaction has committed */ if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalidor aborted */ return true; *************** *** 117,123 **** return true; } ! /* by here, deleting transaction has committed */ tuple->t_infomask |= HEAP_XMAX_COMMITTED; if (tuple->t_infomask& HEAP_MARKED_FOR_UPDATE) --- 135,141 ---- return true; } ! /* xmax transaction committed */ tuple->t_infomask |= HEAP_XMAX_COMMITTED; if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE) *************** *** 177,194 **** if (tuple->t_infomask & HEAP_MOVED_OFF) { ! if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) ! { ! tuple->t_infomask |= HEAP_XMIN_INVALID; return false; } } elseif (tuple->t_infomask & HEAP_MOVED_IN) { ! if (!TransactionIdDidCommit((TransactionId) tuple->t_cmin)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! return false; } } else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin)) --- 195,225 ---- if (tuple->t_infomask & HEAP_MOVED_OFF) { ! if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) return false; + if (!TransactionIdIsInProgress((TransactionId) tuple->t_cmin)) + { + if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) + { + tuple->t_infomask |= HEAP_XMIN_INVALID; + return false; + } + tuple->t_infomask |= HEAP_XMIN_COMMITTED; } } else if (tuple->t_infomask &HEAP_MOVED_IN) { ! if (!TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) { ! if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin)) ! return false; ! if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! else ! { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! return false; ! } } } else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin)) *************** *** 215,221 **** tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */ return false; } ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; } /* by here, the inserting transaction has committed */ --- 246,253 ---- tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */ return false; } ! else ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; } /* by here, the inserting transaction has committed */ *************** *** 257,348 **** } int ! HeapTupleSatisfiesUpdate(HeapTuple tuple) { ! HeapTupleHeader th = tuple->t_data; if (AMI_OVERRIDE) return HeapTupleMayBeUpdated; ! if (!(th->t_infomask & HEAP_XMIN_COMMITTED)) { ! if (th->t_infomask & HEAP_XMIN_INVALID) /* xid invalid or aborted */ return HeapTupleInvisible; ! if (th->t_infomask & HEAP_MOVED_OFF) { ! if (TransactionIdDidCommit((TransactionId) th->t_cmin)) ! { ! th->t_infomask |= HEAP_XMIN_INVALID; return HeapTupleInvisible; } } ! else if (th->t_infomask & HEAP_MOVED_IN) { ! if (!TransactionIdDidCommit((TransactionId) th->t_cmin)) { ! th->t_infomask |= HEAP_XMIN_INVALID; ! return HeapTupleInvisible; } } ! else if (TransactionIdIsCurrentTransactionId(th->t_xmin)) { ! if (CommandIdGEScanCommandId(th->t_cmin)) return HeapTupleInvisible; /* inserted afterscan * started */ ! if (th->t_infomask & HEAP_XMAX_INVALID) /* xid invalid */ return HeapTupleMayBeUpdated; ! Assert(TransactionIdIsCurrentTransactionId(th->t_xmax)); ! if (th->t_infomask & HEAP_MARKED_FOR_UPDATE) return HeapTupleMayBeUpdated; ! if (CommandIdGEScanCommandId(th->t_cmax)) return HeapTupleSelfUpdated; /* updated afterscan * started */ else return HeapTupleInvisible; /* updated before scan * started */ } ! else if (!TransactionIdDidCommit(th->t_xmin)) { ! if (TransactionIdDidAbort(th->t_xmin)) ! th->t_infomask |= HEAP_XMIN_INVALID; /* aborted */ return HeapTupleInvisible; } ! th->t_infomask |= HEAP_XMIN_COMMITTED; } /* by here, the inserting transaction has committed */ ! if (th->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */ return HeapTupleMayBeUpdated; ! if (th->t_infomask & HEAP_XMAX_COMMITTED) { ! if (th->t_infomask & HEAP_MARKED_FOR_UPDATE) return HeapTupleMayBeUpdated; return HeapTupleUpdated; /* updated by other */ } ! if (TransactionIdIsCurrentTransactionId(th->t_xmax)) { ! if (th->t_infomask & HEAP_MARKED_FOR_UPDATE) return HeapTupleMayBeUpdated; ! if (CommandIdGEScanCommandId(th->t_cmax)) return HeapTupleSelfUpdated; /* updated after scan * started */ else return HeapTupleInvisible; /* updated before scan started */ } ! if (!TransactionIdDidCommit(th->t_xmax)) { ! if (TransactionIdDidAbort(th->t_xmax)) { ! th->t_infomask |= HEAP_XMAX_INVALID; /* aborted */ return HeapTupleMayBeUpdated; } /* running xact */ --- 289,394 ---- } int ! HeapTupleSatisfiesUpdate(HeapTuple htuple) { ! HeapTupleHeader tuple = htuple->t_data; if (AMI_OVERRIDE) return HeapTupleMayBeUpdated; ! if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED)) { ! if (tuple->t_infomask & HEAP_XMIN_INVALID) return HeapTupleInvisible; ! if (tuple->t_infomask & HEAP_MOVED_OFF) { ! if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) return HeapTupleInvisible; + if (!TransactionIdIsInProgress((TransactionId) tuple->t_cmin)) + { + if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) + { + tuple->t_infomask |= HEAP_XMIN_INVALID; + return HeapTupleInvisible; + } + tuple->t_infomask |= HEAP_XMIN_COMMITTED; } } ! else if (tuple->t_infomask & HEAP_MOVED_IN) { ! if (!TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) { ! if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin)) ! return HeapTupleInvisible; ! if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! else ! { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! return HeapTupleInvisible; ! } } } ! else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin)) { ! if (CommandIdGEScanCommandId(tuple->t_cmin)) return HeapTupleInvisible; /* insertedafter scan * started */ ! if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid */ return HeapTupleMayBeUpdated; ! Assert(TransactionIdIsCurrentTransactionId(tuple->t_xmax)); ! if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE) return HeapTupleMayBeUpdated; ! if (CommandIdGEScanCommandId(tuple->t_cmax)) return HeapTupleSelfUpdated; /* updated afterscan * started */ else return HeapTupleInvisible; /* updated before scan * started */ } ! else if (!TransactionIdDidCommit(tuple->t_xmin)) { ! if (TransactionIdDidAbort(tuple->t_xmin)) ! tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */ return HeapTupleInvisible; } ! else ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; } /* by here, the inserting transaction has committed */ ! if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */ return HeapTupleMayBeUpdated; ! if (tuple->t_infomask & HEAP_XMAX_COMMITTED) { ! if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE) return HeapTupleMayBeUpdated; return HeapTupleUpdated; /* updated by other */ } ! if (TransactionIdIsCurrentTransactionId(tuple->t_xmax)) { ! if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE) return HeapTupleMayBeUpdated; ! if (CommandIdGEScanCommandId(tuple->t_cmax)) return HeapTupleSelfUpdated; /* updated after scan * started */ else return HeapTupleInvisible; /* updated before scan started */ } ! if (!TransactionIdDidCommit(tuple->t_xmax)) { ! if (TransactionIdDidAbort(tuple->t_xmax)) { ! tuple->t_infomask |= HEAP_XMAX_INVALID; /* aborted */ return HeapTupleMayBeUpdated; } /* running xact */ *************** *** 350,358 **** } /* xmax transaction committed */ ! th->t_infomask |= HEAP_XMAX_COMMITTED; ! if (th->t_infomask & HEAP_MARKED_FOR_UPDATE) return HeapTupleMayBeUpdated; return HeapTupleUpdated; /* updated by other */ --- 396,404 ---- } /* xmax transaction committed */ ! tuple->t_infomask |= HEAP_XMAX_COMMITTED; ! if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE) return HeapTupleMayBeUpdated; return HeapTupleUpdated; /* updated by other */ *************** *** 374,396 **** if (tuple->t_infomask & HEAP_MOVED_OFF) { - /* - * HeapTupleSatisfiesDirty is used by unique btree-s and so - * may be used while vacuuming. - */ if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) returnfalse; ! if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! return false; } - tuple->t_infomask |= HEAP_XMIN_COMMITTED; } else if (tuple->t_infomask & HEAP_MOVED_IN) { if (!TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) { if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) tuple->t_infomask |= HEAP_XMIN_COMMITTED; else --- 420,443 ---- if (tuple->t_infomask & HEAP_MOVED_OFF) { if (TransactionIdIsCurrentTransactionId((TransactionId)tuple->t_cmin)) return false; ! if (!TransactionIdIsInProgress((TransactionId) tuple->t_cmin)) { ! if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) ! { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! return false; ! } ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; } } else if (tuple->t_infomask &HEAP_MOVED_IN) { if (!TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) { + if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin)) + return false; if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) tuple->t_infomask |= HEAP_XMIN_COMMITTED; else *************** *** 416,425 **** { if (TransactionIdDidAbort(tuple->t_xmin)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */ return false; } SnapshotDirty->xmin = tuple->t_xmin; return true; /* in insertion by other */ } else --- 463,473 ---- { if (TransactionIdDidAbort(tuple->t_xmin)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; return false; } SnapshotDirty->xmin= tuple->t_xmin; + /* XXX shouldn't we fall through to look at xmax? */ return true; /* in insertion by other*/ } else *************** *** 474,479 **** --- 522,528 ---- if (AMI_OVERRIDE) return true; + /* XXX this is horribly ugly: */ if (ReferentialIntegritySnapshotOverride) return HeapTupleSatisfiesNow(tuple); *************** *** 484,501 **** if (tuple->t_infomask & HEAP_MOVED_OFF) { ! if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) ! { ! tuple->t_infomask |= HEAP_XMIN_INVALID; return false; } } elseif (tuple->t_infomask & HEAP_MOVED_IN) { ! if (!TransactionIdDidCommit((TransactionId) tuple->t_cmin)) { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! return false; } } else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin)) --- 533,563 ---- if (tuple->t_infomask & HEAP_MOVED_OFF) { ! if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) return false; + if (!TransactionIdIsInProgress((TransactionId) tuple->t_cmin)) + { + if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) + { + tuple->t_infomask |= HEAP_XMIN_INVALID; + return false; + } + tuple->t_infomask |= HEAP_XMIN_COMMITTED; } } else if (tuple->t_infomask &HEAP_MOVED_IN) { ! if (!TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) { ! if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin)) ! return false; ! if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! else ! { ! tuple->t_infomask |= HEAP_XMIN_INVALID; ! return false; ! } } } else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin)) *************** *** 519,528 **** else if (!TransactionIdDidCommit(tuple->t_xmin)) { if (TransactionIdDidAbort(tuple->t_xmin)) ! tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */ return false; } ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; } /* --- 581,591 ---- else if (!TransactionIdDidCommit(tuple->t_xmin)) { if (TransactionIdDidAbort(tuple->t_xmin)) ! tuple->t_infomask |= HEAP_XMIN_INVALID; return false; } ! else ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; } /* *************** *** 623,646 **** return HEAPTUPLE_DEAD; else if (tuple->t_infomask & HEAP_MOVED_OFF) { if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) { tuple->t_infomask |= HEAP_XMIN_INVALID; return HEAPTUPLE_DEAD; } - /* Assume we can only get here if previous VACUUM aborted, */ - /* ie, it couldn't still be in progress */ tuple->t_infomask |= HEAP_XMIN_COMMITTED; } else if (tuple->t_infomask & HEAP_MOVED_IN) { ! if (!TransactionIdDidCommit((TransactionId) tuple->t_cmin)) { - /* Assume we can only get here if previous VACUUM aborted */ tuple->t_infomask |= HEAP_XMIN_INVALID; return HEAPTUPLE_DEAD; } - tuple->t_infomask |= HEAP_XMIN_COMMITTED; } else if (TransactionIdIsInProgress(tuple->t_xmin)) return HEAPTUPLE_INSERT_IN_PROGRESS; --- 686,715 ---- return HEAPTUPLE_DEAD; else if (tuple->t_infomask & HEAP_MOVED_OFF) { + if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) + return HEAPTUPLE_DELETE_IN_PROGRESS; + if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin)) + return HEAPTUPLE_DELETE_IN_PROGRESS; if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) { tuple->t_infomask |= HEAP_XMIN_INVALID; return HEAPTUPLE_DEAD; } tuple->t_infomask |= HEAP_XMIN_COMMITTED; } else if (tuple->t_infomask& HEAP_MOVED_IN) { ! if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin)) ! return HEAPTUPLE_INSERT_IN_PROGRESS; ! if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin)) ! return HEAPTUPLE_INSERT_IN_PROGRESS; ! if (TransactionIdDidCommit((TransactionId) tuple->t_cmin)) ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; ! else { tuple->t_infomask |= HEAP_XMIN_INVALID; return HEAPTUPLE_DEAD; } } else if (TransactionIdIsInProgress(tuple->t_xmin)) return HEAPTUPLE_INSERT_IN_PROGRESS; *************** *** 671,676 **** --- 740,751 ---- if (tuple->t_infomask & HEAP_XMAX_INVALID) return HEAPTUPLE_LIVE; + if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE) + { + /* "deleting" xact really only marked it for update */ + return HEAPTUPLE_LIVE; + } + if (!(tuple->t_infomask & HEAP_XMAX_COMMITTED)) { if (TransactionIdIsInProgress(tuple->t_xmax)) *************** *** 698,709 **** /* * Deleter committed, but check special cases. */ - - if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE) - { - /* "deleting" xact really only marked it for update */ - return HEAPTUPLE_LIVE; - } if (TransactionIdEquals(tuple->t_xmin, tuple->t_xmax)) { --- 773,778 ----