Thread: foreign key locks
Hi, This is v12 of the foreign key locks patch. I gave a talk about this patch at PGCon: http://www.pgcon.org/2012/schedule/events/483.en.html There is an article there that explains this patch. I described the original goal of this in the thread starting here: http://archives.postgresql.org/message-id/1290721684-sup-3951@alvh.no-ip.org The patch itself was first submitted here: http://archives.postgresql.org/message-id/1320343602-sup-2290@alvh.no-ip.org There were many problems that we had to shake off during the 9.2 development cycle; this new version covers most of them. There is a github clone here: http://github.com/alvherre/postgres branch fklocks If you see the "compare" view, changes in this submission that weren't present in v11 start with commit 19dc85f1, here: https://github.com/alvherre/postgres/compare/master...fklocks I know of one significant issue left, possible cause of nasty bugs, which is the way this patch interacts with EvalPlanQual. I mentioned erroneously in my PGcon talk that the way we handle this is by shutting off EPQ recursion -- this was wrong. What I did was shut off heap_lock_tuple from following the update chain, letting it walk only in the cases where it comes from high-level callers. Others such as EPQ, which do their own update-chain walking, do not request locks to follow update. I believe this is correct. This was changed in this commit: https://github.com/alvherre/postgres/commit/e00e6827c55128ed737172a6dd35c581d437f969 There is also a matter of a performance regression we measured in stock pgbench. I have profiled this to come from GetMultiXactMembers and AFAICS the way to fix it is to improve multixact.c's cache of old multis. Right now we ditch the cache at end of transaction, which was already considered wrong in comments in the code but never fixed. I am going to make the cache entries live until the oldest Xid in each multi is below its visibility horizon (so RecentXmin for multis that only contain locks, and something like FreezeLimit for multis that contain updates). I apologize for the size of this patch. I am not really sure that there are pieces to split out for separate review. -- Álvaro Herrera <alvherre@alvh.no-ip.org>
Attachment
On Thu, Jun 14, 2012 at 11:41 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Hi, > > This is v12 of the foreign key locks patch. > Hi Álvaro, Just noticed that this patch needs a rebase because of the refactoring Tom did in ri_triggers.c -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación
Jaime Casanova <jaime@2ndquadrant.com> writes: > On Thu, Jun 14, 2012 at 11:41 AM, Alvaro Herrera > <alvherre@alvh.no-ip.org> wrote: >> This is v12 of the foreign key locks patch. > Just noticed that this patch needs a rebase because of the refactoring > Tom did in ri_triggers.c Hold on a bit before you work on that code --- I've got one more bit of hacking I want to try before walking away from it. I did some oprofile work on Dean's example from <CAEZATCWm8M00RA814o4DC2cD_aj44gQLb0tDdxMHA312qg7HCQ@mail.gmail.com> and noticed that it looks like ri_FetchConstraintInfo is eating a noticeable fraction of the runtime, which is happening because it is called to deconstruct the relevant pg_constraint row for each tuple we consider firing the trigger for (and then again, if we do fire the trigger). I'm thinking it'd be worth maintaining a backend-local cache for the deconstructed data, and am going to go try that. regards, tom lane
Excerpts from Tom Lane's message of mié jun 20 12:54:24 -0400 2012: > Jaime Casanova <jaime@2ndquadrant.com> writes: > > On Thu, Jun 14, 2012 at 11:41 AM, Alvaro Herrera > > <alvherre@alvh.no-ip.org> wrote: > >> This is v12 of the foreign key locks patch. > > > Just noticed that this patch needs a rebase because of the refactoring > > Tom did in ri_triggers.c > > Hold on a bit before you work on that code --- I've got one more bit of > hacking I want to try before walking away from it. I did some oprofile > work on Dean's example from > <CAEZATCWm8M00RA814o4DC2cD_aj44gQLb0tDdxMHA312qg7HCQ@mail.gmail.com> > and noticed that it looks like ri_FetchConstraintInfo is eating a > noticeable fraction of the runtime, which is happening because it is > called to deconstruct the relevant pg_constraint row for each tuple > we consider firing the trigger for (and then again, if we do fire the > trigger). I'm thinking it'd be worth maintaining a backend-local cache > for the deconstructed data, and am going to go try that. I had merged already when I got your email, but merging that additional change did not cause any conflicts. So here's the rebased version. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Attachment
Alvaro Herrera wrote: > So here's the rebased version. I found a couple problems on `make check-world`. Attached is a patch to fix one of them. The other is on pg_upgrade, pasted below. + pg_upgrade -d /home/kevin/pg/master/contrib/pg_upgrade/tmp_check/data.old -D /home/kevin/pg/master/contrib/pg_upgrade/tmp_check/data -b /home/kevin/pg/master/contrib/pg_upgrade/tmp_check/install//home/ kevin/pg/master/Debug/bin -B /home/kevin/pg/master/contrib/pg_upgrade/tmp_check/install//home/ kevin/pg/master/Debug/bin Performing Consistency Checks ----------------------------- Checking current, bin, and data directories ok Checking cluster versions ok Checking database user is a superuser ok Checking for prepared transactions ok Checking for reg* system OID user data types ok Checking for contrib/isn with bigint-passing mismatch ok Creating catalog dump ok Checking for presence of required libraries ok Checking database user is a superuser ok Checking for prepared transactions ok If pg_upgrade fails after this point, you must re-initdb the new cluster before continuing. Performing Upgrade ------------------ Analyzing all rows in the new cluster ok Freezing all rows on the new cluster ok Deleting files from new pg_clog ok Copying old pg_clog to new server ok Setting next transaction ID for new cluster ok Deleting files from new pg_multixact/offsets ok Copying old pg_multixact/offsets to new server ok Deleting files from new pg_multixact/members ok Copying old pg_multixact/members to new server ok Setting next multixact ID and offset for new cluster sh: /home/kevin/pg/master/contrib/pg_upgrade/tmp_check/install//home/ kevin/pg/master/Debug/bin: Permission denied *failure* Consult the last few lines of ""%s/pg_resetxlog" -O %u -m %u,%u "%s" > /dev/null" for the probable cause of the failure. Failure, exiting make[2]: *** [check] Error 1 make[2]: Leaving directory `/home/kevin/pg/master/contrib/pg_upgrade' make[1]: *** [check-pg_upgrade-recurse] Error 2 make[1]: Leaving directory `/home/kevin/pg/master/contrib' make: *** [check-world-contrib-recurse] Error 2 I'm marking it as "Waiting for Author" since this seems like a "must fix", but I'll keep looking at it, -Kevin
Attachment
Excerpts from Kevin Grittner's message of sáb jun 23 18:38:10 -0400 2012: > Alvaro Herrera wrote: > > > So here's the rebased version. > > I found a couple problems on `make check-world`. Attached is a patch > to fix one of them. The other is on pg_upgrade, pasted below. Ah, I mismerged pg_upgrade. The fix is trivial, AFAICS -- a function call is missing a log file name to be specified. Strange that the compiler didn't warn me of a broken printf format specifier. However, pg_upgrade itself has been broken by an unrelated commit and that fix is not so trivial, so I'm stuck waiting for that to be fixed. (We really need automated pg_upgrade testing.) Thanks for the patch for the other problem, I'll push it out. > I'm marking it as "Waiting for Author" since this seems like a "must > fix", but I'll keep looking at it, Great, thanks. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Excerpts from Alvaro Herrera's message of dom jun 24 23:24:47 -0400 2012: > Excerpts from Kevin Grittner's message of sáb jun 23 18:38:10 -0400 2012: > > Alvaro Herrera wrote: > > > > > So here's the rebased version. > > > > I found a couple problems on `make check-world`. Attached is a patch > > to fix one of them. The other is on pg_upgrade, pasted below. > > Ah, I mismerged pg_upgrade. The fix is trivial, AFAICS -- a function > call is missing a log file name to be specified. Strange that the > compiler didn't warn me of a broken printf format specifier. However, > pg_upgrade itself has been broken by an unrelated commit and that fix is > not so trivial, so I'm stuck waiting for that to be fixed. (We really > need automated pg_upgrade testing.) Well, I ended up pushing the pg_upgrade fix anyway even though I couldn't test it, because I had already committed it to my tree before your patch. If it still doesn't work after the other guys fix the outstanding problem I'll just push a new patch. In the meantime, here's fklocks v14, which also merges to new master as there were several other conflicts. It passes make installcheck-world now. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Attachment
Alvaro Herrera wrote: > here's fklocks v14, which also merges to new master as there were > several other conflicts. It passes make installcheck-world now. Recent commits broke it again, so here's a rebased v15. (No changes other than attempting to merge your work with the pg_controldata and pg_resetxlog changes and wrapping that FIXME in a comment.) Using that patch, pg_upgrade completes, but pg_dumpall fails: Upgrade Complete ---------------- Optimizer statistics are not transferred by pg_upgrade so, once you start the new server, consider running: analyze_new_cluster.sh Running this script will delete the old cluster's data files: delete_old_cluster.sh + pg_ctl start -l /home/kevin/pg/master/contrib/pg_upgrade/log/postmaster2.log -w waiting for server to start.... done server started + pg_dumpall pg_dump: Dumping the contents of table "hslot" failed: PQgetResult() failed. pg_dump: Error message from server: ERROR: MultiXactId 115 does no longer exist -- apparent wraparound pg_dump: The command was: COPY public.hslot (slotname, hubname, slotno, slotlink) TO stdout; pg_dumpall: pg_dump failed on database "regression", exiting + pg_dumpall2_status=1 + pg_ctl -m fast stop waiting for server to shut down.... done server stopped + [ -n 1 ] + echo pg_dumpall of post-upgrade database cluster failed pg_dumpall of post-upgrade database cluster failed + exit 1 make[2]: *** [check] Error 1 make[2]: Leaving directory `/home/kevin/pg/master/contrib/pg_upgrade' make[1]: *** [check-pg_upgrade-recurse] Error 2 make[1]: Leaving directory `/home/kevin/pg/master/contrib' make: *** [check-world-contrib-recurse] Error 2 Did I merge it wrong? -Kevin
Attachment
Excerpts from Kevin Grittner's message of mié jun 27 08:40:58 -0400 2012: > Alvaro Herrera wrote: > > > here's fklocks v14, which also merges to new master as there were > > several other conflicts. It passes make installcheck-world now. > > Recent commits broke it again, so here's a rebased v15. (No changes > other than attempting to merge your work with the pg_controldata and > pg_resetxlog changes and wrapping that FIXME in a comment.) Oooh, so sorry -- I was supposed to have git-stashed that before producing the patch. This change cannot have caused the pg_dumpall problem below though, because it only touched the multixact cache code. > Using that patch, pg_upgrade completes, but pg_dumpall fails: Hmm, something changed enough that requires more than just a code merge. I'll look into it. Thanks for the merge. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Excerpts from Kevin Grittner's message of mié jun 27 08:40:58 -0400 2012: > Alvaro Herrera wrote: > > > here's fklocks v14, which also merges to new master as there were > > several other conflicts. It passes make installcheck-world now. > > Recent commits broke it again, so here's a rebased v15. (No changes > other than attempting to merge your work with the pg_controldata and > pg_resetxlog changes and wrapping that FIXME in a comment.) Here's v16, merged to latest stuff, before attempting to fix this: > Using that patch, pg_upgrade completes, but pg_dumpall fails: > + pg_ctl start -l > /home/kevin/pg/master/contrib/pg_upgrade/log/postmaster2.log -w > waiting for server to start.... done > server started > + pg_dumpall > pg_dump: Dumping the contents of table "hslot" failed: PQgetResult() > failed. > pg_dump: Error message from server: ERROR: MultiXactId 115 does no > longer exist -- apparent wraparound I was only testing migrating from an old version into patched, not same-version upgrades. I think I see what's happening here. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Excerpts from Alvaro Herrera's message of vie jun 29 19:17:02 -0400 2012: > Excerpts from Kevin Grittner's message of mié jun 27 08:40:58 -0400 2012: > > Alvaro Herrera wrote: > > > > > here's fklocks v14, which also merges to new master as there were > > > several other conflicts. It passes make installcheck-world now. > > > > Recent commits broke it again, so here's a rebased v15. (No changes > > other than attempting to merge your work with the pg_controldata and > > pg_resetxlog changes and wrapping that FIXME in a comment.) > > Here's v16, merged to latest stuff, Really attached now. BTW you can get this on github: https://github.com/alvherre/postgres/tree/fklocks -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Attachment
Excerpts from Alvaro Herrera's message of vie jun 29 19:17:02 -0400 2012: > I was only testing migrating from an old version into patched, not > same-version upgrades. > > I think I see what's happening here. Okay, I have pushed the fix to github -- as I suspected, code-wise the fix was simple. I will post an updated consolidated patch later today. make check of pg_upgrade now passes for me. It would be nice that "make installcheck" would notify me that some modules' tests are being skipped ... -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Updated patch attached. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Attachment
Excerpts from Alvaro Herrera's message of jue jul 05 18:44:11 -0400 2012: > > Updated patch attached. > Patch rebased to latest sources. This is also on Github: https://github.com/alvherre/postgres/tree/fklocks -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Patch v19 contains some tweaks. Most notably, 1. if an Xid requests a lock A, and then a lock B which is stronger than A, then record only lock B and forget lock A. This is important for performance, because EvalPlanQual obtains ForUpdate locks on the tuples that it chases on an update chain. If EPQ was called by an update or delete, previously a MultiXact was being created. In a stock pgbench run this was causing lots of multis to be created, even when there are no FKs. This was most likely involved in the 9% performance decrease that was previously reported. 2. Save a few TransactionIdIsInProgress calls in MultiXactIdWait. We were calling it to determine how many transactions were still running, but not all callers were interested in that info. XidIsInProgress is not particularly cheap, so it seems good to skip it if possible. I didn't try to measure the difference, however. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Wed, Aug 22, 2012 at 5:31 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Patch v19 contains some tweaks. Most notably, > > 1. if an Xid requests a lock A, and then a lock B which is stronger than > A, then record only lock B and forget lock A. This is important for > performance, because EvalPlanQual obtains ForUpdate locks on the tuples > that it chases on an update chain. If EPQ was called by an update or > delete, previously a MultiXact was being created. > > In a stock pgbench run this was causing lots of multis to be created, > even when there are no FKs. > > This was most likely involved in the 9% performance decrease that was > previously reported. Ah-ha! Neat. I'll try to find some time to re-benchmark this during the next CF, unless you did that already. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Excerpts from Alvaro Herrera's message of mié ago 22 17:31:28 -0400 2012: > > Patch v19 contains some tweaks. Most notably, v20 is merged to latest master; the only change other than automatic merge is to update for pg_upgrade API fixes. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
v21 is merged to latest master. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Friday, August 31, 2012 06:59:51 AM Alvaro Herrera wrote: > v21 is merged to latest master. Ok, I am starting to look at this. (working with a git merge of alvherre/fklocks into todays master) In a very first pass as somebody who hasn't followed the discussions in the past I took notice of the following things: * compiles and survives make check * README.tuplock jumps in quite fast without any introduction * the reasons for the redesign aren't really included in the patch but just in the - very nice - pgcon paper. * "This is a bit of a performance regression, but since we expect FOR SHARE locks to be seldom used, we don’t feel this is a serious problem." makes me a bit uneasy, will look at performance separately * Should RelationGetIndexattrBitmap check whether a unique index is referenced from a pg_constraint row? Currently we pay part of the overhead independent from the existance of foreign keys. * mode == LockTupleKeyShare, == LockTupleShare, == LockTupleUpdate in heap_lock_tuple's BeingUpdated branch look like they should be an if/else if * I don't really like HEAP_UPDATE_KEY_REVOKED as a name, what about HEAP_KEYS_UPDATED (akin to HEAP_HOT_UPDATED) * how can we get infomask2 & HEAP_UPDATE_KEY_REVOKED && infomask & HEAP_XMAX_LOCK_ONLY? * heap_lock_tuple with mode == LockTupleKeyShare && nowait looks like it would wait anyway in heap_lock_updated_tuple_rec * rename HeapSatisfiesHOTUpdate, adjust comments? * I wonder whether HeapSatisfiesHOTUpdate could be made to compute the result for key_attrs and hot_attrs at the same time, often enough they will do the same thing on the same column * you didn't start it but I find that all those l$i jump labels make the code harder to follow * shouldn't XmaxGetUpdateXid be rather called MultiXactIdGetUpdateXid or such? * /* * If the existing lock mode is identical to or weaker than the new * one, we can act as though there is noexisting lock and have the * upper block handle this. */ "block"? * I am not yet sure whether the (xmax == add_to_xmax) optimization in compute_new_xmax_infomask is actually safe * ConditionalMultiXactIdWait and MultiXactIdWait should probably rely on a common implementation * I am not that convinced that moving the waiting functions away from multixact.c is a good idea, but I guess the required knowledge about lockmodes makes it hard not to do so * Haven't looked yet, but I have a slightly uneasy feeling about Hot Standby interactions. Did you look at that? * Hm? HeapTupleSatisfiesVacuum #if 0 ResetMultiHintBit(tuple, buffer, xmax, true); #endif This obviously is not a real review, but just learning what the problem & patch actually try to do. This is quite a bit to take in ;). I will let it sink in ant try to have a look at the architecture and performance next. Greetings, Andres .oO(and people call catalog timetravel complicated) -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > On Friday, August 31, 2012 06:59:51 AM Alvaro Herrera wrote: > > v21 is merged to latest master. > Ok, I am starting to look at this. > > (working with a git merge of alvherre/fklocks into todays master) > > In a very first pass as somebody who hasn't followed the discussions in the > past I took notice of the following things: > > * compiles and survives make check Please verify src/test/isolation's make installcheck as well. > * README.tuplock jumps in quite fast without any introduction Hmm .. I'll give that a second look. Maybe some material from the pgcon paper should be added as introduction. > * "This is a bit of a performance regression, but since we expect FOR SHARE > locks to be seldom used, we don’t feel this is a serious problem." makes me a > bit uneasy, will look at performance separately Thanks. Keep in mind though that the bits we really want good performance for is FOR KEY SHARE, which is going to be used in foreign keys. FOR SHARE is staying just for hypothetical backwards compatibility. I just found that people is using it, see for example http://stackoverflow.com/a/3243083 > * Should RelationGetIndexattrBitmap check whether a unique index is referenced > from a pg_constraint row? Currently we pay part of the overhead independent > from the existance of foreign keys. Noah also suggested something more or less in the along the same lines. My answer is that I don't want to do it currently; maybe we can improve on what indexes we use later, but since this can be seen as a modularity violation, I prefer to keep it as simple as possible. > * mode == LockTupleKeyShare, == LockTupleShare, == LockTupleUpdate in > heap_lock_tuple's BeingUpdated branch look like they should be an if/else if Hm, will look at that. > * I don't really like HEAP_UPDATE_KEY_REVOKED as a name, what about > HEAP_KEYS_UPDATED (akin to HEAP_HOT_UPDATED) Thanks. > * how can we get infomask2 & HEAP_UPDATE_KEY_REVOKED && infomask & > HEAP_XMAX_LOCK_ONLY? SELECT FOR KEY UPDATE? This didn't exist initially, but previous reviewers (Noah) felt that it made sense to have it for consistency. It makes the lock conflict table make more sense, anyway. > * heap_lock_tuple with mode == LockTupleKeyShare && nowait looks like it would > wait anyway in heap_lock_updated_tuple_rec Oops. > * rename HeapSatisfiesHOTUpdate, adjust comments? Yeah. > * I wonder whether HeapSatisfiesHOTUpdate could be made to compute the result > for key_attrs and hot_attrs at the same time, often enough they will do the > same thing on the same column Hmm, I will look at that. > * you didn't start it but I find that all those l$i jump labels make the code > harder to follow You mean you suggest to have them renamed? That can be arranged. > * shouldn't XmaxGetUpdateXid be rather called MultiXactIdGetUpdateXid or such? Yes. > /* > * If the existing lock mode is identical to or weaker than the new > * one, we can act as though there is no existing lock and have the > * upper block handle this. > */ > "block"? Yeah, as in "the if {} block above". Since this is not clear maybe it needs rewording. > * I am not yet sure whether the (xmax == add_to_xmax) optimization in > compute_new_xmax_infomask is actually safe Will think about it and add a/some comment(s). > * ConditionalMultiXactIdWait and MultiXactIdWait should probably rely on a > common implementation Will look at that. > * I am not that convinced that moving the waiting functions away from > multixact.c is a good idea, but I guess the required knowledge about lockmodes > makes it hard not to do so Yeah. The line between multixact.c and heapam.c is a zigzag currently, I think. Maybe it needs to be more clear; I think the multixact modes really belong into heapam.c, and they are just uint32 to multixact.c. I wasn't brave enough to attempt this; maybe I should. > * Haven't looked yet, but I have a slightly uneasy feeling about Hot Standby > interactions. Did you look at that? > > * Hm? > HeapTupleSatisfiesVacuum > #if 0 > ResetMultiHintBit(tuple, buffer, xmax, true); > #endif Yeah. I had a couple of ResetMultiHintBit calls sprinkled in some of the tqual routines, with the idea that if they saw multis that were no longer live they could be removed, and replaced by just the remaining updater. However, this doesn't really work because it's not safe to change the Xmax when not holding exclusive lock, and the tqual routines don't know that. So we're stuck with keeping the multi in there, even when we know they are no longer interesting. This is a bit sad, because it would be a performance benefit to remove them; but it's not strictly necessary for correctness, so we can leave the optimization for a later phase. I let those #ifdefed out calls in place so that it's easy to see where the reset needs to happen. > This obviously is not a real review, but just learning what the problem & patch > actually try to do. This is quite a bit to take in ;). I will let it sink in > ant try to have a look at the architecture and performance next. Well, the patch is large and complex so any review is obviously going to take a long time. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera wrote: > Andres Freund wrote: > > * heap_lock_tuple with mode == LockTupleKeyShare && nowait looks like it would > > wait anyway in heap_lock_updated_tuple_rec > > Oops. I took a stab at fixing this. However, it is not easy. First you need a way to reproduce the problem, which involves setting breakpoints in GDB. (Since a REPEATABLE READ transaction will fail to follow an update chain due to "tuple concurrently updated", you need to use a READ COMMITTED transaction; but obviously the timing to insert the bunch of updates in the middle is really short. Hence I inserted a breakpoint at the end of GetSnapshotData, had a SELECT FOR KEY SHARE NOWAIT get stuck in it, and then launched a couple of updates in another session). I was able to reproduce the undesirable wait. I quickly patched heap_lock_updated_tuple to pass down the nowait flag, but this is actually not reached, because the update chain is first followed by EvalPlanQual in ExecLockRows, and not by heap_lock_updated_tuple directly. And EPQ does not have the nowait behavior. So it still blocks. Maybe what we need to do is prevent ExecLockRows from following the update chain altogether -- after all, if heap_lock_tuple is going to do it by itself maybe it's wholly redundant. Not really sure what's the best way to approach this. At this stage I'm inclined to ignore the problem, unless some EPQ expert shows up and tells me that (1) it's okay to patch EPQ in that way, or (2) we should hack ExecLockRows (and remove EPQ?). I pushed (to github) patches for a couple of other issues you raised. Some others still need a bit more of my attention. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Here is version 22 of this patch. This version contains fixes to issues reported by Andres, as well as a rebase to latest master. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Oct 18, 2012 at 04:58:20PM -0300, Alvaro Herrera wrote: > Here is version 22 of this patch. This version contains fixes to issues > reported by Andres, as well as a rebase to latest master. I scanned this for obvious signs of work left to do. It contains numerous XXX and FIXME comments. Many highlight micro-optimization opportunities and the like; those can stay. Others preclude commit, either highlighting an unsolved problem or wrongly highlighting a non-problem: > + /* > + * XXX we do not lock this tuple here; the theory is that it's sufficient > + * with the buffer lock we're about to grab. Any other code must be able > + * to cope with tuple lock specifics changing while they don't hold buffer > + * lock anyway. > + */ > * so they will be uninteresting by the time our next transaction starts. > * (XXX not clear that this is correct --- other members of the MultiXact > * could hang around longer than we did. However, it's not clear what a > ! * better policy for flushing old cache entries would be.) FIXME actually > ! * this is plain wrong now that multixact's may contain update Xids. > ! nmembers = GetMultiXactIdMembers(multi, &members, true); > ! /* > ! * XXX we don't have the infomask to run the required consistency check > ! * here; the required notational overhead seems excessive. > ! */ > /* We assume the cache entries are sorted */ > ! /* XXX we assume the unused bits in "status" are zeroed */ > ! if (memcmp(members, entry->members, nmembers * sizeof(MultiXactMember)) == 0) > ! * XXX do we have any issues with needing to checkpoint here? > */ > ! static void > ! TruncateMultiXact(void) > { > + /* FIXME what should we initialize this to? */ > + newFrozenMulti = ReadNextMultiXactId(); > + * FIXME -- the XMAX_IS_MULTI test is a bit wrong .. it's possible to > + * have tuples with that bit set that are dead. However, if that's > + * changed, the RawXmax() call below should probably be researched as well. > */ > if (tuple->t_infomask & > ! (HEAP_XMAX_INVALID | HEAP_XMAX_LOCK_ONLY | HEAP_XMAX_IS_MULTI)) > return false;
On Thursday, October 18, 2012 09:58:20 PM Alvaro Herrera wrote: > Here is version 22 of this patch. This version contains fixes to issues > reported by Andres, as well as a rebase to latest master. Ok, I now that pgconf.eu has ended I am starting to do a real review: * Is it ok to make FOR UPDATE somewhat weaker than before? In 9.2 and earlier you could be sure that if you FOR UPDATE'ed a row you could delete it. Unless I miss something now this will not block somebody else acquiring a FOR KEY SHARE lock, so this guarantee is gone. This seems likely to introduce subtle problems in user applications. I propose renaming FOR UPDATE => FOR NO KEY UPDATE, FOR KEY UPDATE => FOR UPDATE or similar (or PREVENT KEY UPDATE?). That keeps the old meaning of FOR UPDATE. You write "SELECT FOR UPDATE is a standards-compliant exclusive lock". I didn't really find all that much about the semantics of FOR UPDATE on cursors in the standard other than "The operations of update and delete are permitted for updatable cursors, subject to constraining Access Rules.". * I would welcome adding some explanatory comments about the point of TupleLockExtraInfo and MultiXactStatusLock at the respective definition. * Why do we have the HEAP_XMAX_IS_MULTI && HEAP_XMAX_LOCK_ONLY case? * I think some of the longer comments could use the attention of a native speaker, unfortunately thats not me. * I am still uncomfortable with the FOR SHARE deoptimization. I have seen people lock larger parts of their table to make some READ COMMITTED things actually safe. Is there any problem retaining the non XMAX_IS_MULTI behaviour except space in infomask? That seems solveable by something like #define HEAP_XMAX_SHR_LOCK 0x0010 #define HEAP_XMAX_EXCL_LOCK 0x0040 #define HEAP_XMAX_KEYSHR_LOCK (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK) and somewhat more complex expressions when testing the locks ((infomask & HEAP_XMAX_KEYSHR_LOCK ) == HEAP_XMAX_KEYSHR_LOCK, etc). * In heap_lock_tuple's XMAX_IS_MULTI case for (i = 0; i < nmembers; i++) { if (TransactionIdIsCurrentTransactionId(members[i].xid)) { LockTupleMode membermode; membermode = TUPLOCK_from_mxstatus(members[i].status); if (membermode > mode) { if (have_tuple_lock) UnlockTupleTuplock(relation,tid, mode); pfree(members); return HeapTupleMayBeUpdated; } } } why is it membermode > mode and not membermode >= mode? * Is the case of a a non-key-update followed by a key-update actually handled when doing a heap_lock_tuple with mode = LockTupleKeyShare and follow_updates = false? I don't really see how, so it seems to deserve at least a comment. But then I don't yet understand why follow_update makes sense. * In heap_lock_tuple with mode == LockTupleUpdate && infomask & HEAP_XMAX_IS_MULTI, were leaking members when doing goto l3. Probably not relevant, but given the code tries to be careful everywhere else... We might also leak in the members == 0 case there, not sure yet. Ok, this is at about 35% of the diff in my second pass, but I just arrived back in Berlin, and this seems useful enough on its own... Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 27 October 2012 00:06, Andres Freund <andres@2ndquadrant.com> wrote: > On Thursday, October 18, 2012 09:58:20 PM Alvaro Herrera wrote: >> Here is version 22 of this patch. This version contains fixes to issues >> reported by Andres, as well as a rebase to latest master. > > Ok, I now that pgconf.eu has ended I am starting to do a real review: > > * Is it ok to make FOR UPDATE somewhat weaker than before? In 9.2 and earlier > you could be sure that if you FOR UPDATE'ed a row you could delete it. Unless > I miss something now this will not block somebody else acquiring a FOR KEY > SHARE lock, so this guarantee is gone. Yes, that is exactly the point of this. > This seems likely to introduce subtle problems in user applications. Yes, it could. So we need some good docs on explaining this. Which applications use FOR UPDATE? Any analysis of particular situations would be very helpful in doing the correct thing here. I think introducing FOR DELETE would be much clearer as an addition/ synonym for FOR KEY UPDATE. > I propose renaming FOR UPDATE => FOR NO KEY UPDATE, FOR KEY UPDATE => FOR > UPDATE or similar (or PREVENT KEY UPDATE?). That keeps the old meaning of FOR > UPDATE. Which is essentially unwinding the feature, to some extent. Does FOR UPDATE throw an error if the user later issues an UPDATE of the PK or a DELETE? That sequence of actions would cause lock escalation in the application, which could also lead to deadlock/contention. This sounds like we need a GUC or table-level default to control whether FOR UPDATE means FOR UPDATE or FOR DELETE More thought/input required on this point, it seems important. > You write "SELECT FOR UPDATE is a standards-compliant exclusive lock". I > didn't really find all that much about the semantics of FOR UPDATE on cursors > in the standard other than "The operations of update and delete are permitted > for updatable cursors, subject to constraining Access Rules.". > * I think some of the longer comments could use the attention of a native > speaker, unfortunately thats not me. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sunday, October 28, 2012 11:47:16 PM Simon Riggs wrote: > On 27 October 2012 00:06, Andres Freund <andres@2ndquadrant.com> wrote: > > On Thursday, October 18, 2012 09:58:20 PM Alvaro Herrera wrote: > >> Here is version 22 of this patch. This version contains fixes to issues > >> reported by Andres, as well as a rebase to latest master. > > > > Ok, I now that pgconf.eu has ended I am starting to do a real review: > > > > * Is it ok to make FOR UPDATE somewhat weaker than before? In 9.2 and > > earlier you could be sure that if you FOR UPDATE'ed a row you could > > delete it. Unless I miss something now this will not block somebody else > > acquiring a FOR KEY SHARE lock, so this guarantee is gone. > > Yes, that is exactly the point of this. Yes, sure. The point is the introduction of a weaker lock level which can be used by the ri triggers. I don't see any imperative that the semantics of the old lock level need to be redefined. That just seems dangerous to me. We need to take care to reduce the complications of upgrades not introduce changes that require complex code reviews. > > This seems likely to introduce subtle problems in user applications. > > Yes, it could. So we need some good docs on explaining this. > > Which applications use FOR UPDATE? Any that want to protect themselves against deadlocks and/or visibility problems due to READ COMMITTED. Thats quite a bunch. > Any analysis of particular situations would be very helpful in doing > the correct thing here. Usual things include * avoiding problems due to lock upgrades in combination with foreign keys (as far as I can see some of those still persist). * prevent rows being deleted by other transactions * prepare for updating if computation of the new values take some time * guarantee order of locking to make sure rows are DELETE/UPDATEed in the same order (still no ORDER BY in UPDATE/DELETE) ... > I think introducing FOR DELETE would be much clearer as an addition/ > synonym for FOR KEY UPDATE. Hm. Not really convinced. For me that seems to just make the overall situation even more complex. > > I propose renaming FOR UPDATE => FOR NO KEY UPDATE, FOR KEY UPDATE => FOR > > UPDATE or similar (or PREVENT KEY UPDATE?). That keeps the old meaning of > > FOR UPDATE. > > Which is essentially unwinding the feature, to some extent. Why? The overall features available are just the same? The only thing is that existing semantics aren't changed. > Does FOR UPDATE throw an error if the user later issues an UPDATE of > the PK or a DELETE? That sequence of actions would cause lock > escalation in the application, which could also lead to > deadlock/contention. Unless I miss something it precisely does *not* result in lock escalation with the 9.2 semanticsbut it *would* with fklocks applied. Thats exactly my point. > This sounds like we need a GUC or table-level default to control > whether FOR UPDATE means FOR UPDATE or FOR DELETE I don't like adding a new guc for something that should be solveable with some creative naming. If a new user doesn't get a bit more concurrency due to manually issued 9.2 FOR UPDATE implicitly being converted into a FOR NO KEY UPDATE its not too bad. The code needs to be checked whether thats valid anyway. The reverse is not true... > More thought/input required on this point, it seems important. Yep, more input welcome. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 29 October 2012 12:27, Andres Freund <andres@2ndquadrant.com> wrote: >> This sounds like we need a GUC or table-level default to control >> whether FOR UPDATE means FOR UPDATE or FOR DELETE > > I don't like adding a new guc for something that should be solveable with some > creative naming. If a new user doesn't get a bit more concurrency due to > manually issued 9.2 FOR UPDATE implicitly being converted into a FOR NO KEY > UPDATE its not too bad. The code needs to be checked whether thats valid > anyway. The reverse is not true... The main point here is that introducing new non-optional behaviour is a compatibility problem and we shouldn't rush that. If we decide to change FOR UPDATE to mean FOR NO KEY UPDATE that needs separate consideration and shouldn't happen until sometime after the feature goes in (months, or perhaps releases). We're introducing a new feature that will allow us to avoid lock problems, by taking the current FOR UPDATE behaviour and splitting it into two options: one weaker and one stronger. We need explicit names for both of those options. The most obvious naming is FOR [[NO] KEY] UPDATE Don't much like that, but its pretty unambiguous and fairly neat. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > On Sunday, October 28, 2012 11:47:16 PM Simon Riggs wrote: >> Andres Freund <andres@2ndquadrant.com> wrote: >>> * Is it ok to make FOR UPDATE somewhat weaker than before? In 9.2 >>> and earlier you could be sure that if you FOR UPDATE'ed a row you >>> could delete it. Unless I miss something now this will not block >>> somebody else acquiring a FOR KEY SHARE lock, so this guarantee >>> is gone. >> >> Yes, that is exactly the point of this. > > The point is the introduction of a weaker lock level which can be > used by the ri triggers. I don't see any imperative that the > semantics of the old lock level need to be redefined. That just > seems dangerous to me. I agree with Andres here -- the new lock level is needed within RI triggers, and we might as well expose it for application programmer use (with proper documentations), but there's no reason to break existing application code by making the semantics of SELECT FOR UPDATE less strict than they were before. Let's keep that term meaning exactly the same thing if we can, and use new names for the new levels. We already present a barrier to people migrating from Oracle because SELECT FOR UPDATE in PostgreSQL is weaker than it is in Oracle (where it is just as strong as if an actual UPDATE were done -- write conflicts and all); let's not increase the barrier by making SELECT FOR UPDATE even weaker. -Kevin
On Mon, Oct 29, 2012 at 08:18:33AM -0400, Kevin Grittner wrote: > Andres Freund wrote: > > On Sunday, October 28, 2012 11:47:16 PM Simon Riggs wrote: > >> Andres Freund <andres@2ndquadrant.com> wrote: > > >>> * Is it ok to make FOR UPDATE somewhat weaker than before? In 9.2 > >>> and earlier you could be sure that if you FOR UPDATE'ed a row you > >>> could delete it. Unless I miss something now this will not block > >>> somebody else acquiring a FOR KEY SHARE lock, so this guarantee > >>> is gone. > >> > >> Yes, that is exactly the point of this. > > > > The point is the introduction of a weaker lock level which can be > > used by the ri triggers. I don't see any imperative that the > > semantics of the old lock level need to be redefined. That just > > seems dangerous to me. > > I agree with Andres here -- the new lock level is needed within RI > triggers, and we might as well expose it for application programmer > use (with proper documentations), but there's no reason to break > existing application code by making the semantics of SELECT FOR > UPDATE less strict than they were before. Let's keep that term > meaning exactly the same thing if we can, and use new names for the > new levels. +1. I had not considered this angle during previous reviews, but I agree with Andres's position. Since this patch does not strengthen the strongest tuple lock relative to its PostgreSQL 9.2 counterpart, that lock type should continue to use the syntax "FOR UPDATE". It will come to mean "the lock type sufficient for all possible UPDATEs of the row".
On 31 October 2012 19:35, Noah Misch <noah@leadboat.com> wrote: > On Mon, Oct 29, 2012 at 08:18:33AM -0400, Kevin Grittner wrote: >> Andres Freund wrote: >> > On Sunday, October 28, 2012 11:47:16 PM Simon Riggs wrote: >> >> Andres Freund <andres@2ndquadrant.com> wrote: >> >> >>> * Is it ok to make FOR UPDATE somewhat weaker than before? In 9.2 >> >>> and earlier you could be sure that if you FOR UPDATE'ed a row you >> >>> could delete it. Unless I miss something now this will not block >> >>> somebody else acquiring a FOR KEY SHARE lock, so this guarantee >> >>> is gone. >> >> >> >> Yes, that is exactly the point of this. >> > >> > The point is the introduction of a weaker lock level which can be >> > used by the ri triggers. I don't see any imperative that the >> > semantics of the old lock level need to be redefined. That just >> > seems dangerous to me. >> >> I agree with Andres here -- the new lock level is needed within RI >> triggers, and we might as well expose it for application programmer >> use (with proper documentations), but there's no reason to break >> existing application code by making the semantics of SELECT FOR >> UPDATE less strict than they were before. Let's keep that term >> meaning exactly the same thing if we can, and use new names for the >> new levels. > > +1. I had not considered this angle during previous reviews, but I agree with > Andres's position. Since this patch does not strengthen the strongest tuple > lock relative to its PostgreSQL 9.2 counterpart, that lock type should > continue to use the syntax "FOR UPDATE". It will come to mean "the lock type > sufficient for all possible UPDATEs of the row". So we have syntax FOR NON KEY UPDATE FOR KEY UPDATE KEY is the default, so FOR UPDATE is a synonym of FOR KEY UPDATE -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs wrote: > On 31 October 2012 19:35, Noah Misch <noah@leadboat.com> wrote: > > On Mon, Oct 29, 2012 at 08:18:33AM -0400, Kevin Grittner wrote: > >> Andres Freund wrote: > >> > The point is the introduction of a weaker lock level which can be > >> > used by the ri triggers. I don't see any imperative that the > >> > semantics of the old lock level need to be redefined. That just > >> > seems dangerous to me. > >> > >> I agree with Andres here -- the new lock level is needed within RI > >> triggers, and we might as well expose it for application programmer > >> use (with proper documentations), but there's no reason to break > >> existing application code by making the semantics of SELECT FOR > >> UPDATE less strict than they were before. Let's keep that term > >> meaning exactly the same thing if we can, and use new names for the > >> new levels. > > > > +1. I had not considered this angle during previous reviews, but I agree with > > Andres's position. Since this patch does not strengthen the strongest tuple > > lock relative to its PostgreSQL 9.2 counterpart, that lock type should > > continue to use the syntax "FOR UPDATE". It will come to mean "the lock type > > sufficient for all possible UPDATEs of the row". Yeah, I agree with this too. > So we have syntax > > FOR NON KEY UPDATE > FOR KEY UPDATE > > KEY is the default, so FOR UPDATE is a synonym of FOR KEY UPDATE Not really sure about the proposed syntax, but yes clearly we need some other syntax to mean "FOR NON KEY UPDATE". I would rather keep FOR UPDATE to mean what I currently call FOR KEY UPDATE. More proposals for the other (weaker) lock level welcome (but if you love FOR NON KEY UPDATE, please chime in too) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Oct 31, 2012 at 05:22:10PM -0300, Alvaro Herrera wrote: > Simon Riggs wrote: > > On 31 October 2012 19:35, Noah Misch <noah@leadboat.com> wrote: > > > +1. I had not considered this angle during previous reviews, but I agree with > > > Andres's position. Since this patch does not strengthen the strongest tuple > > > lock relative to its PostgreSQL 9.2 counterpart, that lock type should > > > continue to use the syntax "FOR UPDATE". It will come to mean "the lock type > > > sufficient for all possible UPDATEs of the row". > > Yeah, I agree with this too. > > > So we have syntax > > > > FOR NON KEY UPDATE > > FOR KEY UPDATE > > > > KEY is the default, so FOR UPDATE is a synonym of FOR KEY UPDATE > > Not really sure about the proposed syntax, but yes clearly we need some > other syntax to mean "FOR NON KEY UPDATE". I would rather keep FOR > UPDATE to mean what I currently call FOR KEY UPDATE. More proposals for > the other (weaker) lock level welcome (but if you love FOR NON KEY > UPDATE, please chime in too) Agree on having "FOR UPDATE" without any "FOR KEY UPDATE" synonym. For the weaker lock, I mildly preferred the proposal of "FOR NO KEY UPDATE". NON KEY captures the idea better in English, but NO is close enough and already part of the SQL lexicon.
FWIW I have gotten a lot of feedback about this patch, and since I don't have time right now to produce an updated version, that I'm going to close this as Returned with Feedback, and submit an updated version to the upcoming commitfest. This patch still needs much more review -- for example, as far as I know, the multixact.c changes have gone largely unreviewed; they have changed somewhat since Noah reviewed them (back in version 15 or so, I think). Of course, to me it all makes sense, but then I'm only its author. -- Alvaro Herrera http://www.flickr.com/photos/alvherre/ "Everybody understands Mickey Mouse. Few understand Hermann Hesse. Hardly anybody understands Einstein. And nobody understands Emperor Norton."
On Monday, November 05, 2012 02:37:15 PM Alvaro Herrera wrote: > FWIW I have gotten a lot of feedback about this patch, and since I don't > have time right now to produce an updated version, that I'm going to > close this as Returned with Feedback, and submit an updated version to > the upcoming commitfest. > > This patch still needs much more review -- for example, as far as I > know, the multixact.c changes have gone largely unreviewed; they have > changed somewhat since Noah reviewed them (back in version 15 or so, I > think). Of course, to me it all makes sense, but then I'm only its > author. There also hasn't been any recent performance testing. I am not sure if I can get the time to do so before the next commitfest... Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> FOR NON KEY UPDATE >> FOR KEY UPDATE >> >> KEY is the default, so FOR UPDATE is a synonym of FOR KEY UPDATE > > Not really sure about the proposed syntax, but yes clearly we need some > other syntax to mean "FOR NON KEY UPDATE". I would rather keep FOR > UPDATE to mean what I currently call FOR KEY UPDATE. More proposals for > the other (weaker) lock level welcome (but if you love FOR NON KEY > UPDATE, please chime in too) FOR ANY UPDATE, synonym of FOR UPDATE FOR KEY UPDATE, optimized version, when it applies to your case I also tend to think that we should better not change the current meaning of FOR UPDATE and have it default to FOR ANY UPDATE. Unless it's easy to upgrade from ANY to KEY, and do that automatically at the right time, but I fear there lie dragons (or something). Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Noah Misch wrote: > On Wed, Oct 31, 2012 at 05:22:10PM -0300, Alvaro Herrera wrote: > > Not really sure about the proposed syntax, but yes clearly we need some > > other syntax to mean "FOR NON KEY UPDATE". I would rather keep FOR > > UPDATE to mean what I currently call FOR KEY UPDATE. More proposals for > > the other (weaker) lock level welcome (but if you love FOR NON KEY > > UPDATE, please chime in too) > > Agree on having "FOR UPDATE" without any "FOR KEY UPDATE" synonym. For the > weaker lock, I mildly preferred the proposal of "FOR NO KEY UPDATE". NON KEY > captures the idea better in English, but NO is close enough and already part > of the SQL lexicon. This is the proposal I like best; however there is an asymmetry, because the locking options now are FOR KEY SHARE FOR SHARE FOR NO KEY UPDATE FOR UPDATE I used to have comments such as /* Currently, SELECT ... FOR [KEY] UPDATE/SHARE requires UPDATE privileges */ #define ACL_SELECT_FOR_UPDATE ACL_UPDATE but now they are slightly incorrect because the NO is not illustrated. I guess I could use SELECT ... FOR [NO KEY] UPDATE/SHARE but this leaves out the "FOR KEY SHARE" case (and can be thought to introduce a FOR NO KEY SHARE case). And getting much more verbose than that is probably not warranted. In some places I would like the use a phrase like "the locking clause", but I'm not sure that it's clear enough. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Here's version 23 of this patch, with fixes for the below comments and a merge to master. Andres Freund wrote: > On Thursday, October 18, 2012 09:58:20 PM Alvaro Herrera wrote: > > Here is version 22 of this patch. This version contains fixes to issues > > reported by Andres, as well as a rebase to latest master. > > Ok, I now that pgconf.eu has ended I am starting to do a real review: > > * Is it ok to make FOR UPDATE somewhat weaker than before? In 9.2 and earlier > you could be sure that if you FOR UPDATE'ed a row you could delete it. Unless > I miss something now this will not block somebody else acquiring a FOR KEY > SHARE lock, so this guarantee is gone. > This seems likely to introduce subtle problems in user applications. > > I propose renaming FOR UPDATE => FOR NO KEY UPDATE, FOR KEY UPDATE => FOR > UPDATE or similar (or PREVENT KEY UPDATE?). That keeps the old meaning of FOR > UPDATE. Okay, in subsequent discussion everyone agreed that this is necessary; thanks for noticing the problem. Instead of your suggestion, however, I used Noah's; luckily I just noticed that it was identical to yours. So apparently there is also agreement on this point. I have updated the code, comments and docs (and renamed some other stuff to match). > You write "SELECT FOR UPDATE is a standards-compliant exclusive lock". I > didn't really find all that much about the semantics of FOR UPDATE on cursors > in the standard other than "The operations of update and delete are permitted > for updatable cursors, subject to constraining Access Rules.". I don't really know where that comes from, but it's a statement I grabbed from the docs somewhere. > * I would welcome adding some explanatory comments about the point of > TupleLockExtraInfo and MultiXactStatusLock at the respective definition. Done. If that's not enough, I would welcome suggestions or specific question needing clarification. > * Why do we have the HEAP_XMAX_IS_MULTI && HEAP_XMAX_LOCK_ONLY case? For pg_upgrade. HEAP_XMAX_LOCK_ONLY is the value previously used by HEAP_XMAX_SHARED_LOCK, so a database containing such values that is upgraded will contain that bit pattern. > * I think some of the longer comments could use the attention of a native > speaker, unfortunately thats not me. Sorry about that. > * I am still uncomfortable with the FOR SHARE deoptimization. I have seen > people lock larger parts of their table to make some READ COMMITTED things > actually safe. I haven't changed this yet. There are bits available in infomask2 that we could use, if we agree that it is necessary; or, as you say, we could use two bits to distinguish three different states, but we need to ensure that pg_upgraded databases will work sanely. > * In heap_lock_tuple's XMAX_IS_MULTI case > > [snip] > > why is it membermode > mode and not membermode >= mode? Uh, that's a bug. Fixed. As noticed in the comment above that snippet, there was a deadlock possible here. Maybe I should add a test to ensure this doesn't happen. > * Is the case of a a non-key-update followed by a key-update actually handled > when doing a heap_lock_tuple with mode = LockTupleKeyShare and follow_updates > = false? I don't really see how, so it seems to deserve at least a comment. I wasn't able to figure out what you think is the problem. > But then I don't yet understand why follow_update makes sense. Basically heap_lock_tuple can be told by the caller to follow the update chain to lock subsequent tuples, or not. Mainly, EvalPlanQual does not want this to happen, because it does its own update-chain walking. In all honesty, it is not clear to me that this is the proper solution to the problem. Maybe some hacking around ExecLockRows is necessary. > * In heap_lock_tuple with mode == LockTupleUpdate && infomask & > HEAP_XMAX_IS_MULTI, were leaking members when doing goto l3. Probably not > relevant, but given the code tries to be careful everywhere else... Right, plugged. > We might also leak in the members == 0 case there, not sure yet. Hm, I don't think GetMultiXactIdMembers really considers the case when 0 members are to be returned. Maybe that needs some more tweaking. > Ok, this is at about 35% of the diff in my second pass, but I just arrived back > in Berlin, and this seems useful enough on its own... Many thanks. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Alvaro Herrera wrote: > > * In heap_lock_tuple's XMAX_IS_MULTI case > > > > [snip] > > > > why is it membermode > mode and not membermode >= mode? > > Uh, that's a bug. Fixed. As noticed in the comment above that snippet, > there was a deadlock possible here. Maybe I should add a test to ensure > this doesn't happen. Done: https://github.com/alvherre/postgres/commit/df2847e38198e99f57e52490e1e9391ebb70d770 (I don't think this is worth a v24 submission). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 14, 2012 at 01:27:26PM -0300, Alvaro Herrera wrote: > https://github.com/alvherre/postgres/commit/df2847e38198e99f57e52490e1e9391ebb70d770 > > (I don't think this is worth a v24 submission). Are you aware of any defects in or unanswered questions of this version that would stall your commit thereof?
On 2012-11-14 13:27:26 -0300, Alvaro Herrera wrote: > Alvaro Herrera wrote: > > > > * In heap_lock_tuple's XMAX_IS_MULTI case > > > > > > [snip] > > > > > > why is it membermode > mode and not membermode >= mode? > > > > Uh, that's a bug. Fixed. As noticed in the comment above that snippet, > > there was a deadlock possible here. Maybe I should add a test to ensure > > this doesn't happen. > > Done: > https://github.com/alvherre/postgres/commit/df2847e38198e99f57e52490e1e9391ebb70d770 > > (I don't think this is worth a v24 submission). I have started doing some performance testing and I fear I was right in being suspicious about the performance difference for FOR SHARE locks: Tested with pgbench -p 5442 -U andres \ -c 30 -j 30 \ -M prepared -f ~/tmp/postgres-fklocks/select-for-share.sql \ -T 20 postgres on a pgbench -i -s 10 database, where select-for-share.sql is: BEGIN; \set naccounts 1000000 \setrandom aid 1 :naccounts SELECT abalance FROM pgbench_accounts WHERE aid = :aid FOR SHARE; \setrandom aid 1 :naccounts SELECT abalance FROM pgbench_accounts WHERE aid = :aid FOR SHARE; \setrandom aid 1 :naccounts SELECT abalance FROM pgbench_accounts WHERE aid = :aid FOR SHARE; COMMIT; which very roughly resembles workloads I have seen in reality (locking some records your rely on while you do some work). With 52b4729fcfc20f056f17531a6670d8c4b9696c39 (alvherre/fklocks) vs 273986bf0d39e5166eb15ba42ebff4803e23a688 (latest merged master) I get tps = 8986.133496 (excluding connections establishing) vs tps = 25307.861193 (excluding connections establishing) Thats nearly a factor of three which seems to be too big to be acceptable to me. So I really think we need to bring FOR SHARE locks back as a flag. I have done some benchmarking of other cases (plain pgbench, pgbench with foreign keys, large insertions, large amounts of FOR SHARE locks) and haven't found anything really outstanding so far. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Noah Misch wrote: > On Wed, Nov 14, 2012 at 01:27:26PM -0300, Alvaro Herrera wrote: > > https://github.com/alvherre/postgres/commit/df2847e38198e99f57e52490e1e9391ebb70d770 > > > > (I don't think this is worth a v24 submission). > > Are you aware of any defects in or unanswered questions of this version that > would stall your commit thereof? Yeah, I am revisiting the list of XXX/FIXME comments you pointed out previously. And I would still like someone with EPQ understanding to review the ExecLockRows / EvalPlanQual / heap_lock_tuple interactions. Andres is on the verge of convincing me that we need to support singleton FOR SHARE without multixacts due to performance concerns. It would be useful for more people to chime in here: is FOR SHARE an important case to cater for? I wonder if using FOR KEY SHARE (keep performance characteristics, but would need to revise application code) would satisfy Andres' users, for example. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2012-11-16 13:17:47 -0300, Alvaro Herrera wrote: > Noah Misch wrote: > > On Wed, Nov 14, 2012 at 01:27:26PM -0300, Alvaro Herrera wrote: > > > https://github.com/alvherre/postgres/commit/df2847e38198e99f57e52490e1e9391ebb70d770 > > > > > > (I don't think this is worth a v24 submission). > > > > Are you aware of any defects in or unanswered questions of this version that > > would stall your commit thereof? > > Yeah, I am revisiting the list of XXX/FIXME comments you pointed out > previously. > > And I would still like someone with EPQ understanding to review the > ExecLockRows / EvalPlanQual / heap_lock_tuple interactions. I am in the process of looking at those atm, but we need somebody that actually understands the intricacies here... > Andres is on the verge of convincing me that we need to support > singleton FOR SHARE without multixacts due to performance concerns. I don't really see any arguments against doing so. We aren't in a that big shortage of flags and if we need more than available I think we can free some (e.g. XMAX/XMIN_INVALID). > It > would be useful for more people to chime in here: is FOR SHARE an > important case to cater for? I wonder if using FOR KEY SHARE (keep > performance characteristics, but would need to revise application code) > would satisfy Andres' users, for example. It definitely wouldn't help in the cases I have seen because the point is to protect against actual content changes of the rows, not just the keys. Note that you actually need to use explicit FOR SHARE/UPDATE for correctness purposes in many scenarios unless youre running in 9.1+ serializable mode. And that cannot be used in some cases (try queuing for example) because the rollback rates would be excessive. Even if applications could be fixed, requiring changes to applications locking behaviour, which possibly is far from trivial, seems like a too big upgrade barrier. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Nov 16, 2012 at 05:31:12PM +0100, Andres Freund wrote: > On 2012-11-16 13:17:47 -0300, Alvaro Herrera wrote: > > Andres is on the verge of convincing me that we need to support > > singleton FOR SHARE without multixacts due to performance concerns. > > I don't really see any arguments against doing so. We aren't in a that > big shortage of flags and if we need more than available I think we can > free some (e.g. XMAX/XMIN_INVALID). The patch currently leaves two unallocated bits. Reclaiming currently-used bits means a binary compatibility break. Until we plan out such a thing, reclaimable bits are not as handy as never-allocated bits. I don't think we should lightly expend one of the final two. > > It > > would be useful for more people to chime in here: is FOR SHARE an > > important case to cater for? I wonder if using FOR KEY SHARE (keep > > performance characteristics, but would need to revise application code) > > would satisfy Andres' users, for example. > > It definitely wouldn't help in the cases I have seen because the point > is to protect against actual content changes of the rows, not just the > keys. > Note that you actually need to use explicit FOR SHARE/UPDATE for > correctness purposes in many scenarios unless youre running in 9.1+ > serializable mode. And that cannot be used in some cases (try queuing > for example) because the rollback rates would be excessive. I agree that tripling FOR SHARE cost is risky. Where is the added cost concentrated? Perchance that multiple belies optimization opportunities. Thanks, nm
On 2012-11-16 22:31:51 -0500, Noah Misch wrote: > On Fri, Nov 16, 2012 at 05:31:12PM +0100, Andres Freund wrote: > > On 2012-11-16 13:17:47 -0300, Alvaro Herrera wrote: > > > Andres is on the verge of convincing me that we need to support > > > singleton FOR SHARE without multixacts due to performance concerns. > > > > I don't really see any arguments against doing so. We aren't in a that > > big shortage of flags and if we need more than available I think we can > > free some (e.g. XMAX/XMIN_INVALID). > > The patch currently leaves two unallocated bits. Reclaiming currently-used > bits means a binary compatibility break. Until we plan out such a thing, > reclaimable bits are not as handy as never-allocated bits. I don't think we > should lightly expend one of the final two. Not sure what you mean with a binary compatibility break? pg_upgrade'ability? What I previously suggested somewhere was: #define HEAP_XMAX_SHR_LOCK 0x0010 #define HEAP_XMAX_EXCL_LOCK 0x0040 #define HEAP_XMAX_KEYSHR_LOCK (HEAP_XMAX_SHR_LOCK|HEAP_XMAX_EXCL_LOCK) /** Different from _LOCK_BITS because it doesn't include LOCK_ONLY*/ #define HEAP_LOCK_MASK (HEAP_XMAX_SHR_LOCK|HEAP_XMAX_EXCL_LOCK) #define HEAP_XMAX_IS_SHR_LOCKED(tup) \ (((tup)->t_infomask & HEAP_LOCK_BITS) == HEAP_XMAX_SHR_LOCK) #define HEAP_XMAX_IS_EXCL_LOCKED(tup) \ (((tup)->t_infomask & HEAP_LOCK_BITS) == HEAP_XMAX_EXCL_LOCK) #define HEAP_XMAX_IS_KEYSHR_LOCKED(tup) \ (((tup)->t_infomask & HEAP_LOCK_BITS) == HEAP_XMAX_KEYSHR_LOCK) It makes checking for locks sightly more more complicated, but its not too bad... > > > It > > > would be useful for more people to chime in here: is FOR SHARE an > > > important case to cater for? I wonder if using FOR KEY SHARE (keep > > > performance characteristics, but would need to revise application code) > > > would satisfy Andres' users, for example. > > > > It definitely wouldn't help in the cases I have seen because the point > > is to protect against actual content changes of the rows, not just the > > keys. > > Note that you actually need to use explicit FOR SHARE/UPDATE for > > correctness purposes in many scenarios unless youre running in 9.1+ > > serializable mode. And that cannot be used in some cases (try queuing > > for example) because the rollback rates would be excessive. > > I agree that tripling FOR SHARE cost is risky. Where is the added cost > concentrated? Perchance that multiple belies optimization opportunities. Good question, let me play a bit. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Nov 17, 2012 at 03:20:20PM +0100, Andres Freund wrote: > On 2012-11-16 22:31:51 -0500, Noah Misch wrote: > > On Fri, Nov 16, 2012 at 05:31:12PM +0100, Andres Freund wrote: > > > On 2012-11-16 13:17:47 -0300, Alvaro Herrera wrote: > > > > Andres is on the verge of convincing me that we need to support > > > > singleton FOR SHARE without multixacts due to performance concerns. > > > > > > I don't really see any arguments against doing so. We aren't in a that > > > big shortage of flags and if we need more than available I think we can > > > free some (e.g. XMAX/XMIN_INVALID). > > > > The patch currently leaves two unallocated bits. Reclaiming currently-used > > bits means a binary compatibility break. Until we plan out such a thing, > > reclaimable bits are not as handy as never-allocated bits. I don't think we > > should lightly expend one of the final two. > > Not sure what you mean with a binary compatibility break? > pg_upgrade'ability? Yes. If we decide HEAP_XMIN_INVALID isn't helping, we can stop adding it to tuples anytime. Old tuples may continue to carry the bit, with no particular deadline for their demise. To reuse that bit in the mean time, we'll need to prove that no tuple writable by PostgreSQL 8.3+ will get an unacceptable interpretation under the new meaning of the bit. Alternately, build the mechanism to prove that all such old bits are gone before using the bit in the new way. This keeps HEAP_MOVED_IN and HEAP_MOVED_OFF unavailable today. > What I previously suggested somewhere was: > > #define HEAP_XMAX_SHR_LOCK 0x0010 > #define HEAP_XMAX_EXCL_LOCK 0x0040 > #define HEAP_XMAX_KEYSHR_LOCK (HEAP_XMAX_SHR_LOCK|HEAP_XMAX_EXCL_LOCK) > /* > * Different from _LOCK_BITS because it doesn't include LOCK_ONLY > */ > #define HEAP_LOCK_MASK (HEAP_XMAX_SHR_LOCK|HEAP_XMAX_EXCL_LOCK) > > #define HEAP_XMAX_IS_SHR_LOCKED(tup) \ > (((tup)->t_infomask & HEAP_LOCK_BITS) == HEAP_XMAX_SHR_LOCK) > #define HEAP_XMAX_IS_EXCL_LOCKED(tup) \ > (((tup)->t_infomask & HEAP_LOCK_BITS) == HEAP_XMAX_EXCL_LOCK) > #define HEAP_XMAX_IS_KEYSHR_LOCKED(tup) \ > (((tup)->t_infomask & HEAP_LOCK_BITS) == HEAP_XMAX_KEYSHR_LOCK) > > It makes checking for locks sightly more more complicated, but its not > too bad... Agreed; that seems reasonable.
> > I agree that tripling FOR SHARE cost is risky. Where is the added cost > > concentrated? Perchance that multiple belies optimization opportunities. > > Good question, let me play a bit. Ok, I benchmarked around and from what I see there is no single easy target. The biggest culprits I could find are: 1. higher amount of XLogInsert calls per transaction (visible in pgbench -t instead of -T mode while watching the WAL volume) 2. Memory allocations in GetMultiXactIdMembers 3. Memory allocations in mXactCachePuta) cache entry itselfb) the cache context 4. More lwlocks acquisitions We can possibly optimize a bit with 2) by using a static buffer for common member sizes, but thats not going to buy us too much... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Nov 17, 2012 at 05:07:18PM +0100, Andres Freund wrote: > > > I agree that tripling FOR SHARE cost is risky. Where is the added cost > > > concentrated? Perchance that multiple belies optimization opportunities. > > > > Good question, let me play a bit. > > Ok, I benchmarked around and from what I see there is no single easy > target. > The biggest culprits I could find are: > 1. higher amount of XLogInsert calls per transaction (visible > in pgbench -t instead of -T mode while watching the WAL volume) > 2. Memory allocations in GetMultiXactIdMembers > 3. Memory allocations in mXactCachePut > a) cache entry itself > b) the cache context > 4. More lwlocks acquisitions > > We can possibly optimize a bit with 2) by using a static buffer for > common member sizes, but thats not going to buy us too much... In that case, +1 for your proposal to prop up FOR SHARE.
On 2012-11-14 13:27:26 -0300, Alvaro Herrera wrote: > Alvaro Herrera wrote: > > > > * In heap_lock_tuple's XMAX_IS_MULTI case > > > > > > [snip] > > > > > > why is it membermode > mode and not membermode >= mode? > > > > Uh, that's a bug. Fixed. As noticed in the comment above that snippet, > > there was a deadlock possible here. Maybe I should add a test to ensure > > this doesn't happen. > > Done: > https://github.com/alvherre/postgres/commit/df2847e38198e99f57e52490e1e9391ebb70d770 Some more review bits, based on ffd6250d1d393f2ecb9bfc55c2c6f715dcece557 - if oldestMultiXactId + db is set and then that database is dropped we seem to have a problem because MultiXactAdvanceOldestwon't overwrite those values. Should probably use SetMultiXactIdLimit directly. - what stop multixacts only being filled out (i.e RecordNewMultiXact()) *after* the XLogInsert() *and* after a MultiXactGetCheckptMulti()?Afaics MultiXactGenLock is not hold in CreateMultiXactId(). If we crash in that moment we loosethe multixact data which now means potential data loss... - multixact member group data crossing 512 sector boundaries makes me uneasy (as its 5 bytes). I don't really see a scenariowhere its dangerous, but ... Does anybody see a problem here? - there are quite some places that domultiStopLimit = multiWrapLimit - 100;if (multiStopLimit < FirstMultiXactId) multiStopLimit-= FirstMultiXactId; perhaps MultiXactIdAdvance and MultiXactIdRetreat macros are in order? - I find using a default: clause in switches with enum types where everything is expected to be handled like the followinga bad idea, this way the compiler won't warn you if youve missed case's which makes changing/extending code harder: switch (rc->strength) { case LCS_FORNOKEYUPDATE: newrc->markType = ROW_MARK_EXCLUSIVE; break; case LCS_FORSHARE: newrc->markType = ROW_MARK_SHARE; break; case LCS_FORKEYSHARE: newrc->markType = ROW_MARK_KEYSHARE; break; case LCS_FORUPDATE: newrc->markType= ROW_MARK_KEYEXCLUSIVE; break; default: elog(ERROR, "unsupported rowmark type%d", rc->strength); } - #if 0 /* * The idea here is to remove the IS_MULTI bit, and replace the * xmax with the updater'sXid. However, we can't really do it: * modifying the Xmax is not allowed under our buffer locking * rules, unless we have an exclusive lock; but we don't know that * we have it. So the multi needs to remainin place :-( */ ResetMultiHintBit(tuple, buffer, xmax, true); #endif Three things: - HeapTupleSatisfiesUpdate is actually always called exclusively locked ;) - Extending something like LWLockHeldByMeto also return the current lockmode doesn't sound hard - we seem to be using #ifdef NOT_YET for suchcases - Using a separate production for the lockmode seems to be nicer imo, not really important though for_locking_item: FOR UPDATE locked_rels_list opt_nowait ... | FOR NO KEY UPDATE locked_rels_list opt_nowait ... | FOR SHARE locked_rels_list opt_nowait ... | FOR KEY SHARE locked_rels_list opt_nowait ; - not really padding, MultiXactStatus is 4bytes.../* * XXX Note: there's a lot of padding space in MultiXactMember. We could* find a more compact representation of this Xlog record -- perhaps all the * status flags in one XLogRecData, thenall the xids in another one? Not * clear that it's worth the trouble though. */ - why #define SizeOfMultiXactCreate (offsetof(xl_multixact_create, nmembers) + sizeof(int32)) and not #define SizeOfMultiXactCreate offsetof(xl_multixact_create, members) - starting a critical section in GetNewMultiXactId but not ending it is ugly, but not new Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-11-14 13:27:26 -0300, Alvaro Herrera wrote: > Alvaro Herrera wrote: > > > > * In heap_lock_tuple's XMAX_IS_MULTI case > > > > > > [snip] > > > > > > why is it membermode > mode and not membermode >= mode? > > > > Uh, that's a bug. Fixed. As noticed in the comment above that snippet, > > there was a deadlock possible here. Maybe I should add a test to ensure > > this doesn't happen. > > Done: > https://github.com/alvherre/postgres/commit/df2847e38198e99f57e52490e1e9391ebb70d770 > > (I don't think this is worth a v24 submission). One more observation: /* * Get and lock the updated version of the row; if fail, return NULL. */ - copyTuple = EvalPlanQualFetch(estate, relation, LockTupleExclusive, + copyTuple = EvalPlanQualFetch(estate, relation, LockTupleNoKeyExclusive, That doesn't seem to be correct to me. Why is it ok to acquire a potentially too low locklevel here? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > - I find using a default: clause in switches with enum types where everything > is expected to be handled like the following a bad idea, this way the > compiler won't warn you if youve missed case's which makes changing/extending code harder: > switch (rc->strength) > { I eliminated some of these default clauses, but the compiler is not happy about not having some of them, so they remain. > - > #if 0 > /* > * The idea here is to remove the IS_MULTI bit, and replace the > * xmax with the updater's Xid. However, we can't really do it: > * modifying the Xmax is not allowed under our buffer locking > * rules, unless we have an exclusive lock; but we don't know that > * we have it. So the multi needs to remain in place :-( > */ > ResetMultiHintBit(tuple, buffer, xmax, true); > #endif > > Three things: > - HeapTupleSatisfiesUpdate is actually always called exclusively locked ;) > - Extending something like LWLockHeldByMe to also return the current > lockmode doesn't sound hard > - we seem to be using #ifdef NOT_YET for such cases After spending some time trying to make this work usefully, I observed that it's pointless, at least if we apply it only in HeapTupleSatisfiesUpdate, because the IS_MULTI bit is going to be removed by compute_new_xmax_infomask if the multi is gone. Something like this would be useful if we could do it in other places; say when we're merely scanning page without the specific intention of modifying that particular tuple. Maybe in heap_prune_page, for example. ISTM we can see that as an optimization we can add later. For the record, the implementation of ResetMultiHintBit looks like this: /** When a tuple is found to be marked by a MultiXactId, all whose members are* completed transactions, the HEAP_XMAX_IS_MULTIbit can be removed.* Additionally, at the request of caller, we can set the Xmax to the given* Xid, andset HEAP_XMAX_COMMITTED.** Note that per buffer access rules, the caller to this function must hold* exclusive contentlock on the affected buffer.*/ static inline void ResetMultiHintBit(HeapTupleHeader tuple, Buffer buffer, TransactionId xid, bool mark_committed) {Assert(LWLockModeHeld((&BufferDescriptors[buffer])->content_lock == LW_EXCLUSIVE));Assert(tuple->t_infomask& HEAP_XMAX_IS_MULTI);Assert(!(tuple->t_infomask & HEAP_XMAX_INVALID));Assert(!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple))); tuple->t_infomask &= ~HEAP_XMAX_BITS;HeapTupleHeaderSetXmax(tuple, xid);if (!TransactionIdIsValid(xid)) tuple->t_infomask|= HEAP_XMAX_INVALID;/* note that HEAP_KEYS_UPDATED persists, if set */ if (mark_committed) SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED, xid);else SetBufferCommitInfoNeedsSave(buffer); } (This is removed by commit 52f86f82fb3e01a6ed0b8106bac20f319bb3ad0a in my github tree, but that commit might be lost in the future, hence this paste.) Some of your other observations have been fixed by commits that I have pushed to my github tree. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2012-11-20 17:36:14 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > > - I find using a default: clause in switches with enum types where everything > > is expected to be handled like the following a bad idea, this way the > > compiler won't warn you if youve missed case's which makes changing/extending code harder: > > switch (rc->strength) > > { > > I eliminated some of these default clauses, but the compiler is not > happy about not having some of them, so they remain. You have removed the ones that looked removable to me... > > > - > > #if 0 > > /* > > * The idea here is to remove the IS_MULTI bit, and replace the > > * xmax with the updater's Xid. However, we can't really do it: > > * modifying the Xmax is not allowed under our buffer locking > > * rules, unless we have an exclusive lock; but we don't know that > > * we have it. So the multi needs to remain in place :-( > > */ > > ResetMultiHintBit(tuple, buffer, xmax, true); > > #endif > > > > Three things: > > - HeapTupleSatisfiesUpdate is actually always called exclusively locked ;) > > - Extending something like LWLockHeldByMe to also return the current > > lockmode doesn't sound hard > > - we seem to be using #ifdef NOT_YET for such cases > > After spending some time trying to make this work usefully, I observed > that it's pointless, at least if we apply it only in > HeapTupleSatisfiesUpdate, because the IS_MULTI bit is going to be > removed by compute_new_xmax_infomask if the multi is gone. Something > like this would be useful if we could do it in other places; say when > we're merely scanning page without the specific intention of modifying > that particular tuple. Maybe in heap_prune_page, for example. ISTM we > can see that as an optimization we can add later. heap_prune_page sounds like a good fit, yes. One other place would be when following hot chains, but the locking would be far more critical in that path, so its less likely to be beneficial. Pushing it off sounds good to me. > Some of your other observations have been fixed by commits that I have > pushed to my github tree. A short repetition of the previous pgbench run of many SELECT FOR SHARE's: 10s test: master: 22673 24637 23874 25527 fklocks: 24835 24601 24606 24868 60s test: master: 32609 33087 fklocks: 33350 33359 Very nice! Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Here's version 24. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Alvaro Herrera wrote: > Here's version 24. Old review emails still contains some items that didn't lead to any changes. I tried to keep close track of those. To that list I add a couple of things of my own. Here they are, for those following along. I welcome suggestions. - Fix the multixact cache in multixact.c -- see mXactCacheEnt. - ResetHintBitMulti() was removed; need to remove old XMAX_IS_MULTI somewhere; perhaps during HOT page pruning? - EvalPlanQual and ExecLockRows maybe need a different overall locking strategy. Right now follow_updates=false is usedto avoid deadlocks. - Ensure multixact.c behaves sanely on wraparound. - Is the case of a a non-key-update followed by a key-update actually handled when doing a heap_lock_tuple with mode = LockTupleKeyShareand follow_updates = false? I don't really see how, so it seems to deserve at least a comment. - if oldestMultiXactId + db is set and then that database is dropped we seem to have a problem because MultiXactAdvanceOldestwon't overwrite those values. Should probably use SetMultiXactIdLimit directly. - what stop multixacts only being filled out (i.e RecordNewMultiXact()) *after* the XLogInsert() *and* after a MultiXactGetCheckptMulti()?Afaics MultiXactGenLock is not hold in CreateMultiXactId(). If we crash in that moment we loosethe multixact data which now means potential data loss... - multixact member group data crossing 512 sector boundaries makes me uneasy (as its 5 bytes). I don't really see a scenariowhere its dangerous, but ... Does anybody see a problem here? - there are quite some places that domultiStopLimit = multiWrapLimit - 100;if (multiStopLimit < FirstMultiXactId) multiStopLimit-= FirstMultiXactId; perhaps MultiXactIdAdvance and MultiXactIdRetreat macros are in order? - not really padding, MultiXactStatus is 4bytes.../* * XXX Note: there's a lot of padding space in MultiXactMember. We could* find a more compact representation of this Xlog record -- perhaps all the * status flags in one XLogRecData, thenall the xids in another one? Not * clear that it's worth the trouble though. */ -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera wrote: > Here's version 24. This no longer applies after the rmgr rm_desc patch. -Kevin
Kevin Grittner wrote: > Alvaro Herrera wrote: > >> Here's version 24. > > This no longer applies after the rmgr rm_desc patch. I took a shot at merging those so I could review the patch against HEAD; attached in hopes that it will be useful. Hopefully this isn't too far off from what . -Kevin
Attachment
Kevin Grittner wrote: > Kevin Grittner wrote: > > Alvaro Herrera wrote: > > > >> Here's version 24. > > > > This no longer applies after the rmgr rm_desc patch. > > I took a shot at merging those so I could review the patch against > HEAD; attached in hopes that it will be useful. > > Hopefully this isn't too far off from what . Uh, sorry for remaining silent -- I'm about to post an updated patch, having just pushed my merge to github. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera wrote: > Kevin Grittner wrote: > > Kevin Grittner wrote: > > > Alvaro Herrera wrote: > > > > > >> Here's version 24. > > > > > > This no longer applies after the rmgr rm_desc patch. > > > > I took a shot at merging those so I could review the patch against > > HEAD; attached in hopes that it will be useful. > > > > Hopefully this isn't too far off from what . > > Uh, sorry for remaining silent -- I'm about to post an updated patch, > having just pushed my merge to github. Here it is. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2012-11-27 12:07:34 -0300, Alvaro Herrera wrote: > Alvaro Herrera wrote: > > Here's version 24. > > Old review emails still contains some items that didn't lead to any > changes. I tried to keep close track of those. To that list I add a > couple of things of my own. Here they are, for those following along. > I welcome suggestions. > > - EvalPlanQual and ExecLockRows maybe need a different overall locking > strategy. Right now follow_updates=false is used to avoid deadlocks. I think this really might need some work. Afaics the EPQ code now needs to actually determine what locklevel needs to be passed to EvalPlanQualFetch via EvalPlanQual otherwise some of the locking issues that triggered all this remain. That sucks a bit from a modularity perspective, but I don't see another way. Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, November 29, 2012 17:16, Alvaro Herrera wrote: > Here it is. > > fklocks-26.patch.gz This applies today after removing, not only the infamous catversion.h chunk, but also a file_fdw chunk. (It's not really a problem.) I also wondered if anyone has any perl/python/bash programs handy that uses these new options. I can hack something together myself, but I just thought I'd ask. Thanks, Erik Rijkers
On 2012-12-22 10:53:47 +0100, Erik Rijkers wrote: > On Thu, November 29, 2012 17:16, Alvaro Herrera wrote: > > Here it is. > > > > fklocks-26.patch.gz > > This applies today after removing, not only the infamous catversion.h chunk, but also a file_fdw > chunk. (It's not really a problem.) > > I also wondered if anyone has any perl/python/bash programs handy that uses these new options. I > can hack something together myself, but I just thought I'd ask. I have mostly done code review and some very targeted testing (pgbench, gdb + psql). So no ready scripts, sorry. There are imo some unfinished pieces (the whole EPQ integration) that will require significant retesting once Alvaro has time to work on this again.. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Erik Rijkers wrote: > On Thu, November 29, 2012 17:16, Alvaro Herrera wrote: > > Here it is. > > > > fklocks-26.patch.gz > > This applies today after removing, not only the infamous catversion.h chunk, but also a file_fdw > chunk. (It's not really a problem.) FWIW, while it's probably not interesting to you, it must be noted that the catversion.h number to use must match a contrib/pg_upgrade/pg_upgrade.h symbol. As far as file_fdw hunks, I only see this one: --- a/contrib/file_fdw/output/file_fdw.source +++ b/contrib/file_fdw/output/file_fdw.source @@ -191,7 +191,7 @@ ERROR: cannot change foreign table "agg_csv" DELETE FROM agg_csv WHERE a = 100; ERROR: cannot change foreign table "agg_csv" SELECT * FROM agg_csv FOR UPDATE OF agg_csv; -ERROR: SELECT FOR UPDATE/SHARE cannot be used with foreign table "agg_csv" +ERROR: SELECT FOR UPDATE/SHARE/KEY UPDATE/KEY SHARE cannot be used with foreign table "agg_csv" LINE 1: SELECT * FROM agg_csv FOR UPDATE OF agg_csv; ^ -- but this should be ignored Surely that needs to be patched still? And this hunk applies without any trouble, so I don't see why you had to remove it. Anyway, here's v27, which is the same code merged to latest master. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Here's version 28 of this patch. The only substantive change here from v26 is that I've made GetTupleForTrigger() use either LockTupleExclusive or LockTupleNoKeyExclusive, depending on whether the key columns are being modified by the update. This needs to go through EvalPlanQual, so that function is now getting the lock mode as a parameter instead of hardcoded LockTupleExclusive. (All other users of GetTupleForTrigger still use LockTupleExclusive, so there's no change for anybody other than FOR EACH ROW BEFORE UPDATE triggers). At this point, I think I've done all I wanted to do in connection with this patch, and I intend to commit. I think this is a good time to get it committed, so that people can play with it more extensively and report any breakage I might have caused before we even get into beta. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2013-01-10 18:00:40 -0300, Alvaro Herrera wrote: > Here's version 28 of this patch. The only substantive change here from > v26 is that I've made GetTupleForTrigger() use either LockTupleExclusive > or LockTupleNoKeyExclusive, depending on whether the key columns are > being modified by the update. This needs to go through EvalPlanQual, so > that function is now getting the lock mode as a parameter instead of > hardcoded LockTupleExclusive. (All other users of GetTupleForTrigger > still use LockTupleExclusive, so there's no change for anybody other > than FOR EACH ROW BEFORE UPDATE triggers). Is that enough in case of a originally non-key update in read committed mode that turns into a key update due to a concurrent key update? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > On 2013-01-10 18:00:40 -0300, Alvaro Herrera wrote: > > Here's version 28 of this patch. The only substantive change here from > > v26 is that I've made GetTupleForTrigger() use either LockTupleExclusive > > or LockTupleNoKeyExclusive, depending on whether the key columns are > > being modified by the update. This needs to go through EvalPlanQual, so > > that function is now getting the lock mode as a parameter instead of > > hardcoded LockTupleExclusive. (All other users of GetTupleForTrigger > > still use LockTupleExclusive, so there's no change for anybody other > > than FOR EACH ROW BEFORE UPDATE triggers). > > Is that enough in case of a originally non-key update in read committed > mode that turns into a key update due to a concurrent key update? Hm, let me try to work through your example. You say that a transaction T1 does a non-key update, and is working through the BEFORE UPDATE trigger; then transaction T2 does a key update and changes the key underneath T1? So at that point T1 becomes a key update, because it's now using the original key values which are no longer the key? I don't think this can really happen, because T2 (which is requesting TupleLockExclusive) would block on the lock that the trigger is grabbing (TupleLockNoKeyExclusive) on the tuple. So T2 would sleep until T1 is committed. Now, maybe you meant that the BEFORE UPDATE trigger changes the key value but the user-supplied UPDATE query does not. So the trigger turns the no-key update into a key update. What would happen here is that GetTupleForTrigger would acquire TupleLockNoKeyExclusive on the tuple, and later heap_update would acquire TupleLockExclusive. So there is lock escalation happening. This could cause a deadlock against someone else grabbing a TupleLockKeyShare on the tuple. I think the answer is "don't do that", i.e. don't update the key columns in a BEFORE UPDATE trigger. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-01-11 12:11:47 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > On 2013-01-10 18:00:40 -0300, Alvaro Herrera wrote: > > > Here's version 28 of this patch. The only substantive change here from > > > v26 is that I've made GetTupleForTrigger() use either LockTupleExclusive > > > or LockTupleNoKeyExclusive, depending on whether the key columns are > > > being modified by the update. This needs to go through EvalPlanQual, so > > > that function is now getting the lock mode as a parameter instead of > > > hardcoded LockTupleExclusive. (All other users of GetTupleForTrigger > > > still use LockTupleExclusive, so there's no change for anybody other > > > than FOR EACH ROW BEFORE UPDATE triggers). > > > > Is that enough in case of a originally non-key update in read committed > > mode that turns into a key update due to a concurrent key update? > > Hm, let me try to work through your example. You say that a transaction > T1 does a non-key update, and is working through the BEFORE UPDATE > trigger; then transaction T2 does a key update and changes the key > underneath T1? So at that point T1 becomes a key update, because it's > now using the original key values which are no longer the key? > > I don't think this can really happen, because T2 (which is requesting > TupleLockExclusive) would block on the lock that the trigger is grabbing > (TupleLockNoKeyExclusive) on the tuple. So T2 would sleep until T1 is > committed. No, I was thinking about an update without triggers present. T0: CREATE TABLE tbl(id serial pk, name text unique, data text); T1: BEGIN; -- read committed T1: UPDATE tbl SET name = 'foo' WHERE name = 'blarg'; /* key update of row id = 1 */ T2: BEGIN; -- read committed T2: UPDATE tbl SET name = 'blarg', data = 'blarg' WHERE id = 1; /* no key update, waiting */ T1: COMMIT; T2: /* UPDATE follows to updated row, due to the changed name its a key update now */ Does that make sense? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > No, I was thinking about an update without triggers present. > > T0: CREATE TABLE tbl(id serial pk, name text unique, data text); > T1: BEGIN; -- read committed > T1: UPDATE tbl SET name = 'foo' WHERE name = 'blarg'; /* key update of row id = 1 */ > T2: BEGIN; -- read committed > T2: UPDATE tbl SET name = 'blarg', data = 'blarg' WHERE id = 1; /* no key update, waiting */ > T1: COMMIT; > T2: /* UPDATE follows to updated row, due to the changed name its a key update now */ > > Does that make sense? So I guess your question is "is T2 now holding a TupleLockExclusive lock?" To answer it, I turned your example into a isolationtester spec: setup { CREATE TABLE tbl(id serial primary key, name text unique, data text); INSERT INTO tbl VALUES (1, 'blarg', 'no data'); } teardown { DROP TABLE tbl; } session "s1" step "s1b" { BEGIN; } step "s1u" { UPDATE tbl SET name = 'foo' WHERE name = 'blarg'; } step "s1c" { COMMIT; } session "s2" step "s2b" { BEGIN; } step "s2u" { UPDATE tbl SET name = 'blarg', data = 'blarg' WHERE id = 1; } step "s2c" { COMMIT; } session "s3" step "s3l" { SELECT * FROM tbl FOR KEY SHARE; } permutation "s1b" "s1u" "s2b" "s2u" "s1c" "s3l" "s2c" And the results: Parsed test spec with 3 sessions starting permutation: s1b s1u s2b s2u s1c s3l s2c step s1b: BEGIN; step s1u: UPDATE tbl SET name = 'foo' WHERE name = 'blarg'; step s2b: BEGIN; step s2u: UPDATE tbl SET name = 'blarg', data = 'blarg' WHERE id = 1; <waiting ...> step s1c: COMMIT; step s2u: <... completed> step s3l: SELECT * FROM tbl FOR KEY SHARE; <waiting ...> step s2c: COMMIT; step s3l: <... completed> id name data 1 blarg blarg So session 3 is correctly waiting for session 2 to finish before being ablt to grab its FOR KEY SHARE lock, indicating that session 2 is holding a FOR UPDATE lock. Good. If I change session 1 to update the data column instead of name, session 3 no longer needs to wait for session 2, meaning session 2 now only grabs a FOR NO KEY UPDATE lock. Also good. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-01-11 13:10:49 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > No, I was thinking about an update without triggers present. > > > > T0: CREATE TABLE tbl(id serial pk, name text unique, data text); > > T1: BEGIN; -- read committed > > T1: UPDATE tbl SET name = 'foo' WHERE name = 'blarg'; /* key update of row id = 1 */ > > T2: BEGIN; -- read committed > > T2: UPDATE tbl SET name = 'blarg', data = 'blarg' WHERE id = 1; /* no key update, waiting */ > > T1: COMMIT; > > T2: /* UPDATE follows to updated row, due to the changed name its a key update now */ > > > > Does that make sense? > > So I guess your question is "is T2 now holding a TupleLockExclusive > lock?" To answer it, I turned your example into a isolationtester spec: Great! I reread the code and it does make sense the way its implemented now. I misremembered something... I vote for adding that spectest including some appropriate permutations. FWIW: Looks good to me. It could use another pair of eyes, but I guess that will have to come by being used. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera wrote: > Here's version 28 of this patch. The only substantive change here from > v26 is that I've made GetTupleForTrigger() use either LockTupleExclusive > or LockTupleNoKeyExclusive, depending on whether the key columns are > being modified by the update. This needs to go through EvalPlanQual, so > that function is now getting the lock mode as a parameter instead of > hardcoded LockTupleExclusive. (All other users of GetTupleForTrigger > still use LockTupleExclusive, so there's no change for anybody other > than FOR EACH ROW BEFORE UPDATE triggers). > > At this point, I think I've done all I wanted to do in connection with > this patch, and I intend to commit. I think this is a good time to get > it committed, so that people can play with it more extensively and > report any breakage I might have caused before we even get into beta. While trying this out this morning I noticed a bug in the XLog code: trying to lock the updated version of a tuple when the page that contains the updated version requires a backup block, would cause this: PANIC: invalid xlog record length 0 The reason is that there is an (unknown to me) rule that there must be some data not associated with a buffer: /* * NOTE: We disallow len == 0 because it provides a useful bit of extra * error checking in ReadRecord. This means that all callers of * XLogInsert must supply at least some not-in-a-buffer data. [...] */ This seems pretty strange to me. And having the rule be spelled out only in a comment within XLogInsert and not at its top, and not nearby the XLogRecData struct definition either, seems pretty strange to me. I wonder how come every PG hacker except me knows this. For the curious, I attach an isolationtester spec file I used to reproduce the problem (and verify the fix). To fix the crash I needed to do this: diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 99fced1..9762890 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -4838,7 +4838,7 @@ l4: { xl_heap_lock_updated xlrec; XLogRecPtr recptr; - XLogRecData rdata; + XLogRecData rdata[2]; Page page = BufferGetPage(buf); xlrec.target.node = rel->rd_node; @@ -4846,13 +4846,18 @@ l4: xlrec.xmax = new_xmax; xlrec.infobits_set = compute_infobits(new_infomask, new_infomask2); - rdata.data = (char *) &xlrec; - rdata.len = SizeOfHeapLockUpdated; - rdata.buffer = buf; - rdata.buffer_std = true; - rdata.next = NULL; + rdata[0].data = (char *) &xlrec; + rdata[0].len = SizeOfHeapLockUpdated; + rdata[0].buffer = InvalidBuffer; + rdata[0].next = &(rdata[1]); + + rdata[1].data = NULL; + rdata[1].len = 0; + rdata[1].buffer = buf; + rdata[1].buffer_std = true; + rdata[1].next = NULL; - recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_LOCK_UPDATED, &rdata); + recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_LOCK_UPDATED, rdata); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 18 January 2013 20:02, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > XLOG_HEAP2_LOCK_UPDATED Every xlog record needs to know where to put the block. Compare with XLOG_HEAP_LOCK xlrec.target.node = relation->rd_node; -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > The reason is that there is an (unknown to me) rule that there must be > some data not associated with a buffer: > /* > * NOTE: We disallow len == 0 because it provides a useful bit of extra > * error checking in ReadRecord. This means that all callers of > * XLogInsert must supply at least some not-in-a-buffer data. [...] > */ > This seems pretty strange to me. And having the rule be spelled out > only in a comment within XLogInsert and not at its top, and not nearby > the XLogRecData struct definition either, seems pretty strange to me. > I wonder how come every PG hacker except me knows this. I doubt it ever came up before. What use is logging only the content of a buffer page? Surely you'd need to know, for example, which relation and page number it is from. regards, tom lane
On 2013-01-18 15:37:47 -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > The reason is that there is an (unknown to me) rule that there must be > > some data not associated with a buffer: > > > /* > > * NOTE: We disallow len == 0 because it provides a useful bit of extra > > * error checking in ReadRecord. This means that all callers of > > * XLogInsert must supply at least some not-in-a-buffer data. [...] > > */ > > > This seems pretty strange to me. And having the rule be spelled out > > only in a comment within XLogInsert and not at its top, and not nearby > > the XLogRecData struct definition either, seems pretty strange to me. > > I wonder how come every PG hacker except me knows this. > > I doubt it ever came up before. What use is logging only the content of > a buffer page? Surely you'd need to know, for example, which relation > and page number it is from. It only got to be a length of 0 because the the data got removed due to a logged full page write. And the backup block contains the data about which blocks it is logging in itself. I wonder if the check shouldn't just check write_len instead, directly below the loop that ads backup blocks. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-01-18 15:37:47 -0500, Tom Lane wrote: >> I doubt it ever came up before. What use is logging only the content of >> a buffer page? Surely you'd need to know, for example, which relation >> and page number it is from. > It only got to be a length of 0 because the the data got removed due to > a logged full page write. And the backup block contains the data about > which blocks it is logging in itself. And if the full-page-image case *hadn't* been invoked, what then? I still don't see a very good argument for xlog records with no fixed data. > I wonder if the check shouldn't just check write_len instead, directly > below the loop that ads backup blocks. We're not changing this unless you can convince me that the read-side error check mentioned in the comment is useless. regards, tom lane
On 2013-01-18 16:01:18 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-01-18 15:37:47 -0500, Tom Lane wrote: > >> I doubt it ever came up before. What use is logging only the content of > >> a buffer page? Surely you'd need to know, for example, which relation > >> and page number it is from. > > > It only got to be a length of 0 because the the data got removed due to > > a logged full page write. And the backup block contains the data about > > which blocks it is logging in itself. > > And if the full-page-image case *hadn't* been invoked, what then? I > still don't see a very good argument for xlog records with no fixed > data. In that case data would have been logged? The code in question was: xl_heap_lock_updated xlrec; xlrec.target.node = rel->rd_node; ... xlrec.xmax = new_xmax; - rdata.data = (char *) &xlrec; - rdata.len = SizeOfHeapLockUpdated; - rdata.buffer = buf; - rdata.buffer_std = true; - rdata.next = NULL; - recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_LOCK_UPDATED, &rdata); Other wal logging code (and fklocks now as well) just put those into two XLogRecData blocks to avoid the issue. > > I wonder if the check shouldn't just check write_len instead, directly > > below the loop that ads backup blocks. > > We're not changing this unless you can convince me that the read-side > error check mentioned in the comment is useless. Youre right, the read side check is worth quite a bit. I think I am retracting my suggestion. I guess the amount of extra data thats uselessly logged although never used in in the redo functions doesn't really amass to anything significant in comparison to the backup block data. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 18 January 2013 21:01, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> On 2013-01-18 15:37:47 -0500, Tom Lane wrote: >>> I doubt it ever came up before. What use is logging only the content of >>> a buffer page? Surely you'd need to know, for example, which relation >>> and page number it is from. > >> It only got to be a length of 0 because the the data got removed due to >> a logged full page write. And the backup block contains the data about >> which blocks it is logging in itself. > > And if the full-page-image case *hadn't* been invoked, what then? I > still don't see a very good argument for xlog records with no fixed > data. > >> I wonder if the check shouldn't just check write_len instead, directly >> below the loop that ads backup blocks. > > We're not changing this unless you can convince me that the read-side > error check mentioned in the comment is useless. There's some confusion here, I think. Alvaro wasn't proposing a WAL record that had no fixed data. The current way XLogRecData works is that if you put data and buffer together on the same chunk then we optimise the data away if we take a backup block. Alvaro chose what looks like the right way to do this when you have simple data: use one XLogRecData chunk and let the data part get optimised away. But doing that results in a WAL record that has a backup block, plus data of 0 length, which then fails the later check. All current code gets around this by including multiple XLogRecData chunks, which results in including additional data that is superfluous when the backup block is present. Alvaro was questioning this; I didn't understand that either when he said it, but I do now. The zero length check should stay, definitely. What this looks like is that further compression of the WAL is possible, but given its alongside backup blocks, that's on the order of a 1% saving, so probably isn't worth considering right now. The way to do that would to include a small token to show record has been optimised, rather than being zero length. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
I just pushed this patch to the master branch. There was a corresponding catversion bump and pg_control version bump. I have verified that "make check-world" passes on my machine, as well as isolation tests and pg_upgrade. Tom Lane said at one point "this is too complex to maintain". Several times during the development I feared he might well be right. I am sure he will be discovering many oversights and poor design choices, when gets around to reviewing the code; hopefully some extra effort will be all that's needed to make the whole thing work sanely and not eat anyone's data. I just hope that nothing so serious comes up that the patch will need to be reverted completely. This patch is the result of the work of many people. I am not allowed to mention the patch sponsors in the commit message, so I'm doing it here: first and foremost I need to thank my previous and current employers Command Prompt and 2ndQuadrant -- they were extremely kind in allowing me to work on this for days on end (and not all of it was supported by financial sponsors). Joel Jacobson started the whole effort by posting a screencast of a problem his company was having; I hope they found a workaround in the meantime, because his post was in mid 2010. The key idea of this patch' design came from Simon Riggs; Noah Misch provided additional design advice that allowed the project torun to completion. Noah and Andres Freund spent considerable time reviewing early versions of this patch; they uncovered many embarrasing bugs in my implementation. Kevin Grittner, Robert Haas, and others, provided useful comments as well. Noah Misch, Andres Freund, Marti Raudsepp and Alexander Shulgin also provided bits of code. Any bugs that remain are, of course, my own fault only. Financial support came from * Command Prompt Inc. of Washington, US * 2ndQuadrant Ltd. of United Kingdom * Trustly (then Glue Finance) of Sweden * Novozymes A/S of Denmark * MailerMailer LLC of Maryland, US * PostgreSQL Experts, Inc. of California, US. My sincerest thanks to everyone. I seriously hope that no patch of mine ever becomes this monstruous again. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services