Thread: Hot Standby 0.2.1
OK, here is the latest version of the Hot Standby patchset. This is about version 30+ by now, but we should regard this as 0.2.1 Patch against CVS HEAD (now): clean apply, compile, no known bugs. OVERVIEW You can download PDF versions of the fine manual is here http://wiki.postgresql.org/images/0/01/Hot_Standby_main.pdf http://wiki.postgresql.org/images/0/01/Hot_Standby_Recovery_Functions.pdf Also available via the project Wiki, which is here http://wiki.postgresql.org/wiki/Hot_Standby Patch should be attached to this email. If problems, get this and other versions from Wiki please. No offlist comments, questions etc please. PATCH VERSIONING & STATUS 0 meaning its not fully released yet 2 meaning this is a major new re-write 1 meaning this is the first release Patch is still in testing and will be for next few days at least. Released now only because I promised to do so. Is this ready for commit? Yes, it is in the shape I want it to be in, but also, No, I can't say it's been through a wide enough range of tests as yet to be considered immediately ready for commit. Further bug fixing and minor cosmetic development will take place via my GIT repo, uploaded soon. Patch included here to meet deadlines and code inclusion. BSD. PATCH SUMMARY * 76 files changed, 5160 insertions(+), 59 deletions(-), 1251 mods(!) * 7 files with more than 100 lines changed procarray.c (1200+ additions) xlog.c (600+ additions) xact.c (500+ additions) inval.c (650 additions) lock.c heapam.c nbtxlog.c * 29 files with 10 or fewer lines changed * Applies cleanly to CVS HEAD as of now CHANGES The rough changes since version 1 series of patches. * Full documentation included. Many, but not all nuances of SGML tagging followed, but sufficient aspects there to allow for proofreading before we do final changes. Some undocumented functions now correctly documented. Recovery functions now split into user and control functions in docs to make it clearer. * Starting conditions in GetRunningTransactionData() are now much stricter and holds more lwlocks. There are few cases where any not-found xids are allowed during xid processing, so code is more robust. Please check for race conditions. * GetRunningTransactionData() now handles initialisation of AccessExclusiveLocks correctly. Locks are counted in a low-contention approach that avoids taking holding lock partition locks, if possible. * Start-up conditions now recoded to allow faster start in cases where many subtransactions are present. Recovery connections are only enabled when the snapshot is valid. * max_connections needs to be correctly set or HS will not allow connections. Once snapshots are enabled they will continue to be available always. * RecordKnownAssignedTransactions() now contains a test for xid wraparound threat which invokes conflict processing should that occur. * Boolean states now clarified and corrected. Hot Standby can be turned off completely if not required or if problems effect production. That causes many changes but there is no change in the intention of those sections of code. * UnobservedXids processing follows Heikki's proposal, but has been renamed to KnownAssignedXids. It has also been modularised and completely re-written using a hash table approach. So far it has been much more stable than the previous sorted array coding, which I am happy to see in the shredder for all the problems it caused. Fully detailed comments all through. * All record types now respect max_standby_delay. * All deferred conflict processing has been removed - conflict processing itself is still enabled. * A few other functions have been renamed and/or moved around to rationalise their exact purpose/position within their modules. * Prepared transactions holding AccessExclusiveLocks at the end of recovery are now handled. * Hash indexes are now safely handled. That was removed at request, but we need it to avoid silent data loss for queries near those types of index. * Hint bits are now set, in appropriate circumstances. * Flat file logic removed * Large swathes of unused code removed. * All code comments addressed and/or re-explained in more detail * CHECKPOINT now works during recovery but performs restartpoint instead. * Tweaked max_standby_delay code to avoid long duration waits. Added dynamic function to control delay during recovery. Added code for stats collection and ps display. Set default to sensible production values. RECENT BUGS * Found and fixed missing relcache init file invalidation * Found and fixed more serious VACUUM FULL-related weirdness <sigh> * Recently discovered bug has resulted in changes in XidInMVCCSnapshot(). Heikki's earlier approach did not correctly allow for the maximum size of a snapshot. The simplicity of Heikki's proposal is good, but hid a flaw in where snapshots would put their xids. I've fully solved the problem though I expect further discussion. I've looked through every change and verified it, but fixing all the bugs means there's areas of new code added in last few days. I accept that any bugs herein are my responsibility. -- Simon Riggs www.2ndQuadrant.com
Attachment
On Tue, Sep 15, 2009 at 10:41:59PM +0100, Simon Riggs wrote: > > OK, here is the latest version of the Hot Standby patchset. This is > about version 30+ by now, but we should regard this as 0.2.1 Patch > against CVS HEAD (now): clean apply, compile, no known bugs. Kudos!!!! Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
--On 15. September 2009 22:41:59 +0100 Simon Riggs <simon@2ndQuadrant.com> wrote: > http://wiki.postgresql.org/images/0/01/Hot_Standby_Recovery_Functions.pdf This doesn't work for me, it seems the correct link is <http://wiki.postgresql.org/images/1/10/Hot_Standby_Recovery_Functions.pdf> ? -- Thanks Bernd
All, Now that Simon has submitted this, can some of the heavy-hitters here review it? Heikki? Nobody's name is next to it. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
On Wed, Sep 16, 2009 at 6:05 PM, Josh Berkus <josh@agliodbs.com> wrote: > Now that Simon has submitted this, can some of the heavy-hitters here > review it? Heikki? > > Nobody's name is next to it. I don't think anyone is planning to ignore this patch, but it wasn't included in the first round of round-robin reviewing assignments because it wasn't submitted until the following day, after the announced deadline for submissions had already passed. So most people are probably busy with with some other patch at the moment, but that's a temporary phenomenon. This is a pretty small CommitFest, so there shouldn't be any shortage of reviewers, though Heikki's time may be stretched a little thin, since Streaming Replication is also in the queue, and he is working on index-only scans. That's really for him to comment on, though. ...Robert
Robert Haas wrote: > On Wed, Sep 16, 2009 at 6:05 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Now that Simon has submitted this, can some of the heavy-hitters here >> review it? Heikki? >> >> Nobody's name is next to it. > > I don't think anyone is planning to ignore this patch, but it wasn't > included in the first round of round-robin reviewing assignments > because it wasn't submitted until the following day, after the > announced deadline for submissions had already passed. So most people > are probably busy with with some other patch at the moment, but that's > a temporary phenomenon. Right, I've added myself as reviewer now. > This is a pretty small CommitFest, so there > shouldn't be any shortage of reviewers, though Heikki's time may be > stretched a little thin, since Streaming Replication is also in the > queue, and he is working on index-only scans. That's really for him > to comment on, though. I'm going to put the index-only scans aside for now to focus on hot standby and streaming replication. Both are big patches, so there's plenty of work in those two alone, and not only for me. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2009-09-17 at 09:54 +0300, Heikki Linnakangas wrote: > > This is a pretty small CommitFest, so there > > shouldn't be any shortage of reviewers, though Heikki's time may be > > stretched a little thin, since Streaming Replication is also in the > > queue, and he is working on index-only scans. That's really for him > > to comment on, though. > > I'm going to put the index-only scans aside for now to focus on hot > standby and streaming replication. Both are big patches, so there's > plenty of work in those two alone, and not only for me. That's very good of you, thanks. It was already clear to a few people that your time would bottleneck trying to review both at the same time. I personally wasn't expecting you to jump into immediate action on HS. We have time. -- Simon Riggs www.2ndQuadrant.com
On Thu, Sep 17, 2009 at 2:54 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Robert Haas wrote: >> On Wed, Sep 16, 2009 at 6:05 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> Now that Simon has submitted this, can some of the heavy-hitters here >>> review it? Heikki? >>> >>> Nobody's name is next to it. >> >> I don't think anyone is planning to ignore this patch, but it wasn't >> included in the first round of round-robin reviewing assignments >> because it wasn't submitted until the following day, after the >> announced deadline for submissions had already passed. So most people >> are probably busy with with some other patch at the moment, but that's >> a temporary phenomenon. > > Right, I've added myself as reviewer now. > >> This is a pretty small CommitFest, so there >> shouldn't be any shortage of reviewers, though Heikki's time may be >> stretched a little thin, since Streaming Replication is also in the >> queue, and he is working on index-only scans. That's really for him >> to comment on, though. > > I'm going to put the index-only scans aside for now to focus on hot > standby and streaming replication. Both are big patches, so there's > plenty of work in those two alone, and not only for me. What is the best way to attack this? Should I keep reviewing index-only scans so that you have feedback for when you get back to it, or should I move on to something else? If something else, does it make more sense for me to look at HS since I did a bit of work with a previous version, or would it be better for me to just pick one of the other patches from the CommitFest and work on that? Also, stepping back from me personally, should we try to assign some additional reviewers to these patches? Is there some way we can divide up review tasks among multiple people so that we're not repeating each others work? Thoughts appreciated, from Heikki, Simon, or others. ...Robert
On Thu, 2009-09-17 at 19:01 -0400, Robert Haas wrote: > > > > I'm going to put the index-only scans aside for now to focus on hot > > standby and streaming replication. Both are big patches, so there's > > plenty of work in those two alone, and not only for me. > > What is the best way to attack this? Should I keep reviewing > index-only scans so that you have feedback for when you get back to > it, or should I move on to something else? If something else, does it > make more sense for me to look at HS since I did a bit of work with a > previous version, or would it be better for me to just pick one of the > other patches from the CommitFest and work on that? > > Also, stepping back from me personally, should we try to assign some > additional reviewers to these patches? Is there some way we can > divide up review tasks among multiple people so that we're not > repeating each others work? > > Thoughts appreciated, from Heikki, Simon, or others. I think this is a great opportunity to widen the pool of people contributing to reviews. I suggest the creation of a second group of people, performing round-robin testing of patches. These people would be able to verify * documentation matches implemented features (does it do what it says on the tin?) * usability of explicit features (do the features work well?) * usability gap of unimplemented features (what else do we need?) * are there any bugs? These questions are often quickly answered for smaller patches, but HS's scope mean that such a task properly executed could take a full week, if not longer. Second group of people are just as skilled Postgres people as reviewers, in some cases more so, apart from they have less detailed knowledge of the codebase. We have many such people and it would be good to encourage them to perform thorough reviews rather than "tire kicking". I'm not sure that Heikki needs additional reviewers. He now has significant knowledge of the patch and is good at focusing on key aspects of the internals. Other code reviewers are welcome, of course. -- Simon Riggs www.2ndQuadrant.com
Robert Haas <robertmhaas@gmail.com> writes: > Also, stepping back from me personally, should we try to assign some > additional reviewers to these patches? Is there some way we can > divide up review tasks among multiple people so that we're not > repeating each others work? > > Thoughts appreciated, from Heikki, Simon, or others. How about this proposal: http://archives.postgresql.org/pgsql-hackers/2009-08/msg00764.php Regards, -- dim
On Tue, Sep 15, 2009 at 2:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Hi Simon,
Is there a reason that you remove the WAL_DEBUG shown below?
*************** begin:;
*** 899,923 ****
FIN_CRC32(rdata_crc);
record->xl_crc = rdata_crc;
- #ifdef WAL_DEBUG
- if (XLOG_DEBUG)
- {
- StringInfoData buf;
-
- initStringInfo(&buf);
- appendStringInfo(&buf, "INSERT @ %X/%X: ",
- RecPtr.xlogid, RecPtr.xrecoff);
- xlog_outrec(&buf, record);
- if (rdata->data != NULL)
- {
- appendStringInfo(&buf, " - ");
- RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
- }
- elog(LOG, "%s", buf.data);
- pfree(buf.data);
- }
- #endif
-
/* Record begin of record in appropriate places */
ProcLastRecPtr = RecPtr;
Insert->PrevRecord = RecPtr;
--- 947,952 ----
Thanks,
OK, here is the latest version of the Hot Standby patchset. This is
about version 30+ by now, but we should regard this as 0.2.1
Patch against CVS HEAD (now): clean apply, compile, no known bugs.
Hi Simon,
Is there a reason that you remove the WAL_DEBUG shown below?
*************** begin:;
*** 899,923 ****
FIN_CRC32(rdata_crc);
record->xl_crc = rdata_crc;
- #ifdef WAL_DEBUG
- if (XLOG_DEBUG)
- {
- StringInfoData buf;
-
- initStringInfo(&buf);
- appendStringInfo(&buf, "INSERT @ %X/%X: ",
- RecPtr.xlogid, RecPtr.xrecoff);
- xlog_outrec(&buf, record);
- if (rdata->data != NULL)
- {
- appendStringInfo(&buf, " - ");
- RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
- }
- elog(LOG, "%s", buf.data);
- pfree(buf.data);
- }
- #endif
-
/* Record begin of record in appropriate places */
ProcLastRecPtr = RecPtr;
Insert->PrevRecord = RecPtr;
--- 947,952 ----
Jeff
On Fri, 2009-09-18 at 07:23 -0700, Jeff Janes wrote: > On Tue, Sep 15, 2009 at 2:41 PM, Simon Riggs <simon@2ndquadrant.com> > wrote: > > OK, here is the latest version of the Hot Standby patchset. > This is > about version 30+ by now, but we should regard this as 0.2.1 > Patch against CVS HEAD (now): clean apply, compile, no known > bugs. > Is there a reason that you remove the WAL_DEBUG shown below? WAL_DEBUG is not removed by the patch, though that section of code is removed, as you observe. I recall an earlier bug report by me/conversation on hackers about how that section of code was irrecoverably broken, since it's calling an rmgr routine while not in recovery and also assuming the data is fully accessible at that point, which it is not. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2009-09-18 at 07:23 -0700, Jeff Janes wrote: >> Is there a reason that you remove the WAL_DEBUG shown below? > WAL_DEBUG is not removed by the patch, though that section of code is > removed, as you observe. I recall an earlier bug report by > me/conversation on hackers about how that section of code was > irrecoverably broken, since it's calling an rmgr routine while not in > recovery and also assuming the data is fully accessible at that point, > which it is not. Wouldn't it be sufficient to remove the rm_desc() call? I agree that that's broken, but the rest doesn't seem to be. regards, tom lane
On Fri, 2009-09-18 at 11:14 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Fri, 2009-09-18 at 07:23 -0700, Jeff Janes wrote: > >> Is there a reason that you remove the WAL_DEBUG shown below? > > > WAL_DEBUG is not removed by the patch, though that section of code is > > removed, as you observe. I recall an earlier bug report by > > me/conversation on hackers about how that section of code was > > irrecoverably broken, since it's calling an rmgr routine while not in > > recovery and also assuming the data is fully accessible at that point, > > which it is not. > > Wouldn't it be sufficient to remove the rm_desc() call? I agree > that that's broken, but the rest doesn't seem to be. That would make sense also. Previous action just because that was earlier consensus. Will change. -- Simon Riggs www.2ndQuadrant.com
I want to help on this area, but I need a mentor for this. For example, Heikki will be a excellent mentor for me. Following the theme, I think that we have to wide all questions for the process of the acceptance of a patch on the sameway that you Simon. We could write new requirements with all these ideas. Don´t you think? Regards "The hurry is enemy of the success: for that reason.......Be patient" Ing. Marcos L. Ortiz Valmaseda Línea Soporte y Despliegue Centro de Tecnologías de Almacenamiento y Análisis de Datos (CENTALAD) Linux User # 418229 PostgreSQL User http://www.postgresql.org http://www.planetpostgresql.org/ http://www.postgresql-es.org/ ----- Mensaje original ----- De: "Simon Riggs" <simon@2ndQuadrant.com> Para: "Robert Haas" <robertmhaas@gmail.com> CC: "Heikki Linnakangas" <heikki.linnakangas@enterprisedb.com>, "Josh Berkus" <josh@agliodbs.com>, pgsql-hackers@postgresql.org Enviados: Jueves, 17 de Septiembre 2009 20:53:24 GMT -10:00 Hawai Asunto: Re: [HACKERS] Hot Standby 0.2.1 On Thu, 2009-09-17 at 19:01 -0400, Robert Haas wrote: > > > > I'm going to put the index-only scans aside for now to focus on hot > > standby and streaming replication. Both are big patches, so there's > > plenty of work in those two alone, and not only for me. > > What is the best way to attack this? Should I keep reviewing > index-only scans so that you have feedback for when you get back to > it, or should I move on to something else? If something else, does it > make more sense for me to look at HS since I did a bit of work with a > previous version, or would it be better for me to just pick one of the > other patches from the CommitFest and work on that? > > Also, stepping back from me personally, should we try to assign some > additional reviewers to these patches? Is there some way we can > divide up review tasks among multiple people so that we're not > repeating each others work? > > Thoughts appreciated, from Heikki, Simon, or others. I think this is a great opportunity to widen the pool of people contributing to reviews. I suggest the creation of a second group of people, performing round-robin testing of patches. These people would be able to verify * documentation matches implemented features (does it do what it says on the tin?) * usability of explicit features (do the features work well?) * usability gap of unimplemented features (what else do we need?) * are there any bugs? These questions are often quickly answered for smaller patches, but HS's scope mean that such a task properly executed could take a full week, if not longer. Second group of people are just as skilled Postgres people as reviewers, in some cases more so, apart from they have less detailed knowledge of the codebase. We have many such people and it would be good to encourage them to perform thorough reviews rather than "tire kicking". I'm not sure that Heikki needs additional reviewers. He now has significant knowledge of the patch and is good at focusing on key aspects of the internals. Other code reviewers are welcome, of course. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Simon Riggs wrote: > OK, here is the latest version of the Hot Standby patchset. This is > about version 30+ by now, but we should regard this as 0.2.1 > Patch against CVS HEAD (now): clean apply, compile, no known bugs. Thanks! Attached is some minor comment and fixes, and some dead code removal. Also available in my git repository, branch 'hs-riggs'. The documentation talks about setting and checking default_transaction_read_only, but I think it doesn't say anything about transaction_read_only, which I find odd. This in particular: > Users will be able to tell whether their session is read-only by > + issuing SHOW default_transaction_read_only seems misleading, as you might have default_transaction_read_only=on, but still be able to do "SET transaction_read_only", so the *session* isn't necessarily read-only. The only bug I've found is this that we seem to be missing conflict resolution for GiST index tuples deleted by the kill_prior_tuples mechanism. Unless I'm missing something, we need similar handling there that we have in b-tree. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index 91917cf..2257ec6 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -2099,9 +2099,9 @@ if (!triggered) <para> In recovery, transactions will not be permitted to take any lock higher - other than AccessShareLock or AccessExclusiveLock. In addition, - transactions may never assign a TransactionId and may never write WAL. - The LOCK TABLE command by default applies an AccessExclusiveLock. + than AccessShareLock. In addition, transactions may never assign a + TransactionId and may never write WAL. + The LOCK TABLE command by default applies an AccessExclusiveLock. Any LOCK TABLE command that runs on the standby and requests a specific lock type other than AccessShareLock will be rejected. </para> @@ -2168,8 +2168,8 @@ if (!triggered) <para> An example of the above would be an Administrator on Primary server - runs a DROP TABLE command that refers to a table currently in use by - a User query on the standby server. + runs a DROP TABLE command on a table that's currently being queried + in the standby server. </para> <para> @@ -2198,9 +2198,9 @@ if (!triggered) <para> We have a number of choices for resolving query conflicts. The default is that we wait and hope the query completes. If the recovery is not paused, - then the server will wait automatically until the server the lag between + then the server will wait automatically until the lag between primary and standby is at most max_standby_delay seconds. Once that grace - period expires, we then take one of the following actions: + period expires, we take one of the following actions: <itemizedlist> <listitem> @@ -2213,7 +2213,7 @@ if (!triggered) <para> If the conflict is caused by cleanup records we tell the standby query that a conflict has occurred and that it must cancel itself to avoid the - risk that it attempts to silently fails to read relevant data because + risk that it silently fails to read relevant data because that data has been removed. (This is very similar to the much feared error message "snapshot too old"). </para> @@ -2222,7 +2222,7 @@ if (!triggered) Note also that this means that idle-in-transaction sessions are never canceled except by locks. Users should be clear that tables that are regularly and heavily updated on primary server will quickly cause - cancellation of any longer running queries made against those tables. + cancellation of any longer running queries in the standby. </para> <para> @@ -2235,7 +2235,7 @@ if (!triggered) </para> <para> - Other remdial actions exist if the number of cancelations is unacceptable. + Other remedial actions exist if the number of cancelations is unacceptable. The first option is to connect to primary server and keep a query active for as long as we need to run queries on the standby. This guarantees that a WAL cleanup record is never generated and we don't ever get query @@ -2283,7 +2283,7 @@ if (!triggered) <title>Administrator's Overview</title> <para> - If there is a recovery.conf file present then the will start in Hot Standby + If there is a recovery.conf file present the server will start in Hot Standby mode by default, though this can be disabled by setting "recovery_connections = off" in recovery.conf. The server may take some time to enable recovery connections since the server must first complete @@ -2329,7 +2329,7 @@ LOG: database system is ready to accept read only connections A set of functions allow superusers to control the flow of recovery are described in <xref linkend="functions-recovery-control-table">. These functions allow you to pause and continue recovery, as well - as dynamically set new recovery targets wile recovery progresses. + as dynamically set new recovery targets while recovery progresses. Note that when a server is paused the apparent delay between primary and standby will continue to increase. </para> @@ -2354,7 +2354,7 @@ LOG: database system is ready to accept read only connections </para> <para> - The following types of administrator command will not be accepted + The following types of administrator command are not accepted during recovery mode <itemizedlist> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 06243c0..b38b344 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -372,9 +372,9 @@ SET ENABLE_SEQSCAN TO OFF; </para> <para> - When running a standby server, it is strongly recommended that you - set this parameter to be the same or higher than the master server. - Otherwise, queries on the standby server may fail. + When running a standby server, you must set this parameter to the + same or higher value than on the master server. Otherwise, queries on + will not be allowed in the standby server. </para> </listitem> </varlistentry> diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c index 94b7202..c6612b8 100644 --- a/src/backend/access/gin/ginxlog.c +++ b/src/backend/access/gin/ginxlog.c @@ -622,7 +622,7 @@ gin_redo(XLogRecPtr lsn, XLogRecord *record) uint8 info = record->xl_info & ~XLR_INFO_MASK; /* - * GIN indexes do not require any conflict processing. XXX really? + * GIN indexes do not require any conflict processing. */ if (InHotStandby) RecordKnownAssignedTransactionIds(record->xl_xid); diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c index 3e5f3b6..8ed49e0 100644 --- a/src/backend/access/gist/gistxlog.c +++ b/src/backend/access/gist/gistxlog.c @@ -397,7 +397,8 @@ gist_redo(XLogRecPtr lsn, XLogRecord *record) MemoryContext oldCxt; /* - * GIST indexes do not require any conflict processing. XXX really? + * GIST indexes do not require any conflict processing. XXX what about + * killed tuples? */ if (InHotStandby) RecordKnownAssignedTransactionIds(record->xl_xid); diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index c58d2fc..d7dfbdd 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -410,7 +410,7 @@ situation the locking requirements can be relaxed and we do not need double locking during block splits. Each WAL record makes changes to a single level of the btree using the correct locking sequence and so is safe for concurrent readers. Some readers may observe a block split -in progress as they descend the tree, but they will simple move right +in progress as they descend the tree, but they will simpye move right onto the correct page. During recovery all index scans start with ignore_killed_tuples = false @@ -419,8 +419,8 @@ on the standby server can be older than the oldest xmin on the master server, which means tuples can be marked as killed even when they are still visible on the standby. We don't WAL log tuple killed bits, but they can still appear in the standby because of full page writes. So -we must always ignore them and that means it's not worth setting them -either. +we must always ignore them in standby, and that means it's not worth +setting them either. Other Things That Are Handy to Know ----------------------------------- diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index eefa888..7871524 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -721,6 +721,7 @@ _bt_delitems(Relation rel, Buffer buf, * We would like to set an accurate latestRemovedXid, but there * is no easy way of obtaining a useful value. So we use the * probably far too conservative value of RecentGlobalXmin instead. + * XXX: this comment is bogus? We don't use RecentGlobalXmin */ xlrec_delete.latestRemovedXid = InvalidTransactionId; rdata[0].data = (char *) &xlrec_delete; diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index 864c41c..6b962b6 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -23,7 +23,7 @@ #include "miscadmin.h" /* - * We must keep track of expected insertions due to page spl its, and apply + * We must keep track of expected insertions due to page splits, and apply * them manually if they are not seen in the WAL log during replay. This * makes it safe for page insertion to be a multiple-WAL-action process. * @@ -503,8 +503,8 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record) } /* - * We need to take a cleanup lock to apply these changes. - * See nbtree/README for details. + * Like in btvacuumpage(), we need to take a cleanup lock on every leaf + * page. See nbtree/README for details. */ buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL); if (!BufferIsValid(buffer)) @@ -805,11 +805,11 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record) /* * Btree delete records can conflict with standby queries. You might - * think that vacuum records would conflict as well, but they don't. - * XLOG_HEAP2_CLEANUP_INFO records provide the highest xid cleaned - * by the vacuum of the heap and so we can resolve any conflicts just - * once when that arrives. After that any we know that no conflicts exist - * from individual btree vacuum records on that index. + * think that vacuum records would conflict as well, but we've handled + * that already. XLOG_HEAP2_CLEANUP_INFO records provide the highest xid + * cleaned by the vacuum of the heap and so we can resolve any conflicts + * just once when that arrives. After that any we know that no conflicts + * exist from individual btree vacuum records on that index. */ if (InHotStandby) { diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index fc7ecfd..46b48f0 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -195,8 +195,7 @@ they first do something that requires one --- typically, insert/update/delete a tuple, though there are a few other places that need an XID assigned. If a subtransaction requires an XID, we always first assign one to its parent. This maintains the invariant that child transactions have XIDs later -than their parents, which is assumed in a number of places. In 8.5 onwards, -some corner cases exist that require XID assignment to be WAL logged. +than their parents, which is assumed in a number of places. The subsidiary actions of obtaining a lock on the XID and and entering it into pg_subtrans and PG_PROC are done at the time it is assigned. diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 28d1cf0..84c0d54 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -376,7 +376,6 @@ ProcArrayClearTransaction(PGPROC *proc) proc->lxid = InvalidLocalTransactionId; proc->xmin = InvalidTransactionId; proc->recoveryConflictMode = 0; - proc->recoveryConflictLSN = InvalidXLogRecPtr; /* redundant, but just in case */ proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK; @@ -1171,7 +1170,7 @@ void GetRunningTransactionData(void) { ProcArrayStruct *arrayP = procArray; - static RunningTransactions CurrentRunningXacts = (RunningTransactions) &CurrentRunningXactsData; + RunningTransactions CurrentRunningXacts = (RunningTransactions) &CurrentRunningXactsData; RunningXact *rxact; TransactionId *subxip; TransactionId latestRunningXid = InvalidTransactionId; @@ -1274,14 +1273,8 @@ GetRunningTransactionData(void) numHeldLocks += proc->numHeldLocks; /* - * Save subtransaction XIDs. - * - * The other backend can add more subxids concurrently, but cannot - * remove any. Hence it's important to fetch nxids just once. Should - * be safe to use memcpy, though. (We needn't worry about missing any - * xids added concurrently, because they must postdate xmax.) - * - * Again, our own XIDs *are* included in the snapshot. + * Save subtransaction XIDs. Other backends can't add or remove entries + * while we're holding XidGenLock. */ nxids = proc->subxids.nxids; @@ -1536,41 +1529,6 @@ BackendPidGetProc(int pid) } /* - * BackendXidGetProc -- get a backend's PGPROC given its XID - * - * Returns NULL if not found. Note that it is up to the caller to be - * sure that the question remains meaningful for long enough for the - * answer to be used ... - */ -PGPROC * -BackendXidGetProc(TransactionId xid) -{ - PGPROC *result = NULL; - ProcArrayStruct *arrayP = procArray; - int index; - - if (xid == InvalidTransactionId) /* never match invalid xid */ - return 0; - - LWLockAcquire(ProcArrayLock, LW_SHARED); - - for (index = 0; index < arrayP->numProcs; index++) - { - PGPROC *proc = arrayP->procs[index]; - - if (proc->xid == xid) - { - result = proc; - break; - } - } - - LWLockRelease(ProcArrayLock); - - return result; -} - -/* * BackendXidGetPid -- get a backend's pid given its XID * * Returns 0 if not found or it's a prepared transaction. Note that diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c index d05e7a2..48d6553 100644 --- a/src/backend/utils/cache/inval.c +++ b/src/backend/utils/cache/inval.c @@ -1586,15 +1586,11 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist, /* * Issue orders for the proc to read next time it receives SIGINT * Note that this is an atomic change and requires no locking. + * + * XXX: Huh? */ if (proc->recoveryConflictMode < cancel_mode) - { - if (cancel_mode == ERROR && - XLByteLT(proc->recoveryConflictLSN, conflict_lsn)) - proc->recoveryConflictLSN = conflict_lsn; - proc->recoveryConflictMode = cancel_mode; - } /* * Do we expect it to talk? No, Mr. Bond, we expect it to die. diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index bfcb75c..91b4e23 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -536,10 +536,6 @@ typedef BTScanOpaqueData *BTScanOpaque; #define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT) #define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT) -/* XXX probably needs new RMgr call to do this cleanly */ -extern bool btree_is_cleanup_record(uint8 info); -extern bool btree_needs_cleanup_lock(uint8 info); - /* * prototypes for functions in nbtree.c (external entry points for btree) */ diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h index f674547..895532d 100644 --- a/src/include/storage/lock.h +++ b/src/include/storage/lock.h @@ -294,7 +294,6 @@ typedef struct LOCKTAG * nRequested -- total requested locks of all types. * granted -- count of each lock type currently granted on the lock. * nGranted -- total granted locks of all types. - * xid -- xid of current/only lock holder for use by GetLockStatusData() * * Note: these counts count 1 for each backend. Internally to a backend, * there may be multiple grabs on a particular lock, but this is not reflected @@ -375,6 +374,11 @@ typedef struct PROCLOCK #define PROCLOCK_LOCKMETHOD(proclock) \ LOCK_LOCKMETHOD(*((proclock).tag.myLock)) +/* + * Does acquisition of this lock need to be replayed in a standby server? + * Only AccessExclusiveLocks can conflict with lock types that read-only + * transactions can acquire in a standby server. + */ #define IsProcLockLoggable(proclock) \ ((proclock->holdMask & LOCKBIT_ON(AccessExclusiveLock)) && \ (proclock->tag.myLock)->tag.locktag_type == LOCKTAG_RELATION) diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h index 5729482..0df2529 100644 --- a/src/include/storage/proc.h +++ b/src/include/storage/proc.h @@ -99,12 +99,9 @@ struct PGPROC int numHeldLocks; /* Number of AccessExclusiveLocks held by * current backend. */ /* - * InHotStandby mode, the lsn of the first conflict, if any. - * Any block seen with changes made after this lsn will trigger - * query cancelation. Always set recoveryConflictCancelMode after - * setting conflictLSN so we can check this without spin locking. + * While in hot standby mode, setting recoveryConflictMode instructs + * the backend to commit suicide. */ - XLogRecPtr recoveryConflictLSN; int recoveryConflictMode; /* Info about LWLock the process is currently waiting for, if any. */ diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h index b882795..36d5d52 100644 --- a/src/include/storage/procarray.h +++ b/src/include/storage/procarray.h @@ -54,7 +54,6 @@ extern int GetTransactionsInCommit(TransactionId **xids_p); extern bool HaveTransactionsInCommit(TransactionId *xids, int nxids); extern PGPROC *BackendPidGetProc(int pid); -extern PGPROC *BackendXidGetProc(TransactionId xid); extern int BackendXidGetPid(TransactionId xid); extern bool IsBackendPid(int pid); diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h index 92feaf8..67e8112 100644 --- a/src/include/utils/snapshot.h +++ b/src/include/utils/snapshot.h @@ -17,6 +17,7 @@ #include "storage/buf.h" #include "storage/sinval.h" + typedef struct SnapshotData *Snapshot; #define InvalidSnapshot ((Snapshot) NULL) @@ -51,7 +52,8 @@ typedef struct SnapshotData /* note: all ids in xip[] satisfy xmin <= xip[i] < xmax */ int32 subxcnt; /* # of xact ids in subxip[], -1 if overflow */ TransactionId *subxip; /* array of subxact IDs in progress */ - bool takenDuringRecovery; /* Recovery-shaped snapshot? */ + bool takenDuringRecovery; /* recovery-shaped snapshot? */ + /* * note: all ids in subxip[] are >= xmin, but we don't bother filtering * out any that are >= xmax
On Mon, 2009-09-21 at 13:50 +0300, Heikki Linnakangas wrote: > The only bug I've found ! > is this that we seem to be missing conflict > resolution for GiST index tuples deleted by the kill_prior_tuples > mechanism. Unless I'm missing something, we need similar handling there > that we have in b-tree. OK, I agree with that. Straightforward change. Thanks very much. I marked the comment to indicate that the handling for GIST and GIN indexes looked dubious to me also. I had the earlier "it is safe" comments but that was before we looked at the kill prior tuples issue. Re-reading code for GIN also, I note that there isn't any further work because we don't kill prior tuples ever. Also, there is no special handling of the GIN b-tree posting tree because VACUUM scans that in logical sequence, rather than the physical sequence in nbtree. -- Simon Riggs www.2ndQuadrant.com
On Mon, 2009-09-21 at 13:50 +0300, Heikki Linnakangas wrote: > The documentation talks about setting and checking > default_transaction_read_only, but I think it doesn't say anything > about > transaction_read_only, which I find odd. This in particular: > > > Users will be able to tell whether their session is read-only by > > + issuing SHOW default_transaction_read_only > > seems misleading, as you might have default_transaction_read_only=on, > but still be able to do "SET transaction_read_only", so the *session* > isn't necessarily read-only. Yes, clearly missing a check there. Those two operations should be blocked at higher level, using PreventCommandDuringRecovery() and I confess that I thought they already were. Doesn't crash because of the other checks in place, but gives wrong error message. Thanks for penetration testing the patch. -- Simon Riggs www.2ndQuadrant.com
On Mon, 2009-09-21 at 14:01 +0100, Simon Riggs wrote: > On Mon, 2009-09-21 at 13:50 +0300, Heikki Linnakangas wrote: > > is this that we seem to be missing conflict > > resolution for GiST index tuples deleted by the kill_prior_tuples > > mechanism. Unless I'm missing something, we need similar handling there > > that we have in b-tree. > > OK, I agree with that. Straightforward change. Thanks very much. > > I marked the comment to indicate that the handling for GIST and GIN > indexes looked dubious to me also. I had the earlier "it is safe" > comments but that was before we looked at the kill prior tuples issue. ISTM I looked at this too quickly. kill_prior_tuple is only ever set by these lines, after scan starts: if (!scan->xactStartedInRecovery) scan->kill_prior_tuple = scan->xs_hot_dead; which is set in indexam.c, not within any particular am. So the coding, as submitted, covers all index types, current and future. AFAICS there is no bug, unless you have a test case or can explain further? Worth raising as a query because it forced me to re-check how GIST and GIN work and am happy again now. -- Simon Riggs www.2ndQuadrant.com
On Mon, Sep 21, 2009 at 9:01 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Mon, 2009-09-21 at 13:50 +0300, Heikki Linnakangas wrote: > >> The only bug I've found > > ! Yeah, wow. ...Robert
Simon Riggs wrote: > > OK, here is the latest version of the Hot Standby patchset. This is > about version 30+ by now, but we should regard this as 0.2.1 > Patch against CVS HEAD (now): clean apply, compile, no known bugs. Wow, great! Simon has allowed us to pass a great milestone in Postgres development. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, Sep 15, 2009 at 2:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > OK, here is the latest version of the Hot Standby patchset. This is > about version 30+ by now, but we should regard this as 0.2.1 > Patch against CVS HEAD (now): clean apply, compile, no known bugs. > > OVERVIEW > > You can download PDF versions of the fine manual is here > http://wiki.postgresql.org/images/0/01/Hot_Standby_main.pdf From this doc: "In recovery, transactions will not be permitted to take any lock higher other than AccessShareLock or AccessExclusiveLock. In addition, transactions may never assign a TransactionId and may never write WAL. The LOCK TABLE command by default applies an AccessExclusiveLock. Any LOCK TABLE command that runs on the standby and requests a specific lock type other than AccessShareLock will be rejected." The first sentence seems to say that clients on the stand-by can take ACCESS EXCLUSIVE, while the last sentence seems to say that they cannot do so. I did a little experiment on a hot standby instance. I expected that either I would be denied the lock altogether, or the lock would cause WAL replay to be paused until either I committed or was forcibly canceled. But neither happened, I was granted the lock but WAL replay continued anyway. jjanes=# begin; BEGIN jjanes=# lock table pgbench_history in access exclusive mode; LOCK TABLE jjanes=# select count(*) from pgbench_history; count -------- 519104 (1 row) jjanes=# select count(*) from pgbench_history; count -------- 527814 (1 row) Is this the expected behavior? Thanks, Jeff
Simon Riggs wrote: > On Mon, 2009-09-21 at 14:01 +0100, Simon Riggs wrote: >> On Mon, 2009-09-21 at 13:50 +0300, Heikki Linnakangas wrote: > >>> is this that we seem to be missing conflict >>> resolution for GiST index tuples deleted by the kill_prior_tuples >>> mechanism. Unless I'm missing something, we need similar handling there >>> that we have in b-tree. >> OK, I agree with that. Straightforward change. Thanks very much. >> >> I marked the comment to indicate that the handling for GIST and GIN >> indexes looked dubious to me also. I had the earlier "it is safe" >> comments but that was before we looked at the kill prior tuples issue. > > ISTM I looked at this too quickly. > > kill_prior_tuple is only ever set by these lines, after scan starts: > > if (!scan->xactStartedInRecovery) > scan->kill_prior_tuple = scan->xs_hot_dead; > > which is set in indexam.c, not within any particular am. So the coding, > as submitted, covers all index types, current and future. That just stops more tuples from being killed in the standby. I was thinking that we need similar conflict resolution in GiST that we do with b-tree delete records, to stop killed tuples from being deleted while they might still be needed in the standby. But looking closer at GiST, it seems that GiST doesn't actually do that; killed tuples are not removed at page splits, but only by VACUUM. So that's not an issue after all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2009-09-21 at 19:42 -0700, Jeff Janes wrote: > On Tue, Sep 15, 2009 at 2:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > OK, here is the latest version of the Hot Standby patchset. This is > > about version 30+ by now, but we should regard this as 0.2.1 > > Patch against CVS HEAD (now): clean apply, compile, no known bugs. > > > > OVERVIEW > > > > You can download PDF versions of the fine manual is here > > http://wiki.postgresql.org/images/0/01/Hot_Standby_main.pdf > > > >From this doc: > > "In recovery, transactions will not be permitted to take any lock > higher other than > AccessShareLock or AccessExclusiveLock. In addition, transactions may never > assign a TransactionId and may never write WAL. The LOCK TABLE command by > default applies an AccessExclusiveLock. Any LOCK TABLE command that runs on > the standby and requests a specific lock type other than AccessShareLock will be > rejected." > > The first sentence seems to say that clients on the stand-by can take > ACCESS EXCLUSIVE, while the last sentence seems to say that they > cannot do so. You are right to pick up that discrepancy, as Heikki did also. The root cause of that discrepancy is a change in patch behaviour between January and now that I will use your post to highlight and discuss, if you don't mind. (and yes, the docs need to be corrected) Initially, it seemed that it was certain that a read-only backend could not take an AccessExclusiveLock. On further thought, there is no particular reason to block AccessExclusiveLocks themselves, just that most things you might do while holding one are banned. But the lock itself is fine. (Any challenge on that?) AccessExclusiveLocks can be used to serialize the actions of other backends. That is a very common use case, so my concern was that LOCK TABLE would be a no-op unless we allowed AccessExclusiveLock, so the patch does allow it. > I did a little experiment on a hot standby instance. I expected that > either I would be denied the lock altogether, or the lock would cause > WAL replay to be paused until either I committed or was forcibly > canceled. But neither happened, I was granted the lock but WAL replay > continued anyway. > > jjanes=# begin; > BEGIN > jjanes=# lock table pgbench_history in access exclusive mode; > LOCK TABLE > jjanes=# select count(*) from pgbench_history; > count > -------- > 519104 > (1 row) > > jjanes=# select count(*) from pgbench_history; > count > -------- > 527814 > (1 row) > > Is this the expected behavior? By me, yes. WAL replay does not require a table lock to progress. Any changes are protected with block-level locks. It does acquire a table lock and cancel conflicting queries when it is about to replay something that would cause a query to explode, such as dropping a table, as explained in docs. So this is not a bug. The explanation of how the above sequence of events occurs is that the backend acquires AccessExclusiveLock - please check on other session in pg_locks. WAL replay continues by the Startup process, inserting further rows into the pgbench_history table as a series of transactions. The second select takes a later snapshot than the first and so sees more data than the first select, hence a larger count. (And I am pleased to see that recovery is progressing quickly even while your queries run). So not a bug, but just one of many possible behaviours we could enforce. 1. Allow AccessExclusiveLocks yet they do not interrupt WAL replay 2. Allow AccessExclusiveLocks but have them pause WAL replay 3. Disallow AccessExclusiveLocks (and so LOCK TABLE is effectively a no-op because it will not be able to serialize anything) So the patch originally implemented (3) but now implements (1). I would say that (2) is very undesirable because it puts WAL replay in the control of non-superusers. That could mean LOCK TABLE implicitly alters the high availability of the standby, and might even be used maliciously to do that. I'm open to views on whether we should use (1) or (3). Comments? Implementing either is no problem and we have a straight choice. We may even wish to review that again later from additional feedback. (Jeff, you have also helped me understand that there is a bug in the way serializable transactions are cancelled, which is easily corrected. Thanks for that unexpected windfall, but it is otherwise unrelated to your comments.) -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Mon, 2009-09-21 at 19:42 -0700, Jeff Janes wrote: >> jjanes=# begin; >> BEGIN >> jjanes=# lock table pgbench_history in access exclusive mode; >> LOCK TABLE >> jjanes=# select count(*) from pgbench_history; >> count >> -------- >> 519104 >> (1 row) >> >> jjanes=# select count(*) from pgbench_history; >> count >> -------- >> 527814 >> (1 row) >> >> Is this the expected behavior? > > By me, yes. WAL replay does not require a table lock to progress. Any > changes are protected with block-level locks. It does acquire a table > lock and cancel conflicting queries when it is about to replay something > that would cause a query to explode, such as dropping a table, as > explained in docs. That is rather surprising. You can't get that result in a normal server, which I think is enough of a reason to disallow it. If you do LOCK TABLE ACCESS EXCLUSIVE, you wouldn't expect the contents to change under your nose. > So not a bug, but just one of many possible behaviours we could enforce. > 1. Allow AccessExclusiveLocks yet they do not interrupt WAL replay > 2. Allow AccessExclusiveLocks but have them pause WAL replay > 3. Disallow AccessExclusiveLocks (and so LOCK TABLE is effectively a > no-op because it will not be able to serialize anything) > > So the patch originally implemented (3) but now implements (1). > > I would say that (2) is very undesirable because it puts WAL replay in > the control of non-superusers. That could mean LOCK TABLE implicitly > alters the high availability of the standby, and might even be used > maliciously to do that. I don't see a problem with (2) as long as the locker is kicked out after max_standby_delay, like a long-running query. That's what I would expect. I'm fine with (3) as well, but (1) does seem rather suprising behavior. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2009-09-22 at 11:04 +0300, Heikki Linnakangas wrote: > > > > By me, yes. WAL replay does not require a table lock to progress. Any > > changes are protected with block-level locks. It does acquire a table > > lock and cancel conflicting queries when it is about to replay something > > that would cause a query to explode, such as dropping a table, as > > explained in docs. > > That is rather surprising. You can't get that result in a normal server, > which I think is enough of a reason to disallow it. If you do LOCK TABLE > ACCESS EXCLUSIVE, you wouldn't expect the contents to change under your > nose. OK, "normality" is a reasonable argument against. So (1) is only a partial implementation of serializing the table. > > So not a bug, but just one of many possible behaviours we could enforce. > > 1. Allow AccessExclusiveLocks yet they do not interrupt WAL replay > > 2. Allow AccessExclusiveLocks but have them pause WAL replay > > 3. Disallow AccessExclusiveLocks (and so LOCK TABLE is effectively a > > no-op because it will not be able to serialize anything) > > > > So the patch originally implemented (3) but now implements (1). > > > > I would say that (2) is very undesirable because it puts WAL replay in > > the control of non-superusers. That could mean LOCK TABLE implicitly > > alters the high availability of the standby, and might even be used > > maliciously to do that. > > I don't see a problem with (2) as long as the locker is kicked out after > max_standby_delay, like a long-running query. That's what I would > expect. I'm fine with (3) as well, but (1) does seem rather suprising > behavior. (2) gives other problems because it would force us to check for conflicting locks for each heap & index WAL record to ensure that the lock was honoured. We could optimize that but it's still going to cost. I'd rather leave things at (3) for now and wait for further feedback. "Start strict, relax later". -- Simon Riggs www.2ndQuadrant.com
In testing, it looks like there's still something wrong with the subtransaction handling. I created a test function to create a large number of subtransactions: CREATE LANGUAGE plpgsql; CREATE TABLE bar (id int4); CREATE OR REPLACE FUNCTION subxids (n integer) RETURNS void LANGUAGE plpgsql AS $$ BEGIN IF n <= 0 THEN RETURN; END IF; INSERT INTO bar VALUES (n); PERFORM subxids(n - 1); RETURN; EXCEPTION WHEN raise_exception THEN NULL; END; $$; And used that to created 100 nested subtransactions in the primary server: SELECT subxids(100); I got this in the standby: ... LOG: REDO @ 0/A000EA8; LSN 0/A000FCC: prev 0/A000E6C; xid 2662; len 264: Transaction - xid assignment xtop 2599 nsubxids 64; 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 0 LOG: extend subtrans xid 2600 page 1 last_page 0 CONTEXT: rmgr 1 redo xid assignment xtop 2599 nsubxids 64; 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 0 TRAP: FailedAssertion("!(((newestXact) >= ((TransactionId) 3)))", File: "subtrans.c", Line: 299) LOG: startup process (PID 28594) was terminated by signal 6: Aborted LOG: terminating any other active server processes Apparently an InvalidXid is sneaking into the unreported xid array collected in the primary. I actually bumped into this while I was testing a simpler implementation of the logic to collect the unreported xids in the primary. As the patch stands, we keep track of how many of the childXids at each subtransaction level have already been reported, but it occurs to me that it would be a lot simpler to populate a separate array of unreported xids on-the-fly, as xids are assigned. With the simpler implementation I bumped into another issue (see below), which is why I wanted to test if it worked without the simplification. So with the simpler logic, I had this problem. When I do this in the primary: postgres=# SELECT subxids(10000); ERROR: stack depth limit exceeded HINT: Increase the configuration parameter "max_stack_depth", after ensuring the platform's stack depth limit is adequate. CONTEXT: SQL statement "SELECT subxids( $1 - 1)" PL/pgSQL function "subxids" line 4 at PERFORM SQL statement "SELECT subxids( $1 - 1)" ... I get this in the standby: ... LOG: REDO @ 0/1000B6C4; LSN 0/1000B6F0: prev 0/1000B698; xid 4325; len 16: Transaction - abort: 2009-09-22 12:45:00.938243+03 LOG: record known xact 4325 latestObservedXid 4334 CONTEXT: rmgr 1 redo abort: 2009-09-22 12:45:00.938243+03 LOG: remove KnownAssignedXid 4325 CONTEXT: rmgr 1 redo abort: 2009-09-22 12:45:00.938243+03 LOG: REDO @ 0/1000B6F0; LSN 0/1000B71C: prev 0/1000B6C4; xid 4324; len 16: Transaction - abort: 2009-09-22 12:45:00.938438+03 LOG: record known xact 4324 latestObservedXid 4334 CONTEXT: rmgr 1 redo abort: 2009-09-22 12:45:00.938438+03 LOG: remove KnownAssignedXid 4324 CONTEXT: rmgr 1 redo abort: 2009-09-22 12:45:00.938438+03 LOG: REDO @ 0/1000B71C; LSN 0/1000B748: prev 0/1000B6F0; xid 4323; len 16: Transaction - abort: 2009-09-22 12:45:00.938632+03 LOG: record known xact 4323 latestObservedXid 4334 CONTEXT: rmgr 1 redo abort: 2009-09-22 12:45:00.938632+03 LOG: remove KnownAssignedXid 4323 CONTEXT: rmgr 1 redo abort: 2009-09-22 12:45:00.938632+03 LOG: 1 KnownAssignedXids 3619 CONTEXT: rmgr 1 redo abort: 2009-09-22 12:45:00.938632+03 FATAL: cannot remove KnownAssignedXid 4323 CONTEXT: rmgr 1 redo abort: 2009-09-22 12:45:00.938632+03 LOG: startup process (PID 31284) exited with exit code 1 LOG: terminating any other active server processes It looks like the standby tries to remove XID 4323 from the known-assigned hash table, but it's not there because it was removed and set in pg_subtrans by an XLOG_XACT_ASSIGNMENT record earlier. I guess we should just not throw an error in that case, but is there a way we could catch that narrow case and still keep the check in KnownAssignedXidsRemove()? It seems like the check might help catch other bugs, so I'd rather keep it if possible. I've pushed the simplified code to my git repository if you want to take a look. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2009-09-22 at 12:53 +0300, Heikki Linnakangas wrote: > In testing, it looks like there's still something wrong with the > subtransaction handling. I created a test function to create a large > number of subtransactions: OK, looking at this now. Thanks for the report. -- Simon Riggs www.2ndQuadrant.com
On Tue, 2009-09-22 at 12:53 +0300, Heikki Linnakangas wrote: > It looks like the standby tries to remove XID 4323 from the > known-assigned hash table, but it's not there because it was removed > and set in pg_subtrans by an XLOG_XACT_ASSIGNMENT record earlier. I > guess we should just not throw an error in that case, but is there a > way we could catch that narrow case and still keep the check in > KnownAssignedXidsRemove()? It seems like the check might help catch > other bugs, so I'd rather keep it if possible. Yes, a certain test is important and I'm glad we've (almost) achieved it. ISTM we can say if not in KnownAssignedXids then check pg_subtrans. If not in either then report error. May need to add info to the abort record to check xtop also. -- Simon Riggs www.2ndQuadrant.com
On Tue, 2009-09-22 at 12:53 +0300, Heikki Linnakangas wrote: > In testing, it looks like there's still something wrong with the > subtransaction handling. I created a test function to create a large > number of subtransactions: > > CREATE LANGUAGE plpgsql; > CREATE TABLE bar (id int4); > > CREATE OR REPLACE FUNCTION subxids (n integer) RETURNS void LANGUAGE > plpgsql AS $$ > BEGIN > IF n <= 0 THEN RETURN; END IF; > INSERT INTO bar VALUES (n); > PERFORM subxids(n - 1); > RETURN; > EXCEPTION WHEN raise_exception THEN NULL; END; > $$; > > And used that to created 100 nested subtransactions in the primary server: > > SELECT subxids(100); > > > I got this in the standby: > > ... > LOG: REDO @ 0/A000EA8; LSN 0/A000FCC: prev 0/A000E6C; xid 2662; len > 264: Transaction - xid assignment xtop 2599 nsubxids 64; 2600 2601 2602 > 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 > 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 > 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 > 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 > 2659 2660 2661 2662 0 > LOG: extend subtrans xid 2600 page 1 last_page 0 > CONTEXT: rmgr 1 redo xid assignment xtop 2599 nsubxids 64; 2600 2601 > 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 > 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 > 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 > 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 > 2658 2659 2660 2661 2662 0 > TRAP: FailedAssertion("!(((newestXact) >= ((TransactionId) 3)))", File: > "subtrans.c", Line: 299) > LOG: startup process (PID 28594) was terminated by signal 6: Aborted > LOG: terminating any other active server processes > > Apparently an InvalidXid is sneaking into the unreported xid array > collected in the primary. > > I actually bumped into this while I was testing a simpler implementation > of the logic to collect the unreported xids in the primary. As the patch > stands, we keep track of how many of the childXids at each > subtransaction level have already been reported, but it occurs to me > that it would be a lot simpler to populate a separate array of > unreported xids on-the-fly, as xids are assigned. With the simpler > implementation I bumped into another issue (see below), which is why I > wanted to test if it worked without the simplification. The bug seems an off-by-one error and would seem easily corrected in any case; there is no fundamental problem revealed. I prefer your suggested approach, since it avoids that rather complex looking code that did the grovelling thru the nested xact states. I'll get back to you with more on this tomorrow. -- Simon Riggs www.2ndQuadrant.com
Heikki Linnakangas escribió: > Simon Riggs wrote: > > On Mon, 2009-09-21 at 19:42 -0700, Jeff Janes wrote: > >> jjanes=# begin; > >> BEGIN > >> jjanes=# lock table pgbench_history in access exclusive mode; > >> LOCK TABLE > >> jjanes=# select count(*) from pgbench_history; > >> count > >> -------- > >> 519104 > >> (1 row) > >> > >> jjanes=# select count(*) from pgbench_history; > >> count > >> -------- > >> 527814 > >> (1 row) > >> > >> Is this the expected behavior? > > > > By me, yes. WAL replay does not require a table lock to progress. Any > > changes are protected with block-level locks. It does acquire a table > > lock and cancel conflicting queries when it is about to replay something > > that would cause a query to explode, such as dropping a table, as > > explained in docs. > > That is rather surprising. You can't get that result in a normal server, > which I think is enough of a reason to disallow it. If you do LOCK TABLE > ACCESS EXCLUSIVE, you wouldn't expect the contents to change under your > nose. I think the fallout from that argument is that WAL replay should hold table-level locks (otherwise the workaround to this problem is too special-casey). I'm not sure about that. If I really wanted to get consistent results, I'd use a serializable transaction. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Simon Riggs wrote: > > OK, here is the latest version of the Hot Standby patchset. This is > about version 30+ by now, but we should regard this as 0.2.1 > Patch against CVS HEAD (now): clean apply, compile, no known bugs. > > OVERVIEW Anyone who is interested in how the hot standby behaves should read the following excellent PDF Simon produced. It goes into great detail of the slave's read-only transactions and how the standby behaves during continuous slave recovery: > You can download PDF versions of the fine manual is here > http://wiki.postgresql.org/images/0/01/Hot_Standby_main.pdf The function call docs are at: > http://wiki.postgresql.org/images/1/10/Hot_Standby_Recovery_Functions.pdf -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Looking at the way cache invalidations are handled in two-phase transactions, it would be simpler if we store the shared cache invalidation messages in the twophase state file header, like we store deleted relations and subxids. This allows them to be copied to the COMMIT_PREPARED WAL record, so that we don't need treat twophase commits differently than regular ones in xact_redo_commit. As the patch stands, the new xactGetCommittedInvalidationMessages/MakeSharedInvalidMessagesArray mechanism is duplicated functionality with AtPrepare_Inval/-PersistInvalidationMessage - both materialize the pending shared invalidation messages so that they can be written to disk. I did that in my git branch. I wonder if we might have alignment issues with the SharedInvalidationMessages being stored in WAL records, following the subxids. All the data stored in that record have 4-byte alignment at the moment, but if SharedInvalidationMessage ever needs 8-byte alignment, we would have trouble. Probably not worth changing code, it's highly unlikely that SharedInvalidationMessage will ever need 8-byte alignment, but perhaps a comment somewhere would be in order. I note that we don't emit RunningXacts after a shutdown checkpoint. So if recovery starts at a shutdown checkpoint, we don't let read-only backends in until the first online checkpoint. Could we treat a shutdown checkpoint as a snapshot with no transactions running? Or do prepared transactions screw that up? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
The logic in the lock manager to track the number of held AccessExclusiveLocks (with ProcArrayIncrementNumHeldLocks and ProcArrayDecrementNumHeldLocks) seems to be broken. I added an Assertion into ProcArrayDecrementNumHeldLocks: --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -1401,6 +1401,7 @@ ProcArrayIncrementNumHeldLocks(PGPROC *proc)voidProcArrayDecrementNumHeldLocks(PGPROC *proc){ + Assert(proc->numHeldLocks > 0); proc->numHeldLocks--;} This tripped the assertion: postgres=# CREATE TABLE foo (id int4 primary key); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "foo_pkey" for table "foo" server closed the connection unexpectedlyThis probably means the server terminated abnormallybefore or while processing therequest. Making matters worse, the primary server refuses to startup up after that, tripping the assertion again in crash recovery: $ bin/postmaster -D data LOG: database system was interrupted while in recovery at 2009-09-23 11:56:15 EEST HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/32000070 LOG: REDO @ 0/32000070; LSN 0/320000AC: prev 0/32000020; xid 0; len 32: Heap2 - clean: rel 1663/11562/1249; blk 32 remxid 4352 LOG: consistent recovery state reached LOG: REDO @ 0/320000AC; LSN 0/320000CC: prev 0/32000070; xid 0; len 4: XLOG - nextOid: 24600 LOG: REDO @ 0/320000CC; LSN 0/320000F4: prev 0/320000AC; xid 0; len 12: Storage - file create: base/11562/16408 LOG: REDO @ 0/320000F4; LSN 0/3200011C: prev 0/320000CC; xid 4364; len 12: Relation - exclusive relation lock: xid 4364 db 11562 rel 16408 LOG: REDO @ 0/3200011C; LSN 0/320001D8: prev 0/320000F4; xid 4364; len 159: Heap - insert: rel 1663/11562/1259; tid 5/4 ... LOG: REDO @ 0/32004754; LSN 0/32004878: prev 0/320046A8; xid 4364; len 264: Transaction - commit: 2009-09-23 11:55:51.888398+03; 15 inval msgs:catcache id38 catcache id37 catcache id38 catcache id37 catcache id38 catcache id37 catcache id7 catcache id6 catcache id26 smgr relcache smgr relcache smgr relcache TRAP: FailedAssertion("!(proc->numHeldLocks > 0)", File: "procarray.c", Line: 1404) LOG: startup process (PID 27430) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure I'm sure that's just a simple bug somewhere, but it highlights that we need be careful to avoid putting any extra work into the normal recovery path. Otherwise bugs in hot standby related code can cause crash recovery to fail. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2009-09-23 at 11:13 +0300, Heikki Linnakangas wrote: > Looking at the way cache invalidations are handled in two-phase > transactions, it would be simpler if we store the shared cache > invalidation messages in the twophase state file header, like we store > deleted relations and subxids. This allows them to be copied to the > COMMIT_PREPARED WAL record, so that we don't need treat twophase commits > differently than regular ones in xact_redo_commit. As the patch stands, > the new > xactGetCommittedInvalidationMessages/MakeSharedInvalidMessagesArray > mechanism is duplicated functionality with > AtPrepare_Inval/-PersistInvalidationMessage - both materialize the > pending shared invalidation messages so that they can be written to > disk. I did that in my git branch. We could, but the prepared transaction path already contains special case code anyway, so we aren't reducing number of test cases required. This looks like a possible area for refactoring, but I don't see the need for pre-factoring. i.e. don't rework before commit, rework after. > I wonder if we might have alignment issues with the > SharedInvalidationMessages being stored in WAL records, following the > subxids. All the data stored in that record have 4-byte alignment at the > moment, but if SharedInvalidationMessage ever needs 8-byte alignment, we > would have trouble. Probably not worth changing code, it's highly > unlikely that SharedInvalidationMessage will ever need 8-byte alignment, > but perhaps a comment somewhere would be in order. It's a possible source of bugs, but there are no issues there AFAICS. The technique of multiple arrays on a WAL record wasn't invented by this patch. > I note that we don't emit RunningXacts after a shutdown checkpoint. So > if recovery starts at a shutdown checkpoint, we don't let read-only > backends in until the first online checkpoint. Could we treat a shutdown > checkpoint as a snapshot with no transactions running? Or do prepared > transactions screw that up? We could, but I see no requirement for starting HS from a backup taken on a shutdown database. It's just another special case to test and since we already have significant number of important test cases I'd say add this later. That seems to have reflected all of your points on this post, though thanks for the comments. I'm keen to reduce complexity in areas that have caused bugs, but for stuff that is working I am tempted to leave alone on such a big patch. Anything we can do to avoid re-testing sections of code and/or reduce the number of tests required is going to increase stability. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2009-09-23 at 12:07 +0300, Heikki Linnakangas wrote: > it highlights that we > need be careful to avoid putting any extra work into the normal > recovery > path. Otherwise bugs in hot standby related code can cause crash > recovery to fail. Excellent point. I will put in additional protective code there. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2009-09-23 at 12:07 +0300, Heikki Linnakangas wrote: > seems to be broken Agreed. Patch withdrawn for correction and rework. Nothing serious, but not much point doing further testing to all current issues resolved. Tracking of issues raised and later solved via Wiki. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Wed, 2009-09-23 at 11:13 +0300, Heikki Linnakangas wrote: >> I note that we don't emit RunningXacts after a shutdown checkpoint. So >> if recovery starts at a shutdown checkpoint, we don't let read-only >> backends in until the first online checkpoint. Could we treat a shutdown >> checkpoint as a snapshot with no transactions running? Or do prepared >> transactions screw that up? > > We could, but I see no requirement for starting HS from a backup taken > on a shutdown database. It's just another special case to test and since > we already have significant number of important test cases I'd say add > this later. There's also a related issue that if a backend holding AccessExclusiveLock crashes without writing an abort WAL record, the lock is never released in the standby. We handle the expiration of xids at replay of running-xacts records, but AFAICS we don't do that for locks. It shouldn't be much code to clear those states at shutdown checkpoint, just a few lines to call the right functions methinks. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Simon Riggs wrote: >> On Wed, 2009-09-23 at 11:13 +0300, Heikki Linnakangas wrote: >>> I note that we don't emit RunningXacts after a shutdown checkpoint. So >>> if recovery starts at a shutdown checkpoint, we don't let read-only >>> backends in until the first online checkpoint. Could we treat a shutdown >>> checkpoint as a snapshot with no transactions running? Or do prepared >>> transactions screw that up? >> We could, but I see no requirement for starting HS from a backup taken >> on a shutdown database. It's just another special case to test and since >> we already have significant number of important test cases I'd say add >> this later. > > There's also a related issue that if a backend holding > AccessExclusiveLock crashes without writing an abort WAL record, the > lock is never released in the standby. We handle the expiration of xids > at replay of running-xacts records, but AFAICS we don't do that for locks. Ah, scratch that, I now see that we do call XactClearRecoveryTransactions() when we see a shutdown checkpoint, which clears all recovery locks. But doesn't that prematurely clear all locks belonging to prepared transactions as well? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Sep 22, 2009 at 10:02 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Heikki Linnakangas escribió: >> Simon Riggs wrote: >> > On Mon, 2009-09-21 at 19:42 -0700, Jeff Janes wrote: >> >> jjanes=# begin; >> >> BEGIN >> >> jjanes=# lock table pgbench_history in access exclusive mode; >> >> LOCK TABLE >> >> jjanes=# select count(*) from pgbench_history; >> >> count >> >> -------- >> >> 519104 >> >> (1 row) >> >> >> >> jjanes=# select count(*) from pgbench_history; >> >> count >> >> -------- >> >> 527814 >> >> (1 row) >> >> >> >> Is this the expected behavior? >> > >> > By me, yes. WAL replay does not require a table lock to progress. Any >> > changes are protected with block-level locks. It does acquire a table >> > lock and cancel conflicting queries when it is about to replay something >> > that would cause a query to explode, such as dropping a table, as >> > explained in docs. >> >> That is rather surprising. You can't get that result in a normal server, >> which I think is enough of a reason to disallow it. If you do LOCK TABLE >> ACCESS EXCLUSIVE, you wouldn't expect the contents to change under your >> nose. Right. I'd rather be denied the lock than have it granted to me but not do the same thing it does on a primary---prevent all changes to the locked table. > I think the fallout from that argument is that WAL replay should hold > table-level locks (otherwise the workaround to this problem is too > special-casey). I'm not sure about that. If I really wanted to get > consistent results, I'd use a serializable transaction. Unfortunately, isolation level "serializable" is not truly serializable. Usually it is good enough, but when it isn't good enough and you need an explicit table lock (a very rare but not nonexistent situation), I think it should either lock the table in the manner it would do on the primary, or throw an error. I think that silently changing the behavior between primary and standby is not a good thing. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > Unfortunately, isolation level "serializable" is not truly > serializable. Usually it is good enough, but when it isn't good > enough and you need an explicit table lock (a very rare but not > nonexistent situation), I think it should either lock the table in the > manner it would do on the primary, or throw an error. I think that > silently changing the behavior between primary and standby is not a > good thing. +1 --- this proposal made me acutely uncomfortable, too. regards, tom lane
Simon Riggs wrote: > On Wed, 2009-09-23 at 12:07 +0300, Heikki Linnakangas wrote: >> seems to be broken > > Agreed. Looking at the relation lock stuff a bit more... When someone tries to acquire an AccessExclusiveLock, but can't get it immediately, we sleep while holding RecoveryInfoLock. That doesn't affect normal queries, but it will in turn block any attempt to get a running-xacts snapshot, and will thus block checkpoint from finishing. It could take a very long time until you get an AccessExclusiveLock on a busy table, so that seems pretty bad. I think we need to think a bit harder about that locking. The problem becomes a lot easier if we accept that it's OK to have a lock included in the running-xacts snapshot and also appear in a XLOG_RELATION_LOCK record later. The standby should handle that gracefully already. If we just remove RecoveryInfoLock, that can happen, but it still won't be possible for a lock to be missed out which is what we really care about. Rather than keep the numHeldLocks counters per-proc in proc array, I think it would be simpler to have a single (or one per lock partition) counter in shared memory in lock.c. It's just an optimization to make it faster to find out that there is no loggable AccessExclusiveLocks in the system, so it really rather belongs into the lock manager. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon, > Patch withdrawn for correction and rework. Nothing serious, but not much > point doing further testing to all current issues resolved. :-( Good thing we went for 4 CFs. Is there a GIT branch of Simon's current working version up somewhere? -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Josh Berkus wrote: >> Patch withdrawn for correction and rework. Nothing serious, but not much >> point doing further testing to all current issues resolved. > > :-( > > Good thing we went for 4 CFs. I think we should try to hammer this in in this commitfest. None of the issues found this far are too serious, nothing that requires major rewrites. > Is there a GIT branch of Simon's current working version up somewhere? I have my git repository at git.postgresql.org, branch 'hs-riggs'. I've pushed a number of small changes there over Simon's patch. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki, > I think we should try to hammer this in in this commitfest. None of the > issues found this far are too serious, nothing that requires major rewrites. It would certainly be valuable to get users testing it starting with Alpha2 instead of waiting 2 months. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Heikki Linnakangas wrote: > The problem becomes a lot easier if we accept that it's OK to have a > lock included in the running-xacts snapshot and also appear in a > XLOG_RELATION_LOCK record later. The standby should handle that > gracefully already. If we just remove RecoveryInfoLock, that can happen, > but it still won't be possible for a lock to be missed out which is what > we really care about. I see the problem with that now. Without the lock, it's possible that the XLOG_RELATION_LOCK WAL record is written before the XLOG_RUNNING_XACTS record. If the lock is not included in the snapshot, we want the lock WAL record to be after the snapshot record. So i guess we'll need the RecoveryInfoLock. But we don't need to hold itacross the wait. I think it's enough to acquire itjust before writing the WAL record in LockAcquire. That will ensure that the WAL record isn't written until the snapshot is completely finished. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Looking at the startup sequence now. I see that you modified ExtendSUBTRANS so that it doesn't wipe out previously set values if it's called with out-of-order xids. I guess that works, although I think it can leave pages unzeroed if it's called with a large enough gap between xids, so that the previous one was on e.g page 10 and the next one on page 12. Page 11 would not be zeroed, AFAICS. Not sure if such big gaps in the xid sequence are possible, but seems so if you have a very large max_connections setting and a lot of subtransactions. Nevertheless, I'd feel better if we kept ExtendSUBTRANS unmodified, and instead arranged things so that ExtendSUBTRANS is always called in xid-order. We can call it from RecordKnownAssignedTransactionIds, which handles the out-of-order problem for other purposes already. I think we have similar problem with clog. ExtendCLOG is currently not called during recovery, so isn't it possible for the problem with subtransaction commits that's described in the comments in StartupCLOG to arise during hot standby? Again, I think we should call ExtendCLOG() in RecordKnownAssignedTransactionIds. RecordKnownAssignedTransactionIds is the hot standby version of GetNewTransactionId(), so to speak. If we go with that, I think we'll need to call StartupSUBTRANS/CLOG earlier in the startup sequence too, when we establish the startup snapshot and let backends in. Thoughts? Other stuff: - I removed the new DBConsistentUpToLSN() function and moved its functionality into XLogNeedsFlush(). Since XLogFlush() updates minRecoveryPoint instead of flushing WAL during recovery, it makes sense for XLogNeedsFlush() to correspondingly check if minRecoveryPoint needs to be updated when in recovery. That's a better definition for the call in bufmgr.c too. - I went ahead with the changes with RecoveryInfoLock and tracking the number of held AccessExclusive locks in lmgr.c instead of proc array. Can we find a better name for "loggable locks"? It means locks that need to be WAL logged when acquired, for hot standby, and "loggable lock" doesn't sound right for that. "Loggable" implies that it can be logged, but it really means that it *must* be logged. Keep an eye on my git repository for latest changes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2009-09-24 at 13:33 +0300, Heikki Linnakangas wrote: > Heikki Linnakangas wrote: > > The problem becomes a lot easier if we accept that it's OK to have a > > lock included in the running-xacts snapshot and also appear in a > > XLOG_RELATION_LOCK record later. The standby should handle that > > gracefully already. If we just remove RecoveryInfoLock, that can happen, > > but it still won't be possible for a lock to be missed out which is what > > we really care about. > > I see the problem with that now. Without the lock, it's possible that > the XLOG_RELATION_LOCK WAL record is written before the > XLOG_RUNNING_XACTS record. If the lock is not included in the snapshot, > we want the lock WAL record to be after the snapshot record. > > So i guess we'll need the RecoveryInfoLock. But we don't need to hold it > across the wait. I think it's enough to acquire it just before writing > the WAL record in LockAcquire. That will ensure that the WAL record > isn't written until the snapshot is completely finished. I will think some more on that. I remember thinking there was a reason why that wasn't enough, possibly to do with no-wait locks which I remember caused me to change that code. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2009-09-23 at 19:07 +0300, Heikki Linnakangas wrote: > Rather than keep the numHeldLocks counters per-proc in proc array, I > think it would be simpler to have a single (or one per lock partition) > counter in shared memory in lock.c. It's just an optimization to make it > faster to find out that there is no loggable AccessExclusiveLocks in the > system, so it really rather belongs into the lock manager. What lock would protect that value? The whole purpose is to avoid taking the LockMgrLocks and to give something that is accessible by the locks already held by GetRunningTransactionData(). -- Simon Riggs www.2ndQuadrant.com
On Wed, 2009-09-23 at 17:45 +0300, Heikki Linnakangas wrote: > Heikki Linnakangas wrote: > > Simon Riggs wrote: > >> On Wed, 2009-09-23 at 11:13 +0300, Heikki Linnakangas wrote: > >>> I note that we don't emit RunningXacts after a shutdown checkpoint. So > >>> if recovery starts at a shutdown checkpoint, we don't let read-only > >>> backends in until the first online checkpoint. Could we treat a shutdown > >>> checkpoint as a snapshot with no transactions running? Or do prepared > >>> transactions screw that up? > >> We could, but I see no requirement for starting HS from a backup taken > >> on a shutdown database. It's just another special case to test and since > >> we already have significant number of important test cases I'd say add > >> this later. > > > > There's also a related issue that if a backend holding > > AccessExclusiveLock crashes without writing an abort WAL record, the > > lock is never released in the standby. We handle the expiration of xids > > at replay of running-xacts records, but AFAICS we don't do that for locks. > > Ah, scratch that, I now see that we do call > XactClearRecoveryTransactions() when we see a shutdown checkpoint, which > clears all recovery locks. But doesn't that prematurely clear all locks > belonging to prepared transactions as well? Much better to read your second post(s). :-) Yes, you have found a(nother) issue. This was the first one that gave me pause to think of the answer. The locks currently aren't tracked as to whether they are 2PC or not, so we would need to store that info also so that we can selectively release locks later. Question: is it possible to do a fast shutdown when we have a prepared transaction? Would it be better to take a different approach there for prepared transactions? It seems strange to write a shutdown checkpoint when the system isn't yet "clean". -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Wed, 2009-09-23 at 19:07 +0300, Heikki Linnakangas wrote: > >> Rather than keep the numHeldLocks counters per-proc in proc array, I >> think it would be simpler to have a single (or one per lock partition) >> counter in shared memory in lock.c. It's just an optimization to make it >> faster to find out that there is no loggable AccessExclusiveLocks in the >> system, so it really rather belongs into the lock manager. > > What lock would protect that value? The whole purpose is to avoid taking > the LockMgrLocks and to give something that is accessible by the locks > already held by GetRunningTransactionData(). The lock partition lock (so we really need one counter per partition, a single counter would need additional locking). We're already holding that in LockAcquire/LockRelease when we need to increment/decrement the counter. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-09-25 at 11:49 +0300, Heikki Linnakangas wrote: > Looking at the startup sequence now. > > I see that you modified ExtendSUBTRANS so that it doesn't wipe out > previously set values if it's called with out-of-order xids. I guess > that works, although I think it can leave pages unzeroed if it's called > with a large enough gap between xids, so that the previous one was on > e.g page 10 and the next one on page 12. Page 11 would not be zeroed, > AFAICS. Not sure if such big gaps in the xid sequence are possible, but > seems so if you have a very large max_connections setting and a lot of > subtransactions. Yeh, it happens and I've seen it - which is why the code is there. > Nevertheless, I'd feel better if we kept ExtendSUBTRANS unmodified, and > instead arranged things so that ExtendSUBTRANS is always called in > xid-order. We can call it from RecordKnownAssignedTransactionIds, which > handles the out-of-order problem for other purposes already. > > I think we have similar problem with clog. ExtendCLOG is currently not > called during recovery, so isn't it possible for the problem with > subtransaction commits that's described in the comments in StartupCLOG > to arise during hot standby? Again, I think we should call ExtendCLOG() > in RecordKnownAssignedTransactionIds. RecordKnownAssignedTransactionIds > is the hot standby version of GetNewTransactionId(), so to speak. OK. We record xids in sequence, so it is now a much more natural place to do this. Zeroing them makes them dirty, unfortunately, but that is another issue. RecordKnownAssignedTransactionIds() only works once the snapshot has been initialised. That could leave a gap, so we will need to run it continuously if InHotStandby. Which may have knock-on changes also. > If we go with that, I think we'll need to call StartupSUBTRANS/CLOG > earlier in the startup sequence too, when we establish the startup > snapshot and let backends in. Yes, I think an earlier patch had that, but it turns out that there really isn't anything for those functions to do. Really we could rename those functions EndOfRecoverySUBTRANS and EndOfRecoveryCLOG to illustrate their role better. > - I removed the new DBConsistentUpToLSN() function and moved its > functionality into XLogNeedsFlush(). Since XLogFlush() updates > minRecoveryPoint instead of flushing WAL during recovery, it makes sense > for XLogNeedsFlush() to correspondingly check if minRecoveryPoint needs > to be updated when in recovery. That's a better definition for the call > in bufmgr.c too. > > - I went ahead with the changes with RecoveryInfoLock and tracking the > number of held AccessExclusive locks in lmgr.c instead of proc array. > > Can we find a better name for "loggable locks"? It means locks that need > to be WAL logged when acquired, for hot standby, and "loggable lock" > doesn't sound right for that. "Loggable" implies that it can be logged, > but it really means that it *must* be logged. The distinction of "loggable" is somewhat false since we rely on the fact that only one person is holding it. So we may as well just call em what they are: AccessExclusiveLocks. > Keep an eye on my git repository for latest changes. No, I'm not doing that. If you want to submit changes, please do so to the list or just mention what needs work and I will do it. Having multiple versions of a patch isn't helpful, as I have already said. I have already been burned multiple times by accepting trial code and you yourself have re-written particular parts multiple times. I am very, very grateful for your reviews and detailed suggestions, so this comment is only about coherency and not tripping each other up. (If you want to "editorialize", it needs to be just prior to commit, but making changes to a patch just prior to commit has historically shown to introduce bugs where there weren't any before). There's enough changes already to demand a full re-test, so everything discovered so far plus testing is about 2 weeks work. I will commit things onto git as agreed as I complete coding but that won't imply full testing. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Wed, 2009-09-23 at 17:45 +0300, Heikki Linnakangas wrote: >> XactClearRecoveryTransactions() when we see a shutdown checkpoint, which >> clears all recovery locks. But doesn't that prematurely clear all locks >> belonging to prepared transactions as well? > > Much better to read your second post(s). :-) > > Yes, you have found a(nother) issue. This was the first one that gave me > pause to think of the answer. The locks currently aren't tracked as to > whether they are 2PC or not, so we would need to store that info also so > that we can selectively release locks later. > > Question: is it possible to do a fast shutdown when we have a prepared > transaction? Yes. > Would it be better to take a different approach there for > prepared transactions? It seems strange to write a shutdown checkpoint > when the system isn't yet "clean". Hmm, I guess you could define prepared transactions as active backends from the shutdown point of view, and wait for them to finish. I can see one problem, though: Once you issue shutdown, fast or smart, we no longer accept new connections. So you can't connect to issue the ROLLBACK/COMMIT PREPARED anymore. Anyway, it would be a change from the current behavior, so it would be better to cope with prepared transactions in the standby. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-09-25 at 13:14 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Wed, 2009-09-23 at 19:07 +0300, Heikki Linnakangas wrote: > > > >> Rather than keep the numHeldLocks counters per-proc in proc array, I > >> think it would be simpler to have a single (or one per lock partition) > >> counter in shared memory in lock.c. It's just an optimization to make it > >> faster to find out that there is no loggable AccessExclusiveLocks in the > >> system, so it really rather belongs into the lock manager. > > > > What lock would protect that value? The whole purpose is to avoid taking > > the LockMgrLocks and to give something that is accessible by the locks > > already held by GetRunningTransactionData(). > > The lock partition lock (so we really need one counter per partition, a > single counter would need additional locking). We're already holding > that in LockAcquire/LockRelease when we need to increment/decrement the > counter. Again: The whole purpose is to avoid taking those locks. Why would we put something behind a lock we are trying to avoid taking? -- Simon Riggs www.2ndQuadrant.com
On Fri, 2009-09-25 at 13:23 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Wed, 2009-09-23 at 17:45 +0300, Heikki Linnakangas wrote: > >> XactClearRecoveryTransactions() when we see a shutdown checkpoint, which > >> clears all recovery locks. But doesn't that prematurely clear all locks > >> belonging to prepared transactions as well? > > > > Much better to read your second post(s). :-) > > > > Yes, you have found a(nother) issue. This was the first one that gave me > > pause to think of the answer. The locks currently aren't tracked as to > > whether they are 2PC or not, so we would need to store that info also so > > that we can selectively release locks later. > > > > Question: is it possible to do a fast shutdown when we have a prepared > > transaction? > > Yes. > > > Would it be better to take a different approach there for > > prepared transactions? It seems strange to write a shutdown checkpoint > > when the system isn't yet "clean". > > Hmm, I guess you could define prepared transactions as active backends > from the shutdown point of view, and wait for them to finish. I can see > one problem, though: Once you issue shutdown, fast or smart, we no > longer accept new connections. So you can't connect to issue the > ROLLBACK/COMMIT PREPARED anymore. Anyway, it would be a change from the > current behavior, so it would be better to cope with prepared > transactions in the standby. Definitely need to cope with them for Hot Standby. My point was general one to say that behaviour is very non-useful for users with prepared transactions. It just causes manual effort by a DBA each time the system is shutdown. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > Definitely need to cope with them for Hot Standby. My point was general > one to say that behaviour is very non-useful for users with prepared > transactions. It just causes manual effort by a DBA each time the system > is shutdown. The transaction manager is supposed to reconnect and finish any in-doubt transactions, eventually. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
In XidInMVCCSnapshot: > if (snapshot->takenDuringRecovery) > { > /* > * If the snapshot contains full subxact data, the fastest way to check > * things is just to compare the given XID against both subxact XIDs and > * top-level XIDs. If the snapshot overflowed, we have to use pg_subtrans > * to convert a subxact XID to its parent XID, but then we need only look > * at top-level XIDs not subxacts. > */ ... > } > else > { > int32 j; > > /* > * > * In recovery we store all xids in the subxact array because this > * is by far the bigger array and we mostly don't know which xids > * are top-level and which are subxacts. The xip array is empty. > * > * We start by searching subtrans, if we overflowed. > */ ... > } Hang on, isn't this 180 degrees backwards? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-09-25 at 15:50 +0300, Heikki Linnakangas wrote: > Hang on, isn't this 180 degrees backwards? A witty riposte escapes me; yes, the if test is !correct. What I find amazing is that it passed the test where I put it doing make installcheck in an infinite loop for a long time. I guess that means the regression tests hardly touch the concurrency code at all, which now I think about it makes sense but I still find that very worrying. :-( -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > What I find amazing is that it passed the test where I put it doing make > installcheck in an infinite loop for a long time. I guess that means the > regression tests hardly touch the concurrency code at all, which now I > think about it makes sense but I still find that very worrying. :-( Try installcheck-parallel, which should be a bit better. It's probably not yet good enough though because it always runs the same tests concurrently. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Fri, 2009-09-25 at 16:30 -0400, Alvaro Herrera wrote: > Simon Riggs wrote: > > > What I find amazing is that it passed the test where I put it doing make > > installcheck in an infinite loop for a long time. I guess that means the > > regression tests hardly touch the concurrency code at all, which now I > > think about it makes sense but I still find that very worrying. :-( > > Try installcheck-parallel, which should be a bit better. It's probably > not yet good enough though because it always runs the same tests > concurrently. Good tip, thanks. -- Simon Riggs www.2ndQuadrant.com
Alvaro Herrera wrote: > Simon Riggs wrote: > > >> What I find amazing is that it passed the test where I put it doing make >> installcheck in an infinite loop for a long time. I guess that means the >> regression tests hardly touch the concurrency code at all, which now I >> think about it makes sense but I still find that very worrying. :-( >> > > Try installcheck-parallel, which should be a bit better. It's probably > not yet good enough though because it always runs the same tests > concurrently. > > It is also quite easy to set up your own schedule. cheers andrew
Andrew Dunstan wrote: > > Alvaro Herrera wrote: > > >Try installcheck-parallel, which should be a bit better. It's probably > >not yet good enough though because it always runs the same tests > >concurrently. > > It is also quite easy to set up your own schedule. ... except you have to be careful with hidden test dependencies :-( -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
The locking in smgr_redo_commit and smgr_redo_abort doesn't look right. First of all, smgr_redo_abort is not holding XidGenLock and ProcArrayLock while modifying ShmemVariableCache->nextXid and ShmemVariableCache->latestCompletedXid, respectively, like smgr_redo_commit is. Attached patch fixes that. But I think there's a little race condition in there anyway, as they remove the finished xids from known-assigned-xids list by calling ExpireTreeKnownAssignedTransactionIds, and ShmemVariableCache->latestCompletedXid is updated only after that. We're not holding ProcArrayLock between those two steps, like we do during normal transaction commit. I'm not sure what kind of issues that can lead to, but it seems like it can lead to broken snapshots if someone calls GetSnapshotData() between those steps. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com >From 9d6ef0a3eb901b1d2182b3f2efd5c75d52e88cba Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas <heikki@enterprisedb.com> Date: Sun, 27 Sep 2009 14:51:43 +0300 Subject: [PATCH] Make locking in smgr_redo_abort reflect that in smgr_redo_commit. --- src/backend/access/transam/xact.c | 22 ++++++++++++++++------ 1 files changed, 16 insertions(+), 6 deletions(-) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index a696857..92a7c43 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -4502,10 +4502,7 @@ xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, /* * Release locks, if any. We do this for both two phase and normal * one phase transactions. In effect we are ignoring the prepare - * phase and just going straight to lock release. This explains - * why the twophase_postcommit_standby_callbacks[] do not invoke - * a special routine to handle locks - that is performed here - * instead. + * phase and just going straight to lock release. */ RelationReleaseRecoveryLockTree(xid, xlrec->nsubxacts, sub_xids); } @@ -4593,12 +4590,25 @@ xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid) } /* Make sure nextXid is beyond any XID mentioned in the record */ + /* We don't expect anyone else to modify nextXid, hence we + * don't need to hold a lock while checking this. We still acquire + * the lock to modify it, though. + */ if (TransactionIdFollowsOrEquals(max_xid, ShmemVariableCache->nextXid)) { + LWLockAcquire(XidGenLock, LW_EXCLUSIVE); ShmemVariableCache->nextXid = max_xid; - ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid; - TransactionIdAdvance(ShmemVariableCache->nextXid); + LWLockRelease(XidGenLock); + } + + /* Same here, don't use lock to test, but need one to modify */ + if (TransactionIdFollowsOrEquals(max_xid, + ShmemVariableCache->latestCompletedXid)) + { + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + ShmemVariableCache->latestCompletedXid = max_xid; + LWLockRelease(ProcArrayLock); } /* Make sure files supposed to be dropped are dropped */ -- 1.6.3.3
TransactionIdIsInProgress() doesn't consult the known-assigned-xids structure. That's a problem: in the standby, TransactionIdIsInProgress() can return false for a transaction that is still running in the master. HeapTupleSatisfies* functions can incorrectly set HEAP_XMIN/XMAX_INVALID hint bits for xids that commit later on. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2009-09-27 at 15:35 +0300, Heikki Linnakangas wrote: > TransactionIdIsInProgress() doesn't consult the known-assigned-xids > structure. That's a problem: in the standby, TransactionIdIsInProgress() > can return false for a transaction that is still running in the master. > HeapTupleSatisfies* functions can incorrectly set HEAP_XMIN/XMAX_INVALID > hint bits for xids that commit later on. Agreed. Will fix. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2009-09-27 at 14:59 +0300, Heikki Linnakangas wrote: > The locking in smgr_redo_commit and smgr_redo_abort doesn't look right. > First of all, smgr_redo_abort is not holding XidGenLock and > ProcArrayLock while modifying ShmemVariableCache->nextXid and > ShmemVariableCache->latestCompletedXid, respectively, like > smgr_redo_commit is. Attached patch fixes that. > > But I think there's a little race condition in there anyway, as they > remove the finished xids from known-assigned-xids list by calling > ExpireTreeKnownAssignedTransactionIds, and > ShmemVariableCache->latestCompletedXid is updated only after that. We're > not holding ProcArrayLock between those two steps, like we do during > normal transaction commit. I'm not sure what kind of issues that can > lead to, but it seems like it can lead to broken snapshots if someone > calls GetSnapshotData() between those steps. I confess I didn't know what you were talking about when you wrote this and was expecting you to have a coffee and retract. I realise now you meant xact_redo_commit() rather than smgr_ and it makes sense at last. I've just committed about half of your patch exactly, but not all. I've avoided making the changes to ShmemVariableCache->latestCompletedXid directly and moved the update to ExpireTreeKnownAssignedTransactionIds() which already had access to max_xid while holding ProcArrayLock. Hopefully that resolves your comment about race condition. Also, I noticed that you removed the line TransactionIdAdvance(ShmemVariableCache->nextXid); in xact_redo_abort(). That looks like an error to me, since this follows neither the comment nor how it is coded in xact_redo_commit(). Let me know if there was some other reason for that line removal and I'll make the change again. -- Simon Riggs www.2ndQuadrant.com
On Tue, 2009-09-22 at 12:53 +0300, Heikki Linnakangas wrote: > It looks like the standby tries to remove XID 4323 from the > known-assigned hash table, but it's not there because it was removed > and set in pg_subtrans by an XLOG_XACT_ASSIGNMENT record earlier. I > guess we should just not throw an error in that case, but is there a > way we could catch that narrow case and still keep the check in > KnownAssignedXidsRemove()? It seems like the check might help catch > other bugs, so I'd rather keep it if possible. Fix attached. -- Simon Riggs www.2ndQuadrant.com
Attachment
On Wed, 2009-09-23 at 12:07 +0300, Heikki Linnakangas wrote: > we need be careful to avoid putting any extra work into the normal > recovery path. Otherwise bugs in hot standby related code can cause > crash recovery to fail. Re-checked code and found a couple of additional places that needed tests if (InHotStandby) -- Simon Riggs www.2ndQuadrant.com
Attachment
Simon Riggs wrote: > On Tue, 2009-09-22 at 12:53 +0300, Heikki Linnakangas wrote: >> It looks like the standby tries to remove XID 4323 from the >> known-assigned hash table, but it's not there because it was removed >> and set in pg_subtrans by an XLOG_XACT_ASSIGNMENT record earlier. I >> guess we should just not throw an error in that case, but is there a >> way we could catch that narrow case and still keep the check in >> KnownAssignedXidsRemove()? It seems like the check might help catch >> other bugs, so I'd rather keep it if possible. > > Fix attached. I fixed that on Friday already, in a slightly different manner. Do you see a problem with that approach? I've made public the version I'm working on. That's the version I'm ultimately going to commit. It would be a lot more helpful if you provided these patches over that version. Otherwise I have to refactor them over that codebase, possibly introducing new bugs. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > I've made public the version I'm working on. That's the version I'm > ultimately going to commit. It would be a lot more helpful if you > provided these patches over that version. Otherwise I have to refactor > them over that codebase, possibly introducing new bugs. Actually, it makes most sense if you push your changes directly to my git repository. I just granted you write access to it. Please do these changes against the version currently there, and push any fixes there directly. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2009-10-07 at 23:19 -0400, Heikki Linnakangas wrote: > Heikki Linnakangas wrote: > > I've made public the version I'm working on. That's the version I'm > > ultimately going to commit. It would be a lot more helpful if you > > provided these patches over that version. Otherwise I have to refactor > > them over that codebase, possibly introducing new bugs. > > Actually, it makes most sense if you push your changes directly to my > git repository. I just granted you write access to it. Please do these > changes against the version currently there, and push any fixes there > directly. OK, that makes sense. Thank you. I would still like you to make a clear statement that the contents of that repository are BSD licenced open source contributions. I have a contractual responsibility to ensure the code I write is licenced in that way. I have already mentioned I'm not looking at it yet for that reason, so I am unaware of any changes not posted to the list. You have posted patches that I have said I don't agree with. My name is going to be on this when it goes in, so I don't think it makes any sense to force that commit to include changes I don't agree with. I cannot prevent you making changes afterwards, nor would I wish to. I'd like you to respond sensibly to comments on those. We should work together on a consensus basis, especially since I know you have not fully tested your changes (either). Your error rate might be lower than mine, but it is non-zero. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > I would still like you to make a clear statement that the contents of > that repository are BSD licenced open source contributions. Ok. All the content in the repository at git://git.postgresql.org/git/users/heikki/postgres.git is released under the same BSD license as PostgreSQL. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2009-10-08 at 07:33 -0400, Heikki Linnakangas wrote: > Simon Riggs wrote: > > I would still like you to make a clear statement that the contents of > > that repository are BSD licenced open source contributions. > > Ok. All the content in the repository at > git://git.postgresql.org/git/users/heikki/postgres.git is released under > the same BSD license as PostgreSQL. Thank you. I presume that the copyright is novated also. I'll get cracking on some changes. -- Simon Riggs www.2ndQuadrant.com
On Thu, 2009-10-08 at 12:49 +0100, Simon Riggs wrote: > I'll get cracking on some changes. This will probably be next week now, just in case you're wondering when I'll start adding patches. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > You have posted patches that I have said I don't agree with. My name is > going to be on this when it goes in, so I don't think it makes any sense > to force that commit to include changes I don't agree with. I cannot > prevent you making changes afterwards, nor would I wish to. I'd like you > to respond sensibly to comments on those. We should work together on a > consensus basis, especially since I know you have not fully tested your > changes (either). Your error rate might be lower than mine, but it is > non-zero. The commit message and release notes mention might have just Simon's name, or multiple people. The hot patch commit is going to have multiple people involved before it is committed, so if Simon is worried that the patch will have ideas in it he does not agree with, perhaps we can make sure the commit and release note items include Heikki's name as well. Normally if a committer makes signficant changes to a patch, the committer's name is also added to the commmit message, and I suggest we do the same thing here with hot standby. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Oct 9, 2009, at 1:21 PM, Bruce Momjian <bruce@momjian.us> wrote: > Simon Riggs wrote: >> You have posted patches that I have said I don't agree with. My >> name is >> going to be on this when it goes in, so I don't think it makes any >> sense >> to force that commit to include changes I don't agree with. I cannot >> prevent you making changes afterwards, nor would I wish to. I'd >> like you >> to respond sensibly to comments on those. We should work together >> on a >> consensus basis, especially since I know you have not fully tested >> your >> changes (either). Your error rate might be lower than mine, but it is >> non-zero. > > The commit message and release notes mention might have just Simon's > name, or multiple people. > > The hot patch commit is going to have multiple people involved > before it > is committed, so if Simon is worried that the patch will have ideas in > it he does not agree with, perhaps we can make sure the commit and > release note items include Heikki's name as well. Normally if a > committer makes signficant changes to a patch, the committer's name is > also added to the commmit message, and I suggest we do the same thing > here with hot standby. I think this is a weakness of our current style of heavy-weight commits. I don't have a great suggestion for fixing it, though. Even if we move to git, a major feature like this has such a complex development history that I'm queasy about slurping it in unsquashed. But at least for simple features I think that there would be a value in separating the patch author's work from the committer's adjustments. I realize (now) that this would complicate the release note generation process somewhat, based on our current process, and there might be other downsides as well. All the same, I think it has enough value to make it worth thinking about whether there's some way to make it work. ...Robert
Robert Haas wrote: > But at least for simple features I think that there would be a value > in separating the patch author's work from the committer's adjustments. > > That is just going to make life harder for committers. There are plenty of things with my name on them that are not exactly what I submitted. I think that's true of just about everybody. Mostly things changed hae improved, but not always. I don't think we should be too proprietary about patches. As far as I'm concerned, credit goes to the submitter and blame if any to the committer. cheers andrew
On Fri, 2009-10-09 at 14:05 -0400, Andrew Dunstan wrote: > > Robert Haas wrote: > > But at least for simple features I think that there would be a value > > in separating the patch author's work from the committer's adjustments. > > > > > > That is just going to make life harder for committers. > > There are plenty of things with my name on them that are not exactly > what I submitted. I think that's true of just about everybody. Mostly > things changed hae improved, but not always. I don't think we should be > too proprietary about patches. As far as I'm concerned, credit goes to > the submitter and blame if any to the committer. +1 > > cheers > > andrew > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander
Andrew Dunstan wrote: > > > Robert Haas wrote: > > But at least for simple features I think that there would be a value > > in separating the patch author's work from the committer's adjustments. > > > > > > That is just going to make life harder for committers. > > There are plenty of things with my name on them that are not exactly > what I submitted. I think that's true of just about everybody. Mostly > things changed hae improved, but not always. I don't think we should be > too proprietary about patches. As far as I'm concerned, credit goes to > the submitter and blame if any to the committer. Agreed. Simon is right that if only his name is on the commit, there is an assumption that the committer made no changes, or only cosmetic ones. For hot standby, I think the committer is making significant changes (that could lead to bugs) and hence the committer's name should be in the commit. Sometimes we say "adjusted by" committer, but in this case I think Heikki is doing more than adjustmants --- nothing wrong with that --- it should just be documented in the commit message. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 2009-10-09 at 14:05 -0400, Andrew Dunstan wrote: > > Robert Haas wrote: > > But at least for simple features I think that there would be a value > > in separating the patch author's work from the committer's adjustments. > > > > > > That is just going to make life harder for committers. > > There are plenty of things with my name on them that are not exactly > what I submitted. I think that's true of just about everybody. Mostly > things changed hae improved, but not always. I don't think we should be > too proprietary about patches. As far as I'm concerned, credit goes to > the submitter and blame if any to the committer. +1 > > cheers > > andrew > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering If the world pushes look it in the eye and GRR. Then push back harder. - Salamander