Thread: Hot Standby on git
Just a note to say that Hot Standby patch is now on git repository git://git.postgresql.org/git/users/simon/postgres Branch name: hot_standby The complete contents of that repository are BSD licenced contributions to the PostgreSQL project. Any further changes to that will be by agreement here on hackers. From now, I will be submitting each individual change as patch-on-patch to allow people to see and discuss them and to confirm them as open source contributions. I request anybody else interested to do the same to allow us to work together. All contributions welcome. My record of agreed changes is here http://wiki.postgresql.org/wiki/Hot_Standby#Remaining_Work_Items You'll notice that I've already completed 8 changes (10 commits); those are all fairly minor changes, so submitted here as a combined patch. There are 9 pending changes, so far, none of which appear to be major obstacles to resolve. Many thanks to Heikki for a thorough review which has identified nearly all of those change requests. I estimate that making the remaining changes noted on the Wiki and fully testing them will take at least 2 weeks. Gabriele Bartolini is assisting in this area, though neither of us are able to work full time on this. We still have ample time to complete the project in this release. Many thanks to Magnus and Aidan for helping me resolve my git-wrestling contest and apologies for the delay while that bout happened. -- Simon Riggs www.2ndQuadrant.com
Attachment
On Sat, Sep 26, 2009 at 5:49 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Just a note to say that Hot Standby patch is now on git repository > git://git.postgresql.org/git/users/simon/postgres > Branch name: hot_standby Awesome! Thanks for taking the time to get this set up. > The complete contents of that repository are BSD licenced contributions > to the PostgreSQL project. Excellent... > Any further changes to that will be by agreement here on hackers. From > now, I will be submitting each individual change as patch-on-patch to > allow people to see and discuss them and to confirm them as open source > contributions. Sounds good. > I request anybody else interested to do the same to allow > us to work together. All contributions welcome. > > My record of agreed changes is here > http://wiki.postgresql.org/wiki/Hot_Standby#Remaining_Work_Items > > You'll notice that I've already completed 8 changes (10 commits); those > are all fairly minor changes, so submitted here as a combined patch. > There are 9 pending changes, so far, none of which appear to be major > obstacles to resolve. Many thanks to Heikki for a thorough review which > has identified nearly all of those change requests. > > I estimate that making the remaining changes noted on the Wiki and fully > testing them will take at least 2 weeks. Gabriele Bartolini is assisting > in this area, though neither of us are able to work full time on this. > We still have ample time to complete the project in this release. Sounds like you are making rapid progress. If you think there's something useful I could do, let me know and I'll take a look. ...Robert
On Sat, 2009-09-26 at 09:29 -0400, Robert Haas wrote: > > I estimate that making the remaining changes noted on the Wiki and > fully > > testing them will take at least 2 weeks. Gabriele Bartolini is assisting > > in this area, though neither of us are able to work full time on this. > > We still have ample time to complete the project in this release. > > Sounds like you are making rapid progress. Only because Heikki has put so much effort into review and because most of the remaining issues are coding related, rather than theoretical. We moving along and are in no way threatened by time. > If you think there's > something useful I could do, let me know and I'll take a look. I feel like I need a better way of unit testing new code. Some of the code in the patch is to handle corner cases, so recreating them is fairly hard. It is a nagging feeling that I am missing some knowledge here and would welcome some insight, or research, into better ways of doing general case unit testing. -- Simon Riggs www.2ndQuadrant.com
On 09/26/2009 10:04 AM, Simon Riggs wrote: >> If you think there's >> something useful I could do, let me know and I'll take a look. >> > I feel like I need a better way of unit testing new code. Some of the > code in the patch is to handle corner cases, so recreating them is > fairly hard. It is a nagging feeling that I am missing some knowledge > here and would welcome some insight, or research, into better ways of > doing general case unit testing. > You might try and steal ideas from EasyMock / PowerMock - but not sure how well the ideas map to C. Generally it means allowing the functions to be called from a "mock" environment, where subroutine calls that might be called are stubbed out to return sample data that would simulate your scenario. Object oriented languages that require every object to provide an interface where most object methods can be overridden are more ideal for performing this sort of test. I rarely ever see this sort of stuff in FOSS projects, and never that I can remember in FOSS C projects. It's not easy, though. I assume you are doing it through code changing right now. Commenting out lines, replacing them with others, etc? Cheers, mark -- Mark Mielke<mark@mielke.cc>
On Sat, Sep 26, 2009 at 10:45:17AM -0400, Mark Mielke wrote: > On 09/26/2009 10:04 AM, Simon Riggs wrote: > >>If you think there's > >>something useful I could do, let me know and I'll take a look. > >I feel like I need a better way of unit testing new code. Some of the > >code in the patch is to handle corner cases, so recreating them is > >fairly hard. It is a nagging feeling that I am missing some knowledge > >here and would welcome some insight, or research, into better ways of > >doing general case unit testing. > > You might try and steal ideas from EasyMock / PowerMock - but not > sure how well the ideas map to C. > > Generally it means allowing the functions to be called from a "mock" > environment, where subroutine calls that might be called are stubbed > out to return sample data that would simulate your scenario. Object > oriented languages that require every object to provide an interface > where most object methods can be overridden are more ideal for > performing this sort of test. > > I rarely ever see this sort of stuff in FOSS projects, and never > that I can remember in FOSS C projects. It's not easy, though. > > I assume you are doing it through code changing right now. > Commenting out lines, replacing them with others, etc? > > Cheers, > mark > > -- > Mark Mielke<mark@mielke.cc> > > There are a variety of projects dedicated to creating C unit test frameworks. I don't have a lot of experience with them, but I have heard good things about check and cunit. Here's a link I found with a longer list of frameworks. http://www.opensourcetesting.org/unit_c.php -- --Dan
> I feel like I need a better way of unit testing new code. Some of the > code in the patch is to handle corner cases, so recreating them is > fairly hard. It is a nagging feeling that I am missing some knowledge > here and would welcome some insight, or research, into better ways of > doing general case unit testing. There's always pgtap. Whenever we find a new corner case, we add it to the development test suite. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
On 09/26/2009 02:28 PM, Dan Colish wrote: > There are a variety of projects dedicated to creating C unit test > frameworks. I don't have a lot of experience with them, but I have heard > good things about check and cunit. Here's a link I found with a longer > list of frameworks. http://www.opensourcetesting.org/unit_c.php > Looking at check and cunit - I don't see what sort of mock function facility they would provide? One part of unit testing is arranging for functions to be called, tested, and results reported on. This can take you a certain amount of the way. "Pure" functions, for example, that always generate the same output for the same input parameters, are perfect for this situation. Perhaps test how a qsort() or bsearch() method works under various scenarios? Most real life code gets a little more complicated. For example, what if we want to simulate a network failure or "out of disk space" condition? What if we want to test out what happens when the Y2038 date is reached? This requires either complex test case setup that is difficult to run reproducibly, or another approach - "mock". It means doing things like overriding the write() method, and making it return successful N times, and then failing on the (N+1)th time with ENOSPC. It means overriding the gettimeofday() method to return a time in the future. A major benefit of this sort of testing is that it should not require source changes in order to perform the test. This sort of stuff is a LOT easier to do in OO languages. I see it done in Java a lot. I can't remember ever having seen it done in C. I think it's just too hard compared to the value obtained from the effort. In your list above, it does show a few attempts - CMock sticks out as a for example. It looks more complicated, though. It takes a .h file and generates stubs for you to fill in? That could be difficult to manage for a large project with thousands or many times more unit tests. OO is easier because you can override *only* particular methods, and you can safely call the super method that it overrides to provide the underlying behaviour in the success cases. Cheers, mark -- Mark Mielke<mark@mielke.cc>
On Sep 26, 2009, at 12:33 PM, Josh Berkus wrote: > There's always pgtap. Whenever we find a new corner case, we add it > to > the development test suite. Also, for C TAP, there's [libtap](http://jc.ngo.org.uk/trac-bin/trac.cgi/wiki/LibTap ). You can then use `prove` which you likely already have on your system, to harness the test output. Best, David
Mark Mielke escribió: > Most real life code gets a little more complicated. For example, > what if we want to simulate a network failure or "out of disk space" > condition? What if we want to test out what happens when the Y2038 > date is reached? This requires either complex test case setup that > is difficult to run reproducibly, or another approach - "mock". It > means doing things like overriding the write() method, and making it > return successful N times, and then failing on the (N+1)th time with > ENOSPC. I remember a kernel simulator based on User-Mode Linux called UMLsim, with which you could stuff like this. It's dead at this point though, and I don't know if there's any replacement. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Per Simon's request, for the benefit of the archive, here's all the changes I've done on the patch since he posted the initial version for review for this commitfest as incremental patches. This is extracted from my git repository at git://git.postgresql.org/git/users/heikki/postgres.git. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Heikki Linnakangas wrote: > Per Simon's request, for the benefit of the archive, here's all the > changes I've done on the patch since he posted the initial version for > review for this commitfest as incremental patches. This is extracted > from my git repository at > git://git.postgresql.org/git/users/heikki/postgres.git. Further fixes extracted from above repository attached.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Looking at the changes to StartupMultiXact, you're changing the locking so that both MultiXactOffsetControlLock and MultiXactMemberControlLock are acquire first before changing anything. Why? Looking at the other functions in that file, all others that access both files are happy to acquire one lock at a time, like StartupMultiXact does without the patch. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Looking at the changes to StartupMultiXact, you're changing the locking so that both MultiXactOffsetControlLock and MultiXactMemberControlLock are acquired first before changing anything. Why? Looking at the other functions in that file, all others that access both files are happy to acquire one lock at a time, like StartupMultiXact does without the patch. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Regarding this in InitStandbyDelayTimers: + /* + * If replication delay is enormously huge, just treat that as + * zero and work up from there. This prevents us from acting + * foolishly when replaying old log files. + */ + if (*currentDelay_ms < 0) + *currentDelay_ms = 0; + So we're treating restoring from an old backup the same as an up-to-datestandby server. If you're restoring from say a monthold base backup with WAL archive up to present day, and have max_standby_delay set to say 5 seconds, the server will wait for that 5 seconds on each conflicting query before killing it. Until it reaches the point in the archive where the delay is less than INT_MAX/1000 seconds old: at that point it switches into "oh my goodness, we've fallen badly behind, let's try to catch up ASAP and kill any queries that get into the way" mode. That's pretty surprising behavior, and not documented either. I propose we simply remove the above check (fixing the rest of the code so that you don't hit integer overflows), and always respect max_standby_delay. BTW, I wonder if should warn or something if we find that the timestamps in the archive are in the future? IOW, if either the master's or the standby's clock is not set correctly. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
+ /* + * If our initial RunningXactData had an overflowed snapshot then we + * knew we were missing some subxids from our snapshot. We can use + * this data as an initial snapshot, but we cannot yet mark it valid. + * We know that the missing subxids are equal to or earlier than + * LatestRunningXid. After we initialise we continue to apply changes + * during recovery, so once the oldestRunningXid is later than the + * initLatestRunningXid we can now prove that we no longer have + * missing information and can mark the snapshot as valid. + */ + if (initRunningXactData && !recoverySnapshotValid) + { + if (TransactionIdPrecedes(initLatestRunningXid, xlrec->oldestRunningXid) + { + recoverySnapshotValid = true; + elog(trace_recovery(DEBUG2), + "running xact data now proven complete"); + elog(trace_recovery(DEBUG2), + "recovery snapshots are now enabled"); + } + return; + } + When GetRunningXactData() calculates latestRunningXid in the master, which is stored in initLatestRunningXid in the standby, it only looks at xids and subxids present in the procarray. It doesn't take into account overflowed subxids. I think we could declare a recovery snapshot "proven complete" too early because of that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2009-09-30 at 18:45 +0300, Heikki Linnakangas wrote: > Regarding this in InitStandbyDelayTimers: > + /* > + * If replication delay is enormously huge, just treat that as > + * zero and work up from there. This prevents us from acting > + * foolishly when replaying old log files. > + */ > + if (*currentDelay_ms < 0) > + *currentDelay_ms = 0; > + > > So we're treating restoring from an old backup the same as an up-to-date > standby server. If you're restoring from say a month old base backup > with WAL archive up to present day, and have max_standby_delay set to > say 5 seconds, the server will wait for that 5 seconds on each > conflicting query before killing it. Until it reaches the point in the > archive where the delay is less than INT_MAX/1000 seconds old: at that > point it switches into "oh my goodness, we've fallen badly behind, let's > try to catch up ASAP and kill any queries that get into the way" mode. > That's pretty surprising behavior, and not documented either. I propose > we simply remove the above check (fixing the rest of the code so that > you don't hit integer overflows), and always respect max_standby_delay. Agreed. I will docuemnt the recommendation to set max_standby_delay = 0 if performing an archive recovery (and explain why). > BTW, I wonder if should warn or something if we find that the timestamps > in the archive are in the future? IOW, if either the master's or the > standby's clock is not set correctly. Something similar was just spotted by a client. You can set a recovery_target_timestamp that is before the pg_stop_recovery() timestamp and it doesn't complain. Will fix. Not sure if I like the sound of a system moaning at me about the clock settings. Perhaps just once when it starts, when we read control file. -- Simon Riggs www.2ndQuadrant.com
On Thu, 2009-10-01 at 14:29 +0300, Heikki Linnakangas wrote: > + /* > + * If our initial RunningXactData had an overflowed snapshot then we > + * knew we were missing some subxids from our snapshot. We can use > + * this data as an initial snapshot, but we cannot yet mark it valid. > + * We know that the missing subxids are equal to or earlier than > + * LatestRunningXid. After we initialise we continue to apply changes > + * during recovery, so once the oldestRunningXid is later than the > + * initLatestRunningXid we can now prove that we no longer have > + * missing information and can mark the snapshot as valid. > + */ > + if (initRunningXactData && !recoverySnapshotValid) > + { > + if (TransactionIdPrecedes(initLatestRunningXid, > xlrec->oldestRunningXid) > + { > + recoverySnapshotValid = true; > + elog(trace_recovery(DEBUG2), > + "running xact data now proven complete"); > + elog(trace_recovery(DEBUG2), > + "recovery snapshots are now enabled"); > + } > + return; > + } > + > > When GetRunningXactData() calculates latestRunningXid in the master, > which is stored in initLatestRunningXid in the standby, it only looks at > xids and subxids present in the procarray. It doesn't take into account > overflowed subxids. I think we could declare a recovery snapshot "proven > complete" too early because of that. Hmm, yes. ISTM that I'm still calculating latestRunningXid the old way while assuming it is calculated the new way. The new way is just to grab nextXid since we have XidGenLock and do TransactionIdRetreat() on it. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2009-09-30 at 09:33 +0300, Heikki Linnakangas wrote: > Looking at the changes to StartupMultiXact, you're changing the locking > so that both MultiXactOffsetControlLock and MultiXactMemberControlLock > are acquire first before changing anything. Why? Looking at the other > functions in that file, all others that access both files are happy to > acquire one lock at a time, like StartupMultiXact does without the patch. I think those changes are just paranoia from early versions of patch. We now know for certain that MultiXact isn't used at the time StartupMultiXact() is called. The same isn't true for StartupClog() and StartupSubtrans() which need to cope with concurrent callers. Will remove changes and document that nothing touching it when it runs. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Wed, 2009-09-30 at 09:33 +0300, Heikki Linnakangas wrote: > >> Looking at the changes to StartupMultiXact, you're changing the locking >> so that both MultiXactOffsetControlLock and MultiXactMemberControlLock >> are acquire first before changing anything. Why? Looking at the other >> functions in that file, all others that access both files are happy to >> acquire one lock at a time, like StartupMultiXact does without the patch. > > I think those changes are just paranoia from early versions of patch. > > We now know for certain that MultiXact isn't used at the time > StartupMultiXact() is called. The same isn't true for StartupClog() and > StartupSubtrans() which need to cope with concurrent callers. > > Will remove changes and document that nothing touching it when it runs. Thanks, I reverted that in my working version already. Comment patch welcome if you feel it's needed. Attached is a new batch of changes I've been doing since last batch. These are again extracted from my git repository. It includes some of the add-on patches from your repository, the rest I believe have I had already did myself earlier, or are not necessary anymore for other reasons. Could you look into these two TODO items you listed on the wiki page: - Correct SET default_transaction_read_only and SET transaction_read_only (Heikki 21/9 Hackers) - Shutdown checkpoints must not clear locks for prepared transactions (Heikki 23/9) And if you could please review the changes I've been doing, just to make sure I haven't inadvertently introduced new bugs. That has happened before, as you've rightfully reminded me :-). There's also the issue that you can't go into hot standby mode after a shutdown checkpoint. I think that really should be fixed, it's just weird from a usability point of view if it doesn't work. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Simon Riggs wrote: > Hmm, yes. ISTM that I'm still calculating latestRunningXid the old way > while assuming it is calculated the new way. The new way is just to grab > nextXid since we have XidGenLock and do TransactionIdRetreat() on it. Ok, good, that's what I thought too. I'll fix that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2009-10-01 at 18:47 +0300, Heikki Linnakangas wrote: > Could you look into these two TODO items you listed on the wiki page: Unless we agree otherwise, if its listed on the Wiki page then I will work on it. Maybe not as when you might like it, but I am working through the list. 5 new changes pushed just minutes ago, sans full testing. Yes, will review your changes also. -- Simon Riggs www.2ndQuadrant.com
On Thu, 2009-10-01 at 18:48 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Hmm, yes. ISTM that I'm still calculating latestRunningXid the old way > > while assuming it is calculated the new way. The new way is just to grab > > nextXid since we have XidGenLock and do TransactionIdRetreat() on it. > > Ok, good, that's what I thought too. I'll fix that. OK, not working on that, so go ahead. Thanks. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > @@ -7061,6 +7061,15 @@ ShutdownXLOG(int code, Datum arg) > else > { > /* > + * Take a snapshot of running transactions and write this to WAL. > + * This allows us to reconstruct the state of running transactions > + * during archive recovery, if required. We do this even if we are > + * not archiving, to allow a cold physical backup of the server to > + * be useful as a read only standby. > + */ > + GetRunningTransactionData(); > + > + /* > * If archiving is enabled, rotate the last XLOG file so that all the > * remaining records are archived (postmaster wakes up the archiver > * process one more time at the end of shutdown). The checkpoint > I don't think this will do any good where it's placed. The checkpoint that follows will have its redo-pointer beyond the running-xacts record, so WAL replay will never see it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > I will docuemnt the recommendation to set max_standby_delay = 0 if > performing an archive recovery (and explain why). Hmm, not sure if that's such a good piece of advice either. It will mean waiting for queries forever, which probably isn't what you want if you're performing archive recovery. Or maybe it is? Maybe -1? I guess it depends on the situation... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-10-02 at 10:04 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > @@ -7061,6 +7061,15 @@ ShutdownXLOG(int code, Datum arg) > > else > > { > > /* > > + * Take a snapshot of running transactions and write this to WAL. > > + * This allows us to reconstruct the state of running transactions > > + * during archive recovery, if required. We do this even if we are > > + * not archiving, to allow a cold physical backup of the server to > > + * be useful as a read only standby. > > + */ > > + GetRunningTransactionData(); > > + > > + /* > > * If archiving is enabled, rotate the last XLOG file so that all the > > * remaining records are archived (postmaster wakes up the archiver > > * process one more time at the end of shutdown). The checkpoint > > > > I don't think this will do any good where it's placed. The checkpoint > that follows will have its redo-pointer beyond the running-xacts record, > so WAL replay will never see it. Perhaps we need two entries then to cover multiple use cases? The placement of this was specifically chosen so that it is the last entry before the log switch, so that the runningxact record would be archived. Yes, we also need one after the shutdown checkpoint to cover the case where the whole data directory is copied after shutdown. The comments matched the latter case but the position addressed the first case, so it looks like I was confused as to which case I was addressing. Have updated code to do both. See what you think. Thanks. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Fri, 2009-10-02 at 10:04 +0300, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> @@ -7061,6 +7061,15 @@ ShutdownXLOG(int code, Datum arg) >>> else >>> { >>> /* >>> + * Take a snapshot of running transactions and write this to WAL. >>> + * This allows us to reconstruct the state of running transactions >>> + * during archive recovery, if required. We do this even if we are >>> + * not archiving, to allow a cold physical backup of the server to >>> + * be useful as a read only standby. >>> + */ >>> + GetRunningTransactionData(); >>> + >>> + /* >>> * If archiving is enabled, rotate the last XLOG file so that all the >>> * remaining records are archived (postmaster wakes up the archiver >>> * process one more time at the end of shutdown). The checkpoint >>> >> I don't think this will do any good where it's placed. The checkpoint >> that follows will have its redo-pointer beyond the running-xacts record, >> so WAL replay will never see it. > > Perhaps we need two entries then to cover multiple use cases? > > The placement of this was specifically chosen so that it is the last > entry before the log switch, so that the runningxact record would be > archived. > > Yes, we also need one after the shutdown checkpoint to cover the case > where the whole data directory is copied after shutdown. The comments > matched the latter case but the position addressed the first case, so it > looks like I was confused as to which case I was addressing. > > Have updated code to do both. See what you think. Thanks. It seems dangerous to write a WAL record after the shutdown checkpoint. It will be overwritten by subsequent startup, which is a recipe for trouble. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-10-02 at 10:32 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > I will docuemnt the recommendation to set max_standby_delay = 0 if > > performing an archive recovery (and explain why). > > Hmm, not sure if that's such a good piece of advice either. It will mean > waiting for queries forever, which probably isn't what you want if > you're performing archive recovery. Or maybe it is? Maybe -1? I guess it > depends on the situation... That assumes that the purpose of the archive recovery is more important than running queries. As you say, it would mean always waiting. But the beauty is that you *can* run queries to determine when to stop, so having them cancelled defeats that purpose. -- Simon Riggs www.2ndQuadrant.com
On Fri, 2009-10-02 at 10:43 +0300, Heikki Linnakangas wrote: > It seems dangerous to write a WAL record after the shutdown checkpoint. > It will be overwritten by subsequent startup, which is a recipe for trouble. I've said its a corner case and not worth spending time on. I'm putting it in at your request. If it's not correct before and not correct after, where exactly do you want it? -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Fri, 2009-10-02 at 10:43 +0300, Heikki Linnakangas wrote: > >> It seems dangerous to write a WAL record after the shutdown checkpoint. >> It will be overwritten by subsequent startup, which is a recipe for trouble. > > I've said its a corner case and not worth spending time on. I'm putting > it in at your request. If it's not correct before and not correct after, > where exactly do you want it? I don't know. Perhaps it should go between the REDO pointer of the shutdown checkpoint and the checkpoint record itself. Or maybe the information should be included in the checkpoint record itself. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-10-02 at 11:26 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Fri, 2009-10-02 at 10:43 +0300, Heikki Linnakangas wrote: > > > >> It seems dangerous to write a WAL record after the shutdown checkpoint. > >> It will be overwritten by subsequent startup, which is a recipe for trouble. > > > > I've said its a corner case and not worth spending time on. I'm putting > > it in at your request. If it's not correct before and not correct after, > > where exactly do you want it? > > I don't know. Perhaps it should go between the REDO pointer of the > shutdown checkpoint and the checkpoint record itself. That would seem minimally invasive approach and would appear to work for both cases. Will do. -- Simon Riggs www.2ndQuadrant.com
On Fri, 2009-10-02 at 11:26 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Fri, 2009-10-02 at 10:43 +0300, Heikki Linnakangas wrote: > > > >> It seems dangerous to write a WAL record after the shutdown checkpoint. > >> It will be overwritten by subsequent startup, which is a recipe for trouble. > > > > I've said its a corner case and not worth spending time on. I'm putting > > it in at your request. If it's not correct before and not correct after, > > where exactly do you want it? > > I don't know. Perhaps it should go between the REDO pointer of the > shutdown checkpoint and the checkpoint record itself. Or maybe the > information should be included in the checkpoint record itself. I've implemented this but it requires us to remove two checks - one at shutdown and one at startup on a shutdown checkpoint. I'm not happy doing that and would like to put them back. I'd rather just skip this for now. It's a minor case anyway and there's nothing stopping writing their own RunningXactData records with a function, if it is needed. I can add a function for that. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > I'd rather just skip this for now. It's a minor case anyway and there's > nothing stopping writing their own RunningXactData records with a > function, if it is needed. I can add a function for that. That won't help. There's no way to have it in a right place wrt. the shutdown checkpoint if it's triggered by a user-callable function. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-10-02 at 13:52 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > I'd rather just skip this for now. It's a minor case anyway and there's > > nothing stopping writing their own RunningXactData records with a > > function, if it is needed. I can add a function for that. > > That won't help. There's no way to have it in a right place wrt. the > shutdown checkpoint if it's triggered by a user-callable function. I notice that you avoid saying "yes, I agree we should remove the two checks". I will add code to make a shutdown checkpoint be a valid starting place for Hot Standby, as long as there are no in-doubt prepared transactions. That way we know there are no xids still running and no locks, without needing to write a record to say so. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > I will add code to make a shutdown checkpoint be a valid starting place > for Hot Standby, as long as there are no in-doubt prepared transactions. > That way we know there are no xids still running and no locks, without > needing to write a record to say so. Ok, I can live with that, and should be dead simple to implement. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-10-02 at 18:04 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > I will add code to make a shutdown checkpoint be a valid starting place > > for Hot Standby, as long as there are no in-doubt prepared transactions. > > That way we know there are no xids still running and no locks, without > > needing to write a record to say so. > > Ok, I can live with that, and should be dead simple to implement. First cut changes made, for discussion. -- Simon Riggs www.2ndQuadrant.com
On Mon, 2009-09-28 at 11:25 +0300, Heikki Linnakangas wrote: > Heikki Linnakangas wrote: > > Per Simon's request, for the benefit of the archive, here's all the > > changes I've done on the patch since he posted the initial version for > > review for this commitfest as incremental patches. This is extracted > > from my git repository at > > git://git.postgresql.org/git/users/heikki/postgres.git. > > Further fixes extracted from above repository attached.. I've applied changes on all these patches apart from 0006-... which has some dependencies on earlier work I'm still looking at. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2009-09-27 at 13:57 +0300, Heikki Linnakangas wrote: > Per Simon's request, for the benefit of the archive, here's all the > changes I've done on the patch since he posted the initial version for > review for this commitfest as incremental patches. This is extracted > from my git repository at > git://git.postgresql.org/git/users/heikki/postgres.git. I'm working my way through these changes now. 1, 2, 15 and 16 applied. We discussed briefly your change 0011-Replace-per-proc-counters-of-loggable-locks-with-per.patch. I don't see how that helps at all. The objective of lock counters was to know if we can skip acquiring an LWlock on all lock partitions. This change keeps the lock counters yet acquires the locks we were trying to avoid. This change needs some justification since it is not a bug fix. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > I don't see how that helps at all. The objective of lock counters was to > know if we can skip acquiring an LWlock on all lock partitions. This > change keeps the lock counters yet acquires the locks we were trying to > avoid. This change needs some justification since it is not a bug fix. [ scratches head ... ] Why is hot standby messing with this sort of thing at all? It sounds like a performance optimization that should be considered separately, and *later*. regards, tom lane
On Mon, 2009-10-05 at 10:19 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > I don't see how that helps at all. The objective of lock counters was to > > know if we can skip acquiring an LWlock on all lock partitions. This > > change keeps the lock counters yet acquires the locks we were trying to > > avoid. This change needs some justification since it is not a bug fix. > > [ scratches head ... ] Why is hot standby messing with this sort of > thing at all? It sounds like a performance optimization that should > be considered separately, and *later*. Possibly. We have 3 suggested approaches: * Avoid taking LockPartition locks while we get info for Hot Standby during normal running, by means of a ref counting scheme (Simon) * Take the locks and implement a ref counting scheme (Heikki) * Take the locks, worry later (Tom) The middle ground seems pointless to me. I'm happy to go with simple lock-everything-for-now but it's pretty clear its going to be a annoying performance hit. If we do that we should put in a parameter to turn on/off so that those who will never use Hot Standby can avoid this completely. I'll wait for Heikki's thoughts before implementing anything. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > We discussed briefly your change > 0011-Replace-per-proc-counters-of-loggable-locks-with-per.patch. > > I don't see how that helps at all. The objective of lock counters was to > know if we can skip acquiring an LWlock on all lock partitions. This > change keeps the lock counters yet acquires the locks we were trying to > avoid. This change needs some justification since it is not a bug fix. Well, the original code was buggy. But more to the point, it's a lot simpler this way, I don't see any reason why the counters should be per-process, meaning that they need to be exposed in the pgproc structs or procarray.c. The point is to avoid the seqscan of the lock hash table. I presumed that's the expensive part in GetRunningTransactionLocks(). Tom Lane wrote: > [ scratches head ... ] Why is hot standby messing with this sort of > thing at all? It sounds like a performance optimization that should > be considered separately, and *later*. Yeah, I too considered just ripping it out. Simon is worried that locking all the lock partitions and scanning the locks table can take a long time. We do that in the master, while holding both ProcArrayLock and XidGenLock in exclusive mode (hmm, why is shared not enough?), so there is some grounds for worry. OTOH, it's only done once per checkpoint. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Mon, 2009-09-28 at 11:25 +0300, Heikki Linnakangas wrote: >> Heikki Linnakangas wrote: >>> Per Simon's request, for the benefit of the archive, here's all the >>> changes I've done on the patch since he posted the initial version for >>> review for this commitfest as incremental patches. This is extracted >>> from my git repository at >>> git://git.postgresql.org/git/users/heikki/postgres.git. >> Further fixes extracted from above repository attached.. > > I've applied changes on all these patches apart from 0006-... which has > some dependencies on earlier work I'm still looking at. Simon, you don't need to apply those patches. Just review them, and post comments or subsequent patches on top of the repository. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2009-10-05 at 18:30 -0400, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Mon, 2009-09-28 at 11:25 +0300, Heikki Linnakangas wrote: > >> Heikki Linnakangas wrote: > >>> Per Simon's request, for the benefit of the archive, here's all the > >>> changes I've done on the patch since he posted the initial version for > >>> review for this commitfest as incremental patches. This is extracted > >>> from my git repository at > >>> git://git.postgresql.org/git/users/heikki/postgres.git. > >> Further fixes extracted from above repository attached.. > > > > I've applied changes on all these patches apart from 0006-... which has > > some dependencies on earlier work I'm still looking at. > > Simon, you don't need to apply those patches. Just review them, and post > comments or subsequent patches on top of the repository. I've applied them to the repository. -- Simon Riggs www.2ndQuadrant.com
On Tue, 2009-10-06 at 01:10 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > We discussed briefly your change > > 0011-Replace-per-proc-counters-of-loggable-locks-with-per.patch. > > > > I don't see how that helps at all. The objective of lock counters was to > > know if we can skip acquiring an LWlock on all lock partitions. This > > change keeps the lock counters yet acquires the locks we were trying to > > avoid. This change needs some justification since it is not a bug fix. > > Well, the original code was buggy. One bug was found, yes, but the submitted changes are not just a bug fix, they alter the whole basic meaning of that code. > But more to the point, it's a lot > simpler this way, I don't see any reason why the counters should be > per-process, meaning that they need to be exposed in the pgproc structs > or procarray.c. There was a very clear reason for doing it that way. By putting the counters on the PGPROC structs it allows us to check the counts while we are performing the main sweep of the procarray *and* if the count is zero it allows us to *completely avoid* taking the locks. By making the lock counters private to each backend the counters themselves do not need to be protected by a lock when updated, and the fetch can be done atomically, as we do with xid. With respect, I have explained this on list at least twice previously, as well as in code comments. > The point is to avoid the seqscan of the lock hash table. I presumed > that's the expensive part in GetRunningTransactionLocks(). > Tom Lane wrote: > > [ scratches head ... ] Why is hot standby messing with this sort of > > thing at all? It sounds like a performance optimization that should > > be considered separately, and *later*. > > Yeah, I too considered just ripping it out. Simon is worried that > locking all the lock partitions and scanning the locks table can take a > long time. We do that in the master, while holding both ProcArrayLock > and XidGenLock in exclusive mode (hmm, why is shared not enough?), so > there is some grounds for worry. OTOH, it's only done once per checkpoint. I could live with ripping it out, but what we have now doesn't make sense, to me. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2009-09-27 at 13:57 +0300, Heikki Linnakangas wrote: > Per Simon's request, for the benefit of the archive, here's all the > changes I've done on the patch since he posted the initial version for > review for this commitfest as incremental patches. This is extracted > from my git repository at > git://git.postgresql.org/git/users/heikki/postgres.git. There were 16 change patches included in this post. I have applied 14 of them, almost all without any changes to comment or code. I've fixed 2 bugs and made changes where XXX comments were left in code. That leaves 0011-locks... discussed on another part of this thread and patch 0005-Include-information... which I'm discussing here. It's a huge set of changes, all of which is just refactoring, none of which is a bug fix of any kind. The refactoring does sound reasonable, but is really fairly minor. I feel we should defer this change for the future to allow us to stabilise the current patch. I do understand the need for refactoring, but if we refactor everything touched by Hot Standby then we will simply touch more code and trigger more bugs etc.. This is especially true with prepared transactions, which aren't well understood and not covered by as many tests. ---src/backend/access/transam/twophase.c | 96 +++++++++++++++++-----------src/backend/access/transam/twophase_rmgr.c | 13 ----src/backend/access/transam/xact.c | 64 +++++++------------src/backend/access/transam/xlog.c | 5 +-src/backend/commands/discard.c | 3 +-src/backend/commands/sequence.c | 3 +src/backend/storage/lmgr/lock.c | 5 +-src/backend/tcop/postgres.c | 1 +src/backend/utils/cache/inval.c | 90 ++------------------------src/include/access/subtrans.h | 3 -src/include/access/twophase.h | 2 -src/include/access/twophase_rmgr.h | 6 +-src/include/access/xact.h | 6 --src/include/miscadmin.h | 7 --src/include/storage/proc.h | 7 +-src/include/utils/inval.h | 7 ++16 files changed, 108 insertions(+), 210 deletions(-) -- Simon Riggs www.2ndQuadrant.com
On Thu, 2009-10-01 at 18:47 +0300, Heikki Linnakangas wrote: > And if you could please review the changes I've been doing, just to > make sure I haven't inadvertently introduced new bugs. That has > happened before, as you've rightfully reminded me :-). You posted 17 patches here. I've reviewed/applied patches 1,3,4,5,6,7,9,10,13,14,15. That leaves me with some form of issue on 2, 5, 8, 11, 12, 16 and 17. Sounds a lot, but out of 41 patches in total to date, I have as yet unresolved issues with 9. Patch 0017 has significant changes to it, so I'm posting patches here for further discussion. Main line thought is that I agree with the changes you wanted to make and I've added a few extra things. Commit message from repo: Apply 0017-Revert-changes-to-subtrans.c-and-slru.c.-Instead-cal.patch but with heavy modifications to fix a number of bugs and make associated changes. First, StartupSubtrans() positioned itself at oldestXid, so that when later running transactions complete they could find no page for them to update and crash. Second, ExtendClog() expected to be able to write WAL during recovery and so crashed after 32768 xids. This patch also extends the patch to cover the recently added support for starting Hot Standby from a shutdown checkpoint, which causes some refactoring. Various comments reworded, including allowing a lock overflow to cause a PENDING state, just as we do with subxid overflow. Another bug was also found, in that failing to make subtrans entries from the initial snapshot could lead to later abort records hanging because the topxid was not set. Code is now similar in all code paths. Sounds like a lot of changes, but mostly subtle changes rather than lengthy ones. It seems highly likely that you'll find an error in my changes to your changes also, but they do pass initial testing. -- Simon Riggs www.2ndQuadrant.com
Attachment
On Fri, 2009-10-02 at 10:32 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > I will docuemnt the recommendation to set max_standby_delay = 0 if > > performing an archive recovery (and explain why). > > Hmm, not sure if that's such a good piece of advice either. It will mean > waiting for queries forever, which probably isn't what you want if > you're performing archive recovery. Or maybe it is? Maybe -1? I guess it > depends on the situation... I've changed that to -1 now, which was what I meant. 0 is likely to be a very wrong setting during archive recovery. -- Simon Riggs www.2ndQuadrant.com
While playing with conflict resolution, I bumped into this: postgres=# begin ISOLATION LEVEL SERIALIZABLE; BEGIN postgres=# SELECT * FROM foo;id | data ----+------12 | (1 row) postgres=# SELECT * FROM foo;id | data ----+------12 | (1 row) postgres=# SELECT * FROM foo;id | data ----+------12 | (1 row) postgres=# SELECT * FROM foo;id | data ----+------12 | (1 row) postgres=# SELECT * FROM foo;id | data ----+------12 | (1 row) postgres=# SELECT * FROM foo; ERROR: canceling statement due to conflict with recovery postgres=# SELECT * FROM foo;id | data ----+------13 | (1 row) postgres=# SELECT * FROM foo;id | data ----+------13 | (1 row) postgres=# begin ISOLATION LEVEL SERIALIZABLE;id | data ----+------13 | (1 row) postgres=# SELECT * FROM foo; BEGIN postgres=# SELECT * FROM foo;id | data ----+------13 | (1 row) The backend and the frontend seem to go out of sync, when a conflict happens in idle-in-transaction mode. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: >> Tom Lane wrote: >>> [ scratches head ... ] Why is hot standby messing with this sort of >>> thing at all? It sounds like a performance optimization that should >>> be considered separately, and *later*. >> Yeah, I too considered just ripping it out. Simon is worried that >> locking all the lock partitions and scanning the locks table can take a >> long time. We do that in the master, while holding both ProcArrayLock >> and XidGenLock in exclusive mode (hmm, why is shared not enough?), so >> there is some grounds for worry. OTOH, it's only done once per checkpoint. > > I could live with ripping it out, but what we have now doesn't make > sense, to me. Ok, let's just rip it out for now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com