Thread: cheaper snapshots
On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wonder whether we could do something involving WAL properties --- the > current tuple visibility logic was designed before WAL existed, so it's > not exploiting that resource at all. I'm imagining that the kernel of a > snapshot is just a WAL position, ie the end of WAL as of the time you > take the snapshot (easy to get in O(1) time). Visibility tests then > reduce to "did this transaction commit with a WAL record located before > the specified position?". You'd need some index datastructure that made > it reasonably cheap to find out the commit locations of recently > committed transactions, where "recent" means "back to recentGlobalXmin". > That seems possibly do-able, though I don't have a concrete design in > mind. I was mulling this idea over some more (the same ideas keep floating back to the top...). I don't think an LSN can actually work, because there's no guarantee that the order in which the WAL records are emitted is the same order in which the effects of the transactions become visible to new snapshots. For example: 1. Transaction A inserts its commit record, flushes WAL, and begins waiting for sync rep. 2. A moment later, transaction B sets synchronous_commit=off, inserts its commit record, requests a background WAL flush, and removes itself from the ProcArray. 3. Transaction C takes a snapshot. Sync rep doesn't create this problem; there's a race anyway. The order of acquisition for WALInsertLock needn't match that for ProcArrayLock. This has the more-than-slightly-odd characteristic that you could end up with a snapshot on the master that can see A but not B and a snapshot on the slave that can see B but not A. But having said that an LSN can't work, I don't see why we can't just use a 64-bit counter. In fact, the predicate locking code already does something much like this, using an SLRU, for serializable transactions only. In more detail, what I'm imagining is an array with 4 billion entries, one per XID, probably broken up into files of say 16MB each with 2 million entries per file. Each entry is a 64-bit value. It is 0 if the XID has not yet started, is still running, or has aborted. Otherwise, it is the commit sequence number of the transaction. For reasons I'll explain below, I'm imagining starting the commit sequence number counter at some very large value and having it count down from there. So the basic operations are: - To take a snapshot, you just read the counter. - To commit a transaction which has an XID, you read the counter, stamp all your XIDs with that value, and decrement the counter. - To find out whether an XID is visible to your snapshot, you look up the XID in the array and get the counter value. If the value you read is greater than your snapshot value, it's visible. If it's less, it's not. Now, is this algorithm any good, and how little locking can we get away with? It seems to me that if we used an SLRU to store the array, the lock contention would be even worse than it is under our current system, wherein everybody fights over ProcArrayLock. A system like this is going to involve lots and lots of probes into the array (even if we build a per-backend cache of some kind) and an SLRU will require at least one LWLock acquire and release per probe. Some kind of locking is pretty much unavoidable, because you have to worry about pages getting evicted from shared memory. However, what if we used a set of files (like SLRU) but mapped them separately into each backend's address space? I think this would allow both loads and stores from the array to be done unlocked. One fly in the ointment is that 8-byte stores are apparently done as two 4-byte stores on some platforms. But if the counter runs backward, I think even that is OK. If you happen to read an 8 byte value as it's being written, you'll get 4 bytes of the intended value and 4 bytes of zeros. The value will therefore appear to be less than what it should be. However, if the value was in the midst of being written, then it's still in the midst of committing, which means that that XID wasn't going to be visible anyway. Accidentally reading a smaller value doesn't change the answer. Committing will require a lock on the counter. Taking a snapshot can be done unlocked if (1) 8-byte reads are atomic and either (2a) the architecture has strong memory ordering (no store/store reordering) or (2b) you insert a memory fence between stamping the XIDs and decrementing the counter. Otherwise, taking a snapshot will also require a lock on the counter. Once a particular XID precedes RecentGlobalXmin, you no longer care about the associated counter value. You just need to know that it committed; the order no longer matters. So after a crash, assuming that you have the CLOG bits available, you can just throw away all the array contents and start the counter over at the highest possible value. And, as RecentGlobalXmin advances, you can prune away (or recycle) files that are no longer needed. In fact, you could put all of these files on a ramfs, or maybe use shared memory segments for them. All that having been said, even if I haven't made any severe conceptual errors in the above, I'm not sure how well it will work in practice. On the plus side, taking a snapshot becomes O(1) rather than O(MaxBackends) - that's good. On the further plus side, you can check both whether an XID has committed and whether it's visible to your snapshot in a single, atomic action with no lock - that seems really good. On the minus side, checking an xid against your snapshot now has less locality of reference. And, rolling over into a new segment of the array is going to require everyone to map it, and maybe cause some disk I/O as a new file gets created. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 3:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > All that having been said, even if I haven't made any severe > conceptual errors in the above, I'm not sure how well it will work in > practice. On the plus side, taking a snapshot becomes O(1) rather > than O(MaxBackends) - that's good. On the further plus side, you can > check both whether an XID has committed and whether it's visible to > your snapshot in a single, atomic action with no lock - that seems > really good. On the minus side, checking an xid against your snapshot > now has less locality of reference. And, rolling over into a new > segment of the array is going to require everyone to map it, and maybe > cause some disk I/O as a new file gets created. Sounds like the right set of thoughts to be having. If you do this, you must cover subtransactions and Hot Standby. Work in this area takes longer than you think when you take the complexities into account, as you must. I think you should take the premise of making snapshots O(1) and look at all the ways of doing that. If you grab too early at a solution you may grab the wrong one. For example, another approach would be to use a shared hash table. Snapshots are O(1), committing is O(k), using the snapshot is O(logN). N can be kept small by regularly pruning the hash table. If we crash we lose the hash table - no matter. (I'm not suggesting this is better, just a different approach that should be judged across others). What I'm not sure in any of these ideas is how to derive a snapshot xmin. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Jul28, 2011, at 04:51 , Robert Haas wrote: > One fly in the ointment is that 8-byte > stores are apparently done as two 4-byte stores on some platforms. > But if the counter runs backward, I think even that is OK. If you > happen to read an 8 byte value as it's being written, you'll get 4 > bytes of the intended value and 4 bytes of zeros. The value will > therefore appear to be less than what it should be. However, if the > value was in the midst of being written, then it's still in the midst > of committing, which means that that XID wasn't going to be visible > anyway. Accidentally reading a smaller value doesn't change the > answer. That only works if the update of the most-significant word is guaranteed to be visible before the update to the lest-significant one. Which I think you can only enforce if you update the words individually (and use a fence on e.g. PPC32). Otherwise you're at the mercy of the compiler. Otherwise, the following might happen (with a 2-byte value instead of an 8-byte one, and the assumption that 1-byte stores are atomic while 2-bytes ones aren't. Just to keep the numbers smaller. The machine is assumed to be big-endian) The counter is at 0xff00 Backends 1 decrements, i.e. does (1) STORE [counter+1] 0xff (2) STORE [counter], 0x00 Backend 2 reads (1') LOAD [counter+1] (2') LOAD [counter] If the sequence of events is (1), (1'), (2'), (2), backend 2 will read 0xffff which is higher than it should be. But we could simply use a spin-lock to protect the read on machines where we don't know for sure that 64-bit reads and write are atomic. That'll only really hurt on machines with 16+ cores or so, and the number of architectures which support that isn't that high anyway. If we supported spinlock-less operation on SPARC, x86-64, PPC64 and maybe Itanium, would we miss any important one? best regards, Florian Pflug
On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wonder whether we could do something involving WAL properties --- the > > current tuple visibility logic was designed before WAL existed, so it's > > not exploiting that resource at all. I'm imagining that the kernel of a > > snapshot is just a WAL position, ie the end of WAL as of the time you > > take the snapshot (easy to get in O(1) time). Visibility tests then > > reduce to "did this transaction commit with a WAL record located before > > the specified position?". Why not just cache a "reference snapshots" near WAL writer and maybe also save it at some interval in WAL in case you ever need to restore an old snapshot at some val position for things like time travel. It may be cheaper lock-wise not to update ref. snapshot at each commit, but to keep latest saved snapshot and a chain of transactions committed / aborted since. This means that when reading the snapshot you read the current "saved snapshot" and then apply the list of commits. when moving to a new saved snapshot you really generate a new one and keep the old snapshot + commit chain around for a little while for those who may be still processing it. Seems like this is something that can be done with no locking, > You'd need some index datastructure that made > > it reasonably cheap to find out the commit locations of recently > > committed transactions, where "recent" means "back to recentGlobalXmin". > > That seems possibly do-able, though I don't have a concrete design in > > mind. snapshot + chain of commits is likely as cheap as it gets, unless you additionally cache the commits in a tighter data structure. this is because you will need them all anyway to compute difference from ref snapshot. -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, Jul 28, 2011 at 3:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Sounds like the right set of thoughts to be having. Thanks. > If you do this, you must cover subtransactions and Hot Standby. Work > in this area takes longer than you think when you take the > complexities into account, as you must. Right. This would replace the KnownAssignedXids stuff (a non-trivial project, I am sure). > I think you should take the premise of making snapshots O(1) and look > at all the ways of doing that. If you grab too early at a solution you > may grab the wrong one. Yeah, I'm just brainstorming at this point. This is, I think, the best idea of what I've come up with so far, but it's definitely not the only approach. > For example, another approach would be to use a shared hash table. > Snapshots are O(1), committing is O(k), using the snapshot is O(logN). > N can be kept small by regularly pruning the hash table. If we crash > we lose the hash table - no matter. (I'm not suggesting this is > better, just a different approach that should be judged across > others). Sorry, I'm having a hard time understanding what you are describing here. What would the keys and values in this hash table be, and what do k and N refer to here? > What I'm not sure in any of these ideas is how to derive a snapshot xmin. That is a problem. If we have to scan the ProcArray every time we take a snapshot just to derive an xmin, we are kind of hosed. One thought I had is that we might be able to use a sort of sloppy xmin. In other words, we keep a cached xmin, and have some heuristic where we occasionally try to update it. A snapshot with a too-old xmin isn't wrong, just possibly slower. But if xmin is only slightly stale and xids can be tested relatively quickly, it might not matter very much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 4:16 AM, Florian Pflug <fgp@phlo.org> wrote: > On Jul28, 2011, at 04:51 , Robert Haas wrote: >> One fly in the ointment is that 8-byte >> stores are apparently done as two 4-byte stores on some platforms. >> But if the counter runs backward, I think even that is OK. If you >> happen to read an 8 byte value as it's being written, you'll get 4 >> bytes of the intended value and 4 bytes of zeros. The value will >> therefore appear to be less than what it should be. However, if the >> value was in the midst of being written, then it's still in the midst >> of committing, which means that that XID wasn't going to be visible >> anyway. Accidentally reading a smaller value doesn't change the >> answer. > > That only works if the update of the most-significant word is guaranteed > to be visible before the update to the lest-significant one. Which > I think you can only enforce if you update the words individually > (and use a fence on e.g. PPC32). Otherwise you're at the mercy of the > compiler. > > Otherwise, the following might happen (with a 2-byte value instead of an > 8-byte one, and the assumption that 1-byte stores are atomic while 2-bytes > ones aren't. Just to keep the numbers smaller. The machine is assumed to be > big-endian) > > The counter is at 0xff00 > Backends 1 decrements, i.e. does > (1) STORE [counter+1] 0xff > (2) STORE [counter], 0x00 > > Backend 2 reads > (1') LOAD [counter+1] > (2') LOAD [counter] > > If the sequence of events is (1), (1'), (2'), (2), backend 2 will read > 0xffff which is higher than it should be. You're confusing two different things - I agree that you need a spinlock around reading the counter, unless 8-byte loads and stores are atomic. What I'm saying can be done without a lock is reading the commit-order value for a given XID. If that's in the middle of being updated, then the old value was zero, so the scenario you describe can't occur. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 6:50 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > I wonder whether we could do something involving WAL properties --- the >> > current tuple visibility logic was designed before WAL existed, so it's >> > not exploiting that resource at all. I'm imagining that the kernel of a >> > snapshot is just a WAL position, ie the end of WAL as of the time you >> > take the snapshot (easy to get in O(1) time). Visibility tests then >> > reduce to "did this transaction commit with a WAL record located before >> > the specified position?". > > Why not just cache a "reference snapshots" near WAL writer and maybe > also save it at some interval in WAL in case you ever need to restore an > old snapshot at some val position for things like time travel. > > It may be cheaper lock-wise not to update ref. snapshot at each commit, > but to keep latest saved snapshot and a chain of transactions > committed / aborted since. This means that when reading the snapshot you > read the current "saved snapshot" and then apply the list of commits. Yeah, interesting idea. I thought about that. You'd need not only the list of commits but also the list of XIDs that had been published, since the commits have to be removed from the snapshot and the newly-published XIDs have to be added to it (in case they commit later while the snapshot is still in use). You can imagine doing this with a pair of buffers. You write a snapshot into the beginning of the first buffer and then write each XID that is published or commits into the next slot in the array. When the buffer is filled up, the next process that wants to publish an XID or commit scans through the array and constructs a new snapshot that compacts away all the begin/commit pairs and writes it into the second buffer, and all new snapshots are taken there. When that buffer fills up you flip back to the first one. Of course, you need some kind of synchronization to make sure that you don't flip back to the first buffer while some laggard is still using it to construct a snapshot that he started taking before you flipped to the second one, but maybe that could be made light-weight enough not to matter. I am somewhat concerned that this approach might lead to a lot of contention over the snapshot buffers. In particular, the fact that you have to touch shared cache lines both to advertise a new XID and when it gets committed seems less than ideal. One thing that's kind of interesting about the "commit sequence number" approach is that - as far as I can tell - it doesn't require new XIDs to be advertised anywhere at all. You don't have to worry about overflowing the subxids[] array because it goes away altogether. The commit sequence number itself is going to be a contention hotspot, but at least it's small and fixed-size. Another concern I have with this approach is - how large do you make the buffers? If you make them too small, then you're going to have to regenerate the snapshot frequently, which will lead to the same sort of lock contention we have today - no one can commit while the snapshot is being regenerated. On the other hand, if you make them too big, then deriving a snapshot gets slow. Maybe there's some way to make it work, but I'm afraid it might end up being yet another arcane thing the tuning of which will become a black art among hackers... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2011-07-28 at 09:38 -0400, Robert Haas wrote: > On Thu, Jul 28, 2011 at 6:50 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > > On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> > I wonder whether we could do something involving WAL properties --- the > >> > current tuple visibility logic was designed before WAL existed, so it's > >> > not exploiting that resource at all. I'm imagining that the kernel of a > >> > snapshot is just a WAL position, ie the end of WAL as of the time you > >> > take the snapshot (easy to get in O(1) time). Visibility tests then > >> > reduce to "did this transaction commit with a WAL record located before > >> > the specified position?". > > > > Why not just cache a "reference snapshots" near WAL writer and maybe > > also save it at some interval in WAL in case you ever need to restore an > > old snapshot at some val position for things like time travel. > > > > It may be cheaper lock-wise not to update ref. snapshot at each commit, > > but to keep latest saved snapshot and a chain of transactions > > committed / aborted since. This means that when reading the snapshot you > > read the current "saved snapshot" and then apply the list of commits. > > Yeah, interesting idea. I thought about that. You'd need not only > the list of commits but also the list of XIDs that had been published, > since the commits have to be removed from the snapshot and the > newly-published XIDs have to be added to it (in case they commit later > while the snapshot is still in use). > > You can imagine doing this with a pair of buffers. You write a > snapshot into the beginning of the first buffer and then write each > XID that is published or commits into the next slot in the array. > When the buffer is filled up, the next process that wants to publish > an XID or commit scans through the array and constructs a new snapshot > that compacts away all the begin/commit pairs and writes it into the > second buffer, and all new snapshots are taken there. When that > buffer fills up you flip back to the first one. Of course, you need > some kind of synchronization to make sure that you don't flip back to > the first buffer while some laggard is still using it to construct a > snapshot that he started taking before you flipped to the second one, > but maybe that could be made light-weight enough not to matter. > > I am somewhat concerned that this approach might lead to a lot of > contention over the snapshot buffers. My hope was, that this contention would be the same than simply writing the WAL buffers currently, and thus largely hidden by the current WAL writing sync mechanisma. It really covers just the part which writes commit records to WAL, as non-commit WAL records dont participate in snapshot updates. Writing WAL is already a single point which needs locks or other kind of synchronization. This will stay with us at least until we start supporting multiple WAL streams, and even then we will need some synchronisation between those. > In particular, the fact that > you have to touch shared cache lines both to advertise a new XID and > when it gets committed seems less than ideal. Every commit record writer should do this as part of writing the commit record. And as mostly you still want the latest snapshot, why not just update the snapshot as part of the commit/abort ? Do we need the ability for fast "recent snapshots" at all ? > One thing that's kind > of interesting about the "commit sequence number" approach is that - > as far as I can tell - it doesn't require new XIDs to be advertised > anywhere at all. You don't have to worry about overflowing the > subxids[] array because it goes away altogether. The commit sequence > number itself is going to be a contention hotspot, but at least it's > small and fixed-size. > > Another concern I have with this approach is - how large do you make > the buffers? If you make them too small, then you're going to have to > regenerate the snapshot frequently, which will lead to the same sort > of lock contention we have today - no one can commit while the > snapshot is being regenerated. On the other hand, if you make them > too big, then deriving a snapshot gets slow. Maybe there's some way > to make it work, but I'm afraid it might end up being yet another > arcane thing the tuning of which will become a black art among > hackers... > -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > My hope was, that this contention would be the same than simply writing > the WAL buffers currently, and thus largely hidden by the current WAL > writing sync mechanisma. > > It really covers just the part which writes commit records to WAL, as > non-commit WAL records dont participate in snapshot updates. I'm confused by this, because I don't think any of this can be done when we insert the commit record into the WAL stream. It has to be done later, at the time we currently remove ourselves from the ProcArray. Those things need not happen in the same order, as I noted in my original post. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >> My hope was, that this contention would be the same than simply writing >> the WAL buffers currently, and thus largely hidden by the current WAL >> writing sync mechanisma. >> >> It really covers just the part which writes commit records to WAL, as >> non-commit WAL records dont participate in snapshot updates. > I'm confused by this, because I don't think any of this can be done > when we insert the commit record into the WAL stream. It has to be > done later, at the time we currently remove ourselves from the > ProcArray. Those things need not happen in the same order, as I noted > in my original post. But should we rethink that? Your point that hot standby transactions on a slave could see snapshots that were impossible on the parent was disturbing. Should we look for a way to tie "transaction becomes visible" to its creation of a commit WAL record? I think the fact that they are not an indivisible operation is an implementation artifact, and not a particularly nice one. regards, tom lane
On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote: > On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > > My hope was, that this contention would be the same than simply writing > > the WAL buffers currently, and thus largely hidden by the current WAL > > writing sync mechanisma. > > > > It really covers just the part which writes commit records to WAL, as > > non-commit WAL records dont participate in snapshot updates. > > I'm confused by this, because I don't think any of this can be done > when we insert the commit record into the WAL stream. It has to be > done later, at the time we currently remove ourselves from the > ProcArray. Those things need not happen in the same order, as I noted > in my original post. The update to stored snapshot needs to happen at the moment when the WAL record is considered to be "on stable storage", so the "current snapshot" update presumably can be done by the same process which forces it to stable storage, with the same contention pattern that applies to writing WAL records, no ? If the problem is with a backend which requested an "async commit", then it is free to apply it's additional local commit changes from its own memory if the global latest snapshot disgrees with it. -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
Hannu Krosing <hannu@2ndQuadrant.com> writes: > On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote: >> I'm confused by this, because I don't think any of this can be done >> when we insert the commit record into the WAL stream. > The update to stored snapshot needs to happen at the moment when the WAL > record is considered to be "on stable storage", so the "current > snapshot" update presumably can be done by the same process which forces > it to stable storage, with the same contention pattern that applies to > writing WAL records, no ? No. There is no reason to tie this to fsyncing WAL. For purposes of other currently-running transactions, the commit can be considered to occur at the instant the commit record is inserted into WAL buffers. If we crash before that makes it to disk, no problem, because nothing those other transactions did will have made it to disk either. The advantage of defining it that way is you don't have weirdly different behaviors for sync and async transactions. regards, tom lane
On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Jul 28, 2011 at 10:17 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >>> My hope was, that this contention would be the same than simply writing >>> the WAL buffers currently, and thus largely hidden by the current WAL >>> writing sync mechanisma. >>> >>> It really covers just the part which writes commit records to WAL, as >>> non-commit WAL records dont participate in snapshot updates. > >> I'm confused by this, because I don't think any of this can be done >> when we insert the commit record into the WAL stream. It has to be >> done later, at the time we currently remove ourselves from the >> ProcArray. Those things need not happen in the same order, as I noted >> in my original post. > > But should we rethink that? Your point that hot standby transactions on > a slave could see snapshots that were impossible on the parent was > disturbing. Should we look for a way to tie "transaction becomes > visible" to its creation of a commit WAL record? I think the fact that > they are not an indivisible operation is an implementation artifact, and > not a particularly nice one. Well, I agree with you that it isn't especially nice, but it seems like a fairly intractable problem. Currently, the standby has no way of knowing in what order the transactions became visible on the master. Unless we want to allow only SR and not log shipping, the only way to communicate that information would be to WAL log it. Aside from the expense, what do we do if XLogInsert() fails, given that we've already committed? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2011-07-28 at 10:45 -0400, Tom Lane wrote: > Hannu Krosing <hannu@2ndQuadrant.com> writes: > > On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote: > >> I'm confused by this, because I don't think any of this can be done > >> when we insert the commit record into the WAL stream. > > > The update to stored snapshot needs to happen at the moment when the WAL > > record is considered to be "on stable storage", so the "current > > snapshot" update presumably can be done by the same process which forces > > it to stable storage, with the same contention pattern that applies to > > writing WAL records, no ? > > No. There is no reason to tie this to fsyncing WAL. For purposes of > other currently-running transactions, the commit can be considered to > occur at the instant the commit record is inserted into WAL buffers. > If we crash before that makes it to disk, no problem, because nothing > those other transactions did will have made it to disk either. Agreed. Actually figured it out right after pushing send :) > The > advantage of defining it that way is you don't have weirdly different > behaviors for sync and async transactions. My main point was, that we already do synchronization when writing wal, why not piggyback on this to also update latest snapshot . -- ------- Hannu Krosing PostgreSQL (Infinite) Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, Jul 28, 2011 at 11:10 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > My main point was, that we already do synchronization when writing wal, > why not piggyback on this to also update latest snapshot . Well, one problem is that it would break sync rep. Another problem is that pretty much the last thing I want to do is push more work under WALInsertLock. Based on the testing I've done so far, it seems like WALInsertLock, ProcArrayLock, and CLogControlLock are the main bottlenecks here. I'm focusing on ProcArrayLock and CLogControlLock right now, but I am pretty well convinced that WALInsertLock is going to be the hardest nut to crack, so putting anything more under there seems like it's going in the wrong direction. IMHO, anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2011-07-28 at 17:10 +0200, Hannu Krosing wrote: > On Thu, 2011-07-28 at 10:45 -0400, Tom Lane wrote: > > Hannu Krosing <hannu@2ndQuadrant.com> writes: > > > On Thu, 2011-07-28 at 10:23 -0400, Robert Haas wrote: > > >> I'm confused by this, because I don't think any of this can be done > > >> when we insert the commit record into the WAL stream. > > > > > The update to stored snapshot needs to happen at the moment when the WAL > > > record is considered to be "on stable storage", so the "current > > > snapshot" update presumably can be done by the same process which forces > > > it to stable storage, with the same contention pattern that applies to > > > writing WAL records, no ? > > > > No. There is no reason to tie this to fsyncing WAL. For purposes of > > other currently-running transactions, the commit can be considered to > > occur at the instant the commit record is inserted into WAL buffers. > > If we crash before that makes it to disk, no problem, because nothing > > those other transactions did will have made it to disk either. > > Agreed. Actually figured it out right after pushing send :) > > > The > > advantage of defining it that way is you don't have weirdly different > > behaviors for sync and async transactions. > > My main point was, that we already do synchronization when writing wal, > why not piggyback on this to also update latest snapshot . So the basic design could be "a sparse snapshot", consisting of 'xmin, xmax, running_txids[numbackends] where each backend manages its own slot in running_txids - sets a txid when aquiring one and nulls it at commit, possibly advancing xmin if xmin==mytxid. as xmin update requires full scan of running_txids, it is also a good time to update xmax - no need to advance xmax when "inserting" your next txid, so you don't need to locak anything at insert time. the valid xmax is still computed when getting the snapshot. hmm, probably no need to store xmin and xmax at all. it needs some further analysis to figure out, if doing it this way without any locks can produce any relevantly bad snapshots. maybe you still need one spinlock + memcpy of running_txids to local memory to get snapshot. also, as the running_txids array is global, it may need to be made even sparser to minimise cache-line collisions. needs to be a tuning decision between cache conflicts and speed of memcpy. > > > -- > ------- > Hannu Krosing > PostgreSQL (Infinite) Scalability and Performance Consultant > PG Admin Book: http://www.2ndQuadrant.com/books/ > >
On Thu, 2011-07-28 at 11:15 -0400, Robert Haas wrote: > On Thu, Jul 28, 2011 at 11:10 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > > My main point was, that we already do synchronization when writing wal, > > why not piggyback on this to also update latest snapshot . > > Well, one problem is that it would break sync rep. Can you elaborate, in what way it "breaks" sync rep ? > Another problem is that pretty much the last thing I want to do is > push more work under WALInsertLock. Based on the testing I've done so > far, it seems like WALInsertLock, ProcArrayLock, and CLogControlLock > are the main bottlenecks here. I'm focusing on ProcArrayLock and > CLogControlLock right now, but I am pretty well convinced that > WALInsertLock is going to be the hardest nut to crack, so putting > anything more under there seems like it's going in the wrong > direction. probably it is not just the WALInsertLock, but the fact that we have just one WAL. It can become a bottleneck once we have significant number of processors fighting to write in single WAL. > IMHO, anyway. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> But should we rethink that? Your point that hot standby transactions on >> a slave could see snapshots that were impossible on the parent was >> disturbing. Should we look for a way to tie "transaction becomes >> visible" to its creation of a commit WAL record? I think the fact that >> they are not an indivisible operation is an implementation artifact, and >> not a particularly nice one. > Well, I agree with you that it isn't especially nice, but it seems > like a fairly intractable problem. Currently, the standby has no way > of knowing in what order the transactions became visible on the > master. Right, but if the visibility order were *defined* as the order in which commit records appear in WAL, that problem neatly goes away. It's only because we have the implementation artifact that "set my xid to 0 in the ProcArray" is decoupled from inserting the commit record that there's any difference. regards, tom lane
On Wed, 2011-07-27 at 22:51 -0400, Robert Haas wrote: > On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wonder whether we could do something involving WAL properties --- the > > current tuple visibility logic was designed before WAL existed, so it's > > not exploiting that resource at all. I'm imagining that the kernel of a > > snapshot is just a WAL position, ie the end of WAL as of the time you > > take the snapshot (easy to get in O(1) time). Visibility tests then > > reduce to "did this transaction commit with a WAL record located before > > the specified position?". You'd need some index datastructure that made > > it reasonably cheap to find out the commit locations of recently > > committed transactions, where "recent" means "back to recentGlobalXmin". > > That seems possibly do-able, though I don't have a concrete design in > > mind. > > I was mulling this idea over some more (the same ideas keep floating > back to the top...). I don't think an LSN can actually work, because > there's no guarantee that the order in which the WAL records are > emitted is the same order in which the effects of the transactions > become visible to new snapshots. For example: > > 1. Transaction A inserts its commit record, flushes WAL, and begins > waiting for sync rep. > 2. A moment later, transaction B sets synchronous_commit=off, inserts > its commit record, requests a background WAL flush, and removes itself > from the ProcArray. > 3. Transaction C takes a snapshot. It is Transaction A here which is acting badly - it should also remove itself from procArray right after it inserts its commit record, as for everybody else except the client app of transaction A it is committed at this point. It just cant report back to client before getting confirmation that it is actually syncrepped (or locally written to stable storage). At least at the point of consistent snapshots the right sequence should be: 1) inert commit record into wal 2) remove yourself from ProcArray (or use some other means to declare that your transaction is no longer running) 3) if so configured, wait for WAL flus to stable storage and/or SYnc Rep confirmation Based on this let me suggest a simple snapshot cache mechanism A simple snapshot cache mechanism ================================= have an array of running transactions, with one slot per backend txid running_transactions[max_connections]; there are exactly 3 operations on this array 1. insert backends running transaction id ----------------------------------------- this is done at the moment of acquiring your transaction id from system, and synchronized by the same mechanism as getting the transaction id running_transactions[my_backend] = current_transaction_id 2. remove backends running transaction id ----------------------------------------- this is done at the moment of committing or aborting the transaction, again synchronized by the write commit record mechanism. running_transactions[my_backend] = NULL should be first thing after insertin WAcommit record 3. getting a snapshot --------------------- memcpy() running_transactions to local memory, then construct a snapshot it may be that you need to protect all3 operations with a single spinlock, if so then I'd propose the same spinlock used when getting your transaction id (and placing the array near where latest transaction id is stored so they share cache line). But it is also possible, that you can get logically consistent snapshots by protecting only some ops. for example, if you protect only insert and get snapshot, then the worst that can happen is that you get a snapshot that is a few commits older than what youd get with full locking and it may well be ok for all real uses. -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, 2011-07-28 at 11:57 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> But should we rethink that? Your point that hot standby transactions on > >> a slave could see snapshots that were impossible on the parent was > >> disturbing. Should we look for a way to tie "transaction becomes > >> visible" to its creation of a commit WAL record? I think the fact that > >> they are not an indivisible operation is an implementation artifact, and > >> not a particularly nice one. > > > Well, I agree with you that it isn't especially nice, but it seems > > like a fairly intractable problem. Currently, the standby has no way > > of knowing in what order the transactions became visible on the > > master. > > Right, but if the visibility order were *defined* as the order in which > commit records appear in WAL, that problem neatly goes away. It's only > because we have the implementation artifact that "set my xid to 0 in the > ProcArray" is decoupled from inserting the commit record that there's > any difference. Yes, as I explain in another e-mail, the _only_ one for whom the transaction is not yet committed is the waiting backend itself. for all others it should show as committed the moment after the wal record is written. It's kind of "local 2 phase commit" thing :) > > regards, tom lane >
On Thu, 2011-07-28 at 18:05 +0200, Hannu Krosing wrote: > But it is also possible, that you can get logically consistent snapshots > by protecting only some ops. for example, if you protect only insert and > get snapshot, then the worst that can happen is that you get a snapshot > that is a few commits older than what youd get with full locking and it > may well be ok for all real uses. Thinking more of it, we should lock commit/remove_txid and get_snapshot having a few more running backends does not make a difference, but seeing commits in wrong order may. this will cause contention between commit and get_snapshot, but hopefully less than current ProcArray manipulation, as there is just one simple C array to lock and copy. -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, 2011-07-28 at 18:48 +0200, Hannu Krosing wrote: > On Thu, 2011-07-28 at 18:05 +0200, Hannu Krosing wrote: > > > But it is also possible, that you can get logically consistent snapshots > > by protecting only some ops. for example, if you protect only insert and > > get snapshot, then the worst that can happen is that you get a snapshot > > that is a few commits older than what youd get with full locking and it > > may well be ok for all real uses. > > Thinking more of it, we should lock commit/remove_txid and get_snapshot > > having a few more running backends does not make a difference, but > seeing commits in wrong order may. Sorry, not true as this may advanxe xmax to include some running transactions which were missed during memcpy. so we still need some mechanism to either synchronize the the copy with both inserts and removes, or make it atomic even in presence of multiple CPUs. > this will cause contention between commit and get_snapshot, but > hopefully less than current ProcArray manipulation, as there is just one > simple C array to lock and copy. > -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, Jul 28, 2011 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> But should we rethink that? Your point that hot standby transactions on >>> a slave could see snapshots that were impossible on the parent was >>> disturbing. Should we look for a way to tie "transaction becomes >>> visible" to its creation of a commit WAL record? I think the fact that >>> they are not an indivisible operation is an implementation artifact, and >>> not a particularly nice one. > >> Well, I agree with you that it isn't especially nice, but it seems >> like a fairly intractable problem. Currently, the standby has no way >> of knowing in what order the transactions became visible on the >> master. > > Right, but if the visibility order were *defined* as the order in which > commit records appear in WAL, that problem neatly goes away. It's only > because we have the implementation artifact that "set my xid to 0 in the > ProcArray" is decoupled from inserting the commit record that there's > any difference. Hmm, interesting idea. However, consider the scenario where some transactions are using synchronous_commit or synchronous replication, and others are not. If a transaction that needs to wait (either just for WAL flush, or for WAL flush and synchronous replication) inserts its commit record, and then another transaction with synchronous_commit=off comes along and inserts its commit record, the second transaction will have to block until the first transaction is done waiting. We can't make either transaction visible without making both visible, and we certainly can't acknowledge the second transaction to the client until we've made it visible. I'm not going to say that's so horrible we shouldn't even consider it, but it doesn't seem great, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 11:36 AM, Hannu Krosing <hannu@krosing.net> wrote: > On Thu, 2011-07-28 at 11:15 -0400, Robert Haas wrote: >> On Thu, Jul 28, 2011 at 11:10 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >> > My main point was, that we already do synchronization when writing wal, >> > why not piggyback on this to also update latest snapshot . >> >> Well, one problem is that it would break sync rep. > > Can you elaborate, in what way it "breaks" sync rep ? Well, the point of synchronous replication is that the local machine doesn't see the effects of the transaction until it's been replicated.Therefore, no one can be relying on data that mightdisappear in the event the system is crushed by a falling meteor. It would be easy, technically speaking, to remove the transaction from the ProcArray and *then* wait for synchronous replication, but that would offer a much weaker guarantee than what the current version provides. We would still guarantee that the commit wouldn't be acknowledged to the client which submitted it until it was replicated, but we would no longer be able to guarantee that no one else relied on data written by the transaction prior to successful replication. For example, consider this series of events: 1. User asks ATM "what is my balance?". ATM inquires of database, which says $500. 2. User deposits a check for $100. ATM does an UPDATE to add $100 to balance and issues a COMMIT. But the master has become disconnected from the synchronous standby, so the sync rep wait hangs. 3. ATM eventually times out and tells user "sorry, i can't complete your transaction right now". 4. User wants to know whether their check got deposited, so they walk into the bank and ask a teller to check their balance. Teller's computer connects to the database and gets $600. User is happy and leaves. 5. Master dies. Failover. 6. User's balance is now back to $500. When the user finds out much later, they say "wtf? you told me before it was $600!". Right now, when using synchronous replication, this series of events CANNOT HAPPEN. If some other transaction interrogates the state of the database and sees the results of some transaction, it is an ironclad guarantee that the transaction has been replicated. If we start making transactions visible when their WAL record is flushed or - worse - when it's inserted, then those guarantees go away. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote: > On Thu, Jul 28, 2011 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> On Thu, Jul 28, 2011 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> But should we rethink that? Your point that hot standby transactions on > >>> a slave could see snapshots that were impossible on the parent was > >>> disturbing. Should we look for a way to tie "transaction becomes > >>> visible" to its creation of a commit WAL record? I think the fact that > >>> they are not an indivisible operation is an implementation artifact, and > >>> not a particularly nice one. > > > >> Well, I agree with you that it isn't especially nice, but it seems > >> like a fairly intractable problem. Currently, the standby has no way > >> of knowing in what order the transactions became visible on the > >> master. > > > > Right, but if the visibility order were *defined* as the order in which > > commit records appear in WAL, that problem neatly goes away. It's only > > because we have the implementation artifact that "set my xid to 0 in the > > ProcArray" is decoupled from inserting the commit record that there's > > any difference. > > Hmm, interesting idea. However, consider the scenario where some > transactions are using synchronous_commit or synchronous replication, > and others are not. If a transaction that needs to wait (either just > for WAL flush, or for WAL flush and synchronous replication) inserts > its commit record, and then another transaction with > synchronous_commit=off comes along and inserts its commit record, the > second transaction will have to block until the first transaction is > done waiting. What is the current behavior when the synchronous replication fails (say the slave breaks down) - will the transaction be rolled back at some point or will it wait indefinitely , that is until a new slave is installed ? Or will the sync rep transaction commit when archive_command returns true after copying the WAL segment containing this commit ? > We can't make either transaction visible without making > both visible, and we certainly can't acknowledge the second > transaction to the client until we've made it visible. I'm not going > to say that's so horrible we shouldn't even consider it, but it > doesn't seem great, either. Maybe this is why other databases don't offer per backend async commit ? -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
Hannu Krosing <hannu@krosing.net> writes: > So the basic design could be "a sparse snapshot", consisting of 'xmin, > xmax, running_txids[numbackends] where each backend manages its own slot > in running_txids - sets a txid when aquiring one and nulls it at commit, > possibly advancing xmin if xmin==mytxid. How is that different from what we're doing now? Basically, what you're describing is pulling the xids out of the ProcArray and moving them into a separate data structure. That could be a win I guess if non-snapshot- related reasons to take ProcArrayLock represent enough of the contention to be worth separating out, but I suspect they don't. In particular, the data structure you describe above *cannot* be run lock-free, because it doesn't provide any consistency guarantees without a lock. You need everyone to have the same ideas about commit order, and random backends independently changing array elements without locks won't guarantee that. regards, tom lane
On Thu, 2011-07-28 at 21:32 +0200, Hannu Krosing wrote: > On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote: > > > Hmm, interesting idea. However, consider the scenario where some > > transactions are using synchronous_commit or synchronous replication, > > and others are not. If a transaction that needs to wait (either just > > for WAL flush, or for WAL flush and synchronous replication) inserts > > its commit record, and then another transaction with > > synchronous_commit=off comes along and inserts its commit record, the > > second transaction will have to block until the first transaction is > > done waiting. > > What is the current behavior when the synchronous replication fails (say > the slave breaks down) - will the transaction be rolled back at some > point or will it wait indefinitely , that is until a new slave is > installed ? More importantly, if the master crashes after the commit is written to WAL, will the transaction be rolled back after recovery based on the fact that no confirmation from synchronous slave is received ? > Or will the sync rep transaction commit when archive_command returns > true after copying the WAL segment containing this commit ? > > > We can't make either transaction visible without making > > both visible, and we certainly can't acknowledge the second > > transaction to the client until we've made it visible. I'm not going > > to say that's so horrible we shouldn't even consider it, but it > > doesn't seem great, either. > > Maybe this is why other databases don't offer per backend async commit ? > -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
Hannu Krosing <hannu@2ndQuadrant.com> writes: > On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote: >> We can't make either transaction visible without making >> both visible, and we certainly can't acknowledge the second >> transaction to the client until we've made it visible. I'm not going >> to say that's so horrible we shouldn't even consider it, but it >> doesn't seem great, either. > Maybe this is why other databases don't offer per backend async commit ? Yeah, I've always thought that feature wasn't as simple as it appeared. It got in only because it was claimed to be cost-free, and it's now obvious that it isn't. regards, tom lane
On Thu, 2011-07-28 at 15:42 -0400, Tom Lane wrote: > Hannu Krosing <hannu@2ndQuadrant.com> writes: > > On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote: > >> We can't make either transaction visible without making > >> both visible, and we certainly can't acknowledge the second > >> transaction to the client until we've made it visible. I'm not going > >> to say that's so horrible we shouldn't even consider it, but it > >> doesn't seem great, either. > > > Maybe this is why other databases don't offer per backend async commit ? > > Yeah, I've always thought that feature wasn't as simple as it appeared. > It got in only because it was claimed to be cost-free, and it's now > obvious that it isn't. I still think it is cost-free if you get the semantics of the COMMIT contract right. (Of course it is not cost free as in not wasting developers time in discussions ;) ) I'm still with you in claiming that transaction should be visible to other backends as committed as soon as the WAL record is inserted. the main thing to keep in mind is that getting back positive commit confirmation really means (depending on various sync settings) that your transaction is on stable storage. BUT, _not_ getting back confirmation on commit does not quaranee that it is not committed, just that you need to check. It may well be that it was committed, written to stable storage _and_ also syncrepped but then the confirnation did not come bac to you due to some network outage. or your client computer crashed. or your child spilled black paint over the monitor. or thousand other reasons. async commit has the contract that you are ready to check a few latest commits after crash. but I still think that it is right semantics to make your commit visible to others, even before you have gotten back the confirmation yourself. ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, 2011-07-28 at 15:38 -0400, Tom Lane wrote: > Hannu Krosing <hannu@krosing.net> writes: > > So the basic design could be "a sparse snapshot", consisting of 'xmin, > > xmax, running_txids[numbackends] where each backend manages its own slot > > in running_txids - sets a txid when aquiring one and nulls it at commit, > > possibly advancing xmin if xmin==mytxid. > > How is that different from what we're doing now? Basically, what you're > describing is pulling the xids out of the ProcArray and moving them into > a separate data structure. That could be a win I guess if non-snapshot- > related reasons to take ProcArrayLock represent enough of the contention > to be worth separating out, but I suspect they don't. the idea was to make the thid array small enough to be able to memcpy it to backend local memory fast. But I agree it takes testing to see if it is an overall win > In particular, > the data structure you describe above *cannot* be run lock-free, because > it doesn't provide any consistency guarantees without a lock. You need > everyone to have the same ideas about commit order, and random backends > independently changing array elements without locks won't guarantee > that. > > regards, tom lane > -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
Hannu Krosing <hannu@2ndQuadrant.com> wrote: > but I still think that it is right semantics to make your commit > visible to others, even before you have gotten back the > confirmation yourself. Possibly. That combined with building snapshots based on the order of WAL entries of commit records certainly has several appealing aspects. It is hard to get over the fact that you lose an existing guarantee, though: right now, if you have one synchronous replica, you can never see a transaction's work on the master and then *not* see it on the slave -- the slave always has first visibility. I don't see how such a guarantee can exist in *either* direction with the semantics you describe. After seeing a transaction's work on one system it would always be unknown whether it was visible on the other. There are situations where that is OK as long as each copy has a sane order of visibility, but there are situations where losing that guarantee might matter. On the bright side, it means that transactions would become visible on the replica in the same order as on the master, and that blocking would be reduced. -Kevin
On Thu, Jul 28, 2011 at 3:32 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >> Hmm, interesting idea. However, consider the scenario where some >> transactions are using synchronous_commit or synchronous replication, >> and others are not. If a transaction that needs to wait (either just >> for WAL flush, or for WAL flush and synchronous replication) inserts >> its commit record, and then another transaction with >> synchronous_commit=off comes along and inserts its commit record, the >> second transaction will have to block until the first transaction is >> done waiting. > > What is the current behavior when the synchronous replication fails (say > the slave breaks down) - will the transaction be rolled back at some > point or will it wait indefinitely , that is until a new slave is > installed ? It will wait forever, unless you shut down the database or hit ^C. >> We can't make either transaction visible without making >> both visible, and we certainly can't acknowledge the second >> transaction to the client until we've made it visible. I'm not going >> to say that's so horrible we shouldn't even consider it, but it >> doesn't seem great, either. > > Maybe this is why other databases don't offer per backend async commit ? Yeah, possibly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 3:40 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > On Thu, 2011-07-28 at 21:32 +0200, Hannu Krosing wrote: >> On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote: >> >> > Hmm, interesting idea. However, consider the scenario where some >> > transactions are using synchronous_commit or synchronous replication, >> > and others are not. If a transaction that needs to wait (either just >> > for WAL flush, or for WAL flush and synchronous replication) inserts >> > its commit record, and then another transaction with >> > synchronous_commit=off comes along and inserts its commit record, the >> > second transaction will have to block until the first transaction is >> > done waiting. >> >> What is the current behavior when the synchronous replication fails (say >> the slave breaks down) - will the transaction be rolled back at some >> point or will it wait indefinitely , that is until a new slave is >> installed ? > > More importantly, if the master crashes after the commit is written to > WAL, will the transaction be rolled back after recovery based on the > fact that no confirmation from synchronous slave is received ? No. You can't roll back a transaction once it's committed - ever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 4:12 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Hannu Krosing <hannu@2ndQuadrant.com> wrote: >> but I still think that it is right semantics to make your commit >> visible to others, even before you have gotten back the >> confirmation yourself. > > Possibly. That combined with building snapshots based on the order > of WAL entries of commit records certainly has several appealing > aspects. It is hard to get over the fact that you lose an existing > guarantee, though: right now, if you have one synchronous replica, > you can never see a transaction's work on the master and then *not* > see it on the slave -- the slave always has first visibility. I > don't see how such a guarantee can exist in *either* direction with > the semantics you describe. After seeing a transaction's work on > one system it would always be unknown whether it was visible on the > other. There are situations where that is OK as long as each copy > has a sane order of visibility, but there are situations where > losing that guarantee might matter. > > On the bright side, it means that transactions would become visible > on the replica in the same order as on the master, and that blocking > would be reduced. Having transactions become visible in the same order on the master and the standby is very appealing, but I'm pretty well convinced that allowing commits to become visible before they've been durably committed is throwing the "D" an ACID out the window. If synchronous_commit is off, sure, but otherwise... ...Robert
On Thu, 2011-07-28 at 16:20 -0400, Robert Haas wrote: > On Thu, Jul 28, 2011 at 3:40 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > > On Thu, 2011-07-28 at 21:32 +0200, Hannu Krosing wrote: > >> On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote: > >> > >> > Hmm, interesting idea. However, consider the scenario where some > >> > transactions are using synchronous_commit or synchronous replication, > >> > and others are not. If a transaction that needs to wait (either just > >> > for WAL flush, or for WAL flush and synchronous replication) inserts > >> > its commit record, and then another transaction with > >> > synchronous_commit=off comes along and inserts its commit record, the > >> > second transaction will have to block until the first transaction is > >> > done waiting. > >> > >> What is the current behavior when the synchronous replication fails (say > >> the slave breaks down) - will the transaction be rolled back at some > >> point or will it wait indefinitely , that is until a new slave is > >> installed ? > > > > More importantly, if the master crashes after the commit is written to > > WAL, will the transaction be rolled back after recovery based on the > > fact that no confirmation from synchronous slave is received ? > > No. You can't roll back a transaction once it's committed - ever. so in case of stuck slave the syncrep transcation is committed after crash, but is not committed before the crash happens ? ow will the replay process get stuc gaian during recovery ? > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
On Thu, Jul 28, 2011 at 4:36 PM, Hannu Krosing <hannu@krosing.net> wrote: > so in case of stuck slave the syncrep transcation is committed after > crash, but is not committed before the crash happens ? Yep. > ow will the replay process get stuc gaian during recovery ? Nope. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > Having transactions become visible in the same order on the master > and the standby is very appealing, but I'm pretty well convinced > that allowing commits to become visible before they've been > durably committed is throwing the "D" an ACID out the window. If > synchronous_commit is off, sure, but otherwise... It has been durably committed on the master, but not on the supposedly synchronous copy; so it's not so much through out the "D" in "ACID" as throwing out the "synchronous" in "synchronous replication". :-( Unless I'm missing something we have a choice to make -- I see four possibilities (already mentioned on this thread, I think): (1) Transactions are visible on the master which won't necessarily be there if a meteor takes out the master and you need to resume operations on the replica. (2) An asynchronous commit must block behind any pending synchronous commits if synchronous replication is in use. (3) Transactions become visible on the replica in a different order than they became visible on the master. (4) We communicate acceptable snapshots to the replica to make the order of visibility visibility match the master even when that doesn't match the order that transactions returned from commit. I don't see how we can accept (1) and call it synchronous replication. I'm pretty dubious about (3), because we don't even have Snapshot Isolation on the replica, really. Is (3) where we're currently at? An advantage of (4) is that on the replica we would get the same SI behavior at Repeatable Read that exists on the master, and we could even use the same mechanism for SSI to provide Serializable isolation on the replica. I (predictably) like (4) -- even though it's a lot of work.... -Kevin
On Thu, 2011-07-28 at 14:27 -0400, Robert Haas wrote: > > Right, but if the visibility order were *defined* as the order in which > > commit records appear in WAL, that problem neatly goes away. It's only > > because we have the implementation artifact that "set my xid to 0 in the > > ProcArray" is decoupled from inserting the commit record that there's > > any difference. > > Hmm, interesting idea. However, consider the scenario where some > transactions are using synchronous_commit or synchronous replication, > and others are not. If a transaction that needs to wait (either just > for WAL flush, or for WAL flush and synchronous replication) inserts > its commit record, and then another transaction with > synchronous_commit=off comes along and inserts its commit record, the > second transaction will have to block until the first transaction is > done waiting. We can't make either transaction visible without making > both visible, and we certainly can't acknowledge the second > transaction to the client until we've made it visible. I'm not going > to say that's so horrible we shouldn't even consider it, but it > doesn't seem great, either. I'm trying to follow along here. Wouldn't the same issue exist if one transaction is waiting for sync rep (synchronous_commit=on), and another is waiting for just a WAL flush (synchronous_commit=local)? I don't think that a synchronous_commit=off is required. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> wrote: > Wouldn't the same issue exist if one transaction is waiting for > sync rep (synchronous_commit=on), and another is waiting for just > a WAL flush (synchronous_commit=local)? I don't think that a > synchronous_commit=off is required. I think you're right -- basically, to make visibility atomic with commit and allow a fast snapshot build based on that order, any new commit request would need to block behind any pending request, regardless of that setting. At least, no way around that is apparent to me. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > to make visibility atomic with commit I meant: to make visibility atomic with WAL-write of the commit record -Kevin
----- Цитат от Hannu Krosing (hannu@2ndQuadrant.com), на 28.07.2011 в 22:40 ----- <br /><br />>> <br />>> Maybethis is why other databases don't offer per backend async commit ? <br />>> <br />> <br /><br />Isn't Oracle's<br /><br />COMMIT WRITE NOWAIT; <br /><br />basically the same - ad hoc async commit? Though their idea of backenddo not maps <br />exactly to postgrsql's idea. The closest thing is per session async commit: <br /><br />ALTER SESSIONSET COMMIT_WRITE='NOWAIT'; <br /><br /><br />Best regards <br /><br />-- <br />Luben Karavelov
On Thu, Jul 28, 2011 at 4:54 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > >> Having transactions become visible in the same order on the master >> and the standby is very appealing, but I'm pretty well convinced >> that allowing commits to become visible before they've been >> durably committed is throwing the "D" an ACID out the window. If >> synchronous_commit is off, sure, but otherwise... > > It has been durably committed on the master, but not on the > supposedly synchronous copy; so it's not so much through out the "D" > in "ACID" as throwing out the "synchronous" in "synchronous > replication". :-( Well, depends. Currently, the sequence of events is: 1. Insert commit record. 2. Flush commit record, if synchronous_commit in {local, on}. 3. Wait for synchronous replication, if synchronous_commit = on and synchronous_standby_names is non-empty. 4. Make transaction visible. If you move (4) before (3), you're throwing out the synchronous in synchronous replication. If you move (4) before (2), you're throwing out the D in ACID. > Unless I'm missing something we have a choice to make -- I see four > possibilities (already mentioned on this thread, I think): > > (1) Transactions are visible on the master which won't necessarily > be there if a meteor takes out the master and you need to resume > operations on the replica. > > (2) An asynchronous commit must block behind any pending > synchronous commits if synchronous replication is in use. Well, again, there are three levels: (A) synchronous_commit=off. No waiting! (B) synchronous_commit=local transactions, and synchronous_commit=on transactions when sync rep is not in use. Wait for xlog flush. (C) synchronous_commit=on transactions when sync rep IS in use. Wait for xlog flush and replication. Under your option #2, if a type-A transaction commits after a type-B transaction, it will need to wait for the type-B transaction's xlog flush. If a type-A transaction commits after a type-C transaction, it will need to wait for the type-C transaction to flush xlog and replicate. And if a type-B transaction commits after a type-C transaction, there's no additional waiting for xlog flush, because the type-B transaction would have to wait for that anyway. But it will also have to wait for the preceding type-C transaction to replicate. So basically, you can't be more asynchronous than the guy in front of you. Aside from the fact that this behavior isn't too hot from a user perspective, it might lead to some pretty complicated locking. Every time a transaction finishes xlog flush or sync rep, it's got to go release the transactions that piled up behind it - but not too many, just up to the next one that still needs to wait on some higher LSN. > (3) Transactions become visible on the replica in a different order > than they became visible on the master. > > (4) We communicate acceptable snapshots to the replica to make the > order of visibility visibility match the master even when that > doesn't match the order that transactions returned from commit. > > I don't see how we can accept (1) and call it synchronous > replication. I'm pretty dubious about (3), because we don't even > have Snapshot Isolation on the replica, really. Is (3) where we're > currently at? An advantage of (4) is that on the replica we would > get the same SI behavior at Repeatable Read that exists on the > master, and we could even use the same mechanism for SSI to provide > Serializable isolation on the replica. > > I (predictably) like (4) -- even though it's a lot of work.... I think that (4), beyond being a lot of work, will also have pretty terrible performance. You're basically talking about emitting two WAL records for every commit instead of one. That's not going to be awesome. It might be OK for small or relatively lightly loaded systems, or those with "big" transactions. But for something like pgbench or DBT-2, I think it's going to be a big problem. WAL is already a major bottleneck for us; we need to find a way to make it less of one, not more. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 2011-07-28 at 16:42 -0400, Robert Haas wrote: > On Thu, Jul 28, 2011 at 4:36 PM, Hannu Krosing <hannu@krosing.net> wrote: > > so in case of stuck slave the syncrep transcation is committed after > > crash, but is not committed before the crash happens ? > > Yep. > > > ow will the replay process get stuc gaian during recovery ? > > Nope. Are you sure ? I mean the case when a stuck master comes up but slave is still not functional. How does this behavior currently fit in with ACID and sync guarantees ? > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
On Thu, Jul 28, 2011 at 11:54 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > (4) We communicate acceptable snapshots to the replica to make the > order of visibility visibility match the master even when that > doesn't match the order that transactions returned from commit. I wonder if some interpretation of 2 phase commit could make Robert's original suggestion implement this. On the master the commit sequence would look something like: 1. Insert commit record to the WAL 2. Wait for replication 3. Get a commit seq nr and mark XIDs visible 4. WAL log the seq nr 5. Return success to client When replaying: * When replaying commit record, do everything but make the tx visible. * When replaying the commit sequence number if there is a gap between last visible commit and current: insert the commitsequence nr. to list of waiting commits. else: mark current and all directly following waiting tx's visible This would give consistent visibility order on master and slave. Robert is right that this would undesirably increase WAL traffic. Delaying this traffic would undesirably increase replay lag between master and slave. But it seems to me that this could be an optional WAL level on top of hot_standby that would only be enabled if consistent visibility on slaves is desired. -- Ants Aasma
On Fri, Jul 29, 2011 at 2:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Well, again, there are three levels: > > (A) synchronous_commit=off. No waiting! > (B) synchronous_commit=local transactions, and synchronous_commit=on > transactions when sync rep is not in use. Wait for xlog flush. > (C) synchronous_commit=on transactions when sync rep IS in use. Wait > for xlog flush and replication. ... > So basically, you can't be more asynchronous than the guy in front of > you. (A) still gives a guarantee - transactions that begin after the commit returns see the commited transaction. A weaker variant would say that if the commit returns, and the server doesn't crash in the meantime, the commit would at some point become visible. Maybe even that transactions that begin after the commit returns become visible after that commit. -- Ants Aasma
On Thu, Jul 28, 2011 at 7:54 PM, Ants Aasma <ants.aasma@eesti.ee> wrote: > On Thu, Jul 28, 2011 at 11:54 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> (4) We communicate acceptable snapshots to the replica to make the >> order of visibility visibility match the master even when that >> doesn't match the order that transactions returned from commit. > > I wonder if some interpretation of 2 phase commit could make Robert's > original suggestion implement this. > > On the master the commit sequence would look something like: > 1. Insert commit record to the WAL > 2. Wait for replication > 3. Get a commit seq nr and mark XIDs visible > 4. WAL log the seq nr > 5. Return success to client > > When replaying: > * When replaying commit record, do everything but make > the tx visible. > * When replaying the commit sequence number > if there is a gap between last visible commit and current: > insert the commit sequence nr. to list of waiting commits. > else: > mark current and all directly following waiting tx's visible > > This would give consistent visibility order on master and slave. Robert > is right that this would undesirably increase WAL traffic. Delaying this > traffic would undesirably increase replay lag between master and slave. > But it seems to me that this could be an optional WAL level on top of > hot_standby that would only be enabled if consistent visibility on > slaves is desired. I think you nailed it. An additional point to think about: if we were willing to insist on streaming replication, we could send the commit sequence numbers via a side channel rather than writing them to WAL, which would be a lot cheaper. That might even be a reasonable thing to do, because if you're doing log shipping, this is all going to be super-not-real-time anyway. OTOH, I know we don't want to make WAL shipping anything less than a first class citizen, so maybe not. At any rate, we may be getting a little sidetracked here from the original point of the thread, which was how to make snapshot-taking cheaper. Maybe there's some tie-in to when transactions become visible, but I think it's pretty weak. The existing system could be hacked up to avoid making transactions visible out of LSN order, and the system I proposed could make them visible either in LSN order or do the same thing we do now. They are basically independent problems, AFAICS. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 28, 2011 at 8:12 PM, Ants Aasma <ants.aasma@eesti.ee> wrote: > On Fri, Jul 29, 2011 at 2:20 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Well, again, there are three levels: >> >> (A) synchronous_commit=off. No waiting! >> (B) synchronous_commit=local transactions, and synchronous_commit=on >> transactions when sync rep is not in use. Wait for xlog flush. >> (C) synchronous_commit=on transactions when sync rep IS in use. Wait >> for xlog flush and replication. > ... >> So basically, you can't be more asynchronous than the guy in front of >> you. > > (A) still gives a guarantee - transactions that begin after the commit > returns see > the commited transaction. A weaker variant would say that if the commit > returns, and the server doesn't crash in the meantime, the commit would at > some point become visible. Maybe even that transactions that begin after the > commit returns become visible after that commit. Yeah, you could do that. But that's such a weak guarantee that I'm not sure it has much practical utility. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: >> (4) We communicate acceptable snapshots to the replica to make >> the order of visibility visibility match the master even when >> that doesn't match the order that transactions returned from >> commit. >> I (predictably) like (4) -- even though it's a lot of work.... > > I think that (4), beyond being a lot of work, will also have > pretty terrible performance. You're basically talking about > emitting two WAL records for every commit instead of one. Well, I can think of a great many other ways this could be done, each with its own trade-offs of various types of overhead against how close the replica is to current. At one extreme you could do what you describe, at the other you could generate a new snapshot on the replica once every few minutes. Then there are more clever ways, in discussions a few months ago I suggested that adding two new bit flags to the commit record would suffice, and I don't remember anyone blowing holes in that idea. Of course, that was to achieve serializable behavior on the replica, based on some assumption that the current hot standby already supported repeatable read. We might need another bit or two to solve the problems with that which have surfaced on this thread. -Kevin
On Thu, 2011-07-28 at 20:14 -0400, Robert Haas wrote: > On Thu, Jul 28, 2011 at 7:54 PM, Ants Aasma <ants.aasma@eesti.ee> wrote: > > On Thu, Jul 28, 2011 at 11:54 PM, Kevin Grittner > > <Kevin.Grittner@wicourts.gov> wrote: > >> (4) We communicate acceptable snapshots to the replica to make the > >> order of visibility visibility match the master even when that > >> doesn't match the order that transactions returned from commit. > > > > I wonder if some interpretation of 2 phase commit could make Robert's > > original suggestion implement this. > > > > On the master the commit sequence would look something like: > > 1. Insert commit record to the WAL > > 2. Wait for replication > > 3. Get a commit seq nr and mark XIDs visible > > 4. WAL log the seq nr > > 5. Return success to client > > > > When replaying: > > * When replaying commit record, do everything but make > > the tx visible. > > * When replaying the commit sequence number > > if there is a gap between last visible commit and current: > > insert the commit sequence nr. to list of waiting commits. > > else: > > mark current and all directly following waiting tx's visible > > > > This would give consistent visibility order on master and slave. Robert > > is right that this would undesirably increase WAL traffic. Delaying this > > traffic would undesirably increase replay lag between master and slave. > > But it seems to me that this could be an optional WAL level on top of > > hot_standby that would only be enabled if consistent visibility on > > slaves is desired. > > I think you nailed it. Agreed, this would keep current semantics on master and same visibility order on master and slave. > An additional point to think about: if we were willing to insist on > streaming replication, we could send the commit sequence numbers via a > side channel rather than writing them to WAL, which would be a lot > cheaper. Why do you think that side channel is cheaper than main WAL ? How would you handle synchronising the two ? > That might even be a reasonable thing to do, because if > you're doing log shipping, this is all going to be super-not-real-time > anyway. But perhaps you still may want to preserve visibility order to be able to do PITR to exact transaction "commit", no ? > OTOH, I know we don't want to make WAL shipping anything less > than a first class citizen, so maybe not. > > At any rate, we may be getting a little sidetracked here from the > original point of the thread, which was how to make snapshot-taking > cheaper. Maybe there's some tie-in to when transactions become > visible, but I think it's pretty weak. The existing system could be > hacked up to avoid making transactions visible out of LSN order, and > the system I proposed could make them visible either in LSN order or > do the same thing we do now. They are basically independent problems, > AFAICS. Agreed. -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
On Fri, Jul 29, 2011 at 10:20 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >> An additional point to think about: if we were willing to insist on >> streaming replication, we could send the commit sequence numbers via a >> side channel rather than writing them to WAL, which would be a lot >> cheaper. > > Why do you think that side channel is cheaper than main WAL ? You don't have to flush it to disk, and you can use some other lock that isn't as highly contended as WALInsertLock to synchronize it. >> That might even be a reasonable thing to do, because if >> you're doing log shipping, this is all going to be super-not-real-time >> anyway. > > But perhaps you still may want to preserve visibility order to be able > to do PITR to exact transaction "commit", no ? Maybe. In practice, I suspect most people won't be willing to pay the price a feature like this would exact. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 2011-07-29 at 10:23 -0400, Robert Haas wrote: > On Fri, Jul 29, 2011 at 10:20 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > >> An additional point to think about: if we were willing to insist on > >> streaming replication, we could send the commit sequence numbers via a > >> side channel rather than writing them to WAL, which would be a lot > >> cheaper. > > > > Why do you think that side channel is cheaper than main WAL ? > > You don't have to flush it to disk, You can probably write the "i became visible" WAL record without forcing a flush and still get the same visibility order. > and you can use some other lock > that isn't as highly contended as WALInsertLock to synchronize it. but you will need to synchronise it with WAL replay on slave anyway. It seems easiest to just insert it in the WAL stream and be done with it. > >> That might even be a reasonable thing to do, because if > >> you're doing log shipping, this is all going to be super-not-real-time > >> anyway. > > > > But perhaps you still may want to preserve visibility order to be able > > to do PITR to exact transaction "commit", no ? > > Maybe. In practice, I suspect most people won't be willing to pay the > price a feature like this would exact. Unless we find some really bad problems with different visibility orders on master and slave(s) you are probably right. -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
On Thu, Jul 28, 2011 at 8:32 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > Maybe this is why other databases don't offer per backend async commit ? Oracle has async commit but very few people know about it. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services