Thread: WALInsertLock contention
I've been thinking about the problem of $SUBJECT, and while I know it's too early to think seriously about any 9.2 development, I want to get my thoughts down in writing while they're fresh in my head. It seems to me that there are two basic approaches to this problem. We could either split up the WAL stream into several streams, say one per database or one per tablespace or something of that sort, or we could keep it as a single stream but try not to do so much locking whilst in the process of getting it out the door. Or we could try to do both, and maybe ultimately we'll need to. However, if the second one is practical, it's got two major advantages: it'll probably be a lot less invasive, and it won't add any extra fsync traffic. In thinking about how we might accomplish the goal of reducing lock contention, it occurred to me there's probably no need for the final WAL stream to reflect the exact order in which WAL is generated. For example, suppose transaction T1 inserts a tuple into table A; transaction T2 inserts a tuple into table B; T1 commits; T2 commits. The commit records need to be in the right order, and all the actions that are part of a given transaction need to precede the associated commit record, but, for example, I don't think it would matter if you emitted the commit record for T1 before T2's insert into B. Or you could switch the order in which you logged the inserts, since they're not touching the same buffers. So here's the basic idea. Each backend, if it so desires, is permitted to maintain a per-backend WAL buffer. Per-backend WAL buffers live in shared memory and can be accessed by any backend, but the idea is that most of the time only one backend will be accessing them, so that the locks won't be heavily contended. Any WAL written to a per-backend WAL buffer will eventually be transferred into the main WAL buffers, and flushed. When a process writes to a per-backend WAL buffer, it writes (1) the actual WAL data and (2) the list of buffers affected. Those buffers are stamped with a fake LSN that points back to the per-backend WAL buffer, and they can't be written until the WAL has been moved from the per-backend WAL buffers to the main WAL buffers. So, if a buffer with a fake LSN needs to be (a) written back to the OS or (b) modified by a backend other than the one that owns the fake LSN, this triggers a flush of the per-backend WAL buffers to the main WAL buffers. When this happens, all the affected buffers get stamped with a real LSN and the entries are discarded from the per-backend WAL buffers. Such a flush would also be needed when a backend commits or otherwise needs an XLOG flush, or when there's no more per-backend buffer space. In theory, all of this taken together should mean that WAL gets pushed out in larger chunks: a transaction that does three inserts and commits should only need to grab WALInsertLock once, instead of once per heap insert, once per index insert, and again for the commit, though it'll have to write a bigger chunk of data when it does get the lock. It'll have to repeatedly grab the lock on its per-backend WAL buffer, but ideally that's uncontended. A further refinement would be to try to jigger things so that as a backend fills up per-backend WAL buffers, it somehow throws them over the fence to one of the background processes to write out. For short-running transactions, that won't really make any difference, since the commit will force the per-backend buffers out to the main buffers anyway. But for long-running transactions it seems like it could be quite useful; in essence, the task of assembling the final WAL stream from the WAL output of individual backends becomes a background activity, and ideally the background process doing the work is the only one touching the cache lines being shuffled around. Of course, to make this work, backends would need a steady supply of available per-backend WAL buffers. Maybe shared buffers could be used for this purpose, with the buffer header being marked in some special way to indicate that this is what the buffer's being used for. One not-so-good property of this algorithm is that the operation of moving per-backend WAL into the main WAL buffers requires relocking all the buffers whose fake LSNs now need to changed to "real" LSNs. That could possible be problematic from a performance standpoint, and there are deadlock risks to worry about too. Any thoughts? Other ideas? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> I've been thinking about the problem of $SUBJECT, and while I know > it's too early to think seriously about any 9.2 development, I want to > get my thoughts down in writing while they're fresh in my head. > > It seems to me that there are two basic approaches to this problem. > We could either split up the WAL stream into several streams, say one > per database or one per tablespace or something of that sort, or we > could keep it as a single stream but try not to do so much locking > whilst in the process of getting it out the door. Or we could try to > do both, and maybe ultimately we'll need to. However, if the second > one is practical, it's got two major advantages: it'll probably be a > lot less invasive, and it won't add any extra fsync traffic. In > thinking about how we might accomplish the goal of reducing lock > contention, it occurred to me there's probably no need for the final > WAL stream to reflect the exact order in which WAL is generated. > > For example, suppose transaction T1 inserts a tuple into table A; > transaction T2 inserts a tuple into table B; T1 commits; T2 commits. > The commit records need to be in the right order, and all the actions > that are part of a given transaction need to precede the associated > commit record, but, for example, I don't think it would matter if you > emitted the commit record for T1 before T2's insert into B. Or you > could switch the order in which you logged the inserts, since they're > not touching the same buffers. > > So here's the basic idea. Each backend, if it so desires, is > permitted to maintain a per-backend WAL buffer. Per-backend WAL > buffers live in shared memory and can be accessed by any backend, but > the idea is that most of the time only one backend will be accessing > them, so that the locks won't be heavily contended. Any WAL written > to a per-backend WAL buffer will eventually be transferred into the > main WAL buffers, and flushed. When a process writes to a per-backend > WAL buffer, it writes (1) the actual WAL data and (2) the list of > buffers affected. Those buffers are stamped with a fake LSN that > points back to the per-backend WAL buffer, and they can't be written > until the WAL has been moved from the per-backend WAL buffers to the > main WAL buffers. > > So, if a buffer with a fake LSN needs to be (a) written back to the OS > or (b) modified by a backend other than the one that owns the fake > LSN, this triggers a flush of the per-backend WAL buffers to the main > WAL buffers. When this happens, all the affected buffers get stamped > with a real LSN and the entries are discarded from the per-backend WAL > buffers. Such a flush would also be needed when a backend commits or > otherwise needs an XLOG flush, or when there's no more per-backend > buffer space. In theory, all of this taken together should mean that > WAL gets pushed out in larger chunks: a transaction that does three > inserts and commits should only need to grab WALInsertLock once, > instead of once per heap insert, once per index insert, and again for > the commit, though it'll have to write a bigger chunk of data when it > does get the lock. It'll have to repeatedly grab the lock on its > per-backend WAL buffer, but ideally that's uncontended. > > A further refinement would be to try to jigger things so that as a > backend fills up per-backend WAL buffers, it somehow throws them over > the fence to one of the background processes to write out. For > short-running transactions, that won't really make any difference, > since the commit will force the per-backend buffers out to the main > buffers anyway. But for long-running transactions it seems like it > could be quite useful; in essence, the task of assembling the final > WAL stream from the WAL output of individual backends becomes a > background activity, and ideally the background process doing the work > is the only one touching the cache lines being shuffled around. Of > course, to make this work, backends would need a steady supply of > available per-backend WAL buffers. Maybe shared buffers could be used > for this purpose, with the buffer header being marked in some special > way to indicate that this is what the buffer's being used for. > > One not-so-good property of this algorithm is that the operation of > moving per-backend WAL into the main WAL buffers requires relocking > all the buffers whose fake LSNs now need to changed to "real" LSNs. > That could possible be problematic from a performance standpoint, and > there are deadlock risks to worry about too. > > Any thoughts? Other ideas? I vaguely recall that UNISYS used to present patches to reduce the WAL buffer lock contention and enhanced the CPU scalability limit from 12 to 16 or so(if my memory serves). Your second idea is somewhat related to the patches? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On Wed, Feb 16, 2011 at 11:13 PM, Tatsuo Ishii <ishii@postgresql.org> wrote: > I vaguely recall that UNISYS used to present patches to reduce the WAL > buffer lock contention and enhanced the CPU scalability limit from 12 > to 16 or so(if my memory serves). Your second idea is somewhat related > to the patches? Not sure. Do you have a link to the archives, or any idea when this discussion occurred/what the subject line was? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Feb 16, 2011 at 11:13 PM, Tatsuo Ishii <ishii@postgresql.org> wrote: > > I vaguely recall that UNISYS used to present patches to reduce the WAL > > buffer lock contention and enhanced the CPU scalability limit from 12 > > to 16 or so(if my memory serves). Your second idea is somewhat related > > to the patches? > > Not sure. Do you have a link to the archives, or any idea when this > discussion occurred/what the subject line was? They presented at PgCon a couple of years in a row, iirc.. http://www.pgcon.org/2007/schedule/events/16.en.html I thought there was another one but I'm not finding it atm.. Thanks, Stephen
>> Not sure. Do you have a link to the archives, or any idea when this >> discussion occurred/what the subject line was? > > They presented at PgCon a couple of years in a row, iirc.. > > http://www.pgcon.org/2007/schedule/events/16.en.html Yes, this one. On page 18, they talked about their customized version of PostgreSQL called "Postgres 8.2.4-uis": Change WALInsertLock access $(Q#|(B Using SpinLockAcquire () as WALInsertLock locked most of time $(Q#|(B Consideringa queue mechanism for WALInsertLock I'm not sure if they brought their patches to public or not though... -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On Wed, Feb 16, 2011 at 11:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I've been thinking about the problem of $SUBJECT, and while I know > it's too early to think seriously about any 9.2 development, I want to > get my thoughts down in writing while they're fresh in my head. > > It seems to me that there are two basic approaches to this problem. > We could either split up the WAL stream into several streams, say one > per database or one per tablespace or something of that sort, or we > could keep it as a single stream but try not to do so much locking > whilst in the process of getting it out the door. Or we could try to > do both, and maybe ultimately we'll need to. However, if the second > one is practical, it's got two major advantages: it'll probably be a > lot less invasive, and it won't add any extra fsync traffic. In > thinking about how we might accomplish the goal of reducing lock > contention, it occurred to me there's probably no need for the final > WAL stream to reflect the exact order in which WAL is generated. > > For example, suppose transaction T1 inserts a tuple into table A; > transaction T2 inserts a tuple into table B; T1 commits; T2 commits. > The commit records need to be in the right order, and all the actions > that are part of a given transaction need to precede the associated > commit record, but, for example, I don't think it would matter if you > emitted the commit record for T1 before T2's insert into B. Or you > could switch the order in which you logged the inserts, since they're > not touching the same buffers. > > So here's the basic idea. Each backend, if it so desires, is > permitted to maintain a per-backend WAL buffer. Per-backend WAL > buffers live in shared memory and can be accessed by any backend, but > the idea is that most of the time only one backend will be accessing > them, so that the locks won't be heavily contended. Any WAL written > to a per-backend WAL buffer will eventually be transferred into the > main WAL buffers, and flushed. When a process writes to a per-backend > WAL buffer, it writes (1) the actual WAL data and (2) the list of > buffers affected. Those buffers are stamped with a fake LSN that > points back to the per-backend WAL buffer, and they can't be written > until the WAL has been moved from the per-backend WAL buffers to the > main WAL buffers. > > So, if a buffer with a fake LSN needs to be (a) written back to the OS > or (b) modified by a backend other than the one that owns the fake > LSN, this triggers a flush of the per-backend WAL buffers to the main > WAL buffers. When this happens, all the affected buffers get stamped > with a real LSN and the entries are discarded from the per-backend WAL > buffers. Such a flush would also be needed when a backend commits or > otherwise needs an XLOG flush, or when there's no more per-backend > buffer space. In theory, all of this taken together should mean that > WAL gets pushed out in larger chunks: a transaction that does three > inserts and commits should only need to grab WALInsertLock once, > instead of once per heap insert, once per index insert, and again for > the commit, though it'll l have to write a bigger chunk of data when it > does get the lock. It'lhave to repeatedly grab the lock on its > per-backend WAL buffer, but ideally that's uncontended. There's probably an obvious explanation that I'm not seeing, but if you're not delegating the work of writing the buffers out to someone else, why do you need to lock the per backend buffer at all? That is, why does it have to be in shared memory? Suppose that if the following are true: *) Writing qualifying data (non commit, non switch) *) There is room left in whatever you are copying to you could trylock WalInsertLock, and if failing to get it, just copy qualifying data into a private buffer and punt if the following are true...otherwise just do the current behavior. When you *do* get a lock, either because you got lucky or because you had to wait anyways, you write out the data your previously staged, fixing up the LSNs as you go. Even if you do have to write it to shared memory, I think your idea is a winner -- probably a fair amount of work can get done before ultimately forced to wait...maybe enough to change the scaling dyanmics. > A further refinement would be to try to jigger things so that as a > backend fills up per-backend WAL buffers, it somehow throws them over > the fence to one of the background processes to write out. For > short-running transactions, that won't really make any difference, > since the commit will force the per-backend buffers out to the main > buffers anyway. But for long-running transactions it seems like it > could be quite useful; in essence, the task of assembling the final > WAL stream from the WAL output of individual backends becomes a > background activity, and ideally the background process doing the work > is the only one touching the cache lines being shuffled around. Of > course, to make this work, backends would need a steady supply of > available per-backend WAL buffers. Maybe shared buffers could be used > for this purpose, with the buffer header being marked in some special > way to indicate that this is what the buffer's being used for. That seems complicated -- plus I think the key is to distribute as much of the work as possible. Why would the forward lateral to the background processor not require a similar lock to WalInsertLock? merlin
On Wed, Jun 8, 2011 at 1:59 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > There's probably an obvious explanation that I'm not seeing, ... Yep. :-) > but if > you're not delegating the work of writing the buffers out to someone > else, why do you need to lock the per backend buffer at all? That is, > why does it have to be in shared memory? Suppose that if the > following are true: > *) Writing qualifying data (non commit, non switch) > *) There is room left in whatever you are copying to > you could trylock WalInsertLock, and if failing to get it, just copy > qualifying data into a private buffer and punt if the following are > true...otherwise just do the current behavior. And here it is: Writing a buffer requires a write & fsync of WAL through the buffer LSN. If the WAL for the buffers were completely inaccessible to other backends, then those buffers would be pinned in shared memory. Which would make things very difficult at buffer eviction time, or for checkpoints. At any rate, even if it were possible to make it work, it'd be a misplaced optimization. It isn't touching shared memory - or even touching the LWLock - that's expensive; it's the LWLock contention that kills you, either because stuff blocks, or just because the CPUs burn a lot of cycles fighting over cache lines. An LWLock that is typically taken by only one backend at a time is pretty cheap. I suppose I couldn't afford to be so blasé if we were trying to scale to 2048-core systems where even inserting a memory barrier is expensive enough to worry about, but we've got a ways to go before we need to start worrying about that. [...snip...] >> A further refinement would be to try to jigger things so that as a >> backend fills up per-backend WAL buffers, it somehow throws them over >> the fence to one of the background processes to write out. For >> short-running transactions, that won't really make any difference, >> since the commit will force the per-backend buffers out to the main >> buffers anyway. But for long-running transactions it seems like it >> could be quite useful; in essence, the task of assembling the final >> WAL stream from the WAL output of individual backends becomes a >> background activity, and ideally the background process doing the work >> is the only one touching the cache lines being shuffled around. Of >> course, to make this work, backends would need a steady supply of >> available per-backend WAL buffers. Maybe shared buffers could be used >> for this purpose, with the buffer header being marked in some special >> way to indicate that this is what the buffer's being used for. > > That seems complicated -- plus I think the key is to distribute as > much of the work as possible. Why would the forward lateral to the > background processor not require a similar lock to WalInsertLock? Well, that's the problem. It would. Now, in an ideal world, you might still hope to get some benefit: only the background writer would typically be writing to the real WAL stream, so that's not contended. And the contention between the background writer and the individual backends is only two-way. There's no single point where you have every process on the system piling on to a single lock. But I'm not sure we can really make it work well enough to do more than nibble around at the edges of the problem. Consider: INSERT INTO foo VALUES (1,2,3); This is going to generate XLOG_HEAP_INSERT followed by XLOG_XACT_COMMIT. And now it wants to flush WAL. So now you're pretty much forced to have it go perform the serialization operation itself, and you're right back in contention soup. Batching two records together and inserting them in one operation is presumably going to be more efficient than inserting them one at a time, but not all that much more efficient; and there are bookkeeping and memory bandwidth costs to get there. If we are dealing with long-running transactions, or asynchronous commit, then this approach might have legs -- but I suspect that in real life most transactions are small, and the default configuration is synchronous_commit=on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 8, 2011 at 7:44 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 8, 2011 at 1:59 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> There's probably an obvious explanation that I'm not seeing, ... > > Yep. :-) > >> but if >> you're not delegating the work of writing the buffers out to someone >> else, why do you need to lock the per backend buffer at all? That is, >> why does it have to be in shared memory? Suppose that if the >> following are true: >> *) Writing qualifying data (non commit, non switch) >> *) There is room left in whatever you are copying to >> you could trylock WalInsertLock, and if failing to get it, just copy >> qualifying data into a private buffer and punt if the following are >> true...otherwise just do the current behavior. > > And here it is: Writing a buffer requires a write & fsync of WAL > through the buffer LSN. If the WAL for the buffers were completely > inaccessible to other backends, then those buffers would be pinned in > shared memory. Which would make things very difficult at buffer > eviction time, or for checkpoints. Well, (bear with me here) I'm not giving up that easy. Pinning a judiciously small amount buffers into shared memory so you can recuce congestion on the insert lock might be an acceptable trade-off in high contention scenarios...in fact I assumed that was the whole point of your original idea, which I still think has tremendous potential. Obviously, you wouldn't want more than a very small percentage of shared buffers overall (say 1-10% max) to be pinned in this way. The trylock is an attempt to cap the downside case so that you aren't unnecessarily pinning buffers in say, long running i/o bound transactions where insert lock contention is low. Maybe you could experiment with very small private insert buffer sizes (say 64 kb) that would hopefully provide some of the benefits (if there are in fact any) and mitigate potential costs. Another tweak you could make is that, once having trylocked and failed in a transaction and failed acquirement, you always punt from there on in until you fill up or need to block per ordering requirements. Or maybe the whole thing doesn't help at all...just trying to understand the problem better. > At any rate, even if it were possible to make it work, it'd be a > misplaced optimization. It isn't touching shared memory - or even > touching the LWLock - that's expensive; it's the LWLock contention > that kills you, either because stuff blocks, or just because the CPUs > burn a lot of cycles fighting over cache lines. An LWLock that is > typically taken by only one backend at a time is pretty cheap. I > suppose I couldn't afford to be so blasé if we were trying to scale to > 2048-core systems where even inserting a memory barrier is expensive > enough to worry about, but we've got a ways to go before we need to > start worrying about that. Right -- although it isn't so much of an optimization (although you still want to do everything reasonable to keep work under the lock as light as possible, and shm->shm copy is going to be slower than mem->shm) as a simplification trade-off. You don't have to worry about deadlocks messing around with your per backend buffers during your internal 'flush', and it's generally just easier messing around with private memory (less code, less locking, less everything). One point i'm missing though. Getting back to your original idea, how does writing to shmem prevent you from having to keep buffers pinned? I'm reading your comment here: "Those buffers are stamped with a fake LSN that points back to the per-backend WAL buffer, and they can't be written until the WAL has been moved from the per-backend WAL buffers to the main WAL buffers." That suggests to me that you have to keep them pinned anyways. I'm still a bit fuzzy on how the per-backend buffers being in shm conveys any advantage. IOW, (trying not to be obtuse) under what circumstances would backend A want to read from or (especially) write to backend B's wal buffer? merlin
On Wed, Jun 8, 2011 at 10:18 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > One point i'm missing though. Getting back to your original idea, how > does writing to shmem prevent you from having to keep buffers pinned? > I'm reading your comment here: > "Those buffers are stamped with a fake LSN that > points back to the per-backend WAL buffer, and they can't be written > until the WAL has been moved from the per-backend WAL buffers to the > main WAL buffers." > > That suggests to me that you have to keep them pinned anyways. I'm > still a bit fuzzy on how the per-backend buffers being in shm conveys > any advantage. IOW, (trying not to be obtuse) under what > circumstances would backend A want to read from or (especially) write > to backend B's wal buffer? If backend A needs to evict a buffer with a fake LSN, it can go find the WAL that needs to be serialized, do that, flush WAL, and then evict the buffer. IOW, backend A's private WAL buffer will not be completely private. Only A will write to the buffer, but we don't know who will remove WAL from the buffer and insert it into the main stream. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jun 8, 2011, at 10:15 AM, Robert Haas wrote: >> That suggests to me that you have to keep them pinned anyways. I'm >> still a bit fuzzy on how the per-backend buffers being in shm conveys >> any advantage. IOW, (trying not to be obtuse) under what >> circumstances would backend A want to read from or (especially) write >> to backend B's wal buffer? > > If backend A needs to evict a buffer with a fake LSN, it can go find > the WAL that needs to be serialized, do that, flush WAL, and then > evict the buffer. Isn't the only time that you'd need to evict if you ran out of buffers? If the buffer was truly private, would that stillbe an issue? Perhaps the only way to make that work is multiple WAL streams, as was originally suggested... -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Wed, Jun 8, 2011 at 6:49 PM, Jim Nasby <jim@nasby.net> wrote: >> If backend A needs to evict a buffer with a fake LSN, it can go find >> the WAL that needs to be serialized, do that, flush WAL, and then >> evict the buffer. > > Isn't the only time that you'd need to evict if you ran out of buffers? Sure, but that happens all the time. See pg_stat_bgwriter.buffers_backend. > If the buffer was truly private, would that still be an issue? I'm not sure if you mean make the buffer private or make the WAL storage arena private, but I'm pretty well convinced that neither one can work. > Perhaps the only way to make that work is multiple WAL streams, as was originally suggested... Maybe... but I hope not. I just found an academic paper on this subject, about which I will post shortly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 8, 2011 at 10:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 8, 2011 at 6:49 PM, Jim Nasby <jim@nasby.net> wrote: >>> If backend A needs to evict a buffer with a fake LSN, it can go find >>> the WAL that needs to be serialized, do that, flush WAL, and then >>> evict the buffer. >> >> Isn't the only time that you'd need to evict if you ran out of buffers? > > Sure, but that happens all the time. See pg_stat_bgwriter.buffers_backend. > >> If the buffer was truly private, would that still be an issue? > > I'm not sure if you mean make the buffer private or make the WAL > storage arena private, but I'm pretty well convinced that neither one > can work. You're probably right. I think though there is enough hypothetical upside to the private buffer case that it should be attempted just to see what breaks. The major tricky bit is dealing with the new pin/unpin mechanics. I'd like to give it the 'college try'. (being typically vain and attention seeking, this is right up my alley) :-D. >> Perhaps the only way to make that work is multiple WAL streams, as was originally suggested... If this was an easy way out all high performance file systems would have multiple journals which you could write to concurrently (which they don't afaik). > Maybe... but I hope not. I just found an academic paper on this > subject, about which I will post shortly. I'm thinking that as long as your transactions have to be rigidly ordered you have a fundamental bottleneck you can't really work around. One way to maybe get around this is to try and work out on the fly if transaction 'A' functionally independent from transaction 'B' -- maybe then you could try and write them concurrently to pre-allocated space on the log, or to separate logs maintained for that purpose. Good luck with that...even if you could somehow get it to work, you would still have degenerate cases (like, 99% of real world cases) to contend with. Another thing you could try is to keep separate logs for rigidly ordered data (commits, xlog switch, etc) and non rigidly ordered data (everything else). On the non rigidly ordered side, you can pre-allocate log space and write to it. This is more or less a third potential route (#1 and #2 being the shared/private buffer approaches) of leveraging the fact that some of the data does not have to be rigidly ordered. Ultimately though even that could only get you so far, because it incurs other costs and even contention on the lock for inserting the commit records could start to bottleneck you. If something useful squirts out of academia, I'd sure like to see it :-). merlin
On Wed, Jun 8, 2011 at 11:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > You're probably right. I think though there is enough hypothetical > upside to the private buffer case that it should be attempted just to > see what breaks. The major tricky bit is dealing with the new > pin/unpin mechanics. I'd like to give it the 'college try'. (being > typically vain and attention seeking, this is right up my alley) :-D. Well, I think it's fairly clear what will break: - If you make the data-file buffer completely private, then what will happen when some other backend needs to read or write that buffer? - If you make the XLOG spool private, you will not be able to checkpoint. But I just work here. Feel free to hit your head on that brick wall all you like. If you manage to make a hole (in the wall, not your head), I'll be as happy as anyone to climb through...! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 8, 2011 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 8, 2011 at 11:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> You're probably right. I think though there is enough hypothetical >> upside to the private buffer case that it should be attempted just to >> see what breaks. The major tricky bit is dealing with the new >> pin/unpin mechanics. I'd like to give it the 'college try'. (being >> typically vain and attention seeking, this is right up my alley) :-D. > > Well, I think it's fairly clear what will break: > > - If you make the data-file buffer completely private, then what will > happen when some other backend needs to read or write that buffer? The private wal buffer? The whole point (maybe impossible) is to try and engineer it so that the other backends *never* have to read and write it -- from their point of view, it hasn't happened yet (even though it has been written into some heap buffers). Since all data action on ongoing transactions can happen at any time, moving wal inserts into the private buffer is delaying its entry into the log so you can avoid taking locks for pre-commit heap activity. Doing this allows the backends doing that to pretend they are actually did write data out into the log without breaking the 'wal before data' rule which is effected by keeping the pin on pages with your magic LSN (which I'm starting to wonder if it should be a flag like BM_DEFERRED_WAL). We essentially are moving xlog activity as far ahead in time as possible (although in a very limited time space) in order to combine locks and hopefully gain efficiency. It all comes down to which rules you can bend and which you can break. The heap pages that have been marked this way may or may not have to be off limits from the backend other than the one that did the marking, and if they have to be off limits logically, there may be no realistic path to make them so. I just don't know...I'm learning as I go. At the end of the day, it's all coming off as pretty fragile if it even works, but it's fun to think about. Anyways, I'm inclined to experiment. > - If you make the XLOG spool private, you will not be able to checkpoint. Correct -- but I don't think this problem is intractable, and is really a secondary issue vs making sure the wal/heap/mvcc/backend interactions 'work'. The intent here is to spool only a relatively small amount of uncommitted transaction data for a short period of time, like 5-10 seconds. Maybe you bite the bullet and tell everyone to flush private WAL at checkpoint time via signal or something. Maybe you bend the some rules on checkpoints. merlin
On Wed, Jun 8, 2011 at 11:30 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > The heap pages that have been marked this way may or may not have to > be off limits from the backend other than the one that did the > marking, and if they have to be off limits logically, there may be no > realistic path to make them so. After some more thought, plus a bit of off-list coaching from Haas, I see now the whole approach is basically a non-starter due to the above. Heap pages *are* off limits, because once deferred they can't be scribbled on and committed by other transactions -- that would violate the 'wal before data' rule. To make it 'work', you'd have to implement shared memory machinery to do cooperative flushing as suggested upthread (complex, nasty) or simply block on deferred pages...which would be a deadlock factory. Oh well. :( merlin