Thread: CLOG extension
Currently, the following can happen: 1. A backend needs a new transaction, so it calls GetNewTransactionId(). It acquires XidGenLock and then calls ExtendCLOG(). 2. ExtendCLOG() decides that a new CLOG page is needed, so it acquires CLogControlLock and then calls ZeroCLOGPage(). 3. ZeroCLOGPage() calls WriteZeroPageXlogRec(), which calls XLogInsert(). 4. XLogInsert() acquires WALInsertLock and then calls AdvanceXLInsertBuffer(). 5. AdvanceXLInsertBuffer() sees that WAL buffers may be full and acquires WALWriteLock to check, and possibly to write WAL if the buffers are in fact full. At this point, we have a single backend simultaneously holding XidGenLock, CLogControlLock, WALInsertLock, and WALWriteLock, which from a concurrency standpoint is, at the risk of considerable understatement, not so great. The situation is no better if (as seems to be more typical) we block waiting for WALWriteLock rather than actually holding it ourselves: either way, nobody can get perform any WAL-logged operation, get an XID, or consult CLOG - so all write activity is blocked, and read activity will block as well as soon as it hits an unhinted tuple. This leads to a couple of questions. First, do we really need to WAL-log CLOG extension at all? Perhaps recovery should simply extend CLOG when it hits a commit or abort record that references a page that doesn't exist yet. Second, is there any harm in pre-extending CLOG? Currently, we don't extend CLOG until we get to the point where the XID we're allocating is on a page that doesn't exist yet, so no further XIDs can be assigned until the extension is complete. We could avoid that by extending a page in advance. Right now, whenever a backend rolls onto a new CLOG page, it must first create it. What we could do instead is try to stay one page ahead of whatever we're currently using: whenever a backend rolls onto a new CLOG page, it creates *the next page*. That way, it can release XidGenLock first and *then* call ExtendCLOG(). That allows all the other backends to continue allocating XIDs in parallel with the CLOG extension. In theory we could still get a pile-up if the entire page worth of XIDs gets used up before we can finish the extension, but that should be pretty rare. (Alternatively, we could introduce a separate background process to extend CLOG, and just have foreground processes kick it periodically. This currently seems like overkill to me.) Third, assuming we do need to write WAL, can we somehow rejigger the logging so that we need not hold CLogControlLock while we're writing it, so that other backends can still do CLOG lookups during that time?Maybe when we take CLogControlLock and observe thatextension is needed, we can release CLogControlLock, WAL-log the extension, and then retake CLogControlLock to do SimpleLruZeroPage(). We might need a separate CLogExtensionLock to make sure that two different backends aren't trying to do this dance at the same time, but that should be largely uncontended. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > [ CLOG extension is horrid for concurrency ] Yeah. When that code was designed, a page's worth of transactions seemed like a lot so we didn't worry too much about performance glitches when we crossed a page boundary. It's time to do something about it though. The idea of extending CLOG in advance, so that the work doesn't have to be done with quite so many locks held, sounds like a plan to me. The one thing I'd worry about is that extension has to interact with freezing of very old XIDs and subsequent removal of old clog pages; make sure that pages will get removed before they could possibly get created again. > First, do we really need to WAL-log CLOG extension at all? Perhaps > recovery should simply extend CLOG when it hits a commit or abort > record that references a page that doesn't exist yet. Maybe, but see above. I'd be particularly worried about this in a hot standby situation, as you would then end up with HS queries seeing XIDs (in tuples) for which there was no clog page yet. I'm inclined to think it's better to continue to WAL-log it, but try to arrange to do that without holding the other locks that are now involved. regards, tom lane
On Thu, May 3, 2012 at 5:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> [ CLOG extension is horrid for concurrency ] > > Yeah. When that code was designed, a page's worth of transactions > seemed like a lot so we didn't worry too much about performance glitches > when we crossed a page boundary. It's time to do something about it > though. > > The idea of extending CLOG in advance, so that the work doesn't have to > be done with quite so many locks held, sounds like a plan to me. The > one thing I'd worry about is that extension has to interact with > freezing of very old XIDs and subsequent removal of old clog pages; > make sure that pages will get removed before they could possibly > get created again. > >> First, do we really need to WAL-log CLOG extension at all? Perhaps >> recovery should simply extend CLOG when it hits a commit or abort >> record that references a page that doesn't exist yet. > > Maybe, but see above. I'd be particularly worried about this in a hot > standby situation, as you would then end up with HS queries seeing XIDs > (in tuples) for which there was no clog page yet. I'm inclined to think > it's better to continue to WAL-log it, but try to arrange to do that > without holding the other locks that are now involved. Why not switch to 1 WAL record per file, rather than 1 per page. (32 pages, IIRC). We can then have the whole new file written as zeroes by a background process, which needn't do that while holding the XidGenLock. My earlier patch to do background flushing from bgwriter can be extended to do that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, May 3, 2012 at 1:27 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Why not switch to 1 WAL record per file, rather than 1 per page. (32 > pages, IIRC). > > We can then have the whole new file written as zeroes by a background > process, which needn't do that while holding the XidGenLock. I thought about doing a single record covering a larger number of pages, but that would be an even bigger hit if it were ever to occur in the foreground path, so you'd want to be very sure that the background process was going to absorb all the work. And if the background process is going to absorb all the work, then I'm not sure it matters very much whether we emit one xlog record or 32. After all it's pretty low volume compared to all the other xlog traffic. Maybe there's some room for optimization here, but it doesn't seem like the first thing to pursue. Doing it a background process, though, may make sense. What I'm a little worried about is that - on a busy system - we've only got about 2 seconds to complete each CLOG extension, and we must do an fsync in order to get there. And the fsync can easily take a good chunk of (or even more than) that two seconds. So it's possible that saddling the bgwriter with this responsibility would be putting too many eggs in one basket. We might find that under the high-load scenarios where this is supposed to help, bgwriter is already too busy doing other things, and it doesn't get around to extending CLOG quickly enough. Or, conversely, we might find that it does get around to extending CLOG quickly enough, but consequently fails to carry out its regular duties. We could of course add a NEW background process just for this purpose, but it'd be nicer if we didn't have to go that far. > My earlier patch to do background flushing from bgwriter can be > extended to do that. I've just been looking at that patch again, since as we discussed before commit 3ae5133b1cf478d516666f2003bc68ba0edb84c7 fixed a problem in this area, and it may be that we can now show a benefit of this approach where we couldn't before. I think it's separate from what we're discussing here, so let me write more about that on another thread after I poke at it a little more. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 3, 2012 at 2:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Doing it a background process, though, may make sense. What I'm a > little worried about is that - on a busy system - we've only got about > 2 seconds to complete each CLOG extension, and we must do an fsync in > order to get there. Scratch that - we don't routinely need to do an fsync, though we can end up backed up behind one if wal_buffers are full. I'm still more interested in the do-it-a-page-in-advance idea discussed upthread, but this might be viable as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 3, 2012 at 7:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 3, 2012 at 1:27 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Why not switch to 1 WAL record per file, rather than 1 per page. (32 >> pages, IIRC). >> >> We can then have the whole new file written as zeroes by a background >> process, which needn't do that while holding the XidGenLock. > > I thought about doing a single record covering a larger number of > pages, but that would be an even bigger hit if it were ever to occur > in the foreground path, so you'd want to be very sure that the > background process was going to absorb all the work. And if the > background process is going to absorb all the work, then I'm not sure > it matters very much whether we emit one xlog record or 32. After all > it's pretty low volume compared to all the other xlog traffic. Maybe > there's some room for optimization here, but it doesn't seem like the > first thing to pursue. > > Doing it a background process, though, may make sense. What I'm a > little worried about is that - on a busy system - we've only got about > 2 seconds to complete each CLOG extension, and we must do an fsync in > order to get there. And the fsync can easily take a good chunk of (or > even more than) that two seconds. So it's possible that saddling the > bgwriter with this responsibility would be putting too many eggs in > one basket. We might find that under the high-load scenarios where > this is supposed to help, bgwriter is already too busy doing other > things, and it doesn't get around to extending CLOG quickly enough. > Or, conversely, we might find that it does get around to extending > CLOG quickly enough, but consequently fails to carry out its regular > duties. We could of course add a NEW background process just for this > purpose, but it'd be nicer if we didn't have to go that far. Your two paragraphs have roughly opposite arguments... Doing it every 32 pages would give you 30 seconds to complete the fsync, if you kicked it off when half way through the previous file - at current maximum rates. So there is utility in doing it in larger chunks. If it is too slow, we would just wait for sync like we do now. I think we need another background process since we have both cleaning and pre-allocating tasks to perform. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, May 3, 2012 at 3:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Your two paragraphs have roughly opposite arguments... > > Doing it every 32 pages would give you 30 seconds to complete the > fsync, if you kicked it off when half way through the previous file - > at current maximum rates. So there is utility in doing it in larger > chunks. Maybe, but I'd like to try changing one thing at a time. If we change too much at once, it's likely to be hard to figure out where the improvement is coming from. Moving the task to a background process is one improvement; doing it in larger chunks is another. Those deserve independent testing. > If it is too slow, we would just wait for sync like we do now. > > I think we need another background process since we have both cleaning > and pre-allocating tasks to perform. Possibly. I have some fear of ending up with too many background processes, but we may need them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 3, 2012 at 1:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Possibly. I have some fear of ending up with too many background > processes, but we may need them. I sort of care about this, but only on systems that are not very busy and could otherwise get by with fewer resources -- for example, it'd be nice to turn off autovacuum and the stat collector if it really doesn't have to be around. Perhaps a Nap Commander[0] process or procedure (if baked into postmaster, to optimize to one process from two) would do the trick? This may be related to some of the nascent work mentioned recently on allowing for backend daemons, primarily for event scheduling. Said Nap Commander could also possibly help with wakeups. [0]: Credit to Will Leinweber for the memorable name. -- fdr
Excerpts from Daniel Farina's message of jue may 03 17:04:03 -0400 2012: > On Thu, May 3, 2012 at 1:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > Possibly. I have some fear of ending up with too many background > > processes, but we may need them. > > I sort of care about this, but only on systems that are not very busy > and could otherwise get by with fewer resources -- for example, it'd > be nice to turn off autovacuum and the stat collector if it really > doesn't have to be around. Perhaps a Nap Commander[0] process or > procedure (if baked into postmaster, to optimize to one process from > two) would do the trick? I'm not sure I see the point in worrying about this at all. I mean, a process doing nothing does not waste much resources, does it? Other than keeping a PID that you can't use for other stuff. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Thu, May 3, 2012 at 2:26 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > I'm not sure I see the point in worrying about this at all. I mean, a > process doing nothing does not waste much resources, does it? Other > than keeping a PID that you can't use for other stuff. Not much, but we do have an interest in very thinly sliced database clusters. -- fdr
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Daniel Farina's message of jue may 03 17:04:03 -0400 2012: >> I sort of care about this, but only on systems that are not very busy >> and could otherwise get by with fewer resources -- for example, it'd >> be nice to turn off autovacuum and the stat collector if it really >> doesn't have to be around. Perhaps a Nap Commander[0] process or >> procedure (if baked into postmaster, to optimize to one process from >> two) would do the trick? > I'm not sure I see the point in worrying about this at all. I mean, a > process doing nothing does not waste much resources, does it? Other > than keeping a PID that you can't use for other stuff. Even more to the point, killing a process and then relaunching it whenever there's something for it to do seems likely to consume *more* resources than just letting it sit. (So long as it's only just sitting, of course. Processes with periodic-wakeup logic are another matter.) Note that I'm not particularly in favor of having Yet Another process just to manage clog extension; the incremental complexity seems way more than anyone has shown to be justified. But the "resources" argument against it seems pretty weak. regards, tom lane
On Thu, May 3, 2012 at 2:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> Excerpts from Daniel Farina's message of jue may 03 17:04:03 -0400 2012: >>> I sort of care about this, but only on systems that are not very busy >>> and could otherwise get by with fewer resources -- for example, it'd >>> be nice to turn off autovacuum and the stat collector if it really >>> doesn't have to be around. Perhaps a Nap Commander[0] process or >>> procedure (if baked into postmaster, to optimize to one process from >>> two) would do the trick? > >> I'm not sure I see the point in worrying about this at all. I mean, a >> process doing nothing does not waste much resources, does it? Other >> than keeping a PID that you can't use for other stuff. > > Even more to the point, killing a process and then relaunching it > whenever there's something for it to do seems likely to consume *more* > resources than just letting it sit. (So long as it's only just sitting, > of course. Processes with periodic-wakeup logic are another matter.) > > Note that I'm not particularly in favor of having Yet Another process > just to manage clog extension; the incremental complexity seems way > more than anyone has shown to be justified. But the "resources" > argument against it seems pretty weak. I agree with that; I think that another incremental process addition doesn't really cause concern for me. Rather, I meant to suggest that the only optimization that could really have an effect for me is going from N background processes to, say, 1, or 0 (excluding postmaster) on idle databases. Four to five or five to seven won't really be a big change. And, as per my last thread about lock shmem sizing and how it gets involved with hot standby I have much more serious problems to worry about anyway. I do seem to recall that I measured the number of dirty pages for a idle postgres database at maybe about a megabyte (a few processes times a couple hundred K or so). Ideally, I'd really like to be able to run a functional Postgres cluster in 10MB or less, although getting the most out of even 100MB would be a big step forward for now. -- fdr
On Thu, May 3, 2012 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 3, 2012 at 3:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Your two paragraphs have roughly opposite arguments... >> >> Doing it every 32 pages would give you 30 seconds to complete the >> fsync, if you kicked it off when half way through the previous file - >> at current maximum rates. So there is utility in doing it in larger >> chunks. > > Maybe, but I'd like to try changing one thing at a time. If we change > too much at once, it's likely to be hard to figure out where the > improvement is coming from. Moving the task to a background process > is one improvement; doing it in larger chunks is another. Those > deserve independent testing. You gave a good argument why background pre-allocation wouldn't work very well if we do it a page at a time. I believe you. If we do it a file at a time, we can just write the file without calling it in page by page into the SLRU, as long as we write the WAL file for it first then we don't need to fsync either of them. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 4, 2012 at 3:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, May 3, 2012 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, May 3, 2012 at 3:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> Your two paragraphs have roughly opposite arguments... >>> >>> Doing it every 32 pages would give you 30 seconds to complete the >>> fsync, if you kicked it off when half way through the previous file - >>> at current maximum rates. So there is utility in doing it in larger >>> chunks. >> >> Maybe, but I'd like to try changing one thing at a time. If we change >> too much at once, it's likely to be hard to figure out where the >> improvement is coming from. Moving the task to a background process >> is one improvement; doing it in larger chunks is another. Those >> deserve independent testing. > > You gave a good argument why background pre-allocation wouldn't work > very well if we do it a page at a time. I believe you. Your confidence is sort of gratifying, but in this case I believe it's misplaced. On more careful analysis, it seems that ExtendCLOG() does just two things: (1) evict a CLOG buffer and replace it with a zero'd page representing the new page and (2) write an XLOG record for the change. Apparently, "extending" CLOG doesn't actually involve extending anything on disk at all. We rely on the future buffer eviction to do that, which is surprisingly different from the way relation extension is handled. So CLOG extension is normally fast, but occasionally something goes wrong. So far I see two ways that can happen: (1) the WAL insertion stalls because wal_buffers are full, and we're forced to wait for WAL to be written (and perhaps fsync'd, since both are covered by the same lock) or (2) the page we choose to evict happens to be dirty, and we have to write+fsync it before repurposing it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4 May 2012 13:59, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 4, 2012 at 3:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Thu, May 3, 2012 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, May 3, 2012 at 3:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> Your two paragraphs have roughly opposite arguments... >>>> >>>> Doing it every 32 pages would give you 30 seconds to complete the >>>> fsync, if you kicked it off when half way through the previous file - >>>> at current maximum rates. So there is utility in doing it in larger >>>> chunks. >>> >>> Maybe, but I'd like to try changing one thing at a time. If we change >>> too much at once, it's likely to be hard to figure out where the >>> improvement is coming from. Moving the task to a background process >>> is one improvement; doing it in larger chunks is another. Those >>> deserve independent testing. >> >> You gave a good argument why background pre-allocation wouldn't work >> very well if we do it a page at a time. I believe you. > > Your confidence is sort of gratifying, but in this case I believe it's > misplaced. On more careful analysis, it seems that ExtendCLOG() does > just two things: (1) evict a CLOG buffer and replace it with a zero'd > page representing the new page and (2) write an XLOG record for the > change. Apparently, "extending" CLOG doesn't actually involve > extending anything on disk at all. We rely on the future buffer > eviction to do that, which is surprisingly different from the way > relation extension is handled. > > So CLOG extension is normally fast, but occasionally something goes > wrong. I don't agree its normally fast. WALInsert contention is high, so there is usually a long queue. As we've discussed this can be done offline and and so (2) can completely avoided in the main line. Considering that all new xids wait for this action, any wait at all is bad and takes time to drain once it clears. Evicting a clog has cost because the tail is almost always dirty when we switch pages. Doing both of those will ensure switch to new page requires zero wait time. So you have the solution. Not sure what else you're looking for. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 4, 2012 at 9:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 4 May 2012 13:59, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, May 4, 2012 at 3:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On Thu, May 3, 2012 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> On Thu, May 3, 2012 at 3:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>>> Your two paragraphs have roughly opposite arguments... >>>>> >>>>> Doing it every 32 pages would give you 30 seconds to complete the >>>>> fsync, if you kicked it off when half way through the previous file - >>>>> at current maximum rates. So there is utility in doing it in larger >>>>> chunks. >>>> >>>> Maybe, but I'd like to try changing one thing at a time. If we change >>>> too much at once, it's likely to be hard to figure out where the >>>> improvement is coming from. Moving the task to a background process >>>> is one improvement; doing it in larger chunks is another. Those >>>> deserve independent testing. >>> >>> You gave a good argument why background pre-allocation wouldn't work >>> very well if we do it a page at a time. I believe you. >> >> Your confidence is sort of gratifying, but in this case I believe it's >> misplaced. On more careful analysis, it seems that ExtendCLOG() does >> just two things: (1) evict a CLOG buffer and replace it with a zero'd >> page representing the new page and (2) write an XLOG record for the >> change. Apparently, "extending" CLOG doesn't actually involve >> extending anything on disk at all. We rely on the future buffer >> eviction to do that, which is surprisingly different from the way >> relation extension is handled. >> >> So CLOG extension is normally fast, but occasionally something goes >> wrong. > > I don't agree its normally fast. > > WALInsert contention is high, so there is usually a long queue. As > we've discussed this can be done offline and and so (2) can completely > avoided in the main line. Considering that all new xids wait for this > action, any wait at all is bad and takes time to drain once it clears. > > Evicting a clog has cost because the tail is almost always dirty when > we switch pages. > > Doing both of those will ensure switch to new page requires zero wait time. > > So you have the solution. Not sure what else you're looking for. Nothing, really. I was just mooting some ideas before I went and started coding, to see what people thought. I've got your opinion and Tom's, and of course my own, so now I'm off to test some different approaches. At the moment I'm running a battery of tests on background-writing CLOG, which I will post about when they are complete, and I intend to play around with some of the ideas from this thread as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company