Thread: RC2 and open issues
We are now packaging RC2. If nothing comes up after RC2 is released, we can move to final release. The open items list is attached. The doc changes can be easily completed before final. The only code issue left is with bgwriter. We always knew we needed to find better defaults for its parameters, but we are only now finding more fundamental issues. I think the summary I have seen recently pegs it right --- our use of % of dirty buffers requires a scan of the entire buffer cache, and the current delay of bgwriter is too high, but we can't lower it because the buffer cache scan will become too expensive if done too frequently. I think the ideal solution would be to remove bgwriter_percent or change it to be a percentage of all buffers, not just dirty buffers, so we don't have to scan the entire list. If we set the new value to 10% with a delay of 1 second, and the bgwriter remembers the place it stopped scanning the buffer cache, you will clean out the buffer cache completely every 10 seconds. Right now it seems no one can find proper values. We were clear that this was an issue but it is bad news that we are only addressing it during RC. The 8.1 solution is to have some feedback system so writes by individual backends cause the bgwriter to work more frequently. The big question is what to do during RC2? Do we just leave it as suboptimal knowing we will revisit it in 8.1 or try an incremental solution for 8.0 that might work better. We have to decide now. --------------------------------------------------------------------------- PostgreSQL 8.0 Open Items ========================= Current version at http://candle.pha.pa.us/cgi-bin/pgopenitems. Changes ------- * change bgwriter buffer scan behavior? * adjust bgwriter defaults Documentation ------------- * synchonize supported encodings and docs * improve external interfaces documentation section * manual pages Fixed Since Last Beta --------------------- -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I think the ideal solution would be to remove bgwriter_percent or change > it to be a percentage of all buffers, not just dirty buffers, so we > don't have to scan the entire list. If we set the new value to 10% with > a delay of 1 second, and the bgwriter remembers the place it stopped > scanning the buffer cache, you will clean out the buffer cache > completely every 10 seconds. But we don't *want* it to clean out the buffer cache completely. There's no point in writing a "hot" page every few seconds. So I don't think I believe in remembering where we stopped anyway. I think there's a reasonable case to be made for redefining bgwriter_percent as the max percent of the total buffer list to scan (not the max percent of the list to return --- Jan correctly pointed out that the latter is useless). Then we could modify StrategyDirtyBufferList so that the percent and maxpages parameters are passed in, so it can stop as soon as either one is satisfied. This would be a fairly small/safe code change and I wouldn't have a problem doing it even at this late stage of the cycle. Howeve ... we would have to crank up the default bgwriter_percent, and I don't know if we have any better idea what to set it to after such a change than we do now ... regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I think the ideal solution would be to remove bgwriter_percent or change > > it to be a percentage of all buffers, not just dirty buffers, so we > > don't have to scan the entire list. If we set the new value to 10% with > > a delay of 1 second, and the bgwriter remembers the place it stopped > > scanning the buffer cache, you will clean out the buffer cache > > completely every 10 seconds. > > But we don't *want* it to clean out the buffer cache completely. You are only cleaning out in pieces over a 10 second period so it is getting dirty. You are not scanning the entire buffer at one time. > There's no point in writing a "hot" page every few seconds. So I don't > think I believe in remembering where we stopped anyway. I was thinking if you are doing this scanning every X milliseconds then after a while the front of the buffer cache will be mostly clean and the end will be dirty so you will always be going over the same early ones to get to the later dirty ones. Remembering the location gives the scan more uniform coverage of the buffer cache. You need a "clock sweep" like BSD uses (and probably others). > I think there's a reasonable case to be made for redefining > bgwriter_percent as the max percent of the total buffer list to scan > (not the max percent of the list to return --- Jan correctly pointed out > that the latter is useless). Then we could modify > StrategyDirtyBufferList so that the percent and maxpages parameters are > passed in, so it can stop as soon as either one is satisfied. This > would be a fairly small/safe code change and I wouldn't have a problem > doing it even at this late stage of the cycle. > > Howeve ... we would have to crank up the default bgwriter_percent, > and I don't know if we have any better idea what to set it to after > such a change than we do now ... Once we make the change we will have to get our testers working on it. We need those figure to change over time based on backends doing writes but ath isn't going to happen for 8.0. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > You need a "clock sweep" like BSD uses (and probably others). No, that's *fundamentally* wrong. The reason we are going to the trouble of maintaining a complicated cache algorithm like ARC is so that we can tell the heavily used pages from the lesser used ones. To throw away that knowledge in favor of doing I/O with a plain clock sweep algorithm is just wrong. What's more, I don't even understand what clock sweep would mean given that the ordering of the list is constantly changing. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I am confused. If we change the percentage to be X% of the entire > buffer cache, and we set it to 1%, and we exit when either the dirty > pages or % are reached, don't we end up just scanning the first 1% of > the cache over and over again? Exactly. But 1% would be uselessly small with this definition. Offhand I'd think something like 50% might be a starting point; maybe even more. What that says is that a page isn't a candidate to be written out by the bgwriter until it's fallen halfway down the LRU list. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I am confused. If we change the percentage to be X% of the entire > > buffer cache, and we set it to 1%, and we exit when either the dirty > > pages or % are reached, don't we end up just scanning the first 1% of > > the cache over and over again? > > Exactly. But 1% would be uselessly small with this definition. Offhand > I'd think something like 50% might be a starting point; maybe even more. > What that says is that a page isn't a candidate to be written out by the > bgwriter until it's fallen halfway down the LRU list. So we are not scanning by buffer address but using the LRU list? Are we sure they are mostly dirty? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> Exactly. But 1% would be uselessly small with this definition. Offhand >> I'd think something like 50% might be a starting point; maybe even more. >> What that says is that a page isn't a candidate to be written out by the >> bgwriter until it's fallen halfway down the LRU list. > So we are not scanning by buffer address but using the LRU list? Are we > sure they are mostly dirty? No. The entire point is to keep the LRU end of the list mostly clean. Now that you mention it, it might be interesting to try the approach of doing a clock scan on the buffer array and ignoring the ARC lists entirely. That would be a fundamentally different way of envisioning what the bgwriter is supposed to do, though. I think the main reason Jan didn't try that was he wanted to be sure the LRU page was usually clean so that backends would seldom end up doing writes for themselves when they needed to get a free buffer. Maybe we need a hybrid approach: clean a few percent of the LRU end of the ARC list in order to keep backends from blocking on writes, plus run a clock scan to keep checkpoints from having to do much. But that's way beyond what we have time for in the 8.0 cycle. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> Exactly. But 1% would be uselessly small with this definition. Offhand > >> I'd think something like 50% might be a starting point; maybe even more. > >> What that says is that a page isn't a candidate to be written out by the > >> bgwriter until it's fallen halfway down the LRU list. > > > So we are not scanning by buffer address but using the LRU list? Are we > > sure they are mostly dirty? > > No. The entire point is to keep the LRU end of the list mostly clean. > > Now that you mention it, it might be interesting to try the approach of > doing a clock scan on the buffer array and ignoring the ARC lists > entirely. That would be a fundamentally different way of envisioning > what the bgwriter is supposed to do, though. I think the main reason > Jan didn't try that was he wanted to be sure the LRU page was usually > clean so that backends would seldom end up doing writes for themselves > when they needed to get a free buffer. > > Maybe we need a hybrid approach: clean a few percent of the LRU end of > the ARC list in order to keep backends from blocking on writes, plus run > a clock scan to keep checkpoints from having to do much. But that's way > beyond what we have time for in the 8.0 cycle. OK, so we scan from the end of the LRU. If we scan X% and find _no_ dirty buffers perhaps we should start where we left off last time. If we don't start where we left off, I am thinking if you do a lot of writes then do nothing, the next checkpoint would be huge because a lot of the LRU will be dirty because the bgwriter never got to it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, 20 Dec 2004, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> Exactly. But 1% would be uselessly small with this definition. Offhand > >> I'd think something like 50% might be a starting point; maybe even more. > >> What that says is that a page isn't a candidate to be written out by the > >> bgwriter until it's fallen halfway down the LRU list. > > > So we are not scanning by buffer address but using the LRU list? Are we > > sure they are mostly dirty? > > No. The entire point is to keep the LRU end of the list mostly clean. > > Now that you mention it, it might be interesting to try the approach of > doing a clock scan on the buffer array and ignoring the ARC lists > entirely. That would be a fundamentally different way of envisioning > what the bgwriter is supposed to do, though. I think the main reason > Jan didn't try that was he wanted to be sure the LRU page was usually > clean so that backends would seldom end up doing writes for themselves > when they needed to get a free buffer. Neil and I spoke with Jan briefly last week and he mentioned a few different approaches he'd been tossing over. Firstly, for alternative runs, start X% on from the LRU, so that we aren't scanning clean buffers all the time. Secondly, follow something like the approach you've mentioned above but remember the offset. So, if we're scanning 10%, after 10 runs we will have written out all buffers. I was also thinking of benchmarking the effect of changing the algorithm in StrategyDirtyBufferList(): currently, for each iteration of the loop we read a buffer from each of T1 and T2. I was wondering what effect reading T1 first then T2 and vice versa would have on performance. I haven't thought about this too hard, though, so it might be wrong headed. > > Maybe we need a hybrid approach: clean a few percent of the LRU end of > the ARC list in order to keep backends from blocking on writes, plus run > a clock scan to keep checkpoints from having to do much. But that's way > beyond what we have time for in the 8.0 cycle. Definately. > > regards, tom lane Thanks, Gavin
Gavin Sherry wrote: > Neil and I spoke with Jan briefly last week and he mentioned a few > different approaches he'd been tossing over. Firstly, for alternative > runs, start X% on from the LRU, so that we aren't scanning clean buffers > all the time. Secondly, follow something like the approach you've > mentioned above but remember the offset. So, if we're scanning 10%, after > 10 runs we will have written out all buffers. > > I was also thinking of benchmarking the effect of changing the algorithm > in StrategyDirtyBufferList(): currently, for each iteration of the loop we > read a buffer from each of T1 and T2. I was wondering what effect reading > T1 first then T2 and vice versa would have on performance. I haven't > thought about this too hard, though, so it might be wrong headed. So we are all thinking in the same direction. We might have only a few days to finalize this before final release. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Gavin Sherry <swm@linuxworld.com.au> writes: > I was also thinking of benchmarking the effect of changing the algorithm > in StrategyDirtyBufferList(): currently, for each iteration of the loop we > read a buffer from each of T1 and T2. I was wondering what effect reading > T1 first then T2 and vice versa would have on performance. Looking at StrategyGetBuffer, it definitely seems like a good idea to try to keep the bottom end of both T1 and T2 lists clean. But we should work at T1 a bit harder. The insight I take away from today's discussion is that there are two separate goals here: try to keep backends that acquire a buffer via StrategyGetBuffer from being fed a dirty buffer they have to write, and try to keep the next upcoming checkpoint from having too much work to do. Those are both laudable goals but I hadn't really seen before that they may require different strategies to achieve. I'm liking the idea that bgwriter should alternate between doing writes in pursuit of the one goal and doing writes in pursuit of the other. regards, tom lane
> If we don't start where we left off, I am thinking if you do a lot of > writes then do nothing, the next checkpoint would be huge because a lot > of the LRU will be dirty because the bgwriter never got to it. I think the problem is, that we don't see wether a "read hot" page is also "write hot". We would want to write dirty "read hot" pages, but not "write hot" pages. It does not make sense to write a "write hot" page since it will be dirty again when the checkpoint comes. Andreas
Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > > I was also thinking of benchmarking the effect of changing the algorithm > > in StrategyDirtyBufferList(): currently, for each iteration of the loop we > > read a buffer from each of T1 and T2. I was wondering what effect reading > > T1 first then T2 and vice versa would have on performance. > > Looking at StrategyGetBuffer, it definitely seems like a good idea to > try to keep the bottom end of both T1 and T2 lists clean. But we should > work at T1 a bit harder. > > The insight I take away from today's discussion is that there are two > separate goals here: try to keep backends that acquire a buffer via > StrategyGetBuffer from being fed a dirty buffer they have to write, > and try to keep the next upcoming checkpoint from having too much work > to do. Those are both laudable goals but I hadn't really seen before > that they may require different strategies to achieve. I'm liking the > idea that bgwriter should alternate between doing writes in pursuit of > the one goal and doing writes in pursuit of the other. It seems we have added a new limitation to bgwriter by not doing a full scan. With a full scan we could easily grab the first X pages starting from the end of the LRU list and write them. By not scanning the full list we are opening the possibility of not seeing some of the front-most LRU dirty pages. And the full scan was removed so we can run bgwriter more frequently, but we might end up with other problems. I have a new proposal. The idea is to cause bgwriter to increase its frequency based on how quickly it finds dirty pages. First, we remove the GUC bgwriter_maxpages because I don't see a good way to set a default for that. A default value needs to be based on a percentage of the full buffer cache size. Second, we make bgwriter_percent cause the bgwriter to stop its scan once it has found a number of dirty buffers that matches X% of the buffer cache size. So, if it is set to 5%, the bgwriter scan stops once it find enough dirty buffers to equal 5% of the buffer cache size. Bgwriter continues to scan starting from the end of the LRU list, just like it does now. Now, to control the bgwriter frequency we multiply the percent of the list it had to span by the bgwriter_delay value to determine when to run bgwriter next. For example, if you find enough dirty pages by looking at only 10% of the buffer cache you multiple 10% (0.10) * bgwriter_delay and that is when you run next. If you have to scan 50%, bgwriter runs next at 50% (0.50) * bgwriter_delay, and if it has to scan the entire list it is 100% (1.00) * bgwriter_delay. What this does is to cause bgwriter to run more frequently when there are a lot of dirty buffers on the end of the LRU _and_ when the bgwriter scan will be quick. When there are few writes, bgwriter will run less frequently but will write dirty buffers nearer to the head of the LRU. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > First, we remove the GUC bgwriter_maxpages because I don't see a good > way to set a default for that. A default value needs to be based on a > percentage of the full buffer cache size. This is nonsense. The admin knows what he set shared_buffers to, and so maxpages and percent of shared buffers are not really distinct ways of specifying things. The cases that make a percent spec useful are if (a) it is a percent of a non-constant number (eg, percent of total dirty pages as in the current code), or (b) it is defined in a way that lets it limit the amount of scanning work done (which it isn't useful for in the current code). But a maxpages spec is useful for (b) too. More to the point, maxpages is useful to set a hard limit on the amount of I/O generated by the bgwriter, and I think people will want to be able to do that. > Now, to control the bgwriter frequency we multiply the percent of the > list it had to span by the bgwriter_delay value to determine when to run > bgwriter next. I'm less than enthused about this. The idea of the bgwriter is to trickle out writes in a way that doesn't affect overall performance too much. Not to write everything in sight at any cost. I like the hybrid "keep the bottom of the ARC list clean, plus do a slow clock scan on the main buffer array" approach better. I can see that that directly impacts both of the goals that the bgwriter has. I don't see how a variable I/O rate really improves life on either score; it just makes things harder to predict. regards, tom lane
A quick $0.02 on how DB2 does this (at least in 7.x). They used a combination of everything that's been discussed. The first priority of their background writer was to keep the LRU end of the cache free so individual backends would never have to wait to get a page. Then, they would look to pages that had been dirty for 'a long time', which was user configurable. Pages older than this setting were candidates to be written out even if they weren't close to LRU. Finally, I believe there were also settings for how often the writer would fire up, and how much work it would do at once. I agree that the first priority should be to keep clean pages near LRU, but that you also don't want to get hammered at checkpoint time. I think what might be interesting to consider is keeping a list of dirty pages, which would remove the need to scan a very large buffer. Of course, in an environment with a heavy update load, it could be better to just scan the buffers, especially if you don't do a clock-sweep but instead look at where the last page you wrote out has ended up in the LRU list since you last ran, and start scanning from there (by definition everything after that page would have to be clean). Of course this is just conjecture on my part and would need testing to verify, and it's obviously beyond the scope of 8.0. As for 8.0, I suspect at this point it's probably best to just go with whatever method has the smallest amount of code impact unless it's inherenttly broken. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > First, we remove the GUC bgwriter_maxpages because I don't see a good > > way to set a default for that. A default value needs to be based on a > > percentage of the full buffer cache size. > > This is nonsense. The admin knows what he set shared_buffers to, and so > maxpages and percent of shared buffers are not really distinct ways of > specifying things. The cases that make a percent spec useful are if > (a) it is a percent of a non-constant number (eg, percent of total dirty > pages as in the current code), or (b) it is defined in a way that lets > it limit the amount of scanning work done (which it isn't useful for in > the current code). But a maxpages spec is useful for (b) too. More to > the point, maxpages is useful to set a hard limit on the amount of I/O > generated by the bgwriter, and I think people will want to be able to do > that. I figured that if we specify a percentage users would not need to update this value regularly if they increase their shared buffers. I agree if you want to limit total I/O by the bgwriter an actual pages a count is better but I assumed we were looking for bgwriter to do a certain percentage of total writes. If the system is doing a lot of writes then limiting the bgwriter doesn't help because then the backends are going to have to do the writes themselves. > > Now, to control the bgwriter frequency we multiply the percent of the > > list it had to span by the bgwriter_delay value to determine when to run > > bgwriter next. > > I'm less than enthused about this. The idea of the bgwriter is to > trickle out writes in a way that doesn't affect overall performance too > much. Not to write everything in sight at any cost. No question my idea makes tuning diffcult. I was hoping it would be self-tuning but I am not sure. > I like the hybrid "keep the bottom of the ARC list clean, plus do a slow > clock scan on the main buffer array" approach better. I can see that > that directly impacts both of the goals that the bgwriter has. I don't > see how a variable I/O rate really improves life on either score; it > just makes things harder to predict. So what are we doing for 8.0? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > So what are we doing for 8.0? Well, it looks like RC2 has already crashed and burned --- I can't imagine that Marc will let us release without an RC3 given what was committed today, never mind the btree bug that Mark Wong seems to have found. So maybe we should just bite the bullet and do something real about this. I'm willing to code up a proposed patch for the two-track idea I suggested, and if anyone else has a favorite maybe they could write something too. But do we have the resources to test such patches and make a decision in the next few days? At the moment my inclination is to sit on what we have. I've not seen any indication that 8.0 is really worse than earlier releases; the most you could argue against it is that it's not as much better as we hoped. That's not grounds to muck around at the RC3 stage. regards, tom lane
>At the moment my inclination is to sit on what we have. I've not seen >any indication that 8.0 is really worse than earlier releases; the most >you could argue against it is that it's not as much better as we hoped. >That's not grounds to muck around at the RC3 stage. > > If is is any help, CMD is basically dead right now and I expect it will be that way until the new year. 4 of my 5 C programmers are on vacation but I do have one and a couple of non c programmers. We can't fix, but we can definately help test. Sincerely, Joshua D. Drake > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > > -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
Attachment
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > So what are we doing for 8.0? > > Well, it looks like RC2 has already crashed and burned --- I can't > imagine that Marc will let us release without an RC3 given what was > committed today, never mind the btree bug that Mark Wong seems to have > found. So maybe we should just bite the bullet and do something real > about this. Oh, is it that bad? > I'm willing to code up a proposed patch for the two-track idea I > suggested, and if anyone else has a favorite maybe they could write > something too. But do we have the resources to test such patches and > make a decision in the next few days? > > At the moment my inclination is to sit on what we have. I've not seen > any indication that 8.0 is really worse than earlier releases; the most > you could argue against it is that it's not as much better as we hoped. > That's not grounds to muck around at the RC3 stage. That was my question. It seems bgwriter is fine for low to medium traffic but doesn't handle high traffic, and increasing the scan rate makes things worse. I am fine with doing nothing, but if we are going to do something, I would like to do it now rather than later. The only way I could see it being worse than pre-8.0 is that the bgwriter is doing fsync of all open files rather than using sync. Other than that, I think it should behave the same, or slightly better, right? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > The only way I could see it being worse than pre-8.0 is that the > bgwriter is doing fsync of all open files rather than using sync. Other > than that, I think it should behave the same, or slightly better, > right? It's possible that there exist platforms on which this is a loss --- that is, the OS's handling of fsync is so inefficient that multiple fsync calls are worse than one sync call even though less I/O is forced. But I haven't seen any actual evidence of that; and if such platforms do exist I'm not sure I'd blink anyway. We are not required to optimize for brain-dead kernels. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Maybe we need a hybrid approach: clean a few percent of the LRU end of > the ARC list in order to keep backends from blocking on writes, plus run > a clock scan to keep checkpoints from having to do much. Well if you just keep note of when the last clock scan started then when you get to the end of the list you've _done_ a checkpoint. Put another way, we already have such a clock scan, it's called checkpoint. You could have checkpoint delay between each page write long enough to spread the checkpoint i/o out over a configurable amount of time -- say half the checkpoint interval -- and be done with that side of things. -- greg
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > So what are we doing for 8.0? > > Well, it looks like RC2 has already crashed and burned --- I can't > imagine that Marc will let us release without an RC3 given what was > committed today, never mind the btree bug that Mark Wong seems to have > found. So maybe we should just bite the bullet and do something real > about this. > > I'm willing to code up a proposed patch for the two-track idea I > suggested, and if anyone else has a favorite maybe they could write > something too. But do we have the resources to test such patches and > make a decision in the next few days? > > At the moment my inclination is to sit on what we have. I've not seen > any indication that 8.0 is really worse than earlier releases; the most > you could argue against it is that it's not as much better as we hoped. > That's not grounds to muck around at the RC3 stage. I remember the other difference between 8.0 and pre-8.0. When a backend has to write a block in 8.0, it does a write _plus_ fsync(), while in pre-8.0 it did only a write. There was a proposal to pass backend write information to the background writer so it would know to fsync at checkpoint, but it was decided that backend writing would be rare. I think we have to rethink that assumption. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I remember the other difference between 8.0 and pre-8.0. When a backend > has to write a block in 8.0, it does a write _plus_ fsync(), while in > pre-8.0 it did only a write. There was a proposal to pass backend write > information to the background writer so it would know to fsync at > checkpoint, but it was decided that backend writing would be rare. I > think we have to rethink that assumption. No, just read the code. The above assertions are all wet. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I remember the other difference between 8.0 and pre-8.0. When a backend > > has to write a block in 8.0, it does a write _plus_ fsync(), while in > > pre-8.0 it did only a write. There was a proposal to pass backend write > > information to the background writer so it would know to fsync at > > checkpoint, but it was decided that backend writing would be rare. I > > think we have to rethink that assumption. > > No, just read the code. The above assertions are all wet. Oh, I forgot you added that array to pass fsync info. Shouldn't we send a log message when the array gets full in md.c: { if (ForwardFsyncRequest(reln->smgr_rnode, seg->mdfd_segno)) return true; } if (FileSync(seg->mdfd_vfd) < 0) return false; Seems that could fill up quickly. I see no checking for existing matching records in the array. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Wed, 2004-12-22 at 04:43, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > So what are we doing for 8.0? > > Well, it looks like RC2 has already crashed and burned --- I can't > imagine that Marc will let us release without an RC3 given what was > committed today, never mind the btree bug that Mark Wong seems to have > found. So maybe we should just bite the bullet and do something real > about this. > > I'm willing to code up a proposed patch for the two-track idea I > suggested, and if anyone else has a favorite maybe they could write > something too. But do we have the resources to test such patches and > make a decision in the next few days? > > At the moment my inclination is to sit on what we have. I've not seen > any indication that 8.0 is really worse than earlier releases; the most > you could argue against it is that it's not as much better as we hoped. > That's not grounds to muck around at the RC3 stage. Agreed, if somewhat reluctantly. We may have the time to test, but it is clear that we do not have the time to validate those tests, then discuss and agree on the results. Time to go with what we have. [Mark's possible bug seems a higher priority for me.] -- Best Regards, Simon Riggs
On Mon, Dec 20, 2004 at 11:20:46PM -0500, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> Exactly. But 1% would be uselessly small with this definition. Offhand > >> I'd think something like 50% might be a starting point; maybe even more. > >> What that says is that a page isn't a candidate to be written out by the > >> bgwriter until it's fallen halfway down the LRU list. > > > So we are not scanning by buffer address but using the LRU list? Are we > > sure they are mostly dirty? > > No. The entire point is to keep the LRU end of the list mostly clean. > > Now that you mention it, it might be interesting to try the approach of > doing a clock scan on the buffer array and ignoring the ARC lists > entirely. That would be a fundamentally different way of envisioning > what the bgwriter is supposed to do, though. I think the main reason > Jan didn't try that was he wanted to be sure the LRU page was usually > clean so that backends would seldom end up doing writes for themselves > when they needed to get a free buffer. > > Maybe we need a hybrid approach: clean a few percent of the LRU end of > the ARC list in order to keep backends from blocking on writes, plus run > a clock scan to keep checkpoints from having to do much. But that's way > beyond what we have time for in the 8.0 cycle. > > regards, tom lane > I have not had a chance to investigate, but there is a modification of the ARC cache strategy called CAR that replaces the LRU linked lists with the clock approximation to the LRU lists. This algorithm is virtually identical to the current ARC but reduces the contention at the MRU end of the lists. This may dovetail nicely into your idea of a "clock" bgwriter functionality as well as help with the cache-line performance problem. Yours, Ken Marshall
Greg Stark wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > Maybe we need a hybrid approach: clean a few percent of the LRU end of > > the ARC list in order to keep backends from blocking on writes, plus run > > a clock scan to keep checkpoints from having to do much. > > Well if you just keep note of when the last clock scan started then when you > get to the end of the list you've _done_ a checkpoint. > > Put another way, we already have such a clock scan, it's called checkpoint. > You could have checkpoint delay between each page write long enough to spread > the checkpoint i/o out over a configurable amount of time -- say half the > checkpoint interval -- and be done with that side of things. But don't you have to keep the WAL files around longer then. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Greg Stark wrote: >> Put another way, we already have such a clock scan, it's called checkpoint. >> You could have checkpoint delay between each page write long enough to spread >> the checkpoint i/o out over a configurable amount of time -- say half the >> checkpoint interval -- and be done with that side of things. > But don't you have to keep the WAL files around longer then. Yeah, but do you care? It seems like what Greg is suggesting is a "checkpoint slowdown" knob comparable to the "vacuum slowdown" feature that Jan added for 8.0. It strikes me as not necessarily a bad idea. Suppose that you run a checkpoint every 5 minutes, and with the knob you slow down the checkpoint to extend over say 3 minutes on average, rather than the normal blast-it-out-as-fast-as-possible. Then you'll be keeping an average of 8 minutes worth of WAL files instead of 5. Not exactly a killer objection. Shutdown checkpoints would still need to go as fast as possible, so we might need two separate code paths; or maybe we could just change the delay setting locally during a shutdown. One issue is that while we can regulate the rate at which we issue write()s, we still have to issue fsync()s at the end, and we can't control what happens in response to those. It's quite possible that all the I/O would happen in response to the fsync()s anyway, in which case the whole exercise would be a waste of time. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Suppose that you run a checkpoint every 5 minutes, and with the knob > you slow down the checkpoint to extend over say 3 minutes on average, > rather than the normal blast-it-out-as-fast-as-possible. Then you'll > be keeping an average of 8 minutes worth of WAL files instead of 5. > Not exactly a killer objection. Right. I was thinking that the goal would be to spread the checkpoint out over exactly the checkpoint interval, minus some safety factor. So if it has some estimate of the total number of dirty buffers that need flushing it could just divide the checkpoint interval by that and calculate the delay needed to finish in some fraction of the checkpoint interval, 60% seems like a reasonable guess. > One issue is that while we can regulate the rate at which we issue > write()s, we still have to issue fsync()s at the end, and we can't > control what happens in response to those. It's quite possible that > all the I/O would happen in response to the fsync()s anyway, in which > case the whole exercise would be a waste of time. Well you could fsync earlier as well, say just before whenever you sleep. Obviously the delay on the checkpoint process doesn't matter to performance if it's about to sleep. It could end up scheduling i/o earlier than necessary and cause redundant seeks but then I guess that's an inherent tension between trying to spread out the i/o evenly and trying to get the ideal ordering of i/o. -- greg
Simon Riggs wrote: > On Wed, 2004-12-22 at 04:43, Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > So what are we doing for 8.0? > > > > Well, it looks like RC2 has already crashed and burned --- I can't > > imagine that Marc will let us release without an RC3 given what was > > committed today, never mind the btree bug that Mark Wong seems to have > > found. So maybe we should just bite the bullet and do something real > > about this. > > > > I'm willing to code up a proposed patch for the two-track idea I > > suggested, and if anyone else has a favorite maybe they could write > > something too. But do we have the resources to test such patches and > > make a decision in the next few days? > > > > At the moment my inclination is to sit on what we have. I've not seen > > any indication that 8.0 is really worse than earlier releases; the most > > you could argue against it is that it's not as much better as we hoped. > > That's not grounds to muck around at the RC3 stage. > > Agreed, if somewhat reluctantly. > > We may have the time to test, but it is clear that we do not have the > time to validate those tests, then discuss and agree on the results. > > Time to go with what we have. I ran some tests last week and can report results similar on Tom's test: pgbench -i -s 10 benchpgbench -c 10 -t 10000 bench The tests were on a machine with a single SCSI drive that doesn't lie about fsync. I found 7.4.X got around 75tps while 8.0 got 100tps, very similar to the 65/107 numbers Tom had. First, I am confused why we have such a large improvement in 8.0. Does anyone know? This is a pretty long test so a 33-50% increase is a big jump. Second, I added a little code in my local code to check if the pendingOpsTable overflows and register_dirty_segment() must have a local backend do an fsync(). I found one bgbench test had 54 local fsyncs, but the next test had none, and 54 isn't a very larger number. Should we emit a server log message when this happens so they can reduce bewriter delay? It seems having the backend do the writes is not so bad (same as 7.4.X) and our only big problem with current bgwriter is the inability to reduce checkpoint load for busy servers. Should we consider at least adjusting the meaning of bgwriter_percent? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> I ran some tests last week and can report results similar on Tom's test: > > pgbench -i -s 10 bench > pgbench -c 10 -t 10000 bench > > The tests were on a machine with a single SCSI drive that doesn't lie > about fsync. I found 7.4.X got around 75tps while 8.0 got 100tps, very > similar to the 65/107 numbers Tom had. You do realize, that pgbench result comparisons are about as useful as a fork for eating soup? On another note, how do you know for sure, that your drive does not lie about fsync? Did you run the tests with fsync turned off vs fsync on? > First, I am confused why we have such a large improvement in 8.0. Does > anyone know? This is a pretty long test so a 33-50% increase is a big > jump. bgwriter is responsible I imagine,... I experienced the same improvement in an early 7.5, just after the bgwriter was added. (tho my results was about 4-5 times higher in terms of tps rates, hehe) ... John
Greg Stark wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > Suppose that you run a checkpoint every 5 minutes, and with the knob > > you slow down the checkpoint to extend over say 3 minutes on average, > > rather than the normal blast-it-out-as-fast-as-possible. Then you'll > > be keeping an average of 8 minutes worth of WAL files instead of 5. > > Not exactly a killer objection. > > Right. I was thinking that the goal would be to spread the checkpoint out over > exactly the checkpoint interval, minus some safety factor. So if it has some > estimate of the total number of dirty buffers that need flushing it could just > divide the checkpoint interval by that and calculate the delay needed to > finish in some fraction of the checkpoint interval, 60% seems like a > reasonable guess. > > > One issue is that while we can regulate the rate at which we issue > > write()s, we still have to issue fsync()s at the end, and we can't > > control what happens in response to those. It's quite possible that > > all the I/O would happen in response to the fsync()s anyway, in which > > case the whole exercise would be a waste of time. > > Well you could fsync earlier as well, say just before whenever you sleep. > Obviously the delay on the checkpoint process doesn't matter to performance if > it's about to sleep. It could end up scheduling i/o earlier than necessary and > cause redundant seeks but then I guess that's an inherent tension between > trying to spread out the i/o evenly and trying to get the ideal ordering of > i/o. It certainly is an interesting idea to have the checkpoint span a longer time period. We couldn't do that with sync, but now that we fsync each file it is possible. It would be easy do this if we didn't also need the fsync. The original idea was that we would write() the dirty buffers long before the checkpoint, and the kernel would write many of these dirty buffers before we got to checkpoint time. We could go with the checkpoint clock sweep idea but then we aren't writing them but actually doing write/fsync a lot more. I can't think of a way this would be a win. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Added to TODO: * Improve the background writer Allow the background writer to more efficiently write dirty buffers from the end of the LRU cache and use a clock sweepalgorithm to write other dirty buffers to reduced checkpoint I/O --------------------------------------------------------------------------- Simon Riggs wrote: > On Wed, 2004-12-22 at 04:43, Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > So what are we doing for 8.0? > > > > Well, it looks like RC2 has already crashed and burned --- I can't > > imagine that Marc will let us release without an RC3 given what was > > committed today, never mind the btree bug that Mark Wong seems to have > > found. So maybe we should just bite the bullet and do something real > > about this. > > > > I'm willing to code up a proposed patch for the two-track idea I > > suggested, and if anyone else has a favorite maybe they could write > > something too. But do we have the resources to test such patches and > > make a decision in the next few days? > > > > At the moment my inclination is to sit on what we have. I've not seen > > any indication that 8.0 is really worse than earlier releases; the most > > you could argue against it is that it's not as much better as we hoped. > > That's not grounds to muck around at the RC3 stage. > > Agreed, if somewhat reluctantly. > > We may have the time to test, but it is clear that we do not have the > time to validate those tests, then discuss and agree on the results. > > Time to go with what we have. > > [Mark's possible bug seems a higher priority for me.] > > -- > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
John Hansen wrote: > > I ran some tests last week and can report results similar on Tom's test: > > > > pgbench -i -s 10 bench > > pgbench -c 10 -t 10000 bench > > > > The tests were on a machine with a single SCSI drive that doesn't lie > > about fsync. I found 7.4.X got around 75tps while 8.0 got 100tps, very > > similar to the 65/107 numbers Tom had. > > You do realize, that pgbench result comparisons are about as useful as a > fork for eating soup? > > On another note, how do you know for sure, that your drive does not lie > about fsync? > > Did you run the tests with fsync turned off vs fsync on? I just tried and got 115tps with fsync off vs 100 with fsync on, so fsync is certainly doing something. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > John Hansen wrote: >> On another note, how do you know for sure, that your drive does not lie >> about fsync? > I just tried and got 115tps with fsync off vs 100 with fsync on, so > fsync is certainly doing something. [ raised eyebrow... ] Something is wrong with that. I'd expect a *much* higher difference. It's difficult to credit a tps rate higher than your disk's RPM rating with fsync on, but most modern CPUs can do a lot better than that with fsync off. If you have a 7200 RPM drive then I'd believe the 100 figure, but not the other ... regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > John Hansen wrote: > >> On another note, how do you know for sure, that your drive does not lie > >> about fsync? > > > I just tried and got 115tps with fsync off vs 100 with fsync on, so > > fsync is certainly doing something. > > [ raised eyebrow... ] Something is wrong with that. I'd expect a > *much* higher difference. It's difficult to credit a tps rate higher > than your disk's RPM rating with fsync on, but most modern CPUs can do > a lot better than that with fsync off. If you have a 7200 RPM drive > then I'd believe the 100 figure, but not the other ... I think it is a 10k RPM drive, Seagate Cheteetah ST336607LW. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, 2004-12-28 at 07:23, John Hansen wrote: > > I ran some tests last week and can report results similar on Tom's test: > > > > pgbench -i -s 10 bench > > pgbench -c 10 -t 10000 bench > > > > The tests were on a machine with a single SCSI drive that doesn't lie > > about fsync. I found 7.4.X got around 75tps while 8.0 got 100tps, very > > similar to the 65/107 numbers Tom had. > > You do realize, that pgbench result comparisons are about as useful as a > fork for eating soup? I'd have to agree. I find it hard to get comparable results on my test server, let alone discuss other people's findings. The only tests I have reasonable faith in these days are those performed to a rigorous test method, which is also published, visible and challengeable. OSDL is the nearest thing to that we have to that. -- Best Regards, Simon Riggs
> > > I ran some tests last week and can report results similar on Tom's test: > > > > > > pgbench -i -s 10 bench > > > pgbench -c 10 -t 10000 bench > > > don't you have to specify the scaling factor for the benchmark as well? as in pgbench -c 10 -t 10000 -s 10 bench ? > I just tried and got 115tps with fsync off vs 100 with fsync on, so > fsync is certainly doing something. well, I usually get results that differ by that much from run to run. Probably you ran in to more checkpoints on the second test. Also, did you reinitialize the bench database with pgbench -i ? ... John
John Hansen wrote: > > > > I ran some tests last week and can report results similar on Tom's test: > > > > > > > > pgbench -i -s 10 bench > > > > pgbench -c 10 -t 10000 bench > > > > > > don't you have to specify the scaling factor for the benchmark as well? > as in pgbench -c 10 -t 10000 -s 10 bench ? > > > I just tried and got 115tps with fsync off vs 100 with fsync on, so > > fsync is certainly doing something. > > well, I usually get results that differ by that much from run to run. > Probably you ran in to more checkpoints on the second test. > > Also, did you reinitialize the bench database with pgbench -i ? I destroyed the database and recreated it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: >>well, I usually get results that differ by that much from run to run. >>Probably you ran in to more checkpoints on the second test. >> >>Also, did you reinitialize the bench database with pgbench -i ? >> >> > >I destroyed the database and recreated it. > > The only way I managed to control the variability in Pgbench was to *reboot the machine* and recreate the database for each test. In addition it seems that using a larger scale factor (e.g 200) helped as well. Having said that, on FreeBSD 5.3 with hw.ata.wc=0 (i.e no write cache) my results for s=200, t=10000 and c=4 were 49 (+/- 0.5) tps for both 7.4.6 and 8.0.0RC1 - no measurable difference. If I reduced the number of transactions to t=1000, then 7.4.6 jumped ahead by about 10 tps. Bruce - are you able to try s=200? It would be interesting to see what your setup does. regards Mark
[I know I'm late and this has already been discussed by Richrad, Tom, et al., but ...] On Tue, 21 Dec 2004 16:17:17 -0600, "Jim C. Nasby" <decibel@decibel.org> wrote: >look at where the last page you wrote out has ended up in the LRU list >since you last ran, and start scanning from there (by definition >everything after that page would have to be clean). This is a bit oversimplified, because that page will be moved to the start of the list when it is accessed the next time. A = B = C = D = E = F = G = H = I = J = K = L = m = n = o = p = q ^ would become M = A = B = C = D = E = F = G = H = I = J = K = L = n = o = p = q ^ (a-z ... known to be clean, A-Z ... possibly dirty) But with a bit of cooperation from the backends this could be made to work. Whenever a backend takes the page which is the start of the clean tail out of the list (most probably to insert it into another list or to re-insert it at the start of the same list) the clean tail pointer is advanced to the next list element, if any. So we would get M = A = B = C = D = E = F = G = H = I = J = K = L = n = o = p = q ^ As a little improvement the clean tail could be prevented from shrinking unnecessarily fast by moving the pointer to the previous list element if this is found to be clean: M = A = B = C = D = E = F = G = H = I = J = K = l = n = o = p = q ^ Maybe this approach could serve both goals, (1) keeping a number of clean pages at the LRU end of the list and (2) writing out other dirty pages if there's not much to do near the end of the list. But ... On Tue, 21 Dec 2004 10:26:48 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Also, the cntxDirty mechanism allows a block to be dirtied without >changing the ARC state at all. ... which might kill this proposal anyway. ServusManfred
On Mon, 2004-12-27 at 22:21, Bruce Momjian wrote: > Should we consider at least adjusting the meaning of bgwriter_percent? Yes. As things stand, this is the only change that seems safe. Here's a very short patch that implements this change within BufferSync in bufmgr.c - No algorithm changes - No error message changes - Only change is the call to StrategyDirtyBufferList is made using the maximum number of buffers that will be cleaned, rather than uselessly trawling through all of shared_buffers This changes the meaning of bgwriter_percent from "percent of dirty buffers" to "percent of shared_buffers". The default settings of 1% of 1000 buffers gives up to 10 dirty block writes every 250ms Benefit: allows performance tuning by increases options for setting bgwriter_delay which would otherwise have an ineffectually high minimum setting Risk: low 1-line doc patch to follow, if this is approved. -- Best Regards, Simon Riggs
Attachment
Simon Riggs wrote: > On Mon, 2004-12-27 at 22:21, Bruce Momjian wrote: > > Should we consider at least adjusting the meaning of bgwriter_percent? > > Yes. As things stand, this is the only change that seems safe. > > Here's a very short patch that implements this change within BufferSync > in bufmgr.c > > - No algorithm changes > - No error message changes > - Only change is the call to StrategyDirtyBufferList is made using the > maximum number of buffers that will be cleaned, rather than uselessly > trawling through all of shared_buffers > > This changes the meaning of bgwriter_percent from "percent of dirty > buffers" to "percent of shared_buffers". The default settings of 1% of > 1000 buffers gives up to 10 dirty block writes every 250ms > > Benefit: allows performance tuning by increases options for setting > bgwriter_delay which would otherwise have an ineffectually high minimum > setting > > Risk: low > > 1-line doc patch to follow, if this is approved. I am not objecting to the patch, but what value is there in having both bgwriter_percent and bgwriter_maxpages? Seems both are redundant and that one would be enough. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Fri, 2004-12-31 at 01:14, Bruce Momjian wrote: > Simon Riggs wrote: > > On Mon, 2004-12-27 at 22:21, Bruce Momjian wrote: > > > Should we consider at least adjusting the meaning of bgwriter_percent? > > > > Yes. As things stand, this is the only change that seems safe. > > > > Here's a very short patch that implements this change within BufferSync > > in bufmgr.c > > > > - No algorithm changes > > - No error message changes > > - Only change is the call to StrategyDirtyBufferList is made using the > > maximum number of buffers that will be cleaned, rather than uselessly > > trawling through all of shared_buffers > > > > This changes the meaning of bgwriter_percent from "percent of dirty > > buffers" to "percent of shared_buffers". The default settings of 1% of > > 1000 buffers gives up to 10 dirty block writes every 250ms > > > > Benefit: allows performance tuning by increases options for setting > > bgwriter_delay which would otherwise have an ineffectually high minimum > > setting > > > > Risk: low > > > > 1-line doc patch to follow, if this is approved. > > I am not objecting to the patch, but what value is there in having both > bgwriter_percent and bgwriter_maxpages? Seems both are redundant and > that one would be enough. In brief: i) for now: as little change as possible is good ii) the two parameters are OK iii) trying to decide an alternative takes time, which we do not have iv) what is presented here is simply a performance bug fix, not the best long term alternative... I'd like to move quickly: if we do this (or an alternative), it has to be done soon and it would be easy to discuss this until we run out of time. Could we vote: in RC3, or not? In more detail... The value of having both is: i) as little change as possible at this stage of RC - the main one ...which gives us stability ...and also avoids having to re-discuss what they *should* be ii) Having two isn't that bad. bgwriter_percent auto adjusts the length of the to-be-cleaned-list, so it is roughly useful anywhere between 500 and 10000 shared_buffers. That is IMHO slightly more useful than a hard definition set via bgwriter_maxpages, since that is likely to be set wrong anyway - but has some value as an outside limit on the number of pages. [You may wish to set shared_buffers > 10000 even on smaller servers, since many now have 2GB RAM and yet a relatively poor I/O subsystem. Having maxpages set separately allows the majority of people to set shared_buffers higher without swamping their I/O subsystems because they didn't know about the r8.0 bgwriter feature/parameters] iii) changing the parameters might tempt us towards changing the algorithm, which is not a topic we have reached agreement on iv) I see it as a goal to remove all of those parameters anyway, as well as explore some of the many options and ideas everybody has presented, so further change is likely at the next release whatever is done now. The patch is as simple as I can make it and yet remove the unnecessary performance effect in the existing code. Thanks to Neil and others for showing that this was possible...I see this patch as a team effort. I've already spoken against larger change and would do so again now: if we don't agree this change, then I would vote for no-change.... simply because this patch is minimal change. We *suspect* further change is beneficial but we have no evidence to support what that change should be, amongst the large range of possible solutions proposed. -- Best Regards, Simon Riggs
This change isn't going to make it for RC3, and it probably not something we want to rush. I think there are a few issues involved: o everyone agrees the current meaning of bgwriter_percent is useless (percent of dirty buffers) o removal of bgwriter_percent will cause problems because postgresql.conf is only installed via initdb, so beta users will have to have some workaround so their existing postgresql.conf files work. o bgwriter_percent and bgwriter_maxpages are duplicate for a given number of buffers and it isn't clear which one takes precedence. o 8.1 might use these variables with different meanings, causing slight upgrade confusion. o Another idea is for bgwriter_percent to control how much of the buffer is scanned. Tom feels bgwriter_maxpages is good because it allows the user to specify the I/O traffic, while bgwriter_percent as total pages (not just dirty ones) is perhaps easier to set a default (I/O load varies based on buffer cache size) and perhaps easier to understand. I am not sure what to suggest at this point but whatever solution we use should take the above issues into account. --------------------------------------------------------------------------- Simon Riggs wrote: > On Fri, 2004-12-31 at 01:14, Bruce Momjian wrote: > > Simon Riggs wrote: > > > On Mon, 2004-12-27 at 22:21, Bruce Momjian wrote: > > > > Should we consider at least adjusting the meaning of bgwriter_percent? > > > > > > Yes. As things stand, this is the only change that seems safe. > > > > > > Here's a very short patch that implements this change within BufferSync > > > in bufmgr.c > > > > > > - No algorithm changes > > > - No error message changes > > > - Only change is the call to StrategyDirtyBufferList is made using the > > > maximum number of buffers that will be cleaned, rather than uselessly > > > trawling through all of shared_buffers > > > > > > This changes the meaning of bgwriter_percent from "percent of dirty > > > buffers" to "percent of shared_buffers". The default settings of 1% of > > > 1000 buffers gives up to 10 dirty block writes every 250ms > > > > > > Benefit: allows performance tuning by increases options for setting > > > bgwriter_delay which would otherwise have an ineffectually high minimum > > > setting > > > > > > Risk: low > > > > > > 1-line doc patch to follow, if this is approved. > > > > I am not objecting to the patch, but what value is there in having both > > bgwriter_percent and bgwriter_maxpages? Seems both are redundant and > > that one would be enough. > > In brief: > i) for now: as little change as possible is good > ii) the two parameters are OK > iii) trying to decide an alternative takes time, which we do not have > iv) what is presented here is simply a performance bug fix, not the best > long term alternative... > > I'd like to move quickly: if we do this (or an alternative), it has to > be done soon and it would be easy to discuss this until we run out of > time. Could we vote: in RC3, or not? > > In more detail... > > The value of having both is: > i) as little change as possible at this stage of RC - the main one > ...which gives us stability > ...and also avoids having to re-discuss what they *should* be > > ii) Having two isn't that bad. bgwriter_percent auto adjusts the length > of the to-be-cleaned-list, so it is roughly useful anywhere between 500 > and 10000 shared_buffers. That is IMHO slightly more useful than a hard > definition set via bgwriter_maxpages, since that is likely to be set > wrong anyway - but has some value as an outside limit on the number of > pages. [You may wish to set shared_buffers > 10000 even on smaller > servers, since many now have 2GB RAM and yet a relatively poor I/O > subsystem. Having maxpages set separately allows the majority of people > to set shared_buffers higher without swamping their I/O subsystems > because they didn't know about the r8.0 bgwriter feature/parameters] > > iii) changing the parameters might tempt us towards changing the > algorithm, which is not a topic we have reached agreement on > > iv) I see it as a goal to remove all of those parameters anyway, as well > as explore some of the many options and ideas everybody has presented, > so further change is likely at the next release whatever is done now. > > The patch is as simple as I can make it and yet remove the unnecessary > performance effect in the existing code. Thanks to Neil and others for > showing that this was possible...I see this patch as a team effort. > > I've already spoken against larger change and would do so again now: if > we don't agree this change, then I would vote for no-change.... simply > because this patch is minimal change. We *suspect* further change is > beneficial but we have no evidence to support what that change should > be, amongst the large range of possible solutions proposed. > > -- > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Sat, 2005-01-01 at 06:20, Bruce Momjian wrote: > This change isn't going to make it for RC3, and it probably not > something we want to rush. OK. Thank you. > I think there are a few issues involved: > > o everyone agrees the current meaning of bgwriter_percent is > useless (percent of dirty buffers) > o removal of bgwriter_percent will cause problems because > postgresql.conf is only installed via initdb, so beta users > will have to have some workaround so their existing > postgresql.conf files work. > o bgwriter_percent and bgwriter_maxpages are duplicate for a > given number of buffers and it isn't clear which one takes > precedence. > o 8.1 might use these variables with different meanings, > causing slight upgrade confusion. > o Another idea is for bgwriter_percent to control how much of > the buffer is scanned. > Agreed. Would add as item #1: current behaviour of bgwriter causes sub-optimal performance for 8.0, for systems with a high write workload, more CPUs and higher shared_buffers. > Tom feels bgwriter_maxpages is good because it allows the user to > specify the I/O traffic, while bgwriter_percent as total pages (not just > dirty ones) is perhaps easier to set a default (I/O load varies based on > buffer cache size) and perhaps easier to understand. > Agreed. > I am not sure what to suggest at this point but whatever solution we use > should take the above issues into account. Well, I think we're saying: its not in 8.0 now, and we take our time to consider patches for 8.1 and accept the situation that the parameter names/meaning will change in next release. The patch is there if that decision changes, but I'll say no more on it. > --------------------------------------------------------------------------- > > Simon Riggs wrote: > > On Fri, 2004-12-31 at 01:14, Bruce Momjian wrote: > > > Simon Riggs wrote: > > > > On Mon, 2004-12-27 at 22:21, Bruce Momjian wrote: > > > > > Should we consider at least adjusting the meaning of bgwriter_percent? > > > > > > > > Yes. As things stand, this is the only change that seems safe. > > > > > > > > Here's a very short patch that implements this change within BufferSync > > > > in bufmgr.c > > > > > > > > - No algorithm changes > > > > - No error message changes > > > > - Only change is the call to StrategyDirtyBufferList is made using the > > > > maximum number of buffers that will be cleaned, rather than uselessly > > > > trawling through all of shared_buffers > > > > > > > > This changes the meaning of bgwriter_percent from "percent of dirty > > > > buffers" to "percent of shared_buffers". The default settings of 1% of > > > > 1000 buffers gives up to 10 dirty block writes every 250ms > > > > > > > > Benefit: allows performance tuning by increases options for setting > > > > bgwriter_delay which would otherwise have an ineffectually high minimum > > > > setting > > > > > > > > Risk: low > > > > > > > > 1-line doc patch to follow, if this is approved. > > > > > > I am not objecting to the patch, but what value is there in having both > > > bgwriter_percent and bgwriter_maxpages? Seems both are redundant and > > > that one would be enough. > > > > In brief: > > i) for now: as little change as possible is good > > ii) the two parameters are OK > > iii) trying to decide an alternative takes time, which we do not have > > iv) what is presented here is simply a performance bug fix, not the best > > long term alternative... > > > > I'd like to move quickly: if we do this (or an alternative), it has to > > be done soon and it would be easy to discuss this until we run out of > > time. Could we vote: in RC3, or not? > > > > In more detail... > > > > The value of having both is: > > i) as little change as possible at this stage of RC - the main one > > ...which gives us stability > > ...and also avoids having to re-discuss what they *should* be > > > > ii) Having two isn't that bad. bgwriter_percent auto adjusts the length > > of the to-be-cleaned-list, so it is roughly useful anywhere between 500 > > and 10000 shared_buffers. That is IMHO slightly more useful than a hard > > definition set via bgwriter_maxpages, since that is likely to be set > > wrong anyway - but has some value as an outside limit on the number of > > pages. [You may wish to set shared_buffers > 10000 even on smaller > > servers, since many now have 2GB RAM and yet a relatively poor I/O > > subsystem. Having maxpages set separately allows the majority of people > > to set shared_buffers higher without swamping their I/O subsystems > > because they didn't know about the r8.0 bgwriter feature/parameters] > > > > iii) changing the parameters might tempt us towards changing the > > algorithm, which is not a topic we have reached agreement on > > > > iv) I see it as a goal to remove all of those parameters anyway, as well > > as explore some of the many options and ideas everybody has presented, > > so further change is likely at the next release whatever is done now. > > > > The patch is as simple as I can make it and yet remove the unnecessary > > performance effect in the existing code. Thanks to Neil and others for > > showing that this was possible...I see this patch as a team effort. > > > > I've already spoken against larger change and would do so again now: if > > we don't agree this change, then I would vote for no-change.... simply > > because this patch is minimal change. We *suspect* further change is > > beneficial but we have no evidence to support what that change should > > be, amongst the large range of possible solutions proposed. > > -- Best Regards, Simon Riggs
Simon Riggs wrote: > On Sat, 2005-01-01 at 06:20, Bruce Momjian wrote: > > This change isn't going to make it for RC3, and it probably not > > something we want to rush. > > OK. Thank you. > > > I think there are a few issues involved: > > > > o everyone agrees the current meaning of bgwriter_percent is > > useless (percent of dirty buffers) > > o removal of bgwriter_percent will cause problems because > > postgresql.conf is only installed via initdb, so beta users > > will have to have some workaround so their existing > > postgresql.conf files work. > > o bgwriter_percent and bgwriter_maxpages are duplicate for a > > given number of buffers and it isn't clear which one takes > > precedence. > > o 8.1 might use these variables with different meanings, > > causing slight upgrade confusion. > > o Another idea is for bgwriter_percent to control how much of > > the buffer is scanned. > > > > Agreed. > > Would add as item #1: current behaviour of bgwriter causes sub-optimal > performance for 8.0, for systems with a high write workload, more CPUs > and higher shared_buffers. > > > Tom feels bgwriter_maxpages is good because it allows the user to > > specify the I/O traffic, while bgwriter_percent as total pages (not just > > dirty ones) is perhaps easier to set a default (I/O load varies based on > > buffer cache size) and perhaps easier to understand. > > > > Agreed. > > > I am not sure what to suggest at this point but whatever solution we use > > should take the above issues into account. > > Well, I think we're saying: its not in 8.0 now, and we take our time to > consider patches for 8.1 and accept the situation that the parameter > names/meaning will change in next release. I have no problem doing something for 8.0 if we can find something that meets all the items I mentioned. One idea would be to just remove bgwriter_percent. Beta/RC users would still have it in their postgresql.conf, but it is commented out so it should be OK. If they uncomment it their server would not start but we could just tell testers to remove it. I see that as better than having conflicting parameters. Another idea is to have bgwriter_percent be the percent of the buffer it will scan. We could default that to 50% or 100%, but we then need to make sure all beta/RC users update their postgresql.conf with the new default because the commented-out default will not be correct. At this point I see these as our only two viable options, aside from doing nothing. I realize our current behavior requires a full scan of the buffer cache, but how often is the bgwriter_maxpages limit met? If it is not a full scan is done anyway, right? It seems the only way to really add functionality is to change bgwriter_precent to control how much of the buffer is scanned. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Sat, 2005-01-01 at 17:01, Bruce Momjian wrote: > Simon Riggs wrote: > > > Well, I think we're saying: its not in 8.0 now, and we take our time to > > consider patches for 8.1 and accept the situation that the parameter > > names/meaning will change in next release. > > I have no problem doing something for 8.0 if we can find something that > meets all the items I mentioned. > > One idea would be to just remove bgwriter_percent. Beta/RC users would > still have it in their postgresql.conf, but it is commented out so it > should be OK. If they uncomment it their server would not start but we > could just tell testers to remove it. I see that as better than having > conflicting parameters. Can't say I like that at first thought. I'll think some more though... > Another idea is to have bgwriter_percent be the percent of the buffer it > will scan. Hmmm....well that was my original suggestion (bg2.patch on 12 Dec) (...though with a bug, as Neil pointed out) > We could default that to 50% or 100%, but we then need to > make sure all beta/RC users update their postgresql.conf with the new > default because the commented-out default will not be correct. ...we just differ/ed on what the default should be... > At this point I see these as our only two viable options, aside from > doing nothing. > I realize our current behavior requires a full scan of the buffer cache, > but how often is the bgwriter_maxpages limit met? If it is not a full > scan is done anyway, right? Well, if you heavy a very heavy read workload then that would be a problem. I was more worried about concurrency in a heavy write situation, but I can see your point, and agree. (Idea #1 still suffers from this, so we should rule it out...) > It seems the only way to really add > functionality is to change bgwriter_precent to control how much of the > buffer is scanned. OK. I think you've persuaded me on idea #2, if I understand you right: bgwriter_percent = 50 (default) bgwriter_maxpages = 100 (default) percent is the number of shared_buffers we scan, limited by maxpages. (I'll code it up in a couple of hours when the kids are in bed) -- Best Regards, Simon Riggs
Bruce Momjian <pgman@candle.pha.pa.us> writes: > o everyone agrees the current meaning of bgwriter_percent is > useless (percent of dirty buffers) Oh? It's not useless by any means; it's a perfectly reasonable and useful definition that happens to be expensive to implement. One of the questions that is not answered to my satisfaction is what is an adequate substitute that doesn't lose needed functionality. > o bgwriter_percent and bgwriter_maxpages are duplicate for a > given number of buffers and it isn't clear which one takes > precedence. Not unless the current definition of bgwriter_percent is changed. Please try to make sure that your summaries reduce confusion instead of increasing it. regards, tom lane
On Sat, 2005-01-01 at 17:47, Simon Riggs wrote: > On Sat, 2005-01-01 at 17:01, Bruce Momjian wrote: > > Simon Riggs wrote: > > > > > Well, I think we're saying: its not in 8.0 now, and we take our time to > > > consider patches for 8.1 and accept the situation that the parameter > > > names/meaning will change in next release. > > > > I have no problem doing something for 8.0 if we can find something that > > meets all the items I mentioned. > > > > One idea would be to just remove bgwriter_percent. Beta/RC users would > > still have it in their postgresql.conf, but it is commented out so it > > should be OK. If they uncomment it their server would not start but we > > could just tell testers to remove it. I see that as better than having > > conflicting parameters. > > Can't say I like that at first thought. I'll think some more though... > > > Another idea is to have bgwriter_percent be the percent of the buffer it > > will scan. > > Hmmm....well that was my original suggestion (bg2.patch on 12 Dec) > (...though with a bug, as Neil pointed out) > > > We could default that to 50% or 100%, but we then need to > > make sure all beta/RC users update their postgresql.conf with the new > > default because the commented-out default will not be correct. > > ...we just differ/ed on what the default should be... > > > At this point I see these as our only two viable options, aside from > > doing nothing. > > > I realize our current behavior requires a full scan of the buffer cache, > > but how often is the bgwriter_maxpages limit met? If it is not a full > > scan is done anyway, right? > > Well, if you heavy a very heavy read workload then that would be a > problem. I was more worried about concurrency in a heavy write > situation, but I can see your point, and agree. > > (Idea #1 still suffers from this, so we should rule it out...) > > > It seems the only way to really add > > functionality is to change bgwriter_precent to control how much of the > > buffer is scanned. > > OK. I think you've persuaded me on idea #2, if I understand you right: > > bgwriter_percent = 50 (default) > bgwriter_maxpages = 100 (default) > > percent is the number of shared_buffers we scan, limited by maxpages. > > (I'll code it up in a couple of hours when the kids are in bed) Here's the basic patch - no changes to current default values or docs. Not sure if this is still interesting or not... -- Best Regards, Simon Riggs
Attachment
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > o everyone agrees the current meaning of bgwriter_percent is > > useless (percent of dirty buffers) > > Oh? > > It's not useless by any means; it's a perfectly reasonable and useful > definition that happens to be expensive to implement. One of the > questions that is not answered to my satisfaction is what is an adequate > substitute that doesn't lose needed functionality. I remembered this statement: > I think there's a reasonable case to be made for redefining > bgwriter_percent as the max percent of the total buffer list to scan > (not the max percent of the list to return --- Jan correctly pointed out > that the latter is useless). Then we could modify > StrategyDirtyBufferList so that the percent and maxpages parameters are > passed in, so it can stop as soon as either one is satisfied. This > would be a fairly small/safe code change and I wouldn't have a problem > doing it even at this late stage of the cycle. Referenced here: http://archives.postgresql.org/pgsql-hackers/2004-12/msg00703.php But I now see that Jan was objecting to the idea of the previouis patch where bgwriter_percent is a percent of all buffers to return, which we just discussed as redundant. > > o bgwriter_percent and bgwriter_maxpages are duplicate for a > > given number of buffers and it isn't clear which one takes > > precedence. > > Not unless the current definition of bgwriter_percent is changed. > > Please try to make sure that your summaries reduce confusion instead > of increasing it. OK, whatever. My point is that many have critisized the current behavior of bgwriter_percent and I haven't heard anyone defend it, including Jan. What bothers me is that we have known bgwriter needs tuning for months and I am not sure we are any closer to improving it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
OK, we have a submitted patch that attempts to improve bgwriter by making bgwriter_percent control what percentage of the buffer is scanned. The patch still needs doc changes and a change to the default value but at this point we need a vote on the patch. Is it: * too late for 8.0 * not the right improvement * to be applied with doc/default additions Comments? --------------------------------------------------------------------------- Simon Riggs wrote: > On Sat, 2005-01-01 at 17:47, Simon Riggs wrote: > > On Sat, 2005-01-01 at 17:01, Bruce Momjian wrote: > > > Simon Riggs wrote: > > > > > > > Well, I think we're saying: its not in 8.0 now, and we take our time to > > > > consider patches for 8.1 and accept the situation that the parameter > > > > names/meaning will change in next release. > > > > > > I have no problem doing something for 8.0 if we can find something that > > > meets all the items I mentioned. > > > > > > One idea would be to just remove bgwriter_percent. Beta/RC users would > > > still have it in their postgresql.conf, but it is commented out so it > > > should be OK. If they uncomment it their server would not start but we > > > could just tell testers to remove it. I see that as better than having > > > conflicting parameters. > > > > Can't say I like that at first thought. I'll think some more though... > > > > > Another idea is to have bgwriter_percent be the percent of the buffer it > > > will scan. > > > > Hmmm....well that was my original suggestion (bg2.patch on 12 Dec) > > (...though with a bug, as Neil pointed out) > > > > > We could default that to 50% or 100%, but we then need to > > > make sure all beta/RC users update their postgresql.conf with the new > > > default because the commented-out default will not be correct. > > > > ...we just differ/ed on what the default should be... > > > > > At this point I see these as our only two viable options, aside from > > > doing nothing. > > > > > I realize our current behavior requires a full scan of the buffer cache, > > > but how often is the bgwriter_maxpages limit met? If it is not a full > > > scan is done anyway, right? > > > > Well, if you heavy a very heavy read workload then that would be a > > problem. I was more worried about concurrency in a heavy write > > situation, but I can see your point, and agree. > > > > (Idea #1 still suffers from this, so we should rule it out...) > > > > > It seems the only way to really add > > > functionality is to change bgwriter_precent to control how much of the > > > buffer is scanned. > > > > OK. I think you've persuaded me on idea #2, if I understand you right: > > > > bgwriter_percent = 50 (default) > > bgwriter_maxpages = 100 (default) > > > > percent is the number of shared_buffers we scan, limited by maxpages. > > > > (I'll code it up in a couple of hours when the kids are in bed) > > Here's the basic patch - no changes to current default values or docs. > > Not sure if this is still interesting or not... > > -- > Best Regards, Simon Riggs [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > OK, we have a submitted patch that attempts to improve bgwriter by > making bgwriter_percent control what percentage of the buffer is > scanned. > The patch still needs doc changes and a change to the default value but > at this point we need a vote on the patch. Is it: > * too late for 8.0 > * not the right improvement > * to be applied with doc/default additions My vote: too late for 8.0. There is no hard evidence that this is a useful improvement, and no time for such evidence to be obtained. regards, tom lane
On Mon, 3 Jan 2005, Bruce Momjian wrote: > > OK, we have a submitted patch that attempts to improve bgwriter by > making bgwriter_percent control what percentage of the buffer is > scanned. > > The patch still needs doc changes and a change to the default value but > at this point we need a vote on the patch. Is it: > > * too late for 8.0 Too late by at least 3 RCs ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Mon, 2005-01-03 at 20:09, Bruce Momjian wrote: > OK, we have a submitted patch that attempts to improve bgwriter by > making bgwriter_percent control what percentage of the buffer is > scanned. > > The patch still needs doc changes and a change to the default value but > at this point we need a vote on the patch. Is it: > > * too late for 8.0 > * not the right improvement > * to be applied with doc/default additions > > Comments? > > --------------------------------------------------------------------------- > > Simon Riggs wrote: > > On Sat, 2005-01-01 at 17:47, Simon Riggs wrote: > > > On Sat, 2005-01-01 at 17:01, Bruce Momjian wrote: > > > > Simon Riggs wrote: > > > > > > > > > Well, I think we're saying: its not in 8.0 now, and we take our time to > > > > > consider patches for 8.1 and accept the situation that the parameter > > > > > names/meaning will change in next release. > > > > I hear veto ... so the above situation stands then: 8.1 it is. Not unhappy...I want this thing released as much as the next man... -- Best Regards, Simon Riggs
Simon Riggs wrote: > On Mon, 2005-01-03 at 20:09, Bruce Momjian wrote: > > OK, we have a submitted patch that attempts to improve bgwriter by > > making bgwriter_percent control what percentage of the buffer is > > scanned. > > > > The patch still needs doc changes and a change to the default value but > > at this point we need a vote on the patch. Is it: > > > > * too late for 8.0 > > * not the right improvement > > * to be applied with doc/default additions > > > > Comments? > > > > --------------------------------------------------------------------------- > > > > Simon Riggs wrote: > > > On Sat, 2005-01-01 at 17:47, Simon Riggs wrote: > > > > On Sat, 2005-01-01 at 17:01, Bruce Momjian wrote: > > > > > Simon Riggs wrote: > > > > > > > > > > > Well, I think we're saying: its not in 8.0 now, and we take our time to > > > > > > consider patches for 8.1 and accept the situation that the parameter > > > > > > names/meaning will change in next release. > > > > > > > I hear veto ... so the above situation stands then: 8.1 it is. > > Not unhappy...I want this thing released as much as the next man... Well, we went through the process and that's the best we can do. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Mon, 2005-01-03 at 23:03, Bruce Momjian wrote: > Simon Riggs wrote: > > On Mon, 2005-01-03 at 20:09, Bruce Momjian wrote: > > > OK, we have a submitted patch that attempts to improve bgwriter by > > > making bgwriter_percent control what percentage of the buffer is > > > scanned. > > > > > > The patch still needs doc changes and a change to the default value but > > > at this point we need a vote on the patch. Is it: > > > > > > * too late for 8.0 > > > * not the right improvement > > > * to be applied with doc/default additions > > > > > > Comments? > > > > > > --------------------------------------------------------------------------- > > > > > > Simon Riggs wrote: > > > > On Sat, 2005-01-01 at 17:47, Simon Riggs wrote: > > > > > On Sat, 2005-01-01 at 17:01, Bruce Momjian wrote: > > > > > > Simon Riggs wrote: > > > > > > > > > > > > > Well, I think we're saying: its not in 8.0 now, and we take our time to > > > > > > > consider patches for 8.1 and accept the situation that the parameter > > > > > > > names/meaning will change in next release. > > > > > > > > > > I hear veto ... so the above situation stands then: 8.1 it is. > > > > Not unhappy...I want this thing released as much as the next man... > > Well, we went through the process and that's the best we can do. Here's my bgwriter instrumentation patch, which gives info that could allow the bgwriter settings to be tuned. -- Best Regards, Simon Riggs
Attachment
Simon Riggs wrote: > Here's my bgwriter instrumentation patch, which gives info that could > allow the bgwriter settings to be tuned. Uh, what does this do exactly? Add additional logging output? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
This has been saved for the 8.1 release: http:/momjian.postgresql.org/cgi-bin/pgpatches2 --------------------------------------------------------------------------- Simon Riggs wrote: > On Sat, 2005-01-01 at 17:47, Simon Riggs wrote: > > On Sat, 2005-01-01 at 17:01, Bruce Momjian wrote: > > > Simon Riggs wrote: > > > > > > > Well, I think we're saying: its not in 8.0 now, and we take our time to > > > > consider patches for 8.1 and accept the situation that the parameter > > > > names/meaning will change in next release. > > > > > > I have no problem doing something for 8.0 if we can find something that > > > meets all the items I mentioned. > > > > > > One idea would be to just remove bgwriter_percent. Beta/RC users would > > > still have it in their postgresql.conf, but it is commented out so it > > > should be OK. If they uncomment it their server would not start but we > > > could just tell testers to remove it. I see that as better than having > > > conflicting parameters. > > > > Can't say I like that at first thought. I'll think some more though... > > > > > Another idea is to have bgwriter_percent be the percent of the buffer it > > > will scan. > > > > Hmmm....well that was my original suggestion (bg2.patch on 12 Dec) > > (...though with a bug, as Neil pointed out) > > > > > We could default that to 50% or 100%, but we then need to > > > make sure all beta/RC users update their postgresql.conf with the new > > > default because the commented-out default will not be correct. > > > > ...we just differ/ed on what the default should be... > > > > > At this point I see these as our only two viable options, aside from > > > doing nothing. > > > > > I realize our current behavior requires a full scan of the buffer cache, > > > but how often is the bgwriter_maxpages limit met? If it is not a full > > > scan is done anyway, right? > > > > Well, if you heavy a very heavy read workload then that would be a > > problem. I was more worried about concurrency in a heavy write > > situation, but I can see your point, and agree. > > > > (Idea #1 still suffers from this, so we should rule it out...) > > > > > It seems the only way to really add > > > functionality is to change bgwriter_precent to control how much of the > > > buffer is scanned. > > > > OK. I think you've persuaded me on idea #2, if I understand you right: > > > > bgwriter_percent = 50 (default) > > bgwriter_maxpages = 100 (default) > > > > percent is the number of shared_buffers we scan, limited by maxpages. > > > > (I'll code it up in a couple of hours when the kids are in bed) > > Here's the basic patch - no changes to current default values or docs. > > Not sure if this is still interesting or not... > > -- > Best Regards, Simon Riggs [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Mon, 2005-01-03 at 19:14 -0500, Bruce Momjian wrote: > Simon Riggs wrote: > > Here's my bgwriter instrumentation patch, which gives info that could > > allow the bgwriter settings to be tuned. > > Uh, what does this do exactly? Add additional logging output? > Produces output like this... DEBUG:ARC T1target= 45 B1len= 4954 T1len= 40 T2len= 4960 B2len= 46 DEBUG:ARC total = 98% B1hit= 0% T1hit= 0% T2hit= 98% B2hit= 0% DEBUG:ARC buffer dirty misses= 22% (wasted= 0); cleaned= 4494 when you have debug_shared_buffers (= n) set and you have server messages DEBUG1 available. The last line of log output has been replaced by this version. -- Best Regards, Simon Riggs
Do we want to add this additional log infor to CVS for 8.0? --------------------------------------------------------------------------- Simon Riggs wrote: > On Mon, 2005-01-03 at 19:14 -0500, Bruce Momjian wrote: > > Simon Riggs wrote: > > > Here's my bgwriter instrumentation patch, which gives info that could > > > allow the bgwriter settings to be tuned. > > > > Uh, what does this do exactly? Add additional logging output? > > > > Produces output like this... > > DEBUG:ARC T1target= 45 B1len= 4954 T1len= 40 T2len= 4960 B2len= 46 > DEBUG:ARC total = 98% B1hit= 0% T1hit= 0% T2hit= 98% B2hit= 0% > DEBUG:ARC buffer dirty misses= 22% (wasted= 0); cleaned= 4494 > > when you have debug_shared_buffers (= n) set > and you have server messages DEBUG1 available. > > The last line of log output has been replaced by this version. > > -- > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Fri, 7 Jan 2005, Bruce Momjian wrote: > > Do we want to add this additional log infor to CVS for 8.0? No, unless we're looking for an RC5? > > --------------------------------------------------------------------------- > > Simon Riggs wrote: >> On Mon, 2005-01-03 at 19:14 -0500, Bruce Momjian wrote: >>> Simon Riggs wrote: >>>> Here's my bgwriter instrumentation patch, which gives info that could >>>> allow the bgwriter settings to be tuned. >>> >>> Uh, what does this do exactly? Add additional logging output? >>> >> >> Produces output like this... >> >> DEBUG:ARC T1target= 45 B1len= 4954 T1len= 40 T2len= 4960 B2len= 46 >> DEBUG:ARC total = 98% B1hit= 0% T1hit= 0% T2hit= 98% B2hit= 0% >> DEBUG:ARC buffer dirty misses= 22% (wasted= 0); cleaned= 4494 >> >> when you have debug_shared_buffers (= n) set >> and you have server messages DEBUG1 available. >> >> The last line of log output has been replaced by this version. >> >> -- >> Best Regards, Simon Riggs >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >> > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
"Marc G. Fournier" <scrappy@postgresql.org> writes: > On Fri, 7 Jan 2005, Bruce Momjian wrote: >> Do we want to add this additional log infor to CVS for 8.0? > No, unless we're looking for an RC5? I vote no as well. While it's probably not a dangerous change, the need for it has not been demonstrated. regards, tom lane
Tom Lane wrote: > "Marc G. Fournier" <scrappy@postgresql.org> writes: > > On Fri, 7 Jan 2005, Bruce Momjian wrote: > >> Do we want to add this additional log infor to CVS for 8.0? > > > No, unless we're looking for an RC5? > > I vote no as well. While it's probably not a dangerous change, the need > for it has not been demonstrated. OK, Simon, would you email me a copy of the patch again privately so I can put it in the 8.1 queue. I seem to have lost the email. Thanks. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Later version of this patch added to the patch queue. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Simon Riggs wrote: > On Sat, 2005-01-01 at 17:47, Simon Riggs wrote: > > On Sat, 2005-01-01 at 17:01, Bruce Momjian wrote: > > > Simon Riggs wrote: > > > > > > > Well, I think we're saying: its not in 8.0 now, and we take our time to > > > > consider patches for 8.1 and accept the situation that the parameter > > > > names/meaning will change in next release. > > > > > > I have no problem doing something for 8.0 if we can find something that > > > meets all the items I mentioned. > > > > > > One idea would be to just remove bgwriter_percent. Beta/RC users would > > > still have it in their postgresql.conf, but it is commented out so it > > > should be OK. If they uncomment it their server would not start but we > > > could just tell testers to remove it. I see that as better than having > > > conflicting parameters. > > > > Can't say I like that at first thought. I'll think some more though... > > > > > Another idea is to have bgwriter_percent be the percent of the buffer it > > > will scan. > > > > Hmmm....well that was my original suggestion (bg2.patch on 12 Dec) > > (...though with a bug, as Neil pointed out) > > > > > We could default that to 50% or 100%, but we then need to > > > make sure all beta/RC users update their postgresql.conf with the new > > > default because the commented-out default will not be correct. > > > > ...we just differ/ed on what the default should be... > > > > > At this point I see these as our only two viable options, aside from > > > doing nothing. > > > > > I realize our current behavior requires a full scan of the buffer cache, > > > but how often is the bgwriter_maxpages limit met? If it is not a full > > > scan is done anyway, right? > > > > Well, if you heavy a very heavy read workload then that would be a > > problem. I was more worried about concurrency in a heavy write > > situation, but I can see your point, and agree. > > > > (Idea #1 still suffers from this, so we should rule it out...) > > > > > It seems the only way to really add > > > functionality is to change bgwriter_precent to control how much of the > > > buffer is scanned. > > > > OK. I think you've persuaded me on idea #2, if I understand you right: > > > > bgwriter_percent = 50 (default) > > bgwriter_maxpages = 100 (default) > > > > percent is the number of shared_buffers we scan, limited by maxpages. > > > > (I'll code it up in a couple of hours when the kids are in bed) > > Here's the basic patch - no changes to current default values or docs. > > Not sure if this is still interesting or not... > > -- > Best Regards, Simon Riggs [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073