Thread: Performance features the 4th
I've just uploaded http://developer.postgresql.org/~wieck/all_performance.v4.74.diff.gz This patch contains the "still not yet ready" performance improvements discussed over the couple last days. _Shared buffer replacement_: The buffer replacement strategy is a slightly modified version of ARC. The modifications are some specializations about CDB promotions. Since PostgreSQL allways looks for buffers multiple times when updating (first during the scan, then during the heap_update() etc.), every updated block would jump right into the T2 (frequent accessed) queue. To prevent that the Xid when a buffer got added to the T1 queue is remembered and if a block is found in T1, the same transaction will not promote it into T2. This also affects blocks accessed like SELECT ... FOR UPDATE; UPDATE as this is a usual strategy and does not mean that this particular datum is accessed frequently. Blocks faulted in by vacuum are handled special in that they end up at the LRU of the T1 queue and when evicted from there their CDB get's destroyed instead of added to the B1 queue to prevent vacuum from polluting the caches autotuning. A guc variable buffer_strategy_status_interval = 0 # 0-600 seconds controls DEBUG1 messages every n seconds showing the current queue sizes and the cache hitrates during the last interval. _Vacuum page delay_: Tom Lane's napping during vacuums with another tuning option. I replaced the usleep() call with a PG_DELAY(msec) macro in miscadmin.h, which does use select(2) instead. That should address the possible portability problems. The config options vacuum_page_group_delay = 0 # 0-100 milliseconds vacuum_page_group_size = 10 # 1-1000 pages control how many pages get vacuumed as a group and how long vacuum will nap between groups. I think this can be improved more if vacuum get's feedback from the buffer manager if a page actually was found clean or already dirty in the cache or faulted in. This together with the fact if vacuum actually dirties the page or not would result in a sort of "vacuum page cost" that is accumulated and controls how often to nap. So that vacuuming a page found in the cache and that has no dead tuples is cheap, but vacuuming a page that caused another dirty block to get evicted, then read in and finally ends up dirty because of dead tuples is expensive. _Lazy checkpoint_: This is the checkpoint process with the ability to schedule the buffer flushing over some time. Also the buffers are written in an order told by the buffer replacement strategy. Currently that is a merged list of dirty buffers in the order of the T1 and T2 queues of ARC. Since buffers are replaced in that order, it causes backends to find clean buffers for eviction more often. The config options lazy_checkpoint_time = 0 # 0-3600 seconds lazy_checkpoint_group_size = 50 # 10-1000 pages lazy_checkpoint_maxdelay= 500 # 100-1000 milliseconds control how long the buffer flushing "should" take, how many dirty pages to write as a group before syncing and napping. The maxdelay is a parameter that causes really small amounts of changes not to spread out over that long. The syncing is currently done in a new function in md.c, mdfsyncrecent() called through the smgr. The intention is to maintain some LRU of written to file descriptors and pg_fdatasync() them. I haven't found the right place for that yet, so it simply does a system global sync(). My idea here is that it really does not matter how accurate the single files are forced to disk during this, all we care for is to cause some physical writes performed by the kernel while we're writing them out, and not to buffer those writes in the OS until we finish the checkpoint. The lazy checkpoint configuration should only affect automatic checkpoints started by postmaster because a checkpoint_timeout occured. Acutally it seems to apply this to manually started checkpoints as well. BufferSync() monitors the time to finish, held in shared memory, so it would be relatively easy to hurry up a running lazy checkpoint by setting that to zero. It's just that the postmaster can't do that because he does not have a PGPROC structure and therefore can't lock that shmem structure. This is a must fix item because to hurry up the checkpointer is very critical at shutdown time. _TODO_: * Replace the global sync() in mdfsyncrecent(int max) with calls to pg_fdatasync() * Add functionality to postmaster to hurry up a running checkpoint at shutdown. * Make sure that manual checkpoints are not affected by the lazy checkpoint config options and that they too hurry up arunning one. * Further improve vacuums napping strategy depending on actual caused IO per page. _NOTE_: The core team is well aware of the high demand for these features. As things stand however, it is impossible to get this functionality released in version 7.4. That does not mean, that we have no chance to include some or all of the functionality in a subsequent 7.4.x release. But for that to happen, the above already mentioned TODO's must get done first. Further, we need a good amount of evidence that these changes actually gain the desired effect to a degree that justifies breaking our "no features in dot releases" rule. Also we need a good amount of evidence that the features don't break anything or sacrifice stability and that a backward compatible behaviour (where possible ... not possible with ARC vs. LRU) is the default. I personally would like to see this work included in a 7.4.x release. But it requires people to actually run tests, stress some hardware, check platform portability and *give us feedback*, bacause this is what we get for the release candidates and these improvements can under no circumstance have any lower quality than that. If this goes into a 7.4.x release and there is any platform dependant issue in it, it endangers the timely fix of other bugs for those platforms, and that's a no-go. Happy testing Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck wrote: > > _Vacuum page delay_: > > Tom Lane's napping during vacuums with another tuning option. I > replaced the usleep() call with a PG_DELAY(msec) macro in miscadmin.h, > which does use select(2) instead. That should address the possible > portability problems. What about skipping the delay if there are no outstanding disk operations? Then vacuum would get the full disk bandwidth if the system is idle. -- Manfred
Jan Wieck <JanWieck@Yahoo.com> writes: > This patch contains the "still not yet ready" performance improvements > discussed over the couple last days. Cool stuff! > The buffer replacement strategy is a slightly modified version of > ARC. BTW Jan, I got your message about taking a look at the ARC code; I'm really busy at the moment, but I'll definitely take a look at it when I get a chance. > I personally would like to see this work included in a 7.4.x > release. Personally, I can't see any circumstance under which I would view this as appropriate for integration into the 7.4 branch -- the changes this patch introduces are pretty fundamental to the system; even with testing I'd rather not see a stable release series potentially destabilized. Furthermore, it's not as if these performance issues have been recently discovered: we've been aware of most of them for at least one or two prior releases (if not much longer). -Neil
Manfred Spraul wrote: > Jan Wieck wrote: > >> >> _Vacuum page delay_: >> >> Tom Lane's napping during vacuums with another tuning option. I >> replaced the usleep() call with a PG_DELAY(msec) macro in miscadmin.h, >> which does use select(2) instead. That should address the possible >> portability problems. > > What about skipping the delay if there are no outstanding disk > operations? Then vacuum would get the full disk bandwidth if the system > is idle. All we could do is to monitor our own recent activity. I doubt that anything else would be portable. And on a dedicated DB server that is very close to the truth anyway. How portable is getrusage()? Could the postmaster issue that frequently for RUSAGE_CHILDREN and leave the result somewhere in the shared memory for whoever is concerned? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Neil Conway wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> This patch contains the "still not yet ready" performance improvements >> discussed over the couple last days. > > Cool stuff! > >> The buffer replacement strategy is a slightly modified version of >> ARC. > > BTW Jan, I got your message about taking a look at the ARC code; I'm > really busy at the moment, but I'll definitely take a look at it when > I get a chance. > >> I personally would like to see this work included in a 7.4.x >> release. > > Personally, I can't see any circumstance under which I would view this > as appropriate for integration into the 7.4 branch -- the changes this > patch introduces are pretty fundamental to the system; even with > testing I'd rather not see a stable release series potentially > destabilized. Furthermore, it's not as if these performance issues > have been recently discovered: we've been aware of most of them for at > least one or two prior releases (if not much longer). There are many aspects to this, and a full consensus will probably not be reachable. As a matter of fact, people who have performance problems are likely to be the same who have upgrade problems. And as Gaetano pointed out correctly, we will see wildforms with one or the other feature applied. My opinion is that it is best for us as supporters and for the reputation of PostgreSQL to try to keep the number of wildforms as small as possible and to provide those features applied in the best possible quality. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck wrote: > > How portable is getrusage()? Could the postmaster issue that > frequently for RUSAGE_CHILDREN and leave the result somewhere in the > shared memory for whoever is concerned? > SVr4, BSD4.3, SUS2 and POSIX1003.1, I believe. I also believe there is a M$ dll available that gives that functionality (psapi.dll). cheers andrew
Andrew Dunstan wrote: > Jan Wieck wrote: > >> >> How portable is getrusage()? Could the postmaster issue that >> frequently for RUSAGE_CHILDREN and leave the result somewhere in the >> shared memory for whoever is concerned? >> > SVr4, BSD4.3, SUS2 and POSIX1003.1, I believe. > > I also believe there is a M$ dll available that gives that functionality > (psapi.dll). Remains the question when it is updated, the manpage doesn't tell. If the RUSAGE_CHILDREN information is updated only when the child exits, each backend has to do it. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Wed, Nov 05, 2003 at 03:08:53PM -0500, Neil Conway wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > > I personally would like to see this work included in a 7.4.x > > release. > > Personally, I can't see any circumstance under which I would view this > as appropriate for integration into the 7.4 branch -- the changes this As unhappy as I am to say so, I agree strongly. Dot releases don't get anything like enough testing to make me comfortable with putting this kind of patch into such a release. I'm just a user though. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Jan Wieck <JanWieck@Yahoo.com> writes: > Manfred Spraul wrote: >> What about skipping the delay if there are no outstanding disk >> operations? > How portable is getrusage()? Could the postmaster issue that frequently > for RUSAGE_CHILDREN and leave the result somewhere in the shared memory > for whoever is concerned? How would that tell you about currently outstanding operations? Manfred's idea is interesting but AFAICS completely unimplementable in any portable fashion. You'd have to have hooks into the kernel. regards, tom lane
On Wed, Nov 05, 2003 at 03:49:54PM -0500, Jan Wieck wrote: > Andrew Dunstan wrote: > > >Jan Wieck wrote: > > > >> > >>How portable is getrusage()? Could the postmaster issue that > >>frequently for RUSAGE_CHILDREN and leave the result somewhere in the > >>shared memory for whoever is concerned? > >> > >SVr4, BSD4.3, SUS2 and POSIX1003.1, I believe. > > > >I also believe there is a M$ dll available that gives that functionality > >(psapi.dll). > > Remains the question when it is updated, the manpage doesn't tell. If > the RUSAGE_CHILDREN information is updated only when the child exits, > each backend has to do it. "If the value of the who argument is RUSAGE_CHILDREN, information shall be returned about resources used by the terminated and waited-for children of the current process" Kurt
Jan Wieck <JanWieck@Yahoo.com> writes: > As a matter of fact, people who have performance problems are likely to > be the same who have upgrade problems. And as Gaetano pointed out > correctly, we will see wildforms with one or the other feature applied. I'd believe that for patches of the size of my original VACUUM-delay hack (or even a production-grade version of same, which'd probably be 10x larger). The kind of wholesale rewrite you are currently proposing is much too large to consider folding back into 7.4.*, IMHO. regards, tom lane
Tom Lane wrote: >Manfred's idea is interesting but AFAICS completely unimplementable >in any portable fashion. You'd have to have hooks into the kernel. > > I thought about outstanding operations from postgres - I don't know enough about the buffer layer if it's possible to keep a counter of the currently running read() and write() operations, or something similar. -- Manfred
Tom Lane wrote: >Jan Wieck <JanWieck@Yahoo.com> writes: > > >>As a matter of fact, people who have performance problems are likely to >>be the same who have upgrade problems. And as Gaetano pointed out >>correctly, we will see wildforms with one or the other feature applied. >> >> > >I'd believe that for patches of the size of my original VACUUM-delay >hack (or even a production-grade version of same, which'd probably be >10x larger). The kind of wholesale rewrite you are currently proposing >is much too large to consider folding back into 7.4.*, IMHO. > > Do people think that the VACUUM-delay patch by itself, would be usefully enough on it's own to consider working it into 7.4.1 or something? From the little feedback I have read on the VACUUM-delay patch used in isolation, it certainly does help. I would love to see it put into 7.4 somehow. The far more rigorous changes that Jan is working on, will be welcome improvements for 7.5.
Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > > As a matter of fact, people who have performance problems are likely to > > be the same who have upgrade problems. And as Gaetano pointed out > > correctly, we will see wildforms with one or the other feature applied. > > I'd believe that for patches of the size of my original VACUUM-delay > hack (or even a production-grade version of same, which'd probably be > 10x larger). The kind of wholesale rewrite you are currently proposing > is much too large to consider folding back into 7.4.*, IMHO. What Jan could do is to have a 7.4 patch available that people can test, and he can improve it during the 7.5 development cycle with feedback from users. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
A long time ago, in a galaxy far, far away, pgman@candle.pha.pa.us (Bruce Momjian) wrote: > Tom Lane wrote: >> Jan Wieck <JanWieck@Yahoo.com> writes: >> > As a matter of fact, people who have performance problems are likely to >> > be the same who have upgrade problems. And as Gaetano pointed out >> > correctly, we will see wildforms with one or the other feature applied. >> >> I'd believe that for patches of the size of my original VACUUM-delay >> hack (or even a production-grade version of same, which'd probably be >> 10x larger). The kind of wholesale rewrite you are currently proposing >> is much too large to consider folding back into 7.4.*, IMHO. > > What Jan could do is to have a 7.4 patch available that people can test, > and he can improve it during the 7.5 development cycle with feedback > from users. The thing is, there are two patches that seem likely to be of interest: a) There's the ARC changes, which really feel like they are 7.5 development, not likely to be readily backportable; b) On the other hand, a "simple delay" on the VACUUM seems likely to be useful, and reasonably backportable. And these are two quite different things, both of which may be worth having. -- wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','acm.org'). http://www.ntlug.org/~cbbrowne/unix.html If I could put Klein in a bottle...
Christopher Browne wrote: > A long time ago, in a galaxy far, far away, pgman@candle.pha.pa.us (Bruce Momjian) wrote: > > Tom Lane wrote: > >> Jan Wieck <JanWieck@Yahoo.com> writes: > >> > As a matter of fact, people who have performance problems are likely to > >> > be the same who have upgrade problems. And as Gaetano pointed out > >> > correctly, we will see wildforms with one or the other feature applied. > >> > >> I'd believe that for patches of the size of my original VACUUM-delay > >> hack (or even a production-grade version of same, which'd probably be > >> 10x larger). The kind of wholesale rewrite you are currently proposing > >> is much too large to consider folding back into 7.4.*, IMHO. > > > > What Jan could do is to have a 7.4 patch available that people can test, > > and he can improve it during the 7.5 development cycle with feedback > > from users. > > The thing is, there are two patches that seem likely to be of > interest: > > a) There's the ARC changes, which really feel like they are 7.5 > development, not likely to be readily backportable; > > b) On the other hand, a "simple delay" on the VACUUM seems likely > to be useful, and reasonably backportable. > > And these are two quite different things, both of which may be worth > having. Yes, Tom has already said "b" is possible in a 7.4.X subrelease, but not for 7.4.0. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Christopher Browne wrote: > A long time ago, in a galaxy far, far away, pgman@candle.pha.pa.us (Bruce Momjian) wrote: >> Tom Lane wrote: >>> Jan Wieck <JanWieck@Yahoo.com> writes: >>> > As a matter of fact, people who have performance problems are likely to >>> > be the same who have upgrade problems. And as Gaetano pointed out >>> > correctly, we will see wildforms with one or the other feature applied. >>> >>> I'd believe that for patches of the size of my original VACUUM-delay >>> hack (or even a production-grade version of same, which'd probably be >>> 10x larger). The kind of wholesale rewrite you are currently proposing >>> is much too large to consider folding back into 7.4.*, IMHO. >> >> What Jan could do is to have a 7.4 patch available that people can test, >> and he can improve it during the 7.5 development cycle with feedback >> from users. > > The thing is, there are two patches that seem likely to be of > interest: > > a) There's the ARC changes, which really feel like they are 7.5 > development, not likely to be readily backportable; > > b) On the other hand, a "simple delay" on the VACUUM seems likely > to be useful, and reasonably backportable. > > And these are two quite different things, both of which may be worth > having. I only need to know the three W's, when, what and where (when do people want what pieces of the stuff where?). However, I have not seen much evidence yet that the vacuum delay alone does that much. In conjunction with putting vacuum dirtied blocks at LRU instead of MRU maybe, but that's again another functional change. So I am not sure what the outcome of that for 7.4 is. The general opinion is that the whole thing is too much. But nobody has done anything to show how the vacuum delay alone compares to that. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > However, I have not seen much evidence yet that the vacuum delay alone > does that much. Gaetano and a couple of other people did experiments that seemed to show it was useful. I think we'd want to change the shape of the knob per later suggestions (sleep 10 ms every N blocks, instead of N ms every block) but it did seem that there was useful bang for little buck there. regards, tom lane
Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> However, I have not seen much evidence yet that the vacuum delay alone >> does that much. > > Gaetano and a couple of other people did experiments that seemed to show > it was useful. I think we'd want to change the shape of the knob per > later suggestions (sleep 10 ms every N blocks, instead of N ms every > block) but it did seem that there was useful bang for little buck there. I thought it was "sleep N ms every M blocks". Have we seen any numbers? Anything at all? Something that gives us a clue by what factor one has to multiply the total time a "VACUUM ANALYZE" takes, to get what effect in return? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
----- Original Message ----- From: "Jan Wieck" <JanWieck@Yahoo.com> > Tom Lane wrote: > > Gaetano and a couple of other people did experiments that seemed to show > > it was useful. I think we'd want to change the shape of the knob per > > later suggestions (sleep 10 ms every N blocks, instead of N ms every > > block) but it did seem that there was useful bang for little buck there. > > I thought it was "sleep N ms every M blocks". > > Have we seen any numbers? Anything at all? Something that gives us a > clue by what factor one has to multiply the total time a "VACUUM > ANALYZE" takes, to get what effect in return? I have some time on sunday to do some testing. Is there a patch that I can apply that implements either of the two options? (sleep 10ms every M blocks or sleep N ms every M blocks). I know Tom posted the original patch that sleept N ms every 1 block (where N is > 10 due to OS limitations). Jan can you post a patch that has just the sleep code in it? Or should it be easy enough for me to cull out of the larger patch you posted?
On Fri, 7 Nov 2003, Matthew T. O'Connor wrote: > ----- Original Message ----- > From: "Jan Wieck" <JanWieck@Yahoo.com> > > Tom Lane wrote: > > > Gaetano and a couple of other people did experiments that seemed to show > > > it was useful. I think we'd want to change the shape of the knob per > > > later suggestions (sleep 10 ms every N blocks, instead of N ms every > > > block) but it did seem that there was useful bang for little buck there. > > > > I thought it was "sleep N ms every M blocks". > > > > Have we seen any numbers? Anything at all? Something that gives us a > > clue by what factor one has to multiply the total time a "VACUUM > > ANALYZE" takes, to get what effect in return? > > I have some time on sunday to do some testing. Is there a patch that I can > apply that implements either of the two options? (sleep 10ms every M blocks > or sleep N ms every M blocks). > > I know Tom posted the original patch that sleept N ms every 1 block (where N > is > 10 due to OS limitations). Jan can you post a patch that has just the > sleep code in it? Or should it be easy enough for me to cull out of the > larger patch you posted? The reason for the change is that the minumum sleep period on many systems is 10mS, which meant that vacuum was running 20X slower than normal. While it might be necessary in certain very I/O starved situations to make it this slow, it would probably be better to be able to get a vacuum that ran at about 1/2 to 1/5 speed for most folks. So, since the delta can't less than 10mS on most systems, it's better to just leave it at a fixed amount and change the number of pages vacuumed per sleep. I'm certainly gonna test the patch out too. We aren't really I/O bound, but it would be nice to have a database that only slowed down ~1% or so during vacuuming.
Yes, I would like to see the vacuum delay patch go into 7.4.1 if possible. It's really useful. I don't think there is any major risk in adding the delay patch into a minor revision given the small amount of code change. Stephen ""Matthew T. O'Connor"" <matthew@zeut.net> wrote in message news:3FA97470.3020803@zeut.net... > Tom Lane wrote: > > >Jan Wieck <JanWieck@Yahoo.com> writes: > > > > > >>As a matter of fact, people who have performance problems are likely to > >>be the same who have upgrade problems. And as Gaetano pointed out > >>correctly, we will see wildforms with one or the other feature applied. > >> > >> > > > >I'd believe that for patches of the size of my original VACUUM-delay > >hack (or even a production-grade version of same, which'd probably be > >10x larger). The kind of wholesale rewrite you are currently proposing > >is much too large to consider folding back into 7.4.*, IMHO. > > > > > Do people think that the VACUUM-delay patch by itself, would be usefully > enough on it's own to consider working it into 7.4.1 or something? From > the little feedback I have read on the VACUUM-delay patch used in > isolation, it certainly does help. I would love to see it put into 7.4 > somehow. > > The far more rigorous changes that Jan is working on, will be welcome > improvements for 7.5. > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > >>However, I have not seen much evidence yet that the vacuum delay alone >>does that much. > > > Gaetano and a couple of other people did experiments that seemed to show > it was useful. I think we'd want to change the shape of the knob per > later suggestions (sleep 10 ms every N blocks, instead of N ms every > block) but it did seem that there was useful bang for little buck there. Right, I'd like to try know the patch: "sleep N ms every M blocks". Can you please post this patch ? BTW, I'll see if I'm able to apply it also to a 7.3.X ( our production DB). Regards Gaetano Mendola
scott.marlowe wrote: > On Fri, 7 Nov 2003, Matthew T. O'Connor wrote: > >> ----- Original Message ----- >> From: "Jan Wieck" <JanWieck@Yahoo.com> >> > Tom Lane wrote: >> > > Gaetano and a couple of other people did experiments that seemed to show >> > > it was useful. I think we'd want to change the shape of the knob per >> > > later suggestions (sleep 10 ms every N blocks, instead of N ms every >> > > block) but it did seem that there was useful bang for little buck there. >> > >> > I thought it was "sleep N ms every M blocks". >> > >> > Have we seen any numbers? Anything at all? Something that gives us a >> > clue by what factor one has to multiply the total time a "VACUUM >> > ANALYZE" takes, to get what effect in return? >> >> I have some time on sunday to do some testing. Is there a patch that I can >> apply that implements either of the two options? (sleep 10ms every M blocks >> or sleep N ms every M blocks). >> >> I know Tom posted the original patch that sleept N ms every 1 block (where N >> is > 10 due to OS limitations). Jan can you post a patch that has just the >> sleep code in it? Or should it be easy enough for me to cull out of the >> larger patch you posted? > > The reason for the change is that the minumum sleep period on many systems > is 10mS, which meant that vacuum was running 20X slower than normal. > While it might be necessary in certain very I/O starved situations to make > it this slow, it would probably be better to be able to get a vacuum that > ran at about 1/2 to 1/5 speed for most folks. So, since the delta can't > less than 10mS on most systems, it's better to just leave it at a fixed > amount and change the number of pages vacuumed per sleep. I disagree with that. If you limit yourself to the number of pages being the only knob you have and set the napping time fixed, you can only lower the number of sequentially read pages to slow it down. Making read ahead absurd in an IO starved situation ... I'll post a patch doing every N pages nap for M milliseconds using two GUC variables and based on a select(2) call later. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Matthew T. O'Connor wrote: > ----- Original Message ----- > From: "Jan Wieck" <JanWieck@Yahoo.com> >> Tom Lane wrote: >> > Gaetano and a couple of other people did experiments that seemed to show >> > it was useful. I think we'd want to change the shape of the knob per >> > later suggestions (sleep 10 ms every N blocks, instead of N ms every >> > block) but it did seem that there was useful bang for little buck there. >> >> I thought it was "sleep N ms every M blocks". >> >> Have we seen any numbers? Anything at all? Something that gives us a >> clue by what factor one has to multiply the total time a "VACUUM >> ANALYZE" takes, to get what effect in return? > > I have some time on sunday to do some testing. Is there a patch that I can > apply that implements either of the two options? (sleep 10ms every M blocks > or sleep N ms every M blocks). > > I know Tom posted the original patch that sleept N ms every 1 block (where N > is > 10 due to OS limitations). Jan can you post a patch that has just the > sleep code in it? Or should it be easy enough for me to cull out of the > larger patch you posted? Sorry for the delay, had to finish some other concept yesterday (will be published soon). The attached patch adds vacuum_group_delay_size = 10 (range 1-1000) vacuum_group_delay_msec = 0 (range 0-1000) and does the sleeping via select(2). It does it only at the same places where Tom had done the usleep() in his hack, so I guess there is still some more to do besides the documentation, before it can be added to 7.4.1. But it should be enough to get some testing done. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # Index: src/backend/access/nbtree/nbtree.c =================================================================== RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/access/nbtree/nbtree.c,v retrieving revision 1.106 diff -c -b -r1.106 nbtree.c *** src/backend/access/nbtree/nbtree.c 2003/09/29 23:40:26 1.106 --- src/backend/access/nbtree/nbtree.c 2003/11/09 23:39:36 *************** *** 27,32 **** --- 27,40 ---- #include "storage/smgr.h" + /* + * Variables for vacuum_group_delay option (in commands/vacuumlazy.c) + */ + extern int vacuum_group_delay_size; /* vacuum N pages */ + extern int vacuum_group_delay_msec; /* then sleep M msec */ + extern int vacuum_group_delay_count; + + /* Working state for btbuild and its callback */ typedef struct { *************** *** 610,615 **** --- 618,632 ---- CHECK_FOR_INTERRUPTS(); + if (vacuum_group_delay_msec > 0) + { + if (++vacuum_group_delay_count >= vacuum_group_delay_size) + { + PG_DELAY(vacuum_group_delay_msec); + vacuum_group_delay_count = 0; + } + } + ndeletable = 0; page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); *************** *** 736,741 **** --- 753,769 ---- Buffer buf; Page page; BTPageOpaque opaque; + + CHECK_FOR_INTERRUPTS(); + + if (vacuum_group_delay_msec > 0) + { + if (++vacuum_group_delay_count >= vacuum_group_delay_size) + { + PG_DELAY(vacuum_group_delay_msec); + vacuum_group_delay_count = 0; + } + } buf = _bt_getbuf(rel, blkno, BT_READ); page = BufferGetPage(buf); Index: src/backend/commands/vacuumlazy.c =================================================================== RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/commands/vacuumlazy.c,v retrieving revision 1.32 diff -c -b -r1.32 vacuumlazy.c *** src/backend/commands/vacuumlazy.c 2003/09/25 06:57:59 1.32 --- src/backend/commands/vacuumlazy.c 2003/11/09 23:40:13 *************** *** 88,93 **** --- 88,100 ---- static TransactionId OldestXmin; static TransactionId FreezeLimit; + /* + * Variables for vacuum_group_delay option (in commands/vacuumlazy.c) + */ + int vacuum_group_delay_size = 10; /* vacuum N pages */ + int vacuum_group_delay_msec = 0; /* then sleep M msec */ + int vacuum_group_delay_count = 0; + /* non-export function prototypes */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, *************** *** 228,233 **** --- 235,249 ---- CHECK_FOR_INTERRUPTS(); + if (vacuum_group_delay_msec > 0) + { + if (++vacuum_group_delay_count >= vacuum_group_delay_size) + { + PG_DELAY(vacuum_group_delay_msec); + vacuum_group_delay_count = 0; + } + } + /* * If we are close to overrunning the available space for * dead-tuple TIDs, pause and do a cycle of vacuuming before we *************** *** 469,474 **** --- 485,499 ---- CHECK_FOR_INTERRUPTS(); + if (vacuum_group_delay_msec > 0) + { + if (++vacuum_group_delay_count >= vacuum_group_delay_size) + { + PG_DELAY(vacuum_group_delay_msec); + vacuum_group_delay_count = 0; + } + } + tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]); buf = ReadBuffer(onerel, tblk); LockBufferForCleanup(buf); *************** *** 799,804 **** --- 824,838 ---- hastup; CHECK_FOR_INTERRUPTS(); + + if (vacuum_group_delay_msec > 0) + { + if (++vacuum_group_delay_count >= vacuum_group_delay_size) + { + PG_DELAY(vacuum_group_delay_msec); + vacuum_group_delay_count = 0; + } + } blkno--; Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/utils/misc/guc.c,v retrieving revision 1.164.2.1 diff -c -b -r1.164.2.1 guc.c *** src/backend/utils/misc/guc.c 2003/11/07 21:27:50 1.164.2.1 --- src/backend/utils/misc/guc.c 2003/11/09 23:27:49 *************** *** 73,78 **** --- 73,80 ---- extern int CommitDelay; extern int CommitSiblings; extern char *preload_libraries_string; + extern int vacuum_group_delay_size; + extern int vacuum_group_delay_msec; #ifdef HAVE_SYSLOG extern char *Syslog_facility; *************** *** 1188,1193 **** --- 1190,1213 ---- }, &log_min_duration_statement, -1, -1, INT_MAX / 1000, NULL, NULL + }, + + { + {"vacuum_group_delay_msec", PGC_USERSET, RESOURCES, + gettext_noop("Sets VACUUM's delay in milliseconds between processing groups of pages."), + NULL + }, + &vacuum_group_delay_msec, + 0, 0, 1000, NULL, NULL + }, + + { + {"vacuum_group_delay_size", PGC_USERSET, RESOURCES, + gettext_noop("Sets VACUUM's group size for the vacuum_group_delay_msec option."), + NULL + }, + &vacuum_group_delay_size, + 10, 1, 1000, NULL, NULL }, /* End-of-list marker */ Index: src/backend/utils/misc/postgresql.conf.sample =================================================================== RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/utils/misc/postgresql.conf.sample,v retrieving revision 1.92 diff -c -b -r1.92 postgresql.conf.sample *** src/backend/utils/misc/postgresql.conf.sample 2003/10/08 03:49:38 1.92 --- src/backend/utils/misc/postgresql.conf.sample 2003/11/09 23:04:21 *************** *** 69,74 **** --- 69,79 ---- #max_files_per_process = 1000 # min 25 #preload_libraries = '' + # - Vacuum napping - + + #vacuum_group_delay_size = 10 # range 1-1000 pages ; vacuum this many pages + #vacuum_group_delay_msec = 0 # range 0-1000 msec ; then nap this long + #--------------------------------------------------------------------------- # WRITE AHEAD LOG Index: src/include/miscadmin.h =================================================================== RCS file: /home/pgsql/CvsRoot/pgsql-server/src/include/miscadmin.h,v retrieving revision 1.134 diff -c -b -r1.134 miscadmin.h *** src/include/miscadmin.h 2003/09/24 18:54:01 1.134 --- src/include/miscadmin.h 2003/11/09 23:02:03 *************** *** 96,101 **** --- 96,111 ---- CritSectionCount--; \ } while(0) + /* + * Macro using select(2) to nap for milliseconds + */ + #define PG_DELAY(_msec) \ + { \ + struct timeval _delay; \ + _delay.tv_sec = (_msec) / 1000; \ + _delay.tv_usec = ((_msec) % 1000) * 1000; \ + (void) select(0, NULL, NULL, NULL, &_delay);\ + } /***************************************************************************** * globals.h -- *
On Sun, 9 Nov 2003, Jan Wieck wrote: > scott.marlowe wrote: > > > On Fri, 7 Nov 2003, Matthew T. O'Connor wrote: > > > >> ----- Original Message ----- > >> From: "Jan Wieck" <JanWieck@Yahoo.com> > >> > Tom Lane wrote: > >> > > Gaetano and a couple of other people did experiments that seemed to show > >> > > it was useful. I think we'd want to change the shape of the knob per > >> > > later suggestions (sleep 10 ms every N blocks, instead of N ms every > >> > > block) but it did seem that there was useful bang for little buck there. > >> > > >> > I thought it was "sleep N ms every M blocks". > >> > > >> > Have we seen any numbers? Anything at all? Something that gives us a > >> > clue by what factor one has to multiply the total time a "VACUUM > >> > ANALYZE" takes, to get what effect in return? > >> > >> I have some time on sunday to do some testing. Is there a patch that I can > >> apply that implements either of the two options? (sleep 10ms every M blocks > >> or sleep N ms every M blocks). > >> > >> I know Tom posted the original patch that sleept N ms every 1 block (where N > >> is > 10 due to OS limitations). Jan can you post a patch that has just the > >> sleep code in it? Or should it be easy enough for me to cull out of the > >> larger patch you posted? > > > > The reason for the change is that the minumum sleep period on many systems > > is 10mS, which meant that vacuum was running 20X slower than normal. > > While it might be necessary in certain very I/O starved situations to make > > it this slow, it would probably be better to be able to get a vacuum that > > ran at about 1/2 to 1/5 speed for most folks. So, since the delta can't > > less than 10mS on most systems, it's better to just leave it at a fixed > > amount and change the number of pages vacuumed per sleep. > > I disagree with that. If you limit yourself to the number of pages being > the only knob you have and set the napping time fixed, you can only > lower the number of sequentially read pages to slow it down. Making read > ahead absurd in an IO starved situation ... > > I'll post a patch doing > > every N pages nap for M milliseconds > > using two GUC variables and based on a select(2) call later. I didn't mean "fixed in the code" I meant in your setup. I.e. find a delay (10mS, 50, 100 etc...) then vary the number of pages processed at a time until you start to notice the load, then back it off. Not being forced by the code to have one and only one delay value, setting it yourself.