Thread: RE: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)
> Earlier, Vadim was talking about arranging to share fsyncs of the WAL > log file across transactions (after writing your commit record to the > log, sleep a few milliseconds to see if anyone else fsyncs before you > do; if not, issue the fsync yourself). That would offer less-than- > one-fsync-per-transaction performance without giving up any > guarantees. > If people feel a compulsion to have a tunable parameter, let 'em tune > the length of the pre-fsync sleep ... Already implemented (without ability to tune this parameter - xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so backend sleeps 1/200 sec before checking/forcing log fsync. Vadim
[ Charset ISO-8859-1 unsupported, converting... ] > > Earlier, Vadim was talking about arranging to share fsyncs of the WAL > > log file across transactions (after writing your commit record to the > > log, sleep a few milliseconds to see if anyone else fsyncs before you > > do; if not, issue the fsync yourself). That would offer less-than- > > one-fsync-per-transaction performance without giving up any > > guarantees. > > If people feel a compulsion to have a tunable parameter, let 'em tune > > the length of the pre-fsync sleep ... > > Already implemented (without ability to tune this parameter - > xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so > backend sleeps 1/200 sec before checking/forcing log fsync. But it returns _completed_ to the client before sleeping, right? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > > Earlier, Vadim was talking about arranging to share fsyncs of the WAL > > > log file across transactions (after writing your commit record to the > > > log, sleep a few milliseconds to see if anyone else fsyncs before you > > > do; if not, issue the fsync yourself). That would offer less-than- > > > one-fsync-per-transaction performance without giving up any > > > guarantees. > > > If people feel a compulsion to have a tunable parameter, let 'em tune > > > the length of the pre-fsync sleep ... > > > > Already implemented (without ability to tune this parameter - > > xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so > > backend sleeps 1/200 sec before checking/forcing log fsync. > > But it returns _completed_ to the client before sleeping, right? No. Vadim
[ Charset ISO-8859-1 unsupported, converting... ] > > > > Earlier, Vadim was talking about arranging to share fsyncs of the WAL > > > > log file across transactions (after writing your commit record to the > > > > log, sleep a few milliseconds to see if anyone else fsyncs before you > > > > do; if not, issue the fsync yourself). That would offer less-than- > > > > one-fsync-per-transaction performance without giving up any > > > > guarantees. > > > > If people feel a compulsion to have a tunable parameter, let 'em tune > > > > the length of the pre-fsync sleep ... > > > > > > Already implemented (without ability to tune this parameter - > > > xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so > > > backend sleeps 1/200 sec before checking/forcing log fsync. > > > > But it returns _completed_ to the client before sleeping, right? > > No. Ewe, so we have this 1/200 second delay for every transaction. Seems bad to me. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [001116 08:59] wrote: > [ Charset ISO-8859-1 unsupported, converting... ] > > > > > Earlier, Vadim was talking about arranging to share fsyncs of the WAL > > > > > log file across transactions (after writing your commit record to the > > > > > log, sleep a few milliseconds to see if anyone else fsyncs before you > > > > > do; if not, issue the fsync yourself). That would offer less-than- > > > > > one-fsync-per-transaction performance without giving up any > > > > > guarantees. > > > > > If people feel a compulsion to have a tunable parameter, let 'em tune > > > > > the length of the pre-fsync sleep ... > > > > > > > > Already implemented (without ability to tune this parameter - > > > > xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so > > > > backend sleeps 1/200 sec before checking/forcing log fsync. > > > > > > But it returns _completed_ to the client before sleeping, right? > > > > No. > > Ewe, so we have this 1/200 second delay for every transaction. Seems > bad to me. I think as long as it becomes a tunable this isn't a bad idea at all. Fixing it at 1/200 isn't so great because people not wrapping large amounts of inserts/updates with transaction blocks will suffer. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
At 09:32 AM 11/16/00 -0800, Alfred Perlstein wrote: >* Bruce Momjian <pgman@candle.pha.pa.us> [001116 08:59] wrote: >> Ewe, so we have this 1/200 second delay for every transaction. Seems >> bad to me. > >I think as long as it becomes a tunable this isn't a bad idea at >all. Fixing it at 1/200 isn't so great because people not wrapping >large amounts of inserts/updates with transaction blocks will >suffer. I think the default should probably be no delay, and the documentation on enabling this needs to be clear and obvious (i.e. hard to miss). - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
> At 09:32 AM 11/16/00 -0800, Alfred Perlstein wrote: > >* Bruce Momjian <pgman@candle.pha.pa.us> [001116 08:59] wrote: > > >> Ewe, so we have this 1/200 second delay for every transaction. Seems > >> bad to me. > > > >I think as long as it becomes a tunable this isn't a bad idea at > >all. Fixing it at 1/200 isn't so great because people not wrapping > >large amounts of inserts/updates with transaction blocks will > >suffer. > > I think the default should probably be no delay, and the documentation > on enabling this needs to be clear and obvious (i.e. hard to miss). I just talked to Tom Lane about this. I think a sleep(0) just before the flush would be the best. It would reliquish the cpu slice if another process is ready to run. If no other backend is running, it probably just returns. If there is another one, it gives it a chance to complete. On return from sleep(0), it can check if it still needs to flush. This would tend to bunch up flushers so they flush only once, while not delaying cases where only one backend is running. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote: >> I think the default should probably be no delay, and the documentation >> on enabling this needs to be clear and obvious (i.e. hard to miss). > >I just talked to Tom Lane about this. I think a sleep(0) just before >the flush would be the best. It would reliquish the cpu slice if >another process is ready to run. If no other backend is running, it >probably just returns. If there is another one, it gives it a chance to >complete. On return from sleep(0), it can check if it still needs to >flush. This would tend to bunch up flushers so they flush only once, >while not delaying cases where only one backend is running. This sounds like an interesting approach, yes. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
* Don Baccus <dhogaza@pacifier.com> [001116 13:46]: > At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote: > > >> I think the default should probably be no delay, and the documentation > >> on enabling this needs to be clear and obvious (i.e. hard to miss). > > > >I just talked to Tom Lane about this. I think a sleep(0) just before > >the flush would be the best. It would reliquish the cpu slice if > >another process is ready to run. If no other backend is running, it > >probably just returns. If there is another one, it gives it a chance to > >complete. On return from sleep(0), it can check if it still needs to > >flush. This would tend to bunch up flushers so they flush only once, > >while not delaying cases where only one backend is running. > > This sounds like an interesting approach, yes. Question: Is sleep(0) guaranteed to at least give up control? The way I read my UnixWare 7's man page, it might not, since alarm(0) just cancels the alarm... Larry > > > > - Don Baccus, Portland OR <dhogaza@pacifier.com> > Nature photos, on-line guides, Pacific Northwest > Rare Bird Alert Service and other goodies at > http://donb.photo.net. -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 (voice) Internet: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
> At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote: > > >> I think the default should probably be no delay, and the documentation > >> on enabling this needs to be clear and obvious (i.e. hard to miss). > > > >I just talked to Tom Lane about this. I think a sleep(0) just before > >the flush would be the best. It would reliquish the cpu slice if > >another process is ready to run. If no other backend is running, it > >probably just returns. If there is another one, it gives it a chance to > >complete. On return from sleep(0), it can check if it still needs to > >flush. This would tend to bunch up flushers so they flush only once, > >while not delaying cases where only one backend is running. > > This sounds like an interesting approach, yes. In OS kernel design, you try to avoid process herding bottlenecks. Here, we want them herded, and giving up the CPU may be the best way to do it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [001116 14:02]: > > > This sounds like an interesting approach, yes. > > Question: Is sleep(0) guaranteed to at least give up control? > > > > The way I read my UnixWare 7's man page, it might not, since alarm(0) > > just cancels the alarm... > > Well, it certainly is a kernel call, and most OS's re-evaluate on kernel > call return. BUT, do we know for sure that sleep(0) is not optimized in the library to just return? -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 (voice) Internet: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
* Bruce Momjian <pgman@candle.pha.pa.us> [001116 11:59] wrote: > > At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote: > > > > >> I think the default should probably be no delay, and the documentation > > >> on enabling this needs to be clear and obvious (i.e. hard to miss). > > > > > >I just talked to Tom Lane about this. I think a sleep(0) just before > > >the flush would be the best. It would reliquish the cpu slice if > > >another process is ready to run. If no other backend is running, it > > >probably just returns. If there is another one, it gives it a chance to > > >complete. On return from sleep(0), it can check if it still needs to > > >flush. This would tend to bunch up flushers so they flush only once, > > >while not delaying cases where only one backend is running. > > > > This sounds like an interesting approach, yes. > > In OS kernel design, you try to avoid process herding bottlenecks. > Here, we want them herded, and giving up the CPU may be the best way to > do it. Yes, but if everyone yeilds you're back where you started, and with 128 or more backends do you really want to cause possibly that many context switches per fsync? -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
> * Don Baccus <dhogaza@pacifier.com> [001116 13:46]: > > At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote: > > > > >> I think the default should probably be no delay, and the documentation > > >> on enabling this needs to be clear and obvious (i.e. hard to miss). > > > > > >I just talked to Tom Lane about this. I think a sleep(0) just before > > >the flush would be the best. It would reliquish the cpu slice if > > >another process is ready to run. If no other backend is running, it > > >probably just returns. If there is another one, it gives it a chance to > > >complete. On return from sleep(0), it can check if it still needs to > > >flush. This would tend to bunch up flushers so they flush only once, > > >while not delaying cases where only one backend is running. > > > > This sounds like an interesting approach, yes. > Question: Is sleep(0) guaranteed to at least give up control? > > The way I read my UnixWare 7's man page, it might not, since alarm(0) > just cancels the alarm... Well, it certainly is a kernel call, and most OS's re-evaluate on kernel call return. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > In OS kernel design, you try to avoid process herding bottlenecks. > > Here, we want them herded, and giving up the CPU may be the best way to > > do it. > > Yes, but if everyone yeilds you're back where you started, and with > 128 or more backends do you really want to cause possibly that many > context switches per fsync? You are going to kernel call/yield anyway to fsync, so why not try and if someone does the fsync, we don't need to do it. I am suggesting re-checking the need for fsync after the return from sleep(0). -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Larry Rosenman <ler@lerctr.org> [001116 12:09] wrote: > * Bruce Momjian <pgman@candle.pha.pa.us> [001116 14:02]: > > > > This sounds like an interesting approach, yes. > > > Question: Is sleep(0) guaranteed to at least give up control? > > > > > > The way I read my UnixWare 7's man page, it might not, since alarm(0) > > > just cancels the alarm... > > > > Well, it certainly is a kernel call, and most OS's re-evaluate on kernel > > call return. > BUT, do we know for sure that sleep(0) is not optimized in the library > to just return? sleep(3) should conform to POSIX specification, if anyone has the reference they can check it to see what the effect of sleep(0) should be. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
> * Bruce Momjian <pgman@candle.pha.pa.us> [001116 14:02]: > > > > This sounds like an interesting approach, yes. > > > Question: Is sleep(0) guaranteed to at least give up control? > > > > > > The way I read my UnixWare 7's man page, it might not, since alarm(0) > > > just cancels the alarm... > > > > Well, it certainly is a kernel call, and most OS's re-evaluate on kernel > > call return. > BUT, do we know for sure that sleep(0) is not optimized in the library > to just return? We can only do our best here. I think guessing whether other backends are _about_ to commit is pretty shaky, and sleeping every time is a waste. This seems the cleanest. Funny you should mention the optimization. I just checked BSDI and saw:u_intsleep(secs) u_int secs;{ struct timevalnt, ot; long diff; int rc; if (secs == 0) return (0); So maybe we need another _fake_ kernel call, or a select/usleep with a very small value. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian writes: > > The way I read my UnixWare 7's man page, it might not, since alarm(0) > > just cancels the alarm... > > Well, it certainly is a kernel call, and most OS's re-evaluate on kernel > call return. In glibc, sleep(0) just does "return 0;", so if the compiler has a good day the call will disappear completely. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Alfred Perlstein <bright@wintelcom.net> writes: > It might make more sense to keep a private copy of the last time > the file was modified per-backend by that particular backend and > a timestamp of the last fsync shared globally so one can forgo the > fsync if "it hasn't been dirtied by me since the last fsync" > This would provide a rendevous point for the fsync call although > cost more as one would need to periodically call gettimeofday to > set the modified by me timestamp as well as the post-fsync shared > timestamp. That's the hard way to do it. We just need to keep track of the endpoint of the log as of the last fsync. You need to fsync (after returning from sleep()) iff your commit record position > fsync endpoint. No need to ask the kernel for time-of-day. regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [001116 13:31] wrote: > Alfred Perlstein <bright@wintelcom.net> writes: > > It might make more sense to keep a private copy of the last time > > the file was modified per-backend by that particular backend and > > a timestamp of the last fsync shared globally so one can forgo the > > fsync if "it hasn't been dirtied by me since the last fsync" > > This would provide a rendevous point for the fsync call although > > cost more as one would need to periodically call gettimeofday to > > set the modified by me timestamp as well as the post-fsync shared > > timestamp. > > That's the hard way to do it. We just need to keep track of the > endpoint of the log as of the last fsync. You need to fsync (after > returning from sleep()) iff your commit record position > fsync > endpoint. No need to ask the kernel for time-of-day. Well that breaks when you move to a overwriting storage manager, however if you use oid instead that optimization would survive the change to a overwriting storage manager. ? -Alfred
* Bruce Momjian <pgman@candle.pha.pa.us> [001116 12:31] wrote: > > > In OS kernel design, you try to avoid process herding bottlenecks. > > > Here, we want them herded, and giving up the CPU may be the best way to > > > do it. > > > > Yes, but if everyone yeilds you're back where you started, and with > > 128 or more backends do you really want to cause possibly that many > > context switches per fsync? > > You are going to kernel call/yield anyway to fsync, so why not try and > if someone does the fsync, we don't need to do it. I am suggesting > re-checking the need for fsync after the return from sleep(0). It might make more sense to keep a private copy of the last time the file was modified per-backend by that particular backend and a timestamp of the last fsync shared globally so one can forgo the fsync if "it hasn't been dirtied by me since the last fsync" This would provide a rendevous point for the fsync call although cost more as one would need to periodically call gettimeofday to set the modified by me timestamp as well as the post-fsync shared timestamp. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
Alfred Perlstein <bright@wintelcom.net> writes: >> That's the hard way to do it. We just need to keep track of the >> endpoint of the log as of the last fsync. > Well that breaks when you move to a overwriting storage manager, No, because the log is just a series of records written sequentially --- it has nothing to do with storage management in data files. regards, tom lane
On Thu, 16 Nov 2000, Alfred Perlstein wrote: > * Larry Rosenman <ler@lerctr.org> [001116 12:09] wrote: > > * Bruce Momjian <pgman@candle.pha.pa.us> [001116 14:02]: > > > > > This sounds like an interesting approach, yes. > > > > Question: Is sleep(0) guaranteed to at least give up control? > > > > > > > > The way I read my UnixWare 7's man page, it might not, since alarm(0) > > > > just cancels the alarm... > > > > > > Well, it certainly is a kernel call, and most OS's re-evaluate on kernel > > > call return. > > BUT, do we know for sure that sleep(0) is not optimized in the library > > to just return? > > sleep(3) should conform to POSIX specification, if anyone has the > reference they can check it to see what the effect of sleep(0) > should be. Yes, but Posix also specifies sched_yield() which rather explicitly allows a process to yield its timeslice. No idea how well that is supported. > -- > -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] > "I have the heart of a child; I keep it in a jar on my desk." Tom
> > sleep(3) should conform to POSIX specification, if anyone has the > > reference they can check it to see what the effect of sleep(0) > > should be. > > Yes, but Posix also specifies sched_yield() which rather explicitly > allows a process to yield its timeslice. No idea how well that is > supported. I have it on BSDI. We could add a configure check, and use it if it is there. Another idea is to add a shared memory flag when someone enters the 'commit' section of the transaction code. That way, a backend could check to see if another process is _about_ to commit, and wait. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [001117 11:23]: > > > sleep(3) should conform to POSIX specification, if anyone has the > > > reference they can check it to see what the effect of sleep(0) > > > should be. > > > > Yes, but Posix also specifies sched_yield() which rather explicitly > > allows a process to yield its timeslice. No idea how well that is > > supported. > > I have it on BSDI. We could add a configure check, and use it if it is > there. Another idea is to add a shared memory flag when someone enters > the 'commit' section of the transaction code. That way, a backend could > check to see if another process is _about_ to commit, and wait. On UnixWare, it requires the -Kthread or -Kpthread command, which then links in the threads library... I'm not sure that this is a good thing or not.... LER -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 (voice) Internet: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
> * Bruce Momjian <pgman@candle.pha.pa.us> [001117 11:23]: > > > > sleep(3) should conform to POSIX specification, if anyone has the > > > > reference they can check it to see what the effect of sleep(0) > > > > should be. > > > > > > Yes, but Posix also specifies sched_yield() which rather explicitly > > > allows a process to yield its timeslice. No idea how well that is > > > supported. > > > > I have it on BSDI. We could add a configure check, and use it if it is > > there. Another idea is to add a shared memory flag when someone enters > > the 'commit' section of the transaction code. That way, a backend could > > check to see if another process is _about_ to commit, and wait. > On UnixWare, it requires the -Kthread or -Kpthread command, which then > links in the threads library... > > I'm not sure that this is a good thing or not.... I would hope it just calls the function, and does not bring in thread startup stuff. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [001117 11:39]: > > * Bruce Momjian <pgman@candle.pha.pa.us> [001117 11:23]: > > > > > sleep(3) should conform to POSIX specification, if anyone has the > > > > > reference they can check it to see what the effect of sleep(0) > > > > > should be. > > > > > > > > Yes, but Posix also specifies sched_yield() which rather explicitly > > > > allows a process to yield its timeslice. No idea how well that is > > > > supported. > > > > > > I have it on BSDI. We could add a configure check, and use it if it is > > > there. Another idea is to add a shared memory flag when someone enters > > > the 'commit' section of the transaction code. That way, a backend could > > > check to see if another process is _about_ to commit, and wait. > > On UnixWare, it requires the -Kthread or -Kpthread command, which then > > links in the threads library... > > > > I'm not sure that this is a good thing or not.... > > I would hope it just calls the function, and does not bring in thread > startup stuff. I suspect it DOES bring in the thread startup and all that implies... Tread lightly. The good news is UnixWare Threads are LWP's and the kernel is multithreaded... LER > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 (voice) Internet: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
> > sleep(3) should conform to POSIX specification, if anyone has the > > reference they can check it to see what the effect of sleep(0) > > should be. > > Yes, but Posix also specifies sched_yield() which rather explicitly > allows a process to yield its timeslice. No idea how well that is > supported. OK, I have a new idea. There are two parts to transaction commit. The first is writing all dirty buffers or log changes to the kernel, and second is fsync of the log file. I suggest having a per-backend shared memory byte that has the following values: START_LOG_WRITEWAIT_ON_FSYNCNOT_IN_COMMITbackend_number_doing_fsync I suggest that when each backend starts a commit, it sets its byte to START_LOG_WRITE. When it gets ready to fsync, it checks all backends. If all are NOT_IN_COMMIT, it does fsync and continues. If one or more are in START_LOG_WRITE, it waits until no one is in START_LOG_WRITE. It then checks all WAIT_ON_FSYNC, and if it is the lowest backend in WAIT_ON_FSYNC, marks all others with its backend number, and does fsync. It then clears all backends with its number to NOT_IN_COMMIT. Other backend will see they are not the lowest WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT so they can then continue, knowing their data was synced. This allows a single backend not to sleep, and allows multiple backends to bunch up only when they are all about to commit. The reason backend numbers are written is so other backends entering the commit code will not interfere with the backends performing fsync. Comments? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Other backend will see they are not the lowest > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT > so they can then continue, knowing their data was synced. How will they wait? Without a semaphore involved, your answer must be either "timed sleep" or "busy-wait loop", neither of which is attractive ... regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Other backend will see they are not the lowest > > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT > > so they can then continue, knowing their data was synced. > > How will they wait? Without a semaphore involved, your answer must > be either "timed sleep" or "busy-wait loop", neither of which is > attractive ... Yes, either timed sleep or busy-wait. One nifty trick would be for each backend that is not going to do the fsync to just sleep with signals enabled, and for the fsyncing backend to signal the other backends to exit their sleep. That way, only one backend does the checking. This sleep thing was going to be a problem anyway with the old system. At least this way, they sleep/check only in cases where it is valuable. Can we use a semaphore for this system? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Other backend will see they are not the lowest > > > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT > > > so they can then continue, knowing their data was synced. > > > > How will they wait? Without a semaphore involved, your answer must > > be either "timed sleep" or "busy-wait loop", neither of which is > > attractive ... > > Yes, either timed sleep or busy-wait. One nifty trick would be for each > backend that is not going to do the fsync to just sleep with signals > enabled, and for the fsyncing backend to signal the other backends to > exit their sleep. That way, only one backend does the checking. > > This sleep thing was going to be a problem anyway with the old system. > At least this way, they sleep/check only in cases where it is valuable. I have another idea. If a backend gets to the point that it needs fsync, and there is another backend in START_LOG_WRITE, it can go to an interuptable sleep, knowing another backend will perform the fsync and wake it up. Therefore, there is no busy-wait or timed sleep. Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a race condition. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Tom Lane <tgl@sss.pgh.pa.us> [001117 23:21]: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Other backend will see they are not the lowest > > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT > > so they can then continue, knowing their data was synced. > > How will they wait? Without a semaphore involved, your answer must > be either "timed sleep" or "busy-wait loop", neither of which is > attractive ... how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? > > regards, tom lane -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 (voice) Internet: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
> * Tom Lane <tgl@sss.pgh.pa.us> [001117 23:21]: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Other backend will see they are not the lowest > > > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT > > > so they can then continue, knowing their data was synced. > > > > How will they wait? Without a semaphore involved, your answer must > > be either "timed sleep" or "busy-wait loop", neither of which is > > attractive ... > how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? Looks like a winner. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? > Looks like a winner. sigpause() is a BSD-ism, and not part of any recognized standard according to my HP man pages. How portable do you think it is? regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? > > > Looks like a winner. > > sigpause() is a BSD-ism, and not part of any recognized standard > according to my HP man pages. How portable do you think it is? Good point. I get on BSDI: The sigpause function call appeared in 4.2BSD and has been deprecated. The standard is sigsuspend: The sigsuspend function call conforms to IEEE Std1003.1-1988 (``POSIX''). -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >>>>> how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? > The standard is sigsuspend: OK, we can probably assume that at least one of sigsuspend or sigpause is available everywhere. Now all you need is a free signal number. Unfortunately we're already using both SIGUSR1 and SIGUSR2. regards, tom lane
Larry Rosenman writes: > how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? Both of these signals are already used. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > >>>>> how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? > > > The standard is sigsuspend: > > OK, we can probably assume that at least one of sigsuspend or sigpause > is available everywhere. Now all you need is a free signal number. > Unfortunately we're already using both SIGUSR1 and SIGUSR2. Oh, I didn't want to hear that one. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane writes: > OK, we can probably assume that at least one of sigsuspend or sigpause > is available everywhere. #ifdef HAVE_POSIX_SIGNALS should tell you. > Now all you need is a free signal number. Unfortunately we're already > using both SIGUSR1 and SIGUSR2. Maybe you could dump the old meaning SIGQUIT (externally invoked error), move quickdie() to SIGQUIT, and you got SIGUSR1 free. (That would even make sense in two ways: 1) SIGQUIT would actually cause the guy to quit; 2) there is a correspondence between postmaster and postgres signals.) -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
> Tom Lane writes: > > > OK, we can probably assume that at least one of sigsuspend or sigpause > > is available everywhere. > > #ifdef HAVE_POSIX_SIGNALS should tell you. > > > Now all you need is a free signal number. Unfortunately we're already > > using both SIGUSR1 and SIGUSR2. > > Maybe you could dump the old meaning SIGQUIT (externally invoked error), > move quickdie() to SIGQUIT, and you got SIGUSR1 free. > > (That would even make sense in two ways: 1) SIGQUIT would actually cause > the guy to quit; 2) there is a correspondence between postmaster and > postgres signals.) Good idea. Of course, this assumes my idea was valid. Was it? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Peter Eisentraut <peter_e@gmx.net> writes: >> Now all you need is a free signal number. Unfortunately we're already >> using both SIGUSR1 and SIGUSR2. > Maybe you could dump the old meaning SIGQUIT (externally invoked error), > move quickdie() to SIGQUIT, and you got SIGUSR1 free. > (That would even make sense in two ways: 1) SIGQUIT would actually cause > the guy to quit; 2) there is a correspondence between postmaster and > postgres signals.) Seems like a plan. The current definition of backend SIGQUIT is really stupid anyway --- what's the value of forcing an error asynchronously? Also, it always bothered me that the postmaster and backend signals weren't consistent, so I'd be inclined to make this change even if we end up not using SIGUSR1 for Bruce's idea ... regards, tom lane
> There are two parts to transaction commit. The first is writing all > dirty buffers or log changes to the kernel, and second is fsync of the ^^^^^^^^^^^^ Backend doesn't write any dirty buffer to the kernel at commit time. > log file. The first part is writing commit record into WAL buffers in shmem. This is what XLogInsert does. After that XLogFlush is called to ensure that entire commit record is on disk. XLogFlush does *both* write() and fsync() (single slock is used for both writing and fsyncing) if it needs to do it at all. > I suggest having a per-backend shared memory byte that has the following > values: > > START_LOG_WRITE > WAIT_ON_FSYNC > NOT_IN_COMMIT > backend_number_doing_fsync > > I suggest that when each backend starts a commit, it sets its byte to > START_LOG_WRITE. ^^^^^^^^^^^^^^^^^^^^^^^ Isn't START_COMMIT more meaningful? > When it gets ready to fsync, it checks all backends. ^^^^^^^^^^^^^^^^^^^^^^^^^^ What do you mean by this? The moment just after XLogInsert? > If all are NOT_IN_COMMIT, it does fsync and continues. 1st edition: > If one or more are in START_LOG_WRITE, it waits until no one is in > START_LOG_WRITE. It then checks all WAIT_ON_FSYNC, and if it is the > lowest backend in WAIT_ON_FSYNC, marks all others with its backend > number, and does fsync. It then clears all backends with its number to > NOT_IN_COMMIT. Other backend will see they are not the lowest > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT > so they can then continue, knowing their data was synced. 2nd edition: > I have another idea. If a backend gets to the point that it needs > fsync, and there is another backend in START_LOG_WRITE, it can go to an > interuptable sleep, knowing another backend will perform the fsync and > wake it up. Therefore, there is no busy-wait or timed sleep. > > Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a > race condition. The 2nd edition is much better. But I'm not sure do we really need in these per-backend bytes in shmem. Why not just have some counters? We can use a semaphore to wake-up all waiters at once. > This allows a single backend not to sleep, and allows multiple backends > to bunch up only when they are all about to commit. > > The reason backend numbers are written is so other backends entering the > commit code will not interfere with the backends performing fsync. Being waked-up backend can check what's written/fsynced by calling XLogFlush. Vadim
[ Charset ISO-8859-1 unsupported, converting... ] > > There are two parts to transaction commit. The first is writing all > > dirty buffers or log changes to the kernel, and second is fsync of the > ^^^^^^^^^^^^ > Backend doesn't write any dirty buffer to the kernel at commit time. Yes, I suspected that. > > > log file. > > The first part is writing commit record into WAL buffers in shmem. > This is what XLogInsert does. After that XLogFlush is called to ensure > that entire commit record is on disk. XLogFlush does *both* write() and > fsync() (single slock is used for both writing and fsyncing) if it needs to > do it at all. Yes, I realize there are new steps in WAL. > > > I suggest having a per-backend shared memory byte that has the following > > values: > > > > START_LOG_WRITE > > WAIT_ON_FSYNC > > NOT_IN_COMMIT > > backend_number_doing_fsync > > > > I suggest that when each backend starts a commit, it sets its byte to > > START_LOG_WRITE. > ^^^^^^^^^^^^^^^^^^^^^^^ > Isn't START_COMMIT more meaningful? Yes. > > > When it gets ready to fsync, it checks all backends. > ^^^^^^^^^^^^^^^^^^^^^^^^^^ > What do you mean by this? The moment just after XLogInsert? Just before it calls fsync(). > > > If all are NOT_IN_COMMIT, it does fsync and continues. > > 1st edition: > > If one or more are in START_LOG_WRITE, it waits until no one is in > > START_LOG_WRITE. It then checks all WAIT_ON_FSYNC, and if it is the > > lowest backend in WAIT_ON_FSYNC, marks all others with its backend > > number, and does fsync. It then clears all backends with its number to > > NOT_IN_COMMIT. Other backend will see they are not the lowest > > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT > > so they can then continue, knowing their data was synced. > > 2nd edition: > > I have another idea. If a backend gets to the point that it needs > > fsync, and there is another backend in START_LOG_WRITE, it can go to an > > interuptable sleep, knowing another backend will perform the fsync and > > wake it up. Therefore, there is no busy-wait or timed sleep. > > > > Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a > > race condition. > > The 2nd edition is much better. But I'm not sure do we really need in > these per-backend bytes in shmem. Why not just have some counters? > We can use a semaphore to wake-up all waiters at once. Yes, that is much better and clearer. My idea was just to say, "if no one is entering commit phase, do the commit. If someone else is coming, sleep and wait for them to do the fsync and wake me up with a singal." > > > This allows a single backend not to sleep, and allows multiple backends > > to bunch up only when they are all about to commit. > > > > The reason backend numbers are written is so other backends entering the > > commit code will not interfere with the backends performing fsync. > > Being waked-up backend can check what's written/fsynced by calling XLogFlush. Seems that may not be needed anymore with a counter. The only issue is that other backends may enter commit while fsync() is happening. The process that did the fsync must be sure to wake up only the backends that were waiting for it, and not other backends that may be also be doing fsync as a group while the first fsync was happening. I leave those details to people more experienced. :-) I am just glad people liked my idea. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Added to TODO.detail and TODO list. > [ Charset ISO-8859-1 unsupported, converting... ] > > > There are two parts to transaction commit. The first is writing all > > > dirty buffers or log changes to the kernel, and second is fsync of the > > ^^^^^^^^^^^^ > > Backend doesn't write any dirty buffer to the kernel at commit time. > > Yes, I suspected that. > > > > > > log file. > > > > The first part is writing commit record into WAL buffers in shmem. > > This is what XLogInsert does. After that XLogFlush is called to ensure > > that entire commit record is on disk. XLogFlush does *both* write() and > > fsync() (single slock is used for both writing and fsyncing) if it needs to > > do it at all. > > Yes, I realize there are new steps in WAL. > > > > > > I suggest having a per-backend shared memory byte that has the following > > > values: > > > > > > START_LOG_WRITE > > > WAIT_ON_FSYNC > > > NOT_IN_COMMIT > > > backend_number_doing_fsync > > > > > > I suggest that when each backend starts a commit, it sets its byte to > > > START_LOG_WRITE. > > ^^^^^^^^^^^^^^^^^^^^^^^ > > Isn't START_COMMIT more meaningful? > > Yes. > > > > > > When it gets ready to fsync, it checks all backends. > > ^^^^^^^^^^^^^^^^^^^^^^^^^^ > > What do you mean by this? The moment just after XLogInsert? > > Just before it calls fsync(). > > > > > > If all are NOT_IN_COMMIT, it does fsync and continues. > > > > 1st edition: > > > If one or more are in START_LOG_WRITE, it waits until no one is in > > > START_LOG_WRITE. It then checks all WAIT_ON_FSYNC, and if it is the > > > lowest backend in WAIT_ON_FSYNC, marks all others with its backend > > > number, and does fsync. It then clears all backends with its number to > > > NOT_IN_COMMIT. Other backend will see they are not the lowest > > > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT > > > so they can then continue, knowing their data was synced. > > > > 2nd edition: > > > I have another idea. If a backend gets to the point that it needs > > > fsync, and there is another backend in START_LOG_WRITE, it can go to an > > > interuptable sleep, knowing another backend will perform the fsync and > > > wake it up. Therefore, there is no busy-wait or timed sleep. > > > > > > Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a > > > race condition. > > > > The 2nd edition is much better. But I'm not sure do we really need in > > these per-backend bytes in shmem. Why not just have some counters? > > We can use a semaphore to wake-up all waiters at once. > > Yes, that is much better and clearer. My idea was just to say, "if no > one is entering commit phase, do the commit. If someone else is coming, > sleep and wait for them to do the fsync and wake me up with a singal." > > > > > > This allows a single backend not to sleep, and allows multiple backends > > > to bunch up only when they are all about to commit. > > > > > > The reason backend numbers are written is so other backends entering the > > > commit code will not interfere with the backends performing fsync. > > > > Being waked-up backend can check what's written/fsynced by calling XLogFlush. > > Seems that may not be needed anymore with a counter. The only issue is > that other backends may enter commit while fsync() is happening. The > process that did the fsync must be sure to wake up only the backends > that were waiting for it, and not other backends that may be also be > doing fsync as a group while the first fsync was happening. I leave > those details to people more experienced. :-) > > I am just glad people liked my idea. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026