Thread: RE: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

RE: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

"Mikheev, Vadim"

Date:

15 November 2000, 14:05:52

> Earlier, Vadim was talking about arranging to share fsyncs of the WAL
> log file across transactions (after writing your commit record to the
> log, sleep a few milliseconds to see if anyone else fsyncs before you
> do; if not, issue the fsync yourself).  That would offer less-than-
> one-fsync-per-transaction performance without giving up any 
> guarantees.
> If people feel a compulsion to have a tunable parameter, let 'em tune
> the length of the pre-fsync sleep ...

Already implemented (without ability to tune this parameter - 
xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so
backend sleeps 1/200 sec before checking/forcing log fsync.

Vadim

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Bruce Momjian

Date:

16 November 2000, 00:33:13

[ Charset ISO-8859-1 unsupported, converting... ]
> > Earlier, Vadim was talking about arranging to share fsyncs of the WAL
> > log file across transactions (after writing your commit record to the
> > log, sleep a few milliseconds to see if anyone else fsyncs before you
> > do; if not, issue the fsync yourself).  That would offer less-than-
> > one-fsync-per-transaction performance without giving up any 
> > guarantees.
> > If people feel a compulsion to have a tunable parameter, let 'em tune
> > the length of the pre-fsync sleep ...
> 
> Already implemented (without ability to tune this parameter - 
> xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so
> backend sleeps 1/200 sec before checking/forcing log fsync.

But it returns _completed_ to the client before sleeping, right?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

"Vadim Mikheev"

Date:

16 November 2000, 02:57:42

> > > Earlier, Vadim was talking about arranging to share fsyncs of the WAL
> > > log file across transactions (after writing your commit record to the
> > > log, sleep a few milliseconds to see if anyone else fsyncs before you
> > > do; if not, issue the fsync yourself).  That would offer less-than-
> > > one-fsync-per-transaction performance without giving up any 
> > > guarantees.
> > > If people feel a compulsion to have a tunable parameter, let 'em tune
> > > the length of the pre-fsync sleep ...
> > 
> > Already implemented (without ability to tune this parameter - 
> > xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so
> > backend sleeps 1/200 sec before checking/forcing log fsync.
> 
> But it returns _completed_ to the client before sleeping, right?

No.

Vadim

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Bruce Momjian

Date:

16 November 2000, 12:03:34

[ Charset ISO-8859-1 unsupported, converting... ]
> > > > Earlier, Vadim was talking about arranging to share fsyncs of the WAL
> > > > log file across transactions (after writing your commit record to the
> > > > log, sleep a few milliseconds to see if anyone else fsyncs before you
> > > > do; if not, issue the fsync yourself).  That would offer less-than-
> > > > one-fsync-per-transaction performance without giving up any 
> > > > guarantees.
> > > > If people feel a compulsion to have a tunable parameter, let 'em tune
> > > > the length of the pre-fsync sleep ...
> > > 
> > > Already implemented (without ability to tune this parameter - 
> > > xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so
> > > backend sleeps 1/200 sec before checking/forcing log fsync.
> > 
> > But it returns _completed_ to the client before sleeping, right?
> 
> No.

Ewe, so we have this 1/200 second delay for every transaction.  Seems
bad to me.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Alfred Perlstein

Date:

16 November 2000, 12:52:07

* Bruce Momjian <pgman@candle.pha.pa.us> [001116 08:59] wrote:
> [ Charset ISO-8859-1 unsupported, converting... ]
> > > > > Earlier, Vadim was talking about arranging to share fsyncs of the WAL
> > > > > log file across transactions (after writing your commit record to the
> > > > > log, sleep a few milliseconds to see if anyone else fsyncs before you
> > > > > do; if not, issue the fsync yourself).  That would offer less-than-
> > > > > one-fsync-per-transaction performance without giving up any 
> > > > > guarantees.
> > > > > If people feel a compulsion to have a tunable parameter, let 'em tune
> > > > > the length of the pre-fsync sleep ...
> > > > 
> > > > Already implemented (without ability to tune this parameter - 
> > > > xact.c:CommitDelay, - yet). Currently CommitDelay is 5, so
> > > > backend sleeps 1/200 sec before checking/forcing log fsync.
> > > 
> > > But it returns _completed_ to the client before sleeping, right?
> > 
> > No.
> 
> Ewe, so we have this 1/200 second delay for every transaction.  Seems
> bad to me.

I think as long as it becomes a tunable this isn't a bad idea at
all.  Fixing it at 1/200 isn't so great because people not wrapping
large amounts of inserts/updates with transaction blocks will
suffer.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Don Baccus

Date:

16 November 2000, 13:13:18

At 09:32 AM 11/16/00 -0800, Alfred Perlstein wrote:
>* Bruce Momjian <pgman@candle.pha.pa.us> [001116 08:59] wrote:

>> Ewe, so we have this 1/200 second delay for every transaction.  Seems
>> bad to me.
>
>I think as long as it becomes a tunable this isn't a bad idea at
>all.  Fixing it at 1/200 isn't so great because people not wrapping
>large amounts of inserts/updates with transaction blocks will
>suffer.

I think the default should probably be no delay, and the documentation
on enabling this needs to be clear and obvious (i.e. hard to miss).



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Bruce Momjian

Date:

16 November 2000, 14:15:15

> At 09:32 AM 11/16/00 -0800, Alfred Perlstein wrote:
> >* Bruce Momjian <pgman@candle.pha.pa.us> [001116 08:59] wrote:
> 
> >> Ewe, so we have this 1/200 second delay for every transaction.  Seems
> >> bad to me.
> >
> >I think as long as it becomes a tunable this isn't a bad idea at
> >all.  Fixing it at 1/200 isn't so great because people not wrapping
> >large amounts of inserts/updates with transaction blocks will
> >suffer.
> 
> I think the default should probably be no delay, and the documentation
> on enabling this needs to be clear and obvious (i.e. hard to miss).

I just talked to Tom Lane about this.  I think a sleep(0) just before
the flush would be the best.  It would reliquish the cpu slice if
another process is ready to run.  If no other backend is running, it
probably just returns.  If there is another one, it gives it a chance to
complete.  On return from sleep(0), it can check if it still needs to
flush.  This would tend to bunch up flushers so they flush only once,
while not delaying cases where only one backend is running.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Don Baccus

Date:

16 November 2000, 14:45:58

At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote:

>> I think the default should probably be no delay, and the documentation
>> on enabling this needs to be clear and obvious (i.e. hard to miss).
>
>I just talked to Tom Lane about this.  I think a sleep(0) just before
>the flush would be the best.  It would reliquish the cpu slice if
>another process is ready to run.  If no other backend is running, it
>probably just returns.  If there is another one, it gives it a chance to
>complete.  On return from sleep(0), it can check if it still needs to
>flush.  This would tend to bunch up flushers so they flush only once,
>while not delaying cases where only one backend is running.

This sounds like an interesting approach, yes.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Larry Rosenman

Date:

16 November 2000, 14:55:13

* Don Baccus <dhogaza@pacifier.com> [001116 13:46]:
> At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote:
> 
> >> I think the default should probably be no delay, and the documentation
> >> on enabling this needs to be clear and obvious (i.e. hard to miss).
> >
> >I just talked to Tom Lane about this.  I think a sleep(0) just before
> >the flush would be the best.  It would reliquish the cpu slice if
> >another process is ready to run.  If no other backend is running, it
> >probably just returns.  If there is another one, it gives it a chance to
> >complete.  On return from sleep(0), it can check if it still needs to
> >flush.  This would tend to bunch up flushers so they flush only once,
> >while not delaying cases where only one backend is running.
> 
> This sounds like an interesting approach, yes.
Question: Is sleep(0) guaranteed to at least give up control? 

The way I read my UnixWare 7's man page, it might not, since alarm(0)
just cancels the alarm...

Larry
> 
> 
> 
> - Don Baccus, Portland OR <dhogaza@pacifier.com>
>   Nature photos, on-line guides, Pacific Northwest
>   Rare Bird Alert Service and other goodies at
>   http://donb.photo.net.
-- 
Larry Rosenman                      http://www.lerctr.org/~ler
Phone: +1 972-414-9812 (voice) Internet: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Bruce Momjian

Date:

16 November 2000, 14:55:23

> At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote:
> 
> >> I think the default should probably be no delay, and the documentation
> >> on enabling this needs to be clear and obvious (i.e. hard to miss).
> >
> >I just talked to Tom Lane about this.  I think a sleep(0) just before
> >the flush would be the best.  It would reliquish the cpu slice if
> >another process is ready to run.  If no other backend is running, it
> >probably just returns.  If there is another one, it gives it a chance to
> >complete.  On return from sleep(0), it can check if it still needs to
> >flush.  This would tend to bunch up flushers so they flush only once,
> >while not delaying cases where only one backend is running.
> 
> This sounds like an interesting approach, yes.

In OS kernel design, you try to avoid process herding bottlenecks. 
Here, we want them herded, and giving up the CPU may be the best way to
do it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Larry Rosenman

Date:

16 November 2000, 15:05:37

* Bruce Momjian <pgman@candle.pha.pa.us> [001116 14:02]:
> > > This sounds like an interesting approach, yes.
> > Question: Is sleep(0) guaranteed to at least give up control? 
> > 
> > The way I read my UnixWare 7's man page, it might not, since alarm(0)
> > just cancels the alarm...
> 
> Well, it certainly is a kernel call, and most OS's re-evaluate on kernel
> call return.
BUT, do we know for sure that sleep(0) is not optimized in the library
to just return? 
-- 
Larry Rosenman                      http://www.lerctr.org/~ler
Phone: +1 972-414-9812 (voice) Internet: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Alfred Perlstein

Date:

16 November 2000, 15:14:07

* Bruce Momjian <pgman@candle.pha.pa.us> [001116 11:59] wrote:
> > At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote:
> > 
> > >> I think the default should probably be no delay, and the documentation
> > >> on enabling this needs to be clear and obvious (i.e. hard to miss).
> > >
> > >I just talked to Tom Lane about this.  I think a sleep(0) just before
> > >the flush would be the best.  It would reliquish the cpu slice if
> > >another process is ready to run.  If no other backend is running, it
> > >probably just returns.  If there is another one, it gives it a chance to
> > >complete.  On return from sleep(0), it can check if it still needs to
> > >flush.  This would tend to bunch up flushers so they flush only once,
> > >while not delaying cases where only one backend is running.
> > 
> > This sounds like an interesting approach, yes.
> 
> In OS kernel design, you try to avoid process herding bottlenecks. 
> Here, we want them herded, and giving up the CPU may be the best way to
> do it.

Yes, but if everyone yeilds you're back where you started, and with
128 or more backends do you really want to cause possibly that many
context switches per fsync?

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Bruce Momjian

Date:

16 November 2000, 15:14:34

> * Don Baccus <dhogaza@pacifier.com> [001116 13:46]:
> > At 02:13 PM 11/16/00 -0500, Bruce Momjian wrote:
> > 
> > >> I think the default should probably be no delay, and the documentation
> > >> on enabling this needs to be clear and obvious (i.e. hard to miss).
> > >
> > >I just talked to Tom Lane about this.  I think a sleep(0) just before
> > >the flush would be the best.  It would reliquish the cpu slice if
> > >another process is ready to run.  If no other backend is running, it
> > >probably just returns.  If there is another one, it gives it a chance to
> > >complete.  On return from sleep(0), it can check if it still needs to
> > >flush.  This would tend to bunch up flushers so they flush only once,
> > >while not delaying cases where only one backend is running.
> > 
> > This sounds like an interesting approach, yes.
> Question: Is sleep(0) guaranteed to at least give up control? 
> 
> The way I read my UnixWare 7's man page, it might not, since alarm(0)
> just cancels the alarm...

Well, it certainly is a kernel call, and most OS's re-evaluate on kernel
call return.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Bruce Momjian

Date:

16 November 2000, 15:24:20

> > In OS kernel design, you try to avoid process herding bottlenecks. 
> > Here, we want them herded, and giving up the CPU may be the best way to
> > do it.
> 
> Yes, but if everyone yeilds you're back where you started, and with
> 128 or more backends do you really want to cause possibly that many
> context switches per fsync?

You are going to kernel call/yield anyway to fsync, so why not try and
if someone does the fsync, we don't need to do it.  I am suggesting
re-checking the need for fsync after the return from sleep(0).

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Alfred Perlstein

Date:

16 November 2000, 15:29:15

* Larry Rosenman <ler@lerctr.org> [001116 12:09] wrote:
> * Bruce Momjian <pgman@candle.pha.pa.us> [001116 14:02]:
> > > > This sounds like an interesting approach, yes.
> > > Question: Is sleep(0) guaranteed to at least give up control? 
> > > 
> > > The way I read my UnixWare 7's man page, it might not, since alarm(0)
> > > just cancels the alarm...
> > 
> > Well, it certainly is a kernel call, and most OS's re-evaluate on kernel
> > call return.
> BUT, do we know for sure that sleep(0) is not optimized in the library
> to just return? 

sleep(3) should conform to POSIX specification, if anyone has the
reference they can check it to see what the effect of sleep(0)
should be.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Bruce Momjian

Date:

16 November 2000, 15:37:12

> * Bruce Momjian <pgman@candle.pha.pa.us> [001116 14:02]:
> > > > This sounds like an interesting approach, yes.
> > > Question: Is sleep(0) guaranteed to at least give up control? 
> > > 
> > > The way I read my UnixWare 7's man page, it might not, since alarm(0)
> > > just cancels the alarm...
> > 
> > Well, it certainly is a kernel call, and most OS's re-evaluate on kernel
> > call return.
> BUT, do we know for sure that sleep(0) is not optimized in the library
> to just return? 

We can only do our best here. I think guessing whether other backends
are _about_ to commit is pretty shaky, and sleeping every time is a
waste.  This seems the cleanest.

Funny you should mention the optimization.  I just checked BSDI and saw:u_intsleep(secs)    u_int secs;{    struct
timevalnt, ot;    long diff;    int rc;    if (secs == 0)        return (0);
 

So maybe we need another _fake_ kernel call, or a select/usleep with a
very small value.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Peter Eisentraut

Date:

16 November 2000, 15:54:30

Bruce Momjian writes:

> > The way I read my UnixWare 7's man page, it might not, since alarm(0)
> > just cancels the alarm...
> 
> Well, it certainly is a kernel call, and most OS's re-evaluate on kernel
> call return.

In glibc, sleep(0) just does "return 0;", so if the compiler has a good
day the call will disappear completely.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Tom Lane

Date:

16 November 2000, 16:33:28

Alfred Perlstein <bright@wintelcom.net> writes:
> It might make more sense to keep a private copy of the last time
> the file was modified per-backend by that particular backend and
> a timestamp of the last fsync shared globally so one can forgo the
> fsync if "it hasn't been dirtied by me since the last fsync"
> This would provide a rendevous point for the fsync call although
> cost more as one would need to periodically call gettimeofday to
> set the modified by me timestamp as well as the post-fsync shared
> timestamp.

That's the hard way to do it.  We just need to keep track of the
endpoint of the log as of the last fsync.  You need to fsync (after
returning from sleep()) iff your commit record position > fsync
endpoint.  No need to ask the kernel for time-of-day.
        regards, tom lane

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Alfred Perlstein

Date:

16 November 2000, 16:38:16

* Tom Lane <tgl@sss.pgh.pa.us> [001116 13:31] wrote:
> Alfred Perlstein <bright@wintelcom.net> writes:
> > It might make more sense to keep a private copy of the last time
> > the file was modified per-backend by that particular backend and
> > a timestamp of the last fsync shared globally so one can forgo the
> > fsync if "it hasn't been dirtied by me since the last fsync"
> > This would provide a rendevous point for the fsync call although
> > cost more as one would need to periodically call gettimeofday to
> > set the modified by me timestamp as well as the post-fsync shared
> > timestamp.
> 
> That's the hard way to do it.  We just need to keep track of the
> endpoint of the log as of the last fsync.  You need to fsync (after
> returning from sleep()) iff your commit record position > fsync
> endpoint.  No need to ask the kernel for time-of-day.

Well that breaks when you move to a overwriting storage manager,
however if you use oid instead that optimization would survive
the change to a overwriting storage manager.  ?

-Alfred

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Alfred Perlstein

Date:

16 November 2000, 16:41:21

* Bruce Momjian <pgman@candle.pha.pa.us> [001116 12:31] wrote:
> > > In OS kernel design, you try to avoid process herding bottlenecks. 
> > > Here, we want them herded, and giving up the CPU may be the best way to
> > > do it.
> > 
> > Yes, but if everyone yeilds you're back where you started, and with
> > 128 or more backends do you really want to cause possibly that many
> > context switches per fsync?
> 
> You are going to kernel call/yield anyway to fsync, so why not try and
> if someone does the fsync, we don't need to do it.  I am suggesting
> re-checking the need for fsync after the return from sleep(0).

It might make more sense to keep a private copy of the last time
the file was modified per-backend by that particular backend and
a timestamp of the last fsync shared globally so one can forgo the
fsync if "it hasn't been dirtied by me since the last fsync"

This would provide a rendevous point for the fsync call although
cost more as one would need to periodically call gettimeofday to
set the modified by me timestamp as well as the post-fsync shared
timestamp.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Tom Lane

Date:

16 November 2000, 16:55:19

Alfred Perlstein <bright@wintelcom.net> writes:
>> That's the hard way to do it.  We just need to keep track of the
>> endpoint of the log as of the last fsync.

> Well that breaks when you move to a overwriting storage manager,

No, because the log is just a series of records written sequentially ---
it has nothing to do with storage management in data files.
        regards, tom lane

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Tom Samplonius

Date:

17 November 2000, 00:28:13

On Thu, 16 Nov 2000, Alfred Perlstein wrote:

> * Larry Rosenman <ler@lerctr.org> [001116 12:09] wrote:
> > * Bruce Momjian <pgman@candle.pha.pa.us> [001116 14:02]:
> > > > > This sounds like an interesting approach, yes.
> > > > Question: Is sleep(0) guaranteed to at least give up control? 
> > > > 
> > > > The way I read my UnixWare 7's man page, it might not, since alarm(0)
> > > > just cancels the alarm...
> > > 
> > > Well, it certainly is a kernel call, and most OS's re-evaluate on kernel
> > > call return.
> > BUT, do we know for sure that sleep(0) is not optimized in the library
> > to just return? 
> 
> sleep(3) should conform to POSIX specification, if anyone has the
> reference they can check it to see what the effect of sleep(0)
> should be.
 Yes, but Posix also specifies sched_yield() which rather explicitly
allows a process to yield its timeslice.  No idea how well that is
supported.

> -- 
> -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
> "I have the heart of a child; I keep it in a jar on my desk."

Tom

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Bruce Momjian

Date:

17 November 2000, 12:24:11

> > sleep(3) should conform to POSIX specification, if anyone has the
> > reference they can check it to see what the effect of sleep(0)
> > should be.
> 
>   Yes, but Posix also specifies sched_yield() which rather explicitly
> allows a process to yield its timeslice.  No idea how well that is
> supported.

I have it on BSDI.  We could add a configure check, and use it if it is
there.  Another idea is to add a shared memory flag when someone enters
the 'commit' section of the transaction code.  That way, a backend could
check to see if another process is _about_ to commit, and wait.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Larry Rosenman

Date:

17 November 2000, 12:46:20

* Bruce Momjian <pgman@candle.pha.pa.us> [001117 11:23]:
> > > sleep(3) should conform to POSIX specification, if anyone has the
> > > reference they can check it to see what the effect of sleep(0)
> > > should be.
> > 
> >   Yes, but Posix also specifies sched_yield() which rather explicitly
> > allows a process to yield its timeslice.  No idea how well that is
> > supported.
> 
> I have it on BSDI.  We could add a configure check, and use it if it is
> there.  Another idea is to add a shared memory flag when someone enters
> the 'commit' section of the transaction code.  That way, a backend could
> check to see if another process is _about_ to commit, and wait.
On UnixWare, it requires the -Kthread or -Kpthread command, which then
links in the threads library...

I'm not sure that this is a good thing or not....

LER

-- 
Larry Rosenman                      http://www.lerctr.org/~ler
Phone: +1 972-414-9812 (voice) Internet: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Bruce Momjian

Date:

17 November 2000, 13:06:00

> * Bruce Momjian <pgman@candle.pha.pa.us> [001117 11:23]:
> > > > sleep(3) should conform to POSIX specification, if anyone has the
> > > > reference they can check it to see what the effect of sleep(0)
> > > > should be.
> > > 
> > >   Yes, but Posix also specifies sched_yield() which rather explicitly
> > > allows a process to yield its timeslice.  No idea how well that is
> > > supported.
> > 
> > I have it on BSDI.  We could add a configure check, and use it if it is
> > there.  Another idea is to add a shared memory flag when someone enters
> > the 'commit' section of the transaction code.  That way, a backend could
> > check to see if another process is _about_ to commit, and wait.
> On UnixWare, it requires the -Kthread or -Kpthread command, which then
> links in the threads library...
> 
> I'm not sure that this is a good thing or not....

I would hope it just calls the function, and does not bring in thread
startup stuff.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: RE: [COMMITTERS] pgsql/src/backend/access/transam ( xact.c xlog.c)

From

Larry Rosenman

Date:

17 November 2000, 13:06:07

* Bruce Momjian <pgman@candle.pha.pa.us> [001117 11:39]:
> > * Bruce Momjian <pgman@candle.pha.pa.us> [001117 11:23]:
> > > > > sleep(3) should conform to POSIX specification, if anyone has the
> > > > > reference they can check it to see what the effect of sleep(0)
> > > > > should be.
> > > > 
> > > >   Yes, but Posix also specifies sched_yield() which rather explicitly
> > > > allows a process to yield its timeslice.  No idea how well that is
> > > > supported.
> > > 
> > > I have it on BSDI.  We could add a configure check, and use it if it is
> > > there.  Another idea is to add a shared memory flag when someone enters
> > > the 'commit' section of the transaction code.  That way, a backend could
> > > check to see if another process is _about_ to commit, and wait.
> > On UnixWare, it requires the -Kthread or -Kpthread command, which then
> > links in the threads library...
> > 
> > I'm not sure that this is a good thing or not....
> 
> I would hope it just calls the function, and does not bring in thread
> startup stuff.
I suspect it DOES bring in the thread startup and all that implies...

Tread lightly.  The good news is UnixWare Threads are LWP's and the
kernel is multithreaded...

LER

> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
-- 
Larry Rosenman                      http://www.lerctr.org/~ler
Phone: +1 972-414-9812 (voice) Internet: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

WAL fsync scheduling

From

Bruce Momjian

Date:

18 November 2000, 00:05:20

> > sleep(3) should conform to POSIX specification, if anyone has the
> > reference they can check it to see what the effect of sleep(0)
> > should be.
> 
>   Yes, but Posix also specifies sched_yield() which rather explicitly
> allows a process to yield its timeslice.  No idea how well that is
> supported.

OK, I have a new idea.

There are two parts to transaction commit.  The first is writing all
dirty buffers or log changes to the kernel, and second is fsync of the
log file.

I suggest having a per-backend shared memory byte that has the following
values:
START_LOG_WRITEWAIT_ON_FSYNCNOT_IN_COMMITbackend_number_doing_fsync

I suggest that when each backend starts a commit, it sets its byte to
START_LOG_WRITE.  When it gets ready to fsync, it checks all backends. 
If all are NOT_IN_COMMIT, it does fsync and continues.

If one or more are in START_LOG_WRITE, it waits until no one is in
START_LOG_WRITE.  It then checks all WAIT_ON_FSYNC, and if it is the
lowest backend in WAIT_ON_FSYNC, marks all others with its backend
number, and does fsync.  It then clears all backends with its number to
NOT_IN_COMMIT.  Other backend will see they are not the lowest
WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT
so they can then continue, knowing their data was synced.

This allows a single backend not to sleep, and allows multiple backends
to bunch up only when they are all about to commit.

The reason backend numbers are written is so other backends entering the
commit code will not interfere with the backends performing fsync.

Comments?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: WAL fsync scheduling

From

Tom Lane

Date:

18 November 2000, 00:17:24

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Other backend will see they are not the lowest
> WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT
> so they can then continue, knowing their data was synced.

How will they wait?  Without a semaphore involved, your answer must
be either "timed sleep" or "busy-wait loop", neither of which is
attractive ...
        regards, tom lane

Re: WAL fsync scheduling

From

Bruce Momjian

Date:

18 November 2000, 01:04:51

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Other backend will see they are not the lowest
> > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT
> > so they can then continue, knowing their data was synced.
> 
> How will they wait?  Without a semaphore involved, your answer must
> be either "timed sleep" or "busy-wait loop", neither of which is
> attractive ...

Yes, either timed sleep or busy-wait.  One nifty trick would be for each
backend that is not going to do the fsync to just sleep with signals
enabled, and for the fsyncing backend to signal the other backends to
exit their sleep.  That way, only one backend does the checking.

This sleep thing was going to be a problem anyway with the old system. 
At least this way, they sleep/check only in cases where it is valuable.

Can we use a semaphore for this system?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: WAL fsync scheduling

From

Bruce Momjian

Date:

18 November 2000, 09:36:12

> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Other backend will see they are not the lowest
> > > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT
> > > so they can then continue, knowing their data was synced.
> > 
> > How will they wait?  Without a semaphore involved, your answer must
> > be either "timed sleep" or "busy-wait loop", neither of which is
> > attractive ...
> 
> Yes, either timed sleep or busy-wait.  One nifty trick would be for each
> backend that is not going to do the fsync to just sleep with signals
> enabled, and for the fsyncing backend to signal the other backends to
> exit their sleep.  That way, only one backend does the checking.
> 
> This sleep thing was going to be a problem anyway with the old system. 
> At least this way, they sleep/check only in cases where it is valuable.

I have another idea.  If a backend gets to the point that it needs
fsync, and there is another backend in START_LOG_WRITE, it can go to an
interuptable sleep, knowing another backend will perform the fsync and
wake it up.  Therefore, there is no busy-wait or timed sleep.

Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a
race condition.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: WAL fsync scheduling

From

Larry Rosenman

Date:

18 November 2000, 10:38:37

* Tom Lane <tgl@sss.pgh.pa.us> [001117 23:21]:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Other backend will see they are not the lowest
> > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT
> > so they can then continue, knowing their data was synced.
> 
> How will they wait?  Without a semaphore involved, your answer must
> be either "timed sleep" or "busy-wait loop", neither of which is
> attractive ...
how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? 

> 
>             regards, tom lane
-- 
Larry Rosenman                      http://www.lerctr.org/~ler
Phone: +1 972-414-9812 (voice) Internet: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

Re: WAL fsync scheduling

From

Bruce Momjian

Date:

18 November 2000, 12:49:50

> * Tom Lane <tgl@sss.pgh.pa.us> [001117 23:21]:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Other backend will see they are not the lowest
> > > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT
> > > so they can then continue, knowing their data was synced.
> > 
> > How will they wait?  Without a semaphore involved, your answer must
> > be either "timed sleep" or "busy-wait loop", neither of which is
> > attractive ...
> how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? 

Looks like a winner.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: WAL fsync scheduling

From

Tom Lane

Date:

18 November 2000, 13:10:21

Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? 

> Looks like a winner.

sigpause() is a BSD-ism, and not part of any recognized standard
according to my HP man pages.  How portable do you think it is?
        regards, tom lane

Re: WAL fsync scheduling

From

Bruce Momjian

Date:

18 November 2000, 13:13:09

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >> how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? 
> 
> > Looks like a winner.
> 
> sigpause() is a BSD-ism, and not part of any recognized standard
> according to my HP man pages.  How portable do you think it is?

Good point.  I get on BSDI:
    The sigpause function call appeared in 4.2BSD and has been deprecated.

The standard is sigsuspend:
    The sigsuspend function call conforms to IEEE Std1003.1-1988 (``POSIX'').

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: WAL fsync scheduling

From

Tom Lane

Date:

18 November 2000, 13:21:27

Bruce Momjian <pgman@candle.pha.pa.us> writes:
>>>>> how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? 

> The standard is sigsuspend:

OK, we can probably assume that at least one of sigsuspend or sigpause
is available everywhere.  Now all you need is a free signal number.
Unfortunately we're already using both SIGUSR1 and SIGUSR2.
        regards, tom lane

Re: WAL fsync scheduling

From

Peter Eisentraut

Date:

18 November 2000, 13:22:35

Larry Rosenman writes:

> how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? 

Both of these signals are already used.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/

Re: WAL fsync scheduling

From

Bruce Momjian

Date:

18 November 2000, 13:22:40

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >>>>> how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? 
> 
> > The standard is sigsuspend:
> 
> OK, we can probably assume that at least one of sigsuspend or sigpause
> is available everywhere.  Now all you need is a free signal number.
> Unfortunately we're already using both SIGUSR1 and SIGUSR2.

Oh, I didn't want to hear that one.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: WAL fsync scheduling

From

Peter Eisentraut

Date:

18 November 2000, 13:47:50

Tom Lane writes:

> OK, we can probably assume that at least one of sigsuspend or sigpause
> is available everywhere.

#ifdef HAVE_POSIX_SIGNALS should tell you.

> Now all you need is a free signal number. Unfortunately we're already
> using both SIGUSR1 and SIGUSR2.

Maybe you could dump the old meaning SIGQUIT (externally invoked error),
move quickdie() to SIGQUIT, and you got SIGUSR1 free.

(That would even make sense in two ways:  1) SIGQUIT would actually cause
the guy to quit; 2) there is a correspondence between postmaster and
postgres signals.)

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/

Re: WAL fsync scheduling

From

Bruce Momjian

Date:

18 November 2000, 13:58:12

> Tom Lane writes:
> 
> > OK, we can probably assume that at least one of sigsuspend or sigpause
> > is available everywhere.
> 
> #ifdef HAVE_POSIX_SIGNALS should tell you.
> 
> > Now all you need is a free signal number. Unfortunately we're already
> > using both SIGUSR1 and SIGUSR2.
> 
> Maybe you could dump the old meaning SIGQUIT (externally invoked error),
> move quickdie() to SIGQUIT, and you got SIGUSR1 free.
> 
> (That would even make sense in two ways:  1) SIGQUIT would actually cause
> the guy to quit; 2) there is a correspondence between postmaster and
> postgres signals.)

Good idea.

Of course, this assumes my idea was valid.  Was it?


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: WAL fsync scheduling

From

Tom Lane

Date:

18 November 2000, 14:32:40

Peter Eisentraut <peter_e@gmx.net> writes:
>> Now all you need is a free signal number. Unfortunately we're already
>> using both SIGUSR1 and SIGUSR2.

> Maybe you could dump the old meaning SIGQUIT (externally invoked error),
> move quickdie() to SIGQUIT, and you got SIGUSR1 free.

> (That would even make sense in two ways:  1) SIGQUIT would actually cause
> the guy to quit; 2) there is a correspondence between postmaster and
> postgres signals.)

Seems like a plan.  The current definition of backend SIGQUIT is really
stupid anyway --- what's the value of forcing an error asynchronously?

Also, it always bothered me that the postmaster and backend signals
weren't consistent, so I'd be inclined to make this change even if we
end up not using SIGUSR1 for Bruce's idea ...
        regards, tom lane

Re: WAL fsync scheduling

From

"Vadim Mikheev"

Date:

19 November 2000, 14:17:53

> There are two parts to transaction commit.  The first is writing all
> dirty buffers or log changes to the kernel, and second is fsync of the  ^^^^^^^^^^^^
Backend doesn't write any dirty buffer to the kernel at commit time.

> log file.

The first part is writing commit record into WAL buffers in shmem.
This is what XLogInsert does.  After that XLogFlush is called to ensure
that  entire commit record is on disk. XLogFlush does *both* write() and
fsync() (single slock is used for both writing and fsyncing) if it needs to
do it at all.

> I suggest having a per-backend shared memory byte that has the following
> values:
> 
> START_LOG_WRITE
> WAIT_ON_FSYNC
> NOT_IN_COMMIT
> backend_number_doing_fsync
> 
> I suggest that when each backend starts a commit, it sets its byte to
> START_LOG_WRITE.  ^^^^^^^^^^^^^^^^^^^^^^^
Isn't START_COMMIT more meaningful?

> When it gets ready to fsync, it checks all backends.   ^^^^^^^^^^^^^^^^^^^^^^^^^^
What do you mean by this? The moment just after XLogInsert?

> If all are NOT_IN_COMMIT, it does fsync and continues.

1st edition:
> If one or more are in START_LOG_WRITE, it waits until no one is in
> START_LOG_WRITE.  It then checks all WAIT_ON_FSYNC, and if it is the
> lowest backend in WAIT_ON_FSYNC, marks all others with its backend
> number, and does fsync.  It then clears all backends with its number to
> NOT_IN_COMMIT.  Other backend will see they are not the lowest
> WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT
> so they can then continue, knowing their data was synced.

2nd edition:
> I have another idea.  If a backend gets to the point that it needs
> fsync, and there is another backend in START_LOG_WRITE, it can go to an
> interuptable sleep, knowing another backend will perform the fsync and
> wake it up.  Therefore, there is no busy-wait or timed sleep.
> 
> Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a
> race condition.

The 2nd edition is much better. But I'm not sure do we really need in
these per-backend bytes in shmem. Why not just have some counters?
We can use a semaphore to wake-up all waiters at once.

> This allows a single backend not to sleep, and allows multiple backends
> to bunch up only when they are all about to commit.
> 
> The reason backend numbers are written is so other backends entering the
> commit code will not interfere with the backends performing fsync.

Being waked-up backend can check what's written/fsynced by calling XLogFlush.

Vadim

Re: WAL fsync scheduling

From

Bruce Momjian

Date:

19 November 2000, 14:26:52

[ Charset ISO-8859-1 unsupported, converting... ]
> > There are two parts to transaction commit.  The first is writing all
> > dirty buffers or log changes to the kernel, and second is fsync of the
>    ^^^^^^^^^^^^
> Backend doesn't write any dirty buffer to the kernel at commit time.

Yes, I suspected that.

> 
> > log file.
> 
> The first part is writing commit record into WAL buffers in shmem.
> This is what XLogInsert does.  After that XLogFlush is called to ensure
> that  entire commit record is on disk. XLogFlush does *both* write() and
> fsync() (single slock is used for both writing and fsyncing) if it needs to
> do it at all.

Yes, I realize there are new steps in WAL.

> 
> > I suggest having a per-backend shared memory byte that has the following
> > values:
> > 
> > START_LOG_WRITE
> > WAIT_ON_FSYNC
> > NOT_IN_COMMIT
> > backend_number_doing_fsync
> > 
> > I suggest that when each backend starts a commit, it sets its byte to
> > START_LOG_WRITE. 
>   ^^^^^^^^^^^^^^^^^^^^^^^
> Isn't START_COMMIT more meaningful?

Yes.

> 
> > When it gets ready to fsync, it checks all backends. 
>    ^^^^^^^^^^^^^^^^^^^^^^^^^^
> What do you mean by this? The moment just after XLogInsert?

Just before it calls fsync().

> 
> > If all are NOT_IN_COMMIT, it does fsync and continues.
> 
> 1st edition:
> > If one or more are in START_LOG_WRITE, it waits until no one is in
> > START_LOG_WRITE.  It then checks all WAIT_ON_FSYNC, and if it is the
> > lowest backend in WAIT_ON_FSYNC, marks all others with its backend
> > number, and does fsync.  It then clears all backends with its number to
> > NOT_IN_COMMIT.  Other backend will see they are not the lowest
> > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT
> > so they can then continue, knowing their data was synced.
> 
> 2nd edition:
> > I have another idea.  If a backend gets to the point that it needs
> > fsync, and there is another backend in START_LOG_WRITE, it can go to an
> > interuptable sleep, knowing another backend will perform the fsync and
> > wake it up.  Therefore, there is no busy-wait or timed sleep.
> > 
> > Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a
> > race condition.
> 
> The 2nd edition is much better. But I'm not sure do we really need in
> these per-backend bytes in shmem. Why not just have some counters?
> We can use a semaphore to wake-up all waiters at once.

Yes, that is much better and clearer.  My idea was just to say, "if no
one is entering commit phase, do the commit.  If someone else is coming,
sleep and wait for them to do the fsync and wake me up with a singal."  

> 
> > This allows a single backend not to sleep, and allows multiple backends
> > to bunch up only when they are all about to commit.
> > 
> > The reason backend numbers are written is so other backends entering the
> > commit code will not interfere with the backends performing fsync.
> 
> Being waked-up backend can check what's written/fsynced by calling XLogFlush.

Seems that may not be needed anymore with a counter.  The only issue is
that other backends may enter commit while fsync() is happening.  The
process that did the fsync must be sure to wake up only the backends
that were waiting for it, and not other backends that may be also be
doing fsync as a group while the first fsync was happening.  I leave
those details to people more experienced.  :-)

I am just glad people liked my idea.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: WAL fsync scheduling

From

Bruce Momjian

Date:

24 January 2001, 10:59:48

Added to TODO.detail and TODO list.

> [ Charset ISO-8859-1 unsupported, converting... ]
> > > There are two parts to transaction commit.  The first is writing all
> > > dirty buffers or log changes to the kernel, and second is fsync of the
> >    ^^^^^^^^^^^^
> > Backend doesn't write any dirty buffer to the kernel at commit time.
> 
> Yes, I suspected that.
> 
> > 
> > > log file.
> > 
> > The first part is writing commit record into WAL buffers in shmem.
> > This is what XLogInsert does.  After that XLogFlush is called to ensure
> > that  entire commit record is on disk. XLogFlush does *both* write() and
> > fsync() (single slock is used for both writing and fsyncing) if it needs to
> > do it at all.
> 
> Yes, I realize there are new steps in WAL.
> 
> > 
> > > I suggest having a per-backend shared memory byte that has the following
> > > values:
> > > 
> > > START_LOG_WRITE
> > > WAIT_ON_FSYNC
> > > NOT_IN_COMMIT
> > > backend_number_doing_fsync
> > > 
> > > I suggest that when each backend starts a commit, it sets its byte to
> > > START_LOG_WRITE. 
> >   ^^^^^^^^^^^^^^^^^^^^^^^
> > Isn't START_COMMIT more meaningful?
> 
> Yes.
> 
> > 
> > > When it gets ready to fsync, it checks all backends. 
> >    ^^^^^^^^^^^^^^^^^^^^^^^^^^
> > What do you mean by this? The moment just after XLogInsert?
> 
> Just before it calls fsync().
> 
> > 
> > > If all are NOT_IN_COMMIT, it does fsync and continues.
> > 
> > 1st edition:
> > > If one or more are in START_LOG_WRITE, it waits until no one is in
> > > START_LOG_WRITE.  It then checks all WAIT_ON_FSYNC, and if it is the
> > > lowest backend in WAIT_ON_FSYNC, marks all others with its backend
> > > number, and does fsync.  It then clears all backends with its number to
> > > NOT_IN_COMMIT.  Other backend will see they are not the lowest
> > > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT
> > > so they can then continue, knowing their data was synced.
> > 
> > 2nd edition:
> > > I have another idea.  If a backend gets to the point that it needs
> > > fsync, and there is another backend in START_LOG_WRITE, it can go to an
> > > interuptable sleep, knowing another backend will perform the fsync and
> > > wake it up.  Therefore, there is no busy-wait or timed sleep.
> > > 
> > > Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a
> > > race condition.
> > 
> > The 2nd edition is much better. But I'm not sure do we really need in
> > these per-backend bytes in shmem. Why not just have some counters?
> > We can use a semaphore to wake-up all waiters at once.
> 
> Yes, that is much better and clearer.  My idea was just to say, "if no
> one is entering commit phase, do the commit.  If someone else is coming,
> sleep and wait for them to do the fsync and wake me up with a singal."  
> 
> > 
> > > This allows a single backend not to sleep, and allows multiple backends
> > > to bunch up only when they are all about to commit.
> > > 
> > > The reason backend numbers are written is so other backends entering the
> > > commit code will not interfere with the backends performing fsync.
> > 
> > Being waked-up backend can check what's written/fsynced by calling XLogFlush.
> 
> Seems that may not be needed anymore with a counter.  The only issue is
> that other backends may enter commit while fsync() is happening.  The
> process that did the fsync must be sure to wake up only the backends
> that were waiting for it, and not other backends that may be also be
> doing fsync as a group while the first fsync was happening.  I leave
> those details to people more experienced.  :-)
> 
> I am just glad people liked my idea.
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
> 


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026