Thread: Potential Large Performance Gain in WAL synching

Potential Large Performance Gain in WAL synching

From

"Curtis Faith"

Date:

03 October 2002, 18:26:21

I've been looking at the TODO lists and caching issues and think there may
be a way to greatly improve the performance of the WAL.

I've made the following assumptions based on my reading in the manual and
the WAL archives since about November 2000:

1) WAL is currently fsync'd before commit succeeds. This is done to ensure
that the D in ACID is satisfied.
2) The wait on fsync is the biggest time cost for inserts or updates.
3) fsync itself probably increases contention for file i/o on the same file
since some OS file system cache structures must be locked as part of fsync.
Depending on the file system this could be a significant choke on total i/o
throughput.

The issue is that there must be a definite record in durable storage for the
log before one can be certain that a transaction has succeeded.

I'm not familiar with the exact WAL implementation in PostgreSQL but am
familiar with others including ARIES II, however, it seems that it comes
down to making sure that the write to the WAL log has been positively
written to disk.

So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log
and then use aio_write for all log writes? A transaction would simple do all
the log writing using aio_write and block until all the last log aio request
has completed using aio_waitcomplete. The call to aio_waitcomplete won't
return until the log record has been written to the disk. Opening with
O_DSYNC ensures that when i/o completes the write has been written to the
disk, and aio_write with O_APPEND opened files ensures that writes append in
the order they are received, hence when the aio_write for the last log entry
for a transaction completes, the transaction can be sure that its log
records are in durable storage (IDE problems aside).

It seems to me that this would:

1) Preserve the required D semantics.
2) Allow transactions to complete and do work while other threads are
waiting on the completion of the log write.
3) Obviate the need for commit_delay, since there is no blocking and the
file system and the disk controller can put multiple writes to the log
together as the drive is waiting for the end of the log file to come under
one of the heads.

Here are the relevant TODO's:
Delay fsync() when other backends are about to commit too [fsync] Determine optimal commit_delay value
Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options Allow multiple blocks to be written to WAL with one
write()

Am I missing something?

Curtis Faith
Principal
Galt Capital, LLP

------------------------------------------------------------------
Galt Capital http://www.galtcapital.com
12 Wimmelskafts Gade
Post Office Box 7549 voice: 340.776.0144
Charlotte Amalie, St. Thomas fax: 340.776.0244
United States Virgin Islands 00801 cell: 340.643.5368

Re: Potential Large Performance Gain in WAL synching

From

Tom Lane

Date:

03 October 2002, 19:17:30

"Curtis Faith" <curtis@galtair.com> writes:
> So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log
> and then use aio_write for all log writes?

We already offer an O_DSYNC option.  It's not obvious to me what
aio_write brings to the table (aside from loss of portability).
You still have to wait for the final write to complete, no?

> 2) Allow transactions to complete and do work while other threads are
> waiting on the completion of the log write.

I'm missing something.  There is no useful work that a transaction can
do between writing its commit record and reporting completion, is there?
It has to wait for that record to hit disk.
        regards, tom lane

Re: Potential Large Performance Gain in WAL synching

From

"Curtis Faith"

Date:

04 October 2002, 00:25:55

tom lane replies:
> "Curtis Faith" <curtis@galtair.com> writes:
> > So, why don't we use files opened with O_DSYNC | O_APPEND for 
> the WAL log
> > and then use aio_write for all log writes?
> 
> We already offer an O_DSYNC option.  It's not obvious to me what
> aio_write brings to the table (aside from loss of portability).
> You still have to wait for the final write to complete, no?

Well, for starters by the time the write which includes the commit
log entry is written, much of the writing of the log for the
transaction will already be on disk, or in a controller on its 
way.

I don't see any O_NONBLOCK or O_NDELAY references in the sources 
so it looks like the log writes are blocking. If I read correctly,
XLogInsert calls XLogWrite which calls write which blocks. If these
assumptions are correct, there should be some significant gain here but I
won't know how much until I try to change it. This issue only affects the
speed of a given back-ends transaction processing capability.

The REAL issue and the one that will greatly affect total system
throughput is that of contention on the file locks. Since fsynch needs to
obtain a write lock on the file descriptor, as does the write calls which
originate from XLogWrite as the writes are written to the disk, other
back-ends will block while another transaction is committing if the
log cache fills to the point where their XLogInsert results in a 
XLogWrite call to flush the log cache. I'd guess this means that one
won't gain much by adding other back-end processes past three or four
if there are a lot of inserts or updates.

The method I propose does not result in any blocking because of writes
other than the final commit's write and it has the very significant
advantage of allowing other transactions (from other back-ends) to
continue until they enter commit (and blocking waiting for their final
commit write to complete).

> > 2) Allow transactions to complete and do work while other threads are
> > waiting on the completion of the log write.
> 
> I'm missing something.  There is no useful work that a transaction can
> do between writing its commit record and reporting completion, is there?
> It has to wait for that record to hit disk.

The key here is that a thread that has not committed and therefore is
not blocking can do work while "other threads" (should have said back-ends 
or processes) are waiting on their commit writes.

- Curtis

P.S. If I am right in my assumptions about the way the current system
works, I'll bet the change would speed up inserts in Shridhar's huge
database test by at least a factor of two or three, perhaps even an
order of magnitude. :-)

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Thursday, October 03, 2002 7:17 PM
> To: Curtis Faith
> Cc: Pgsql-Hackers
> Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching 
> 
> 
> "Curtis Faith" <curtis@galtair.com> writes:
> > So, why don't we use files opened with O_DSYNC | O_APPEND for 
> the WAL log
> > and then use aio_write for all log writes?
> 
> We already offer an O_DSYNC option.  It's not obvious to me what
> aio_write brings to the table (aside from loss of portability).
> You still have to wait for the final write to complete, no?
> 
> > 2) Allow transactions to complete and do work while other threads are
> > waiting on the completion of the log write.
> 
> I'm missing something.  There is no useful work that a transaction can
> do between writing its commit record and reporting completion, is there?
> It has to wait for that record to hit disk.
> 
>             regards, tom lane
>

Re: Potential Large Performance Gain in WAL synching

From

Bruce Momjian

Date:

04 October 2002, 00:44:34

Curtis Faith wrote:
> The method I propose does not result in any blocking because of writes
> other than the final commit's write and it has the very significant
> advantage of allowing other transactions (from other back-ends) to
> continue until they enter commit (and blocking waiting for their final
> commit write to complete).
> 
> > > 2) Allow transactions to complete and do work while other threads are
> > > waiting on the completion of the log write.
> > 
> > I'm missing something.  There is no useful work that a transaction can
> > do between writing its commit record and reporting completion, is there?
> > It has to wait for that record to hit disk.
> 
> The key here is that a thread that has not committed and therefore is
> not blocking can do work while "other threads" (should have said back-ends 
> or processes) are waiting on their commit writes.

I may be missing something here, but other backends don't block while
one writes to WAL.  Remember, we are proccess based, not thread based,
so the write() call only blocks the one session.  If you had threads,
and you did a write() call that blocked other threads, I can see where
your idea would be good, and where async i/o becomes an advantage.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Potential Large Performance Gain in WAL synching

From

Tom Lane

Date:

04 October 2002, 00:50:44

"Curtis Faith" <curtis@galtair.com> writes:
> The REAL issue and the one that will greatly affect total system
> throughput is that of contention on the file locks. Since fsynch needs to
> obtain a write lock on the file descriptor, as does the write calls which
> originate from XLogWrite as the writes are written to the disk, other
> back-ends will block while another transaction is committing if the
> log cache fills to the point where their XLogInsert results in a 
> XLogWrite call to flush the log cache.

But that's exactly *why* we have a log cache: to ensure we can buffer a
reasonable amount of log data between XLogFlush calls.  If the above
scenario is really causing a problem, doesn't that just mean you need
to increase wal_buffers?
        regards, tom lane

Re: Potential Large Performance Gain in WAL synching

From

"Curtis Faith"

Date:

04 October 2002, 01:41:12

Bruce Momjian wrote:
> I may be missing something here, but other backends don't block while
> one writes to WAL.

I don't think they'll block until they get to the fsync or XLogWrite
call while another transaction is fsync'ing.

I'm no Unix filesystem expert but I don't see how the OS can
handle multiple writes and fsyncs to the same file descriptors without
blocking other processes from writing at the same time. It may be that
there are some clever data structures they use but I've not seen huge
praise for most of the file systems. A well written file system could
minimize this contention but I'll bet it's there with most of the ones
that PostgreSQL most commonly runs on.

I'll have to write a test and see if there really is a problem.

- Curtis

> -----Original Message-----
> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
> Sent: Friday, October 04, 2002 12:44 AM
> To: Curtis Faith
> Cc: Tom Lane; Pgsql-Hackers
> Subject: Re: [HACKERS] Potential Large Performance Gain in WAL synching
>
>
> Curtis Faith wrote:
> > The method I propose does not result in any blocking because of writes
> > other than the final commit's write and it has the very significant
> > advantage of allowing other transactions (from other back-ends) to
> > continue until they enter commit (and blocking waiting for their final
> > commit write to complete).
> >
> > > > 2) Allow transactions to complete and do work while other
> threads are
> > > > waiting on the completion of the log write.
> > >
> > > I'm missing something.  There is no useful work that a transaction can
> > > do between writing its commit record and reporting
> completion, is there?
> > > It has to wait for that record to hit disk.
> >
> > The key here is that a thread that has not committed and therefore is
> > not blocking can do work while "other threads" (should have
> said back-ends
> > or processes) are waiting on their commit writes.
>
> I may be missing something here, but other backends don't block while
> one writes to WAL.  Remember, we are proccess based, not thread based,
> so the write() call only blocks the one session.  If you had threads,
> and you did a write() call that blocked other threads, I can see where
> your idea would be good, and where async i/o becomes an advantage.
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square,
> Pennsylvania 19073
>

Re: Potential Large Performance Gain in WAL synching

From

"Curtis Faith"

Date:

04 October 2002, 01:50:36

I wrote:
> > The REAL issue and the one that will greatly affect total system
> > throughput is that of contention on the file locks. Since
> fsynch needs to
> > obtain a write lock on the file descriptor, as does the write
> calls which
> > originate from XLogWrite as the writes are written to the disk, other
> > back-ends will block while another transaction is committing if the
> > log cache fills to the point where their XLogInsert results in a
> > XLogWrite call to flush the log cache.

tom lane wrote:
> But that's exactly *why* we have a log cache: to ensure we can buffer a
> reasonable amount of log data between XLogFlush calls.  If the above
> scenario is really causing a problem, doesn't that just mean you need
> to increase wal_buffers?

Well, in cases where there are a lot of small transactions the contention
will not be on the XLogWrite calls from caches getting full but from
XLogWrite calls from transaction commits which will happen very frequently.
I think this will have a detrimental effect on very high update frequency
performance.

So while larger WAL caches will help in the case of cache flushing because
of its being full I don't think it will make any difference for the
potentially
more common case of transaction commits.

- Curtis

Re: Potential Large Performance Gain in WAL synching

From

Bruce Momjian

Date:

04 October 2002, 11:52:17

Curtis Faith wrote:
> Bruce Momjian wrote:
> > I may be missing something here, but other backends don't block while
> > one writes to WAL.
> 
> I don't think they'll block until they get to the fsync or XLogWrite
> call while another transaction is fsync'ing.
> 
> I'm no Unix filesystem expert but I don't see how the OS can
> handle multiple writes and fsyncs to the same file descriptors without
> blocking other processes from writing at the same time. It may be that
> there are some clever data structures they use but I've not seen huge
> praise for most of the file systems. A well written file system could
> minimize this contention but I'll bet it's there with most of the ones
> that PostgreSQL most commonly runs on.
> 
> I'll have to write a test and see if there really is a problem.

Yes, I can see some contention, but what does aio solve?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Potential Large Performance Gain in WAL synching

From

"Curtis Faith"

Date:

04 October 2002, 12:36:10

I wrote:
> > I'm no Unix filesystem expert but I don't see how the OS can
> > handle multiple writes and fsyncs to the same file descriptors without
> > blocking other processes from writing at the same time. It may be that
> > there are some clever data structures they use but I've not seen huge
> > praise for most of the file systems. A well written file system could
> > minimize this contention but I'll bet it's there with most of the ones
> > that PostgreSQL most commonly runs on.
> > 
> > I'll have to write a test and see if there really is a problem.

Bruce Momjian wrote:

> Yes, I can see some contention, but what does aio solve?
> 

Well, theoretically, aio lets the file system handle the writes without
requiring any locks being held by the processes issuing those reads. 
The disk i/o scheduler can therefore issue the writes using spinlocks or
something very fast since it controls the timing of each of the actual
writes. In some systems this is handled by the kernal and can be very
fast.

I suspect that with large RAID controllers or intelligent disk systems
like EMC this is even more important because they should be able to
handle a much higher level of concurrent i/o.

Now whether or not the common file systems handle this well, I can't say,

Take a look at some comments on how Oracle uses asynchronous I/O

http://www.ixora.com.au/notes/redo_write_multiplexing.htm
http://www.ixora.com.au/notes/asynchronous_io.htm
http://www.ixora.com.au/notes/raw_asynchronous_io.htm

It seems that OS support for this will likely increase and that this
issue will become more and more important as uses contemplate SMP systems
or if threading is added to certain PostgreSQL subsystems.

It might be easier for me to implement the change I propose and then
see what kind of difference it makes.

I wanted to run the idea past this group first. We can all postulate
whether or not it will work but we won't know unless we try it. My real
issue is one of what happens in the event that it does work.

I've had very good luck implementing this sort of thing for other systems
but I don't yet know the range of i/o requests that PostgreSQL makes.

Assuming we can demonstrate no detrimental effects on system reliability
and that the change is implemented in such a way that it can be turned
on or off easily, will a 50% or better increase in speed for updates
justify the sort or change I am proposing. 20%? 10%?

- Curtis

Re: Potential Large Performance Gain in WAL synching

From

Bruce Momjian

Date:

04 October 2002, 13:15:41

Curtis Faith wrote:
> > Yes, I can see some contention, but what does aio solve?
> > 
> 
> Well, theoretically, aio lets the file system handle the writes without
> requiring any locks being held by the processes issuing those reads. 
> The disk i/o scheduler can therefore issue the writes using spinlocks or
> something very fast since it controls the timing of each of the actual
> writes. In some systems this is handled by the kernal and can be very
> fast.

I am again confused.  When we do write(), we don't have to lock
anything, do we?  (Multiple processes can write() to the same file just
fine.)  We do block the current process, but we have nothing else to do
until we know it is written/fsync'ed.  Does aio more easily allow the
kernel to order those write?  Is that the issue?  Well, certainly the
kernel already order the writes.  Just because we write() doesn't mean
it goes to disk.  Only fsync() or the kernel do that.

> 
> I suspect that with large RAID controllers or intelligent disk systems
> like EMC this is even more important because they should be able to
> handle a much higher level of concurrent i/o.
> 
> Now whether or not the common file systems handle this well, I can't say,
> 
> Take a look at some comments on how Oracle uses asynchronous I/O
> 
> http://www.ixora.com.au/notes/redo_write_multiplexing.htm
> http://www.ixora.com.au/notes/asynchronous_io.htm
> http://www.ixora.com.au/notes/raw_asynchronous_io.htm

Yes, but Oracle is threaded, right, so, yes, they clearly could win with
it.  I read the second URL and it said we could issue separate writes
and have them be done in an optimal order.  However, we use the file
system, not raw devices, so don't we already have that in the kernel
with fsync()?

> It seems that OS support for this will likely increase and that this
> issue will become more and more important as uses contemplate SMP systems
> or if threading is added to certain PostgreSQL subsystems.

Probably.  Having seen the Informix 5/7 debacle, I don't want to fall
into the trap where we add stuff that just makes things faster on
SMP/threaded systems when it makes our code _slower_ on single CPU
systems, which is exaclty what Informix did in Informix 7, and we know
how that ended (lost customers, bought by IBM).  I don't think that's
going to happen to us, but I thought I would mention it.

> Assuming we can demonstrate no detrimental effects on system reliability
> and that the change is implemented in such a way that it can be turned
> on or off easily, will a 50% or better increase in speed for updates
> justify the sort or change I am proposing. 20%? 10%?

Yea, let's see what boost we get, and the size of the patch, and we can
review it.  It is certainly worth researching.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Potential Large Performance Gain in WAL synching

From

"Curtis Faith"

Date:

04 October 2002, 15:52:34

Bruce Momjian wrote:
> I am again confused.  When we do write(), we don't have to lock
> anything, do we?  (Multiple processes can write() to the same file just
> fine.)  We do block the current process, but we have nothing else to do
> until we know it is written/fsync'ed.  Does aio more easily allow the
> kernel to order those write?  Is that the issue?  Well, certainly the
> kernel already order the writes.  Just because we write() doesn't mean
> it goes to disk.  Only fsync() or the kernel do that.

"We" don't have to lock anything, but most file systems can't process
fsync's
simultaneous with other writes, so those writes block because the file
system grabs its own internal locks. The fsync call is more
contentious than typical writes because its duration is usually
longer so it holds the locks longer over more pages and structures.
That is the real issue. The contention caused by fsync'ing very frequently
which blocks other writers and readers.

For the buffer manager, the blocking of readers is probably even more
problematic when the cache is a small percentage (say < 10% to 15%) of
the total database size because most leaf node accesses will result in
a read. Each of these reads will have to wait on the fsync as well. Again,
a very well written file system probably can minimize this but I've not
seen any.

Further comment on:
<We do block the current process, but we have nothing else to do
>until we know it is written/fsync'ed.

Writing out a bunch of calls at the end, after having consumed a lot
of CPU cycles and then waiting is not as efficient as writing them out,
while those CPU cycles are being used. We are currently waisting the
time it takes for a given process to write.

The thinking probably has been that this is no big deal because other
processes, say B, C and D can use the CPU cycles while process A blocks.
This is true UNLESS the other processes are blocking on reads or
writes caused by process A doing the final writes and fsync.

> Yes, but Oracle is threaded, right, so, yes, they clearly could win with
> it.  I read the second URL and it said we could issue separate writes
> and have them be done in an optimal order.  However, we use the file
> system, not raw devices, so don't we already have that in the kernel
> with fsync()?

Whether by threads or multiple processes, there is the same contention on
the file through multiple writers. The file system can decide to reorder
writes before they start but not after. If a write comes after a
fsync starts it will have to wait on that fsync.

Likewise a given process's writes can NEVER be reordered if they are
submitted synchronously, as is done in the calls to flush the log as
well as the dirty pages in the buffer in the current code.

> Probably.  Having seen the Informix 5/7 debacle, I don't want to fall
> into the trap where we add stuff that just makes things faster on
> SMP/threaded systems when it makes our code _slower_ on single CPU
> systems, which is exaclty what Informix did in Informix 7, and we know
> how that ended (lost customers, bought by IBM).  I don't think that's
> going to happen to us, but I thought I would mention it.

Yes, I hate "improvements" that make things worse for most people. Any
changes I'd contemplate would be simply another configuration driven
optimization that could be turned off very easily.

- Curtis

Re: Potential Large Performance Gain in WAL synching

From

Tom Lane

Date:

04 October 2002, 16:45:28

"Curtis Faith" <curtis@galtair.com> writes:
> ... most file systems can't process fsync's
> simultaneous with other writes, so those writes block because the file
> system grabs its own internal locks.

Oh?  That would be a serious problem, but I've never heard that asserted
before.  Please provide some evidence.

On a filesystem that does have that kind of problem, can't you avoid it
just by using O_DSYNC on the WAL files?  Then there's no need to call
fsync() at all, except during checkpoints (which actually issue sync()
not fsync(), anyway).

> Whether by threads or multiple processes, there is the same contention on
> the file through multiple writers. The file system can decide to reorder
> writes before they start but not after. If a write comes after a
> fsync starts it will have to wait on that fsync.

AFAICS we cannot allow the filesystem to reorder writes of WAL blocks,
on safety grounds (we want to be sure we have a consistent WAL up to the
end of what we've written).  Even if we can allow some reordering when a
single transaction puts out a large volume of WAL data, I fail to see
where any large gain is going to come from.  We're going to be issuing
those writes sequentially and that ought to match the disk layout about
as well as can be hoped anyway.

> Likewise a given process's writes can NEVER be reordered if they are
> submitted synchronously, as is done in the calls to flush the log as
> well as the dirty pages in the buffer in the current code.

We do not fsync buffer pages; in fact a transaction commit doesn't write
buffer pages at all.  I think the above is just a misunderstanding of
what's really happening.  We have synchronous WAL writing, agreed, but
we want that AFAICS.  Data block writes are asynchronous (between
checkpoints, anyway).

There is one thing in the current WAL code that I don't like: if the WAL
buffers fill up then everybody who would like to make WAL entries is
forced to wait while some space is freed, which means a write, which is
synchronous if you are using O_DSYNC.  It would be nice to have a
background process whose only task is to issue write()s as soon as WAL
pages are filled, thus reducing the probability that foreground
processes have to wait for WAL writes (when they're not committing that
is).  But this could be done portably with one more postmaster child
process; I see no real need to dabble in aio_write.
        regards, tom lane

Re: Potential Large Performance Gain in WAL synching

From

"Zeugswetter Andreas SB SD"

Date:

04 October 2002, 17:15:01

> > ... most file systems can't process fsync's
> > simultaneous with other writes, so those writes block because the file
> > system grabs its own internal locks.
>
> Oh?  That would be a serious problem, but I've never heard that asserted
> before.  Please provide some evidence.
>
> On a filesystem that does have that kind of problem, can't you avoid it
> just by using O_DSYNC on the WAL files?

To make this competitive, the WAL writes would need to be improved to
do more than one block (up to 256k or 512k per write) with one write call
(if that much is to be written for this tx to be able to commit).
This should actually not be too difficult since the WAL buffer is already
contiguous memory.

If that is done, then I bet O_DSYNC will beat any other config we currently
have.

With this, a separate disk for WAL and large transactions you shoud be able
to see your disks hit the max IO figures they are capable of :-)

Andreas

Re: Potential Large Performance Gain in WAL synching

From

Tom Lane

Date:

04 October 2002, 17:43:21

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> To make this competitive, the WAL writes would need to be improved to 
> do more than one block (up to 256k or 512k per write) with one write call 
> (if that much is to be written for this tx to be able to commit).
> This should actually not be too difficult since the WAL buffer is already 
> contiguous memory.

Hmmm ... if you were willing to dedicate a half meg or meg of shared
memory for WAL buffers, that's doable.  I was originally thinking of
having the (still hypothetical) background process wake up every time a
WAL page was completed and available to write.  But it could be set up
so that there is some "slop", and it only wakes up when the number of
writable pages exceeds N, for some N that's still well less than the
number of buffers.  Then it could write up to N sequential pages in a
single write().

However, this would only be a win if you had few and large transactions.
Any COMMIT will force a write of whatever we have so far, so the idea of
writing hundreds of K per WAL write can only work if it's hundreds of K
between commit records.  Is that a common scenario?  I doubt it.

If you try to set it up that way, then it's more likely that what will
happen is the background process seldom awakens at all, and each
committer effectively becomes responsible for writing all the WAL
traffic since the last commit.  Wouldn't that lose compared to someone
else having written the previous WAL pages in background?

We could certainly build the code to support this, though, and then
experiment with different values of N.  If it turns out N==1 is best
after all, I don't think we'd have wasted much code.
        regards, tom lane

Re: Potential Large Performance Gain in WAL synching

From

"Curtis Faith"

Date:

04 October 2002, 18:09:23

I wrote:
> > ... most file systems can't process fsync's
> > simultaneous with other writes, so those writes block because the file
> > system grabs its own internal locks.
>

tom lane replies:
> Oh?  That would be a serious problem, but I've never heard that asserted
> before.  Please provide some evidence.

Well I'm basing this on past empirical testing and having read some man
pages that describe fsync under this exact scenario. I'll have to write
a test to prove this one way or another. I'll also try and look into
the linux/BSD source for the common file systems used for PostgreSQL.

> On a filesystem that does have that kind of problem, can't you avoid it
> just by using O_DSYNC on the WAL files?  Then there's no need to call
> fsync() at all, except during checkpoints (which actually issue sync()
> not fsync(), anyway).
>

No, they're not exactly the same thing. Consider:

Process A                    File System
---------                       -----------
Writes index buffer             .idling...
Writes entry to log cache       .
Writes another index buffer     .
Writes another log entry        .
Writes tuple buffer             .
Writes another log entry        .
Index scan                      .
Large table sort                .
Writes tuple buffer             .
Writes another log entry        .
Writes                          .
Writes another index buffer     .
Writes another log entry        .
Writes another index buffer     .
Writes another log entry        .
Index scan                      .
Large table sort                .
Commit                          .
File Write Log Entry            .
.idling...                      Write to cache
File Write Log Entry            .idling...
.idling...                      Write to cache
File Write Log Entry            .idling...
.idling...                      Write to cache
File Write Log Entry            .idling...
.idling...                      Write to cache
Write Commit Log Entry          .idling...
.idling...                      Write to cache
Call fsync                      .idling...
.idling...                      Write all buffers to device.
.DONE.

In this case, Process A is waiting for all the buffers to write
at the end of the transaction.

With asynchronous I/O this becomes:

Process A                    File System
---------                       -----------
Writes index buffer             .idling...
Writes entry to log cache       Queue up write - move head to cylinder
Writes another index buffer     Write log entry to media
Writes another log entry        Immediate write to cylinder since head is
still there.
Writes tuple buffer             .
Writes another log entry        Queue up write - move head to cylinder
Index scan                      .busy with scan...
Large table sort                Write log entry to media
Writes tuple buffer             .
Writes another log entry        Queue up write - move head to cylinder
Writes                          .
Writes another index buffer     Write log entry to media
Writes another log entry        Queue up write - move head to cylinder
Writes another index buffer     .
Writes another log entry        Write log entry to media
Index scan                      .
Large table sort                Write log entry to media
Commit                          .
Write Commit Log Entry          Immediate write to cylinder since head is
still there.
.DONE.

Effectively the real work of writing the cache is done while the CPU
for the process is busy doing index scans, sorts, etc. With the WAL
log on another device and SCSI I/O the log writing should almost always be
done except for the final commit write.

> > Whether by threads or multiple processes, there is the same
> contention on
> > the file through multiple writers. The file system can decide to reorder
> > writes before they start but not after. If a write comes after a
> > fsync starts it will have to wait on that fsync.
>
> AFAICS we cannot allow the filesystem to reorder writes of WAL blocks,
> on safety grounds (we want to be sure we have a consistent WAL up to the
> end of what we've written).  Even if we can allow some reordering when a
> single transaction puts out a large volume of WAL data, I fail to see
> where any large gain is going to come from.  We're going to be issuing
> those writes sequentially and that ought to match the disk layout about
> as well as can be hoped anyway.

My comment was applying to reads and writes of other processes not the
WAL log. In my original email, recall I mentioned using the O_APPEND
open flag which will ensure that all log entries are done sequentially.

> > Likewise a given process's writes can NEVER be reordered if they are
> > submitted synchronously, as is done in the calls to flush the log as
> > well as the dirty pages in the buffer in the current code.
>
> We do not fsync buffer pages; in fact a transaction commit doesn't write
> buffer pages at all.  I think the above is just a misunderstanding of
> what's really happening.  We have synchronous WAL writing, agreed, but
> we want that AFAICS.  Data block writes are asynchronous (between
> checkpoints, anyway).

Hmm, I keep hearing that buffer block writes are asynchronous but I don't
read that in the code at all. There are simple "write" calls with files
that are not opened with O_NOBLOCK, so they'll be done synchronously. The
code for this is relatively straighforward (once you get past the
storage manager abstraction) so I don't see what I might be missing.

It's true that data blocks are not required to be written before the
transaction commits, so they are in some sense asynchronous to the
transactions. However, they still later on block the process that
is requesting a new block when it happens to be dirty forcing a write
of the block in the cache.

It looks to me like BufferAlloc will simply result in a call to
BufferReplace > smgrblindwrt > write for md storage manager objects.

This means that a process will block while the write of dirty cache
buffers takes place.

I'm happy to be wrong on this but I don't see any hard evidence
of asynch file calls anywhere in the code. Unless I am missing something
this is a huuuuge problem.

> There is one thing in the current WAL code that I don't like: if the WAL
> buffers fill up then everybody who would like to make WAL entries is
> forced to wait while some space is freed, which means a write, which is
> synchronous if you are using O_DSYNC.  It would be nice to have a
> background process whose only task is to issue write()s as soon as WAL
> pages are filled, thus reducing the probability that foreground
> processes have to wait for WAL writes (when they're not committing that
> is).  But this could be done portably with one more postmaster child
> process; I see no real need to dabble in aio_write.

Hmm, well, another process writing the log would accomplish the same thing
but isn't that what a file system is? ISTM that aio_write is quite a bit
easier and higher performance? This is especially true for those OS's which
have KAIO support.

- Curtis

Re: Potential Large Performance Gain in WAL synching

From

Neil Conway

Date:

04 October 2002, 19:04:18

"Curtis Faith" <curtis@galtair.com> writes:
> It looks to me like BufferAlloc will simply result in a call to
> BufferReplace > smgrblindwrt > write for md storage manager objects.
> 
> This means that a process will block while the write of dirty cache
> buffers takes place.

I think Tom was suggesting that when a buffer is written out, the
write() call only pushes the data down into the filesystem's buffer --
which is free to then write the actual blocks to disk whenever it
chooses to. In other words, the write() returns, the backend process
can continue with what it was doing, and at some later time the blocks
that we flushed from the Postgres buffer will actually be written to
disk. So in some sense of the word, that I/O is asynchronous.

Cheers,

Neil

-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

fsync exlusive lock evidence WAS: Potential Large Performance Gain in WAL synching

From

"Curtis Faith"

Date:

04 October 2002, 19:33:09

After some research I still hold that fsync blocks, at least on
FreeBSD. Am I missing something?

Here's the evidence:

Code from: /usr/src/sys/syscalls/vfs_syscalls

int
fsync(p, uap)       struct proc *p;       struct fsync_args /* {               syscallarg(int) fd;       } */ *uap;
{       register struct vnode *vp;       struct file *fp;       vm_object_t obj;       int error;
       if ((error = getvnode(p->p_fd, SCARG(uap, fd), &fp)) != 0)               return (error);       vp = (struct
vnode*)fp->f_data;       vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);       if (VOP_GETVOBJECT(vp, &obj) == 0)
vm_object_page_clean(obj, 0, 0, 0);       if ((error = VOP_FSYNC(vp, fp->f_cred, MNT_WAIT, p)) == 0 &&
vp->v_mount&& (vp->v_mount->mnt_flag & MNT_SOFTDEP) &&           bioops.io_fsync)               error =
(*bioops.io_fsync)(vp);      VOP_UNLOCK(vp, 0, p);       return (error);
 
}

Notice the calls to:
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);..VOP_UNLOCK(vp, 0, p);

surrounding the call to VOP_FSYNC.

From the man pages for VOP_UNLOCK:


HEADER STUFF .....

    VOP_LOCK(struct vnode *vp, int flags, struct proc *p);
    int    VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p);
    int    VOP_ISLOCKED(struct vnode *vp, struct proc *p);
    int    vn_lock(struct vnode *vp, int flags, struct proc *p);



DESCRIPTION    These calls are used to serialize access to the filesystem, such as to    prevent two writes to the same
filefrom happening at the same time.
 
    The arguments are:
    vp     the vnode being locked or unlocked
    flags  One of the lock request types:
      LK_SHARED        Shared lock      LK_EXCLUSIVE        Exclusive lock      LK_UPGRADE        Shared-to-exclusive
upgrade     LK_EXCLUPGRADE    First shared-to-exclusive upgrade      LK_DOWNGRADE        Exclusive-to-shared downgrade
   LK_RELEASE        Release any type of lock      LK_DRAIN        Wait for all lock activity to end
 
    The lock type may be or'ed with these lock flags:
      LK_NOWAIT       Do not sleep to wait for lock      LK_SLEEPFAIL       Sleep, then return failure
LK_CANRECURSE   Allow recursive exclusive lock      LK_REENABLE       Lock is to be reenabled after drain
LK_NOPAUSE      No spinloop
 
    The lock type may be or'ed with these control flags:
      LK_INTERLOCK      Specify when the caller already has a simple              lock (VOP_LOCK will unlock the simple
lock             after getting the lock)      LK_RETRY      Retry until locked      LK_NOOBJ      Don't create object
 
    p        process context to use for the locks
    Kernel code should use vn_lock() to lock a vnode rather than calling    VOP_LOCK() directly.

Re: Potential Large Performance Gain in WAL synching

From

"Zeugswetter Andreas SB SD"

Date:

04 October 2002, 19:47:19

> Hmmm ... if you were willing to dedicate a half meg or meg of shared
> memory for WAL buffers, that's doable.

Yup, configuring Informix to three 2 Mb buffers (LOGBUF 2048) here.

> However, this would only be a win if you had few and large transactions.
> Any COMMIT will force a write of whatever we have so far, so the idea of
> writing hundreds of K per WAL write can only work if it's hundreds of K
> between commit records.  Is that a common scenario?  I doubt it.

It should help most for data loading, or mass updating, yes.

Andreas

Re: Potential Large Performance Gain in WAL synching

From

Greg Copeland

Date:

04 October 2002, 20:18:02

On Fri, 2002-10-04 at 18:03, Neil Conway wrote:
> "Curtis Faith" <curtis@galtair.com> writes:
> > It looks to me like BufferAlloc will simply result in a call to
> > BufferReplace > smgrblindwrt > write for md storage manager objects.
> >
> > This means that a process will block while the write of dirty cache
> > buffers takes place.
>
> I think Tom was suggesting that when a buffer is written out, the
> write() call only pushes the data down into the filesystem's buffer --
> which is free to then write the actual blocks to disk whenever it
> chooses to. In other words, the write() returns, the backend process
> can continue with what it was doing, and at some later time the blocks
> that we flushed from the Postgres buffer will actually be written to
> disk. So in some sense of the word, that I/O is asynchronous.


Isn't that true only as long as there is buffer space available?  When
there isn't buffer space available, seems the window for blocking comes
into play??  So I guess you could say it is optimally asynchronous and
worse case synchronous.  I think the worse case situation is one which
he's trying to address.

At least that's how I interpret it.

Greg

Re: Potential Large Performance Gain in WAL synching

From

Giles Lean

Date:

04 October 2002, 20:49:18

Curtis Faith writes:

> I'm no Unix filesystem expert but I don't see how the OS can handle
> multiple writes and fsyncs to the same file descriptors without
> blocking other processes from writing at the same time.

Why not?  Other than the necessary synchronisation for attributes such
as file size and modification times, multiple processes can readily
write to different areas of the same file at the "same" time.

fsync() may not return until after the buffers it schedules are
written, but it doesn't have to block subsequent writes to different
buffers in the file either.  (Note too Tom Lane's responses about
when fsync() is used and not used.)

> I'll have to write a test and see if there really is a problem.

Please do.  I expect you'll find things aren't as bad as you fear.

In another posting, you write:

> Hmm, I keep hearing that buffer block writes are asynchronous but I don't
> read that in the code at all. There are simple "write" calls with files
> that are not opened with O_NOBLOCK, so they'll be done synchronously. The
> code for this is relatively straighforward (once you get past the
> storage manager abstraction) so I don't see what I might be missing.

There is a confusion of terminology here: the write() is synchronous
from the point of the application only in that the data is copied into
kernel buffers (or pages remapped, or whatever) before the system call
returns.  For files opened with O_DSYNC the write() would wait for the
data to be written to disk.  Thus O_DSYNC is "synchronous" I/O, but
there is no equivalently easy name for the regular "flush to disk
after write() returns" that the Unix kernel has done ~forever.

The asynchronous I/O that you mention ("aio") is a third thing,
different from both regular write() and write() with O_DSYNC. I
understand that with aio the data is not even transferred to the
kernel before the aio_write() call returns, but I've never programmed
with aio and am not 100% sure how it works.

Regards,

Giles

Re: Potential Large Performance Gain in WAL synching

From

Tom Lane

Date:

04 October 2002, 23:13:44

Neil Conway <neilc@samurai.com> writes:
> "Curtis Faith" <curtis@galtair.com> writes:
>> It looks to me like BufferAlloc will simply result in a call to
>> BufferReplace > smgrblindwrt > write for md storage manager objects.
>> 
>> This means that a process will block while the write of dirty cache
>> buffers takes place.

> I think Tom was suggesting that when a buffer is written out, the
> write() call only pushes the data down into the filesystem's buffer --
> which is free to then write the actual blocks to disk whenever it
> chooses to.

Exactly --- in all Unix systems that I know of, a write() is
asynchronous unless one takes special pains (like opening the file
with O_SYNC).  Pushing the data from userspace to the kernel disk
buffers does not count as I/O in my mind.

I am quite concerned about Curtis' worries about fsync, though.
There's not any fundamental reason for fsync to block other operations,
but that doesn't mean that it's been implemented reasonably everywhere
:-(.  We need to take a look at that.
        regards, tom lane

Re: Potential Large Performance Gain in WAL synching

From

"Curtis Faith"

Date:

04 October 2002, 23:17:08

I resent this since it didn't seem to get to the list.

After some research I still hold that fsync blocks, at least on
FreeBSD. Am I missing something?

Here's the evidence:

Code from: /usr/src/sys/syscalls/vfs_syscalls

int
fsync(p, uap)       struct proc *p;       struct fsync_args /* {               syscallarg(int) fd;       } */ *uap;
{       register struct vnode *vp;       struct file *fp;       vm_object_t obj;       int error;
       if ((error = getvnode(p->p_fd, SCARG(uap, fd), &fp)) != 0)               return (error);       vp = (struct
vnode*)fp->f_data;       vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);       if (VOP_GETVOBJECT(vp, &obj) == 0)
vm_object_page_clean(obj, 0, 0, 0);       if ((error = VOP_FSYNC(vp, fp->f_cred, MNT_WAIT, p)) == 0 &&
vp->v_mount&& (vp->v_mount->mnt_flag & MNT_SOFTDEP) &&           bioops.io_fsync)               error =
(*bioops.io_fsync)(vp);      VOP_UNLOCK(vp, 0, p);       return (error);
 
}

Notice the calls to:
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);..VOP_UNLOCK(vp, 0, p);

surrounding the call to VOP_FSYNC.

From the man pages for VOP_UNLOCK:


HEADER STUFF .....

    VOP_LOCK(struct vnode *vp, int flags, struct proc *p);
    int    VOP_UNLOCK(struct vnode *vp, int flags, struct proc *p);
    int    VOP_ISLOCKED(struct vnode *vp, struct proc *p);
    int    vn_lock(struct vnode *vp, int flags, struct proc *p);



DESCRIPTION    These calls are used to serialize access to the filesystem, such as to    prevent two writes to the same
filefrom happening at the same time.
 
    The arguments are:
    vp     the vnode being locked or unlocked
    flags  One of the lock request types:
      LK_SHARED        Shared lock      LK_EXCLUSIVE        Exclusive lock      LK_UPGRADE        Shared-to-exclusive
upgrade     LK_EXCLUPGRADE    First shared-to-exclusive upgrade      LK_DOWNGRADE        Exclusive-to-shared downgrade
   LK_RELEASE        Release any type of lock      LK_DRAIN        Wait for all lock activity to end
 
    The lock type may be or'ed with these lock flags:
      LK_NOWAIT       Do not sleep to wait for lock      LK_SLEEPFAIL       Sleep, then return failure
LK_CANRECURSE   Allow recursive exclusive lock      LK_REENABLE       Lock is to be reenabled after drain
LK_NOPAUSE      No spinloop
 
    The lock type may be or'ed with these control flags:
      LK_INTERLOCK      Specify when the caller already has a simple              lock (VOP_LOCK will unlock the simple
lock             after getting the lock)      LK_RETRY      Retry until locked      LK_NOOBJ      Don't create object
 
    p        process context to use for the locks
    Kernel code should use vn_lock() to lock a vnode rather than calling    VOP_LOCK() directly.

Re: Potential Large Performance Gain in WAL synching

From

Tom Lane

Date:

05 October 2002, 00:07:15

"Curtis Faith" <curtis@galtair.com> writes:
> After some research I still hold that fsync blocks, at least on
> FreeBSD. Am I missing something?

> Here's the evidence:
>         [ much snipped ]
>         vp = (struct vnode *)fp->f_data;
>         vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);

Hm, I take it a "vnode" is what's usually called an inode, ie the unique
identification data for a specific disk file?

This is kind of ugly in general terms but I'm not sure that it really
hurts Postgres.  In our present scheme, the only files we ever fsync()
are WAL log files, not data files.  And in normal operation there is
only one WAL writer at a time, and *no* WAL readers.  So an exclusive
kernel-level lock on a WAL file while we fsync really shouldn't create
any problem for us.  (Unless this indirectly blocks other operations
that I'm missing?)

As I commented before, I think we could do with an extra process to
issue WAL writes in places where they're not in the critical path for
a foreground process.  But that seems to be orthogonal from this issue.
        regards, tom lane

Re: Potential Large Performance Gain in WAL synching

From

Bruce Momjian

Date:

05 October 2002, 00:17:13

Tom Lane wrote:
> "Curtis Faith" <curtis@galtair.com> writes:
> > After some research I still hold that fsync blocks, at least on
> > FreeBSD. Am I missing something?
> 
> > Here's the evidence:
> >         [ much snipped ]
> >         vp = (struct vnode *)fp->f_data;
> >         vn_lock(vp, LK_EXCLUSIVE | LK_RETRY, p);
> 
> Hm, I take it a "vnode" is what's usually called an inode, ie the unique
> identification data for a specific disk file?

Yes, Virtual Inode.  I think it is virtual because it is used for NFS,
where the handle really isn't an inode.

> This is kind of ugly in general terms but I'm not sure that it really
> hurts Postgres.  In our present scheme, the only files we ever fsync()
> are WAL log files, not data files.  And in normal operation there is
> only one WAL writer at a time, and *no* WAL readers.  So an exclusive
> kernel-level lock on a WAL file while we fsync really shouldn't create
> any problem for us.  (Unless this indirectly blocks other operations
> that I'm missing?)

I think the small issue is:
proc1        proc2writefsync        write        fync

Proc2 has to wait for the fsync, but the write is so short compared to
the fsync, I don't see an issue.  Now, if someone would come up with
code that did only one fsync for the above case, that would be a big
win.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

From

"Curtis Faith"

Date:

05 October 2002, 02:12:10

It appears the fsync problem is pervasive. Here's Linux 2.4.19's
version from fs/buffer.c:

lock->      down(&inode->i_sem);           ret = filemap_fdatasync(inode->i_mapping);           err =
file->f_op->fsync(file,dentry, 1);           if (err && !ret)               ret = err;           err =
filemap_fdatawait(inode->i_mapping);          if (err && !ret)               ret = err;

unlock->    up(&inode->i_sem);

But this is probably not a big factor as you outline below because
the WALWriteLock is causing the same kind of contention.

tom lane wrote:
> This is kind of ugly in general terms but I'm not sure that it really
> hurts Postgres.  In our present scheme, the only files we ever fsync()
> are WAL log files, not data files.  And in normal operation there is
> only one WAL writer at a time, and *no* WAL readers.  So an exclusive
> kernel-level lock on a WAL file while we fsync really shouldn't create
> any problem for us.  (Unless this indirectly blocks other operations
> that I'm missing?)

I hope you're right but I see some very similar contention problems in
the case of many small transactions because of the WALWriteLock.

Assume Transaction A which writes a lot of buffers and XLog entries,
so the Commit forces a relatively lengthy fsynch.

Transactions B - E block not on the kernel lock from fsync but on
the WALWriteLock. 

When A finishes the fsync and subsequently releases the WALWriteLock
B unblocks and gets the WALWriteLock for its fsync for the flush.

C blocks on the WALWriteLock waiting to write its XLOG_XACT_COMMIT.

B Releases and now C writes its XLOG_XACT_COMMIT.

There now seems to be a lot of contention on the WALWriteLock. This
is a shame for a system that has no locking at the logical level and
therefore seems like it could be very, very fast and offer
incredible concurrency.

> As I commented before, I think we could do with an extra process to
> issue WAL writes in places where they're not in the critical path for
> a foreground process.  But that seems to be orthogonal from this issue.
It's only orthogonal to the fsync-specific contention issue. We now
have to worry about WALWriteLock semantics causes the same contention.
Your idea of a separate LogWriter process could very nicely solve this
problem and accomplish a few other things at the same time if we make
a few enhancements.

Back-end servers would not issue fsync calls. They would simply block
waiting until the LogWriter had written their record to the disk, i.e.
until the sync'd block # was greater than the block that contained the
XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
ends after its log write returns.

The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter
would issue writes of the optimal size when enough data was present or
of smaller chunks if enough time had elapsed since the last write.

The nice part is that the WALWriteLock semantics could be changed to
allow the LogWriter to write to disk while WALWriteLocks are acquired
by back-end servers. WALWriteLocks would only be held for the brief time
needed to copy the entries into the log buffer. The LogWriter would
only need to grab a lock to determine the current end of the log
buffer. Since it would be writing blocks that occur earlier in the
cache than the XLogInsert log writers it won't need to grab a
WALWriteLock before writing the cache buffers.

Many transactions would commit on the same fsync (now really a write
with O_DSYNC) and we would get optimal write throughput for the log
system.

This would handle all the issues I had and it doesn't sound like a
huge change. In fact, it ends up being almost semantically identical 
to the aio_write suggestion I made orignally, except the
LogWriter is doing the background writing instead of the OS and we
don't have to worry about aio implementations and portability.

- Curtis

Use of sync() [was Re: Potential Large Performance Gain in WAL synching]

From

Mats Lofkvist

Date:

05 October 2002, 04:46:07

tgl@sss.pgh.pa.us (Tom Lane) writes:

[snip]
> On a filesystem that does have that kind of problem, can't you avoid it
> just by using O_DSYNC on the WAL files?  Then there's no need to call
> fsync() at all, except during checkpoints (which actually issue sync()
> not fsync(), anyway).

This comment on using sync() instead of fsync() makes me slightly
worried since sync() doesn't in any way guarantee that all data is
written immediately. E.g. on *BSD with softupdates, it doesn't even
guarantee that data is written within some deterministic time as
far as I know (*).

With a quick check of the code I found

/**    mdsync() -- Sync storage.**/
int
mdsync()
{sync();if (IsUnderPostmaster)    sleep(2);sync();return SM_SUCCESS;
}

which is ugly (imho) even if sync() starts an immediate and complete
file system flush (which I don't think it does with softupdates).

It seems to be used only by

/* ------------------------------------------------* FlushBufferPool** Flush all dirty blocks in buffer pool to disk*
atthe checkpoint time* ------------------------------------------------*/

void
FlushBufferPool(void)
{BufferSync();smgrsync();  /* calls mdsync() */
}

so the question that remains is what kinds of guarantees
FlushBufferPool() really expects and needs from smgrsync() ?

If smgrsync() is called to make up for lack of fsync() calls
in BufferSync(), I'm getting really worried :-)
     _
Mats Lofkvist
mal@algonet.se

(*) See for example   http://groups.google.com/groups?th=bfc8a0dc5373ed6e

Re: Proposed LogWriter Scheme, WAS: Potential Large Performance

From

Bruce Momjian

Date:

05 October 2002, 07:50:06

Curtis Faith wrote:
> Back-end servers would not issue fsync calls. They would simply block
> waiting until the LogWriter had written their record to the disk, i.e.
> until the sync'd block # was greater than the block that contained the
> XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
> ends after its log write returns.
> 
> The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter
> would issue writes of the optimal size when enough data was present or
> of smaller chunks if enough time had elapsed since the last write.

So every backend is to going to wait around until its fsync gets done by
the backend process?  How is that a win?  This is just another version
of our GUC parameters:#commit_delay = 0               # range 0-100000, in microseconds#commit_siblings = 5
#range 1-1000

which attempt to delay fsync if other backends are nearing commit.  
Pushing things out to another process isn't a win;  figuring out if
someone else is coming for commit is.  Remember, write() is fast, fsync
is slow.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Proposed LogWriter Scheme, WAS: Potential Large

From

Hannu Krosing

Date:

05 October 2002, 08:42:30

Bruce Momjian kirjutas L, 05.10.2002 kell 13:49:
> Curtis Faith wrote:
> > Back-end servers would not issue fsync calls. They would simply block
> > waiting until the LogWriter had written their record to the disk, i.e.
> > until the sync'd block # was greater than the block that contained the
> > XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
> > ends after its log write returns.
> > 
> > The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter
> > would issue writes of the optimal size when enough data was present or
> > of smaller chunks if enough time had elapsed since the last write.
> 
> So every backend is to going to wait around until its fsync gets done by
> the backend process?  How is that a win?  This is just another version
> of our GUC parameters:
>     
>     #commit_delay = 0               # range 0-100000, in microseconds
>     #commit_siblings = 5            # range 1-1000
> 
> which attempt to delay fsync if other backends are nearing commit.  
> Pushing things out to another process isn't a win;  figuring out if
> someone else is coming for commit is. 

Exactly. If I understand correctly what Curtis is proposing, you don't
have to figure it out under his scheme - you just issue a WALWait
command and the WAL writing process notifies you when your transactions
WAL is safe storage. 

If the other committer was able to get his WALWait in before the actual
write took place, it will notified too, if not, it will be notified
about 1/166th sec. later (for 10K rpm disk) when it's write is done on
the next rev of disk platters.

The writer process should just issue a continuous stream of
aio_write()'s while there are any waiters and keep track which waiters
are safe to continue - thus no guessing of who's gonna commit.

If supported by platform this should use zero-copy writes - it should be
safe because WAL is append-only.

-----------
Hannu

Re: Proposed LogWriter Scheme, WAS: Potential Large PerformanceGain in WAL synching

From

"Curtis Faith"

Date:

05 October 2002, 09:01:46

Bruce Momjian wrote:
> So every backend is to going to wait around until its fsync gets done by
> the backend process?  How is that a win?  This is just another version
> of our GUC parameters:
>     
>     #commit_delay = 0               # range 0-100000, in microseconds
>     #commit_siblings = 5            # range 1-1000
> 
> which attempt to delay fsync if other backends are nearing commit.  
> Pushing things out to another process isn't a win;  figuring out if
> someone else is coming for commit is.  

It's not the same at all. My proposal make two extremely important changes
from a performance perspective.

1) WALWriteLocks are never held by processes for lengthy transations. Only
for long enough to copy the log entry into the buffer. This means real
work can be done by other processes while a transaction is waiting for
it's commit to finish. I'm sure that blocking on XLogInsert because another
transaction is performing an fsync is extremely common with frequent update
scenarios.

2) The log is written using optimal write sizes which is much better than
a user-defined guess of the microseconds to delay the fsync. We should be
able to get the bottleneck to be the maximum write throughput of the disk
with the modifications to Tom Lane's scheme I proposed.

> Remember, write() is fast, fsync is slow.

Okay, it's clear I missed the point about Unix write earlier :-)

However, it's not just saving fsyncs that we need to worry about. It's the
unnecessary blocking of other processes that are simply trying to
append some log records in the course of whatever updating, inserting they
are doing. They may be a long way from commit.

fsync being slow is the whole reason for not wanting to have exclusive
locks held for the duration of an fsync.

On an SMP machine this change alone would probably speed things up by
an order of magnitude (assuming there aren't any other similar locks
causing the same problem).

- Curtis

Re: Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

From

Tom Lane

Date:

05 October 2002, 11:15:51

"Curtis Faith" <curtis@galtair.com> writes:
> Assume Transaction A which writes a lot of buffers and XLog entries,
> so the Commit forces a relatively lengthy fsynch.

> Transactions B - E block not on the kernel lock from fsync but on
> the WALWriteLock. 

You are confusing WALWriteLock with WALInsertLock.  A
transaction-committing flush operation only holds the former.
XLogInsert only needs the latter --- at least as long as it
doesn't need to write.

Thus, given adequate space in the WAL buffers, transactions B-E do not
get blocked by someone else who is writing/syncing in order to commit.

Now, as the code stands at the moment there is no event other than
commit or full-buffers that prompts a write; that means that we are
likely to run into the full-buffer case more often than is good for
performance.  But a background writer task would fix that.

> Back-end servers would not issue fsync calls. They would simply block
> waiting until the LogWriter had written their record to the disk, i.e.
> until the sync'd block # was greater than the block that contained the
> XLOG_XACT_COMMIT record. The LogWriter could wake up committed back-
> ends after its log write returns.

This will pessimize performance except in the case where WAL traffic
is very heavy, because it means you don't commit until the block
containing your commit record is filled.  What if you are the only
active backend?

My view of this is that backends would wait for the background writer
only when they encounter a full-buffer situation, or indirectly when
they are trying to do a commit write and the background guy has the
WALWriteLock.  The latter serialization is unavoidable: in that
scenario, the background guy is writing/flushing an earlier page of
the WAL log, and we *must* have that down to disk before we can declare
our transaction committed.  So any scheme that tries to eliminate the
serialization of WAL writes will fail.  I do not, however, see any
value in forcing all the WAL writes to be done by a single process;
which is essentially what you're saying we should do.  That just adds
extra process-switch overhead that we don't really need.

> The log file would be opened O_DSYNC, O_APPEND every time.

Keep in mind that we support platforms without O_DSYNC.  I am not
sure whether there are any that don't have O_SYNC either, but I am
fairly sure that we measured O_SYNC to be slower than fsync()s on
some platforms.

> The nice part is that the WALWriteLock semantics could be changed to
> allow the LogWriter to write to disk while WALWriteLocks are acquired
> by back-end servers.

As I said, we already have that; you are confusing WALWriteLock
with WALInsertLock.

> Many transactions would commit on the same fsync (now really a write
> with O_DSYNC) and we would get optimal write throughput for the log
> system.

How are you going to avoid pessimizing the few-transactions case?
        regards, tom lane

Re: Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

From

Doug McNaught

Date:

05 October 2002, 11:28:51

Tom Lane <tgl@sss.pgh.pa.us> writes:

> "Curtis Faith" <curtis@galtair.com> writes:

> > The log file would be opened O_DSYNC, O_APPEND every time.
> 
> Keep in mind that we support platforms without O_DSYNC.  I am not
> sure whether there are any that don't have O_SYNC either, but I am
> fairly sure that we measured O_SYNC to be slower than fsync()s on
> some platforms.

And don't we preallocate WAL files anyway?  So O_APPEND would be
irrelevant?

-Doug

Re: Proposed LogWriter Scheme, WAS: Potential Large Performance

From

Tom Lane

Date:

05 October 2002, 11:32:50

Hannu Krosing <hannu@tm.ee> writes:
> The writer process should just issue a continuous stream of
> aio_write()'s while there are any waiters and keep track which waiters
> are safe to continue - thus no guessing of who's gonna commit.

This recipe sounds like "eat I/O bandwidth whether we need it or not".
It might be optimal in the case where activity is so heavy that we
do actually need a WAL write on every disk revolution, but in any
scenario where we're not maxing out the WAL disk's bandwidth, it will
hurt performance.  In particular, it would seriously degrade performance
if the WAL file isn't on its own spindle but has to share bandwidth with
data file access.

What we really want, of course, is "write on every revolution where
there's something worth writing" --- either we've filled a WAL blovk
or there is a commit pending.  But that just gets us back into the
same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon.
I don't see how an extra process makes that problem any easier.

BTW, it would seem to me that aio_write() buys nothing over plain write()
in terms of ability to gang writes.  If we issue the write at time T
and it completes at T+X, we really know nothing about exactly when in
that interval the data was read out of our WAL buffers.  We cannot
assume that commit records that were stored into the WAL buffer during
that interval got written to disk.  The only safe assumption is that
only records that were in the buffer at time T are down to disk; and
that means that late arrivals lose.  You can't issue aio_write
immediately after the previous one completes and expect that this
optimizes performance --- you have to delay it as long as you possibly
can in hopes that more commit records arrive.  So it comes down to being
the same problem.
        regards, tom lane

Re: Use of sync() [was Re: Potential Large Performance Gain in WAL synching]

From

Tom Lane

Date:

05 October 2002, 12:07:39

Mats Lofkvist <mal@algonet.se> writes:
> [ mdsync is ugly and not completely reliable ]

Yup, it is.  Do you have a better solution?

fsync is not the answer, since the checkpoint process has no way to know
what files may have been touched since the last checkpoint ... and even
if it could find that out, a string of retail fsync calls would kill
performance, cf. Curtis Faith's complaint.

In practice I am not sure there is a problem.  The local man page for
sync() says
    The writing, although scheduled, is not necessarily complete upon    return from sync.

Now if "scheduled" means "will occur before any subsequently-commanded
write occurs" then we're fine.  I don't know if that's true though ...
        regards, tom lane

Re: Use of sync() [was Re: Potential Large Performance Gain in WAL synching]

From

Doug McNaught

Date:

05 October 2002, 12:29:34

Tom Lane <tgl@sss.pgh.pa.us> writes:

> In practice I am not sure there is a problem.  The local man page for
> sync() says
> 
>      The writing, although scheduled, is not necessarily complete upon
>      return from sync.
> 
> Now if "scheduled" means "will occur before any subsequently-commanded
> write occurs" then we're fine.  I don't know if that's true though ...

In my understanding, it means "all currently dirty blocks in the file
cache are queued to the disk driver".  The queued writes will
eventually complete, but not necessarily before sync() returns.  I
don't think subsequent write()s will block, unless the system is low
on buffers and has to wait until dirty blocks are freed by the driver.

-Doug

Re: Proposed LogWriter Scheme, WAS: Potential Large Performance Gain in WAL synching

From

"Curtis Faith"

Date:

05 October 2002, 12:33:27

> You are confusing WALWriteLock with WALInsertLock.  A
> transaction-committing flush operation only holds the former.
> XLogInsert only needs the latter --- at least as long as it
> doesn't need to write.

Well that make things better than I thought. We still end up with
a disk write for each transaction though and I don't see how this
can ever get better than (Disk RPM)/ 60 transactions per second,
since commit fsyncs are serialized. Every fsync will have to wait
almost a full revolution to reach the end of the log.

As a practial matter then everyone will use commit_delay to
improve this.
> This will pessimize performance except in the case where WAL traffic
> is very heavy, because it means you don't commit until the block
> containing your commit record is filled.  What if you are the only
> active backend?

We could handle this using a mechanism analogous to the current
commit delay. If there are more than commit_siblings other processes
running then do the write automatically after commit_delay seconds.

This would make things no more pessimistic than the current
implementation but provide the additional benefit of allowing the
LogWriter to write in optimal sizes if there are many transactions.

The commit_delay method won't be as good in many cases. Consider
a update scenario where a larger commit delay gives better throughput.
A given transaction will flush after commit_delay milliseconds. The
delay is very unlikely to result in a scenario where the dirty log 
buffers are the optimal size.

As a practical matter I think this would tend to make the writes
larger than they would otherwise have been and this would
unnecessarily delay the commit on the transaction.

> I do not, however, see any
> value in forcing all the WAL writes to be done by a single process;
> which is essentially what you're saying we should do.  That just adds
> extra process-switch overhead that we don't really need.

I don't think that an fsync will ever NOT cause the process to get
switched out so I don't see how another process doing the write would
result in more overhead. The fsync'ing process will block on the
fsync, so there will always be at least one process switch (probably
many) while waiting for the fsync to comlete since we are talking
many milliseconds for the fsync in every case.

> > The log file would be opened O_DSYNC, O_APPEND every time.
> 
> Keep in mind that we support platforms without O_DSYNC.  I am not
> sure whether there are any that don't have O_SYNC either, but I am
> fairly sure that we measured O_SYNC to be slower than fsync()s on
> some platforms.

Well there is no reason that the logwriter couldn't be doing fsyncs
instead of O_DSYNC writes in those cases. I'd leave this switchable
using the current flags. Just change the semantics a bit.

- Curtis

Re: Use of sync() [was Re: Potential Large Performance Gain in WAL synching]

From

Tom Lane

Date:

05 October 2002, 12:55:13

Doug McNaught <doug@wireboard.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> In practice I am not sure there is a problem.  The local man page for
>> sync() says
>> 
>> The writing, although scheduled, is not necessarily complete upon
>> return from sync.
>> 
>> Now if "scheduled" means "will occur before any subsequently-commanded
>> write occurs" then we're fine.  I don't know if that's true though ...

> In my understanding, it means "all currently dirty blocks in the file
> cache are queued to the disk driver".  The queued writes will
> eventually complete, but not necessarily before sync() returns.  I
> don't think subsequent write()s will block, unless the system is low
> on buffers and has to wait until dirty blocks are freed by the driver.

We don't need later write()s to block.  We only need them to not hit
disk before the sync-queued writes hit disk.  So I guess the question
boils down to what "queued to the disk driver" means --- has the order
of writes been determined at that point?
        regards, tom lane

Re: Proposed LogWriter Scheme, WAS: Potential Large Performance

From

"Curtis Faith"

Date:

05 October 2002, 13:23:03

>In particular, it would seriously degrade performance if the WAL file
> isn't on its own spindle but has to share bandwidth with
> data file access.

If the OS is stupid I could see this happening. But if there are
buffers and some sort of elevator algorithm the I/O won't happen
at bad times.

I agree with you though that writing for every single insert probably
does not make sense. There should be some blocking of writes. The
optimal size would have to be derived empirically.

> What we really want, of course, is "write on every revolution where
> there's something worth writing" --- either we've filled a WAL blovk
> or there is a commit pending.  But that just gets us back into the
> same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon.
> I don't see how an extra process makes that problem any easier.

The whole point of the extra process handling all the writes is so
that it can write on every revolution, if there is something to
write. It doesn't need to care if more commits will arrive soon.

> BTW, it would seem to me that aio_write() buys nothing over plain write()
> in terms of ability to gang writes.  If we issue the write at time T
> and it completes at T+X, we really know nothing about exactly when in
> that interval the data was read out of our WAL buffers.  We cannot
> assume that commit records that were stored into the WAL buffer during
> that interval got written to disk.

Why would we need to make that assumption? The only thing we'd need to
know is that a given write succeeded meaning that commits before that
write are done.

The advantage to aio_write in this scenario is when writes cross track
boundaries or when the head is in the wrong spot. If we write
in reasonable blocks with aio_write the write might get to the disk
before the head passes the location for the write.

Consider a scenario where:
   Head is at file offset 10,000.
   Log contains blocks 12,000 - 12,500
   ..time passes..
   Head is now at 12,050
   Commit occurs writing block 12,501

In the aio_write case the write would already have been done for blocks  
12,000 to 12,050 and would be queued up for some additional blocks up to
potentially 12,500. So the write for the commit could occur without an
additional rotation delay. We are talking 85 to 200 milliseconds
delay for this rotation on a single disk. I don't know how often this
happens in actual practice but it might occur as often as every other
time.

- Curtis

Re: Proposed LogWriter Scheme, WAS: Potential Large Performance

From

Bruce Momjian

Date:

05 October 2002, 14:26:58

Curtis Faith wrote:
> The advantage to aio_write in this scenario is when writes cross track
> boundaries or when the head is in the wrong spot. If we write
> in reasonable blocks with aio_write the write might get to the disk
> before the head passes the location for the write.
> 
> Consider a scenario where:
> 
>     Head is at file offset 10,000.
> 
>     Log contains blocks 12,000 - 12,500
> 
>     ..time passes..
> 
>     Head is now at 12,050
> 
>     Commit occurs writing block 12,501
> 
> In the aio_write case the write would already have been done for blocks  
> 12,000 to 12,050 and would be queued up for some additional blocks up to
> potentially 12,500. So the write for the commit could occur without an
> additional rotation delay. We are talking 85 to 200 milliseconds
> delay for this rotation on a single disk. I don't know how often this
> happens in actual practice but it might occur as often as every other
> time.

So, you are saying that we may get back aio confirmation quicker than if
we issued our own write/fsync because the OS was able to slip our flush
to disk in as part of someone else's or a general fsync?

I don't buy that because it is possible our write() gets in as part of
someone else's fsync and our fsync becomes a no-op, meaning there aren't
any dirty buffers for that file.  Isn't that also possible?

Also, remember the kernel doesn't know where the platter rotation is
either. Only the SCSI drive can reorder the requests to match this. The
OS can group based on head location, but it doesn't know much about the
platter location, and it doesn't even know where the head is.

Also, does aio return info when the data is in the kernel buffers or
when it is actually on the disk?   

Simply, aio allows us to do the write and get notification when it is
complete.  I don't see how that helps us, and I don't see any other
advantages to aio.  To use aio, we need to find something that _can't_
be solved with more traditional Unix API's, and I haven't seen that yet.

This aio thing is getting out of hand.  It's like we have a hammer, and
everything looks like a nail, or a use for aio.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Proposed LogWriter Scheme, WAS: Potential Large Performance

From

"Curtis Faith"

Date:

05 October 2002, 15:46:45

> So, you are saying that we may get back aio confirmation quicker than if
> we issued our own write/fsync because the OS was able to slip our flush
> to disk in as part of someone else's or a general fsync?
> 
> I don't buy that because it is possible our write() gets in as part of
> someone else's fsync and our fsync becomes a no-op, meaning there aren't
> any dirty buffers for that file.  Isn't that also possible?

Separate out the two concepts:

1) Writing of incomplete transactions at the block level by a
background LogWriter. 

I think it doesn't matter whether the write is aio_write or
write, writing blocks when we get them should provide the benefit
I outlined.

Waiting till fsync could miss the opporunity to write before the 
head passes the end of the last durable write because the drive
buffers might empty causing up to a full rotation's delay.

2) aio_write vs. normal write.

Since as you and others have pointed out aio_write and write are both
asynchronous, the issue becomes one of whether or not the copies to the
file system buffers happen synchronously or not.

This is not a big difference but it seems to me that the OS might be
able to avoid some context switches by grouping copying in the case
of aio_write. I've heard anecdotal reports that this is significantly
faster for some things but I don't know for certain.

> 
> Also, remember the kernel doesn't know where the platter rotation is
> either. Only the SCSI drive can reorder the requests to match this. The
> OS can group based on head location, but it doesn't know much about the
> platter location, and it doesn't even know where the head is.

The kernel doesn't need to know anything about platter rotation. It
just needs to keep the disk write buffers full enough not to cause
a rotational latency.

It's not so much a matter of reordering as it is of getting the data
into the SCSI drive before the head passes the last write's position.
If the SCSI drive's buffers are kept full it can continue writing at
its full throughput. If the writes stop and the buffers empty
it will need to wait up to a full rotation before it gets to the end 
of the log again

> Also, does aio return info when the data is in the kernel buffers or
> when it is actually on the disk?   
> 
> Simply, aio allows us to do the write and get notification when it is
> complete.  I don't see how that helps us, and I don't see any other
> advantages to aio.  To use aio, we need to find something that _can't_
> be solved with more traditional Unix API's, and I haven't seen that yet.
> 
> This aio thing is getting out of hand.  It's like we have a hammer, and
> everything looks like a nail, or a use for aio.

Yes, while I think its probably worth doing and faster, it won't help as
much as just keeping the drive buffers full even if that's by using write
calls.

I still don't understand the opposition to aio_write. Could we just have
the configuration setup determine whether one or the other is used? I 
don't see why we wouldn't use the faster calls if they were present and
reliable on a given system.

- Curtis

Re: Proposed LogWriter Scheme, WAS: Potential Large Performance

From

Bruce Momjian

Date:

05 October 2002, 16:06:41

Curtis Faith wrote:
> > So, you are saying that we may get back aio confirmation quicker than if
> > we issued our own write/fsync because the OS was able to slip our flush
> > to disk in as part of someone else's or a general fsync?
> > 
> > I don't buy that because it is possible our write() gets in as part of
> > someone else's fsync and our fsync becomes a no-op, meaning there aren't
> > any dirty buffers for that file.  Isn't that also possible?
> 
> Separate out the two concepts:
> 
> 1) Writing of incomplete transactions at the block level by a
> background LogWriter. 
> 
> I think it doesn't matter whether the write is aio_write or
> write, writing blocks when we get them should provide the benefit
> I outlined.
> 
> Waiting till fsync could miss the opportunity to write before the 
> head passes the end of the last durable write because the drive
> buffers might empty causing up to a full rotation's delay.

No question about that!  The sooner we can get stuff to the WAL buffers,
the more likely we will get some other transaction to do our fsync work.
Any ideas on how we can do that?

> 2) aio_write vs. normal write.
> 
> Since as you and others have pointed out aio_write and write are both
> asynchronous, the issue becomes one of whether or not the copies to the
> file system buffers happen synchronously or not.
> 
> This is not a big difference but it seems to me that the OS might be
> able to avoid some context switches by grouping copying in the case
> of aio_write. I've heard anecdotal reports that this is significantly
> faster for some things but I don't know for certain.

I suppose it is possible, but because we spend so much time in fsync, we
want to focus on that.  People have recommended mmap of the WAL file,
and that seems like a much more direct way to handle it rather than aio.
However, we can't control when the stuff gets sent to disk with mmap'ed
WAL, or should I say we can't write to it and withhold writes to the
disk file with mmap, so we would need some intermediate step, and then
again, it just becomes more steps and extra steps slow things down too.

> > This aio thing is getting out of hand.  It's like we have a hammer, and
> > everything looks like a nail, or a use for aio.
> 
> Yes, while I think its probably worth doing and faster, it won't help as
> much as just keeping the drive buffers full even if that's by using write
> calls.

> I still don't understand the opposition to aio_write. Could we just have
> the configuration setup determine whether one or the other is used? I 
> don't see why we wouldn't use the faster calls if they were present and
> reliable on a given system.

We hesitate to add code relying on new features unless it is a
significant win, and in the aio case, we would have different WAL disk
write models for with/without aio, so it clearly could be two code
paths, and with two code paths, we can't as easily improve or optimize. 
If we get 2% boost out of some feature,  but it later discourages us
from adding a 5% optimization, it is a loss.  And, in most cases, the 2%
optimization is for a few platform, while the 5% optimization is for
all.  This code is +15 years old, so we are looking way down the road,
not just for today's hot feature.

For example, Tom just improved DISTINCT by 25% by optimizing some of the
sorting and function call handling.  If we had more complex threaded
sort code, that may not have been possible, or it may have been possible
for him to optimize only one of the code paths.

I can't tell you how many aio/mmap/fancy feature discussions we have
had, and we obviously discuss them, but in the end, they end up being of
questionable value for the risk/complexity;  but, we keep talking,
hoping we are wrong or some good ideas come out of it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Proposed LogWriter Scheme, WAS: Potential Large

From

Hannu Krosing

Date:

05 October 2002, 16:27:43

On Sat, 2002-10-05 at 20:32, Tom Lane wrote:
> Hannu Krosing <hannu@tm.ee> writes:
> > The writer process should just issue a continuous stream of
> > aio_write()'s while there are any waiters and keep track which waiters
> > are safe to continue - thus no guessing of who's gonna commit.
> 
> This recipe sounds like "eat I/O bandwidth whether we need it or not".
> It might be optimal in the case where activity is so heavy that we
> do actually need a WAL write on every disk revolution, but in any
> scenario where we're not maxing out the WAL disk's bandwidth, it will
> hurt performance.  In particular, it would seriously degrade performance
> if the WAL file isn't on its own spindle but has to share bandwidth with
> data file access.
> 
> What we really want, of course, is "write on every revolution where
> there's something worth writing" --- either we've filled a WAL blovk
> or there is a commit pending. 

That's what I meant by "while there are any waiters".

> But that just gets us back into the
> same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon.
> I don't see how an extra process makes that problem any easier.

I still think that we could get gang writes automatically, if we just
ask for aio_write at completion of each WAL file page and keep track of
those that are written. We could also keep track of write position
inside the WAL page for

1. end of last write() of each process

2. WAL files write position at each aio_write()

Then we can safely(?) assume, that each backend wants only its own
write()'s be on disk before it can assume the trx has committed. If the
fsync()-like request comes in at time when aio_write for that processes
last position has committed, we can let that process continue without
even a context switch.

In the above scenario I assume that kernel can do the right thing by
doing multiple aio_write requests for the same page in one sweep and not
doing one physical write for each aio_write.

> BTW, it would seem to me that aio_write() buys nothing over plain write()
> in terms of ability to gang writes.  If we issue the write at time T
> and it completes at T+X, we really know nothing about exactly when in
> that interval the data was read out of our WAL buffers. 

Yes, most likely. If we do several write's of the same pages they will
hit physical disk at the same physical write.

> We cannot
> assume that commit records that were stored into the WAL buffer during
> that interval got written to disk.  The only safe assumption is that
> only records that were in the buffer at time T are down to disk; and
> that means that late arrivals lose. 

I assume that if each commit record issues an aio_write when all of
those which actually reached the disk will be notified. 

IOW the first aio_write orders the write, but all the latecomers which
arrive before actual write will also get written and notified.

> You can't issue aio_write
> immediately after the previous one completes and expect that this
> optimizes performance --- you have to delay it as long as you possibly
> can in hopes that more commit records arrive. 

I guess we have quite different cases for different hardware
configurations - if we have a separate disk subsystem for WAL, we may
want to keep the log flowing to disk as fast as it is ready, including
the writing of last, partial page as often as new writes to it are done
- as we possibly can't write more than ~ 250 times/sec (with 15K drives,
no battery RAM) we will always have at least two context switches
between writes (for 500Hz ontext switch clock), and much more if
processes background themselves while waiting for small transactions to
commit.

> So it comes down to being the same problem.

Or its solution ;) as instead of the predicting we just write all data
in log that is ready to be written. If we postpone writing, there will
be hickups when we suddenly discover that we need to write a whole lot
of pages (fsync()) after idling the disk for some period.

---------------
Hannu

Re: Proposed LogWriter Scheme, WAS: Potential Large Performance

From

"Curtis Faith"

Date:

05 October 2002, 17:02:45

> No question about that!  The sooner we can get stuff to the WAL buffers,
> the more likely we will get some other transaction to do our fsync work.
> Any ideas on how we can do that?

More like the sooner we get stuff out of the WAL buffers and into the
disk's buffers whether by write or aio_write.

It doesn't do any good to have information in the XLog unless it
gets written to the disk buffers before they empty.

> We hesitate to add code relying on new features unless it is a
> significant win, and in the aio case, we would have different WAL disk
> write models for with/without aio, so it clearly could be two code
> paths, and with two code paths, we can't as easily improve or optimize. 
> If we get 2% boost out of some feature,  but it later discourages us
> from adding a 5% optimization, it is a loss.  And, in most cases, the 2%
> optimization is for a few platform, while the 5% optimization is for
> all.  This code is +15 years old, so we are looking way down the road,
> not just for today's hot feature.

I'll just have to implement it and see if it's as easy and isolated as I
think it might be and would allow the same algorithm for aio_write or
write.

> I can't tell you how many aio/mmap/fancy feature discussions we have
> had, and we obviously discuss them, but in the end, they end up being of
> questionable value for the risk/complexity;  but, we keep talking,
> hoping we are wrong or some good ideas come out of it.

I'm all in favor of keeping clean designs. I'm very pleased with how
easy PostreSQL is to read and understand given how much it does.

- Curtis

Re: Proposed LogWriter Scheme, WAS: Potential Large Performance

From

Bruce Momjian

Date:

05 October 2002, 17:45:02

Curtis Faith wrote:
> > No question about that!  The sooner we can get stuff to the WAL buffers,
> > the more likely we will get some other transaction to do our fsync work.
> > Any ideas on how we can do that?
> 
> More like the sooner we get stuff out of the WAL buffers and into the
> disk's buffers whether by write or aio_write.

Does aio_write to write or write _and_ fsync()?

> It doesn't do any good to have information in the XLog unless it
> gets written to the disk buffers before they empty.

Just for clarification, we have two issues in this thread:
WAL memory buffers fill up, forcing WAL writemultiple commits at the same time force too many fsync's

I just wanted to throw that out.

> > I can't tell you how many aio/mmap/fancy feature discussions we have
> > had, and we obviously discuss them, but in the end, they end up being of
> > questionable value for the risk/complexity;  but, we keep talking,
> > hoping we are wrong or some good ideas come out of it.
> 
> I'm all in favor of keeping clean designs. I'm very pleased with how
> easy PostreSQL is to read and understand given how much it does.

Glad you see the situation we are in.  ;-)

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Proposed LogWriter Scheme, WAS: Potential Large Performance

From

Tom Lane

Date:

05 October 2002, 19:04:03

Hannu Krosing <hannu@tm.ee> writes:
> Or its solution ;) as instead of the predicting we just write all data
> in log that is ready to be written. If we postpone writing, there will
> be hickups when we suddenly discover that we need to write a whole lot
> of pages (fsync()) after idling the disk for some period.

This part is exactly the same point that I've been proposing to solve
with a background writer process.  We don't need aio_write for that.
The background writer can handle pushing completed WAL pages out to
disk.  The sticky part is trying to gang the writes for multiple 
transactions whose COMMIT records would fit into the same WAL page,
and that WAL page isn't full yet.

The rest of what you wrote seems like wishful thinking about how
aio_write might behave :-(.  I have no faith in it.
        regards, tom lane

Mailing list unsubscribe - hackers isn't there?

From

Mitch

Date:

05 October 2002, 19:31:46

It seems that the Hackers list isn't in the list to 
subscribe/unsubscribe at http://developer.postgresql.org/mailsub.php

Just an FYI.

-Mitch

Computers are like Air Conditioners, they don't work when you open 
Windows.

Re: Proposed LogWriter Scheme, WAS: Potential Large

From

Hannu Krosing

Date:

06 October 2002, 03:35:39

On Sun, 2002-10-06 at 04:03, Tom Lane wrote:
> Hannu Krosing <hannu@tm.ee> writes:
> > Or its solution ;) as instead of the predicting we just write all data
> > in log that is ready to be written. If we postpone writing, there will
> > be hickups when we suddenly discover that we need to write a whole lot
> > of pages (fsync()) after idling the disk for some period.
> 
> This part is exactly the same point that I've been proposing to solve
> with a background writer process.  We don't need aio_write for that.
> The background writer can handle pushing completed WAL pages out to
> disk.  The sticky part is trying to gang the writes for multiple 
> transactions whose COMMIT records would fit into the same WAL page,
> and that WAL page isn't full yet.

I just hoped that kernel could be used as the background writer process
and in the process also solve the multiple commits on the same page
problem

> The rest of what you wrote seems like wishful thinking about how
> aio_write might behave :-(.  I have no faith in it.

Yeah, and the fact that there are several slightly different
implementations of AIO even on Linux alone does not help.

I have to test the SGI KAIO implementation for conformance with my
wishful thinking ;)

Perhaps you could ask around about AIO in RedHat Advanced Server (is it
the same AIO as SGI, how does it behave in "multiple writes on the same
page" case) as you may have better links to RedHat ?

--------------
Hannu

Re: Proposed LogWriter Scheme, WAS: Potential Large

From

Greg Copeland

Date:

06 October 2002, 12:06:48

On Sat, 2002-10-05 at 14:46, Curtis Faith wrote:
>
> 2) aio_write vs. normal write.
>
> Since as you and others have pointed out aio_write and write are both
> asynchronous, the issue becomes one of whether or not the copies to the
> file system buffers happen synchronously or not.

Actually, I believe that write will be *mostly* asynchronous while
aio_write will always be asynchronous.  In a buffer poor environment, I
believe write will degrade into a synchronous operation.  In an ideal
situation, I think they will prove to be on par with one another with a
slight bias toward aio_write.  In less than ideal situations where
buffer space is at a premium, I think aio_write will get the leg up.

>
> The kernel doesn't need to know anything about platter rotation. It
> just needs to keep the disk write buffers full enough not to cause
> a rotational latency.

Which is why in a buffer poor environment, aio_write is generally
preferred as the write is still queued even if the buffer is full.  That
means it will be ready to begin placing writes into the buffer, all
without the process having to wait. On the other hand, when using write,
the process must wait.

In a worse case scenario, it seems that aio_write does get a win.

I personally would at least like to see an aio implementation and would
be willing to even help benchmark it to benchmark/validate any returns
in performance.  Surely if testing reflected a performance boost it
would be considered for baseline inclusion?

Greg

Re: Proposed LogWriter Scheme, WAS: Potential Large

From

Tom Lane

Date:

06 October 2002, 12:46:30

Greg Copeland <greg@CopelandConsulting.Net> writes:
> I personally would at least like to see an aio implementation and would
> be willing to even help benchmark it to benchmark/validate any returns
> in performance.  Surely if testing reflected a performance boost it
> would be considered for baseline inclusion?

It'd be considered, but whether it'd be accepted would have to depend
on the size of the performance boost, its portability (how many
platforms/scenarios do you actually get a boost for), and the extent of
bloat/uglification of the code.

I can't personally get excited about something that only helps if your
server is starved for RAM --- who runs servers that aren't fat on RAM
anymore?  But give it a shot if you like.  Perhaps your analysis is
pessimistic.
        regards, tom lane

Re: Proposed LogWriter Scheme, WAS: Potential Large

From

Greg Copeland

Date:

06 October 2002, 16:21:28

On Sun, 2002-10-06 at 11:46, Tom Lane wrote:
> I can't personally get excited about something that only helps if your
> server is starved for RAM --- who runs servers that aren't fat on RAM
> anymore?  But give it a shot if you like.  Perhaps your analysis is
> pessimistic.

I do suspect my analysis is somewhat pessimistic too but to what degree,
I have no idea.  You make a good case on your memory argument but please
allow me to further kick it around.  I don't find it far fetched to
imagine situations where people may commit large amounts of memory for
the database yet marginally starve available memory for file system
buffers.  Especially so on heavily I/O bound systems or where sporadicly
other types of non-database file activity may occur.

Now, while I continue to assure myself that it is not far fetched I
honestly have no idea how often this type of situation will typically
occur.  Of course, that opens the door for simply adding more memory
and/or slightly reducing the amount of memory available to the database
(thus making it available elsewhere).  Now, after all that's said and
done, having something like aio in use would seemingly allowing it to be
somewhat more "self-tuning" from a potential performance perspective.

Greg

Re: Use of sync() [was Re: Potential Large Performance Gain in WAL synching]

From

Doug McNaught

Date:

07 October 2002, 08:31:49

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Doug McNaught <doug@wireboard.com> writes:

> > In my understanding, it means "all currently dirty blocks in the file
> > cache are queued to the disk driver".  The queued writes will
> > eventually complete, but not necessarily before sync() returns.  I
> > don't think subsequent write()s will block, unless the system is low
> > on buffers and has to wait until dirty blocks are freed by the driver.
> 
> We don't need later write()s to block.  We only need them to not hit
> disk before the sync-queued writes hit disk.  So I guess the question
> boils down to what "queued to the disk driver" means --- has the order
> of writes been determined at that point?

It's certainy possible that new write(s) get put into the queue
alongside old ones--I think the Linux block layer tries to do this
when it can, for one.  According to the manpage, Linux used to wait
until everything was written to return from sync(), though I don't
*think* it does anymore.  But that's not mandated by the specs.

So I don't think we can rely on such behavior (not reordering writes
across a sync()), though it will probably happen in practice a lot of
the time.  AFAIK there isn't anything better than sync() + sleep() as
far as the specs go.  Yes, it kinda sucks.  ;)

-Doug

Re: Proposed LogWriter Scheme, WAS: Potential Large

From

Antti Haapala

Date:

07 October 2002, 11:38:56

On 6 Oct 2002, Greg Copeland wrote:

> On Sat, 2002-10-05 at 14:46, Curtis Faith wrote:
> >
> > 2) aio_write vs. normal write.
> >
> > Since as you and others have pointed out aio_write and write are both
> > asynchronous, the issue becomes one of whether or not the copies to the
> > file system buffers happen synchronously or not.
>
> Actually, I believe that write will be *mostly* asynchronous while
> aio_write will always be asynchronous.  In a buffer poor environment, I
> believe write will degrade into a synchronous operation.  In an ideal
> situation, I think they will prove to be on par with one another with a
> slight bias toward aio_write.  In less than ideal situations where
> buffer space is at a premium, I think aio_write will get the leg up.

Browsed web and came across this piece of text regarding a Linux-KAIO
patch by Silicon Graphics...

"The asynchronous I/O (AIO) facility implements interfaces defined by the
POSIX standard, although it has not been through formal compliance
certification. This version of AIO is implemented with support from kernel
modifications, and hence will be called KAIO to distinguish it from AIO
facilities available from newer versions of glibc/librt.  Because of the
kernel support, KAIO is able to perform split-phase I/O to maximize
concurrency of I/O at the device. With split-phase I/O, the initiating
request (such as an aio_read) truly queues the I/O at the device as the
first phase of the I/O request; a second phase of the I/O request,
performed as part of the I/O completion, propagates results of the
request.  The results may include the contents of the I/O buffer on a
read, the number of bytes read or written, and any error status.

Preliminary experience with KAIO have shown  over  35% improvement in
database performance tests. Unit tests (which only perform I/O) using KAIO
and Raw I/O have been successful in achieving 93% saturation with 12 disks
hung off 2  X 40 MB/s Ultra-Wide SCSI channels. We believe that these
encouraging results are a direct result of implementing  a significant
part of KAIO in the kernel using split-phase I/O while avoiding or
minimizing the use of any globally contented locks."

Well...

> In a worse case scenario, it seems that aio_write does get a win.
>
> I personally would at least like to see an aio implementation and would
> be willing to even help benchmark it to benchmark/validate any returns
> in performance.  Surely if testing reflected a performance boost it
> would be considered for baseline inclusion?

Re: Proposed LogWriter Scheme, WAS: Potential Large

From

Greg Copeland

Date:

07 October 2002, 12:20:33

On Mon, 2002-10-07 at 10:38, Antti Haapala wrote:
> Browsed web and came across this piece of text regarding a Linux-KAIO
> patch by Silicon Graphics...
>

Ya, I have read this before.  The problem here is that I'm not aware of
which AIO implementation on Linux is the forerunner nor do I have any
idea how it's implementation or performance details defer from that of
other implementations on other platforms.  I know there are at least two
aio efforts underway for Linux.  There could yet be others.  Attempting
to cite specifics that only pertain to Linux and then, only with a
specific implementation which may or may not be in general use is
questionable.  Because of this I simply left it as saying that I believe
my analysis is pessimistic.

Anyone have any idea of Red Hat's Advanced Server uses KAIO or what?

>
> Preliminary experience with KAIO have shown  over  35% improvement in
> database performance tests. Unit tests (which only perform I/O) using KAIO
> and Raw I/O have been successful in achieving 93% saturation with 12 disks
> hung off 2  X 40 MB/s Ultra-Wide SCSI channels. We believe that these
> encouraging results are a direct result of implementing  a significant
> part of KAIO in the kernel using split-phase I/O while avoiding or
> minimizing the use of any globally contented locks."

The problem here is, I have no idea what they are comparing to (worse
case read/writes which we know PostgreSQL *mostly* isn't suffering
from).  If we assume that PostgreSQL's read/write operations are
somewhat optimized (as it currently sounds like they are), I'd seriously
doubt we'd see that big of a difference.  On the other hand, I'm hoping
that if an aio postgresql implementation does get done we'll see
something like a 5%-10% performance boost.  Even still, I have nothing
to pin that on other than hope.  If we do see a notable performance
increase for Linux, I have no idea what it will do for other platforms.

Then, there are all of the issues that Tom brought up about
bloat/uglification and maintainability.  So, while I certainly do keep
those remarks in my mind, I think it's best to simply encourage the
effort (or something like it) and help determine where we really sit by
means of empirical evidence.

Greg

Re: Proposed LogWriter Scheme, WAS: Potential Large

From

Neil Conway

Date:

07 October 2002, 12:36:38

Greg Copeland <greg@CopelandConsulting.Net> writes:
> Ya, I have read this before.  The problem here is that I'm not aware of
> which AIO implementation on Linux is the forerunner nor do I have any
> idea how it's implementation or performance details defer from that of
> other implementations on other platforms.

The implementation of AIO in 2.5 is the one by Ben LaHaise (not
SGI). Not sure what the performance is like -- although it's been
merged into 2.5 already, so someone can do some benchmarking. Can
anyone suggest a good test?

Keep in mind that glibc has had a user-space implementation for a
little while (although I'd guess the performance to be unimpressive),
so AIO would not be *that* kernel-version specific.

> Anyone have any idea of Red Hat's Advanced Server uses KAIO or what?

RH AS uses Ben LaHaise's implemention of AIO, I believe.

Cheers,

Neil

-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Dirty Buffer Writing [was Proposed LogWriter Scheme]

From

"Curtis Faith"

Date:

07 October 2002, 14:31:21

> On Sun, 2002-10-06 at 11:46, Tom Lane wrote:
> > I can't personally get excited about something that only helps if your
> > server is starved for RAM --- who runs servers that aren't fat on RAM
> > anymore?  But give it a shot if you like.  Perhaps your analysis is
> > pessimistic.
>
> <snipped> I don't find it far fetched to
> imagine situations where people may commit large amounts of memory for
> the database yet marginally starve available memory for file system
> buffers.  Especially so on heavily I/O bound systems or where sporadicly
> other types of non-database file activity may occur.
>
> <snipped> Of course, that opens the door for simply adding more memory
> and/or slightly reducing the amount of memory available to the database
> (thus making it available elsewhere).  Now, after all that's said and
> done, having something like aio in use would seemingly allowing it to be
> somewhat more "self-tuning" from a potential performance perspective.

Good points.

Now for some surprising news (at least it surprised me).

I researched the file system source on my system (FreeBSD 4.6) and found
that the behavior was optimized for non-database access to eliminate
unnecessary writes when temp files are created and deleted rapidly. It was
not optimized to get data to the disk in the most efficient manner.

The syncer on FreeBSD appears to place dirtied filesystem buffers into
work queues that range from 1 to SYNCER_MAXDELAY. Each second the syncer
processes one of the queues and increments a counter syncer_delayno.

On my system the setting for SYNCER_MAXDELAY is 32. So each second 1/32nd
of the writes that were buffered are processed. If the syncer gets behind
and the writes for a given second exceed one second to process the syncer
does not wait but begins processing the next queue.

AFAICT this means that there is no opportunity to have writes combined by
the  disk since they are processed in buckets based on the time the writes
came in.

Also, it seems very likely that many installations won't have enough
buffers for 30 seconds worth of changes and that there would be some level
of SYNCHRONOUS writing because of this delay and the syncer process getting
backed up. This might happen once per second as the buffers get full and
the syncer has not yet started for that second interval.

Linux might handle this better. I saw some emails exchanged a year or so
ago about starting writes immediately in a low-priority way but I'm not
sure if those patches got applied to the linux kernel or not. The source I
had access to seems to do something analogous to FreeBSD but using fixed
percentages of the dirty blocks or a minimum number of blocks. They appear
to be handled in LRU order however.

On-disk caches are much much larger these days so it seems that some way of
getting the data out sooner would result in better write performance for
the cache. My newer drive is a 10K RPM IBM Ultrastar SCSI and it has a 4M
cache. I don't see these caches getting smaller over time so not letting
the disk see writes will become more and more of a performance drain.

- Curtis

Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]

From

Bruce Momjian

Date:

07 October 2002, 16:29:03

Curtis Faith wrote:
> Good points.
> 
> Now for some surprising news (at least it surprised me).
> 
> I researched the file system source on my system (FreeBSD 4.6) and found
> that the behavior was optimized for non-database access to eliminate
> unnecessary writes when temp files are created and deleted rapidly. It was
> not optimized to get data to the disk in the most efficient manner.
> 
> The syncer on FreeBSD appears to place dirtied filesystem buffers into
> work queues that range from 1 to SYNCER_MAXDELAY. Each second the syncer
> processes one of the queues and increments a counter syncer_delayno.
> 
> On my system the setting for SYNCER_MAXDELAY is 32. So each second 1/32nd
> of the writes that were buffered are processed. If the syncer gets behind
> and the writes for a given second exceed one second to process the syncer
> does not wait but begins processing the next queue.
> 
> AFAICT this means that there is no opportunity to have writes combined by
> the  disk since they are processed in buckets based on the time the writes
> came in.

This is the trickle syncer.  It prevents bursts of disk activity every
30 seconds.  It is for non-fsync writes, of course, and I assume if the
kernel buffers get low, it starts to flush faster.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]

From

"Curtis Faith"

Date:

07 October 2002, 16:43:45

> This is the trickle syncer.  It prevents bursts of disk activity every
> 30 seconds.  It is for non-fsync writes, of course, and I assume if the
> kernel buffers get low, it starts to flush faster.

AFAICT, the syncer only speeds up when virtual memory paging fills the
buffers past
a threshold and even in that event it only speeds it up by a factor of two.

I can't find any provision for speeding up flushing of the dirty buffers
when they fill for normal file system writes, so I don't think that
happens.

- Curtis

Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]

From

Greg Copeland

Date:

07 October 2002, 17:28:18

On Mon, 2002-10-07 at 15:28, Bruce Momjian wrote:
> This is the trickle syncer.  It prevents bursts of disk activity every
> 30 seconds.  It is for non-fsync writes, of course, and I assume if the
> kernel buffers get low, it starts to flush faster.

Doesn't this also increase the likelihood that people will be running in
a buffer-poor environment more frequently that I previously asserted,
especially in very heavily I/O bound systems?  Unless I'm mistaken, that
opens the door for a general case of why an aio implementation should be
looked into.

Also, on a side note, IIRC, linux kernel 2.5.x has a new priority
elevator which is said to be MUCH better as saturating disks than ever
before.  Once 2.6 (or whatever it's number will be) is released, it may
not be as much of a problem as it seems to be for FreeBSD (I think
that's the one you're using).

Greg

Re: Proposed LogWriter Scheme, WAS: Potential Large

From

Hannu Krosing

Date:

07 October 2002, 17:30:43

On Mon, 2002-10-07 at 21:35, Neil Conway wrote:
> Greg Copeland <greg@CopelandConsulting.Net> writes:
> > Ya, I have read this before.  The problem here is that I'm not aware of
> > which AIO implementation on Linux is the forerunner nor do I have any
> > idea how it's implementation or performance details defer from that of
> > other implementations on other platforms.
> 
> The implementation of AIO in 2.5 is the one by Ben LaHaise (not
> SGI). Not sure what the performance is like -- although it's been
> merged into 2.5 already, so someone can do some benchmarking. Can
> anyone suggest a good test?

What would be really interesting is to aio_write small chunks to the
same 8k page by multiple threads/processes and then wait for the page
getting written to disk.

Then check how many backends get their wait back at the same write. 

The docs for POSIX aio_xxx are at:

http://www.opengroup.org/onlinepubs/007904975/functions/aio_write.html

----------------
Hannu

Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]

From

Neil Conway

Date:

07 October 2002, 20:42:37

Greg Copeland <greg@CopelandConsulting.Net> writes:
> Doesn't this also increase the likelihood that people will be running in
> a buffer-poor environment more frequently that I previously asserted,
> especially in very heavily I/O bound systems?  Unless I'm mistaken, that
> opens the door for a general case of why an aio implementation should be
> looked into.

Well, at least for *this specific sitation*, it doesn't really change
anything -- since FreeBSD doesn't implement POSIX AIO as far as I
know, we can't use that as an alternative.

However, I'd suspect that the FreeBSD kernel allows for some way to
tune the behavior of the syncer. If that's the case, we could do some
research into what settings are more appropriate for FreeBSD, and
recommend those in the docs. I don't run FreeBSD, however -- would
someone like to volunteer to take a look at this?

BTW Curtis, did you happen to check whether this behavior has been
changed in FreeBSD 5.0?

> Also, on a side note, IIRC, linux kernel 2.5.x has a new priority
> elevator which is said to be MUCH better as saturating disks than ever
> before.

Yeah, there are lots of new & interesting features for database
systems in the new kernel -- I'm looking forward to when 2.6 is widely
deployed...

Cheers,

Neil

-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]

From

"Curtis Faith"

Date:

07 October 2002, 22:45:55

> Greg Copeland <greg@CopelandConsulting.Net> writes:
> > Doesn't this also increase the likelihood that people will be
> > running in a buffer-poor environment more frequently that I
> > previously asserted, especially in very heavily I/O bound
> > systems?  Unless I'm mistaken, that opens the door for a
> > general case of why an aio implementation should be looked into.

Neil Conway replies:
> Well, at least for *this specific sitation*, it doesn't really change
> anything -- since FreeBSD doesn't implement POSIX AIO as far as I
> know, we can't use that as an alternative.

I haven't tried it yet but there does seem to be an aio implementation that
conforms to POSIX in FreeBSD 4.6.2.  Its part of the kernel and can be
found in:
/usr/src/sys/kern/vfs_aio.c

> However, I'd suspect that the FreeBSD kernel allows for some way to
> tune the behavior of the syncer. If that's the case, we could do some
> research into what settings are more appropriate for FreeBSD, and
> recommend those in the docs. I don't run FreeBSD, however -- would
> someone like to volunteer to take a look at this?

I didn't see anything obvious in the docs but I still believe there's some
way to tune it. I'll let everyone know if I find some better settings.

> BTW Curtis, did you happen to check whether this behavior has been
> changed in FreeBSD 5.0?

I haven't checked but I will.

Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]

From

Bruce Momjian

Date:

07 October 2002, 23:08:49

Curtis Faith wrote:
> > This is the trickle syncer.  It prevents bursts of disk activity every
> > 30 seconds.  It is for non-fsync writes, of course, and I assume if the
> > kernel buffers get low, it starts to flush faster.
> 
> AFAICT, the syncer only speeds up when virtual memory paging fills the
> buffers past
> a threshold and even in that event it only speeds it up by a factor of two.
> 
> I can't find any provision for speeding up flushing of the dirty buffers
> when they fill for normal file system writes, so I don't think that
> happens.

So you think if I try to write a 1 gig file, it will write enough to
fill up the buffers, then wait while the sync'er writes out a few blocks
every second, free up some buffers, then write some more?

Take a look at vfs_bio::getnewbuf() on *BSD and you will see that when
it can't get a buffer, it will async write a dirty buffer to disk.

As far as this AIO conversation is concerned, I want to see someone come
up with some performance improvement that we can only do with AIO. 
Unless I see it, I am not interested in pursuing this thread.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]

From

"Curtis Faith"

Date:

08 October 2002, 09:55:28

> So you think if I try to write a 1 gig file, it will write enough to
> fill up the buffers, then wait while the sync'er writes out a few blocks
> every second, free up some buffers, then write some more?
>
> Take a look at vfs_bio::getnewbuf() on *BSD and you will see that when
> it can't get a buffer, it will async write a dirty buffer to disk.

We've addressed this scenario before, if I recall, the point Greg made
earlier is that buffers getting full means writes become synchronous.

I was trying to point out was that it was very likely that the buffers will
fill even for large buffers and that the writes are going to be driven out
not by efficient ganging but by something approaching LRU flushing, with an
occasional once a second slightly more efficient write of 1/32nd of the
buffers.

Once the buffers get full, all subsequent writes turn into synchronous
writes, since even if the kernel writes asynchronously (meaning it can do
other work), the writing process can't complete, it has to wait until the
buffer has been flushed and is free for the copy. So the relatively poor
implementation (for database inserts at least) of the syncer mechanism will
cost a lot of performance if we get to this synchronous write mode due to a
full buffer. It appears this scenario is much more likely than I had
thought.

Do you not think this is a potential performance problem to be explored?

I'm only pursuing this as hard as I am because I feel like it's deja vu all
over again. I've done this before and found a huge improvement (12X to 20X
for bulk inserts). I'm not necessarily expecting that level of improvement
here but my gut tells me there is more here than seems obvious on the
surface.

> As far as this AIO conversation is concerned, I want to see someone come
> up with some performance improvement that we can only do with AIO.
> Unless I see it, I am not interested in pursuing this thread.

If I come up with something via aio that helps I'd be more than happy if
someone else points out a non-aio way to accomplish the same thing. I'm by
no means married to any particular solutions, I care about getting problems
solved. And I'll stop trying to sell anyone on aio.

- Curtis

Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]

From

Tom Lane

Date:

08 October 2002, 10:51:03

"Curtis Faith" <curtis@galtair.com> writes:
> Do you not think this is a potential performance problem to be explored?

I agree that there's a problem if the kernel runs short of buffer space.
I am not sure whether that's really an issue in practical situations,
nor whether we can do much about it at the application level if it is
--- but by all means look for solutions if you are concerned.

(This is, BTW, one of the reasons for discouraging people from pushing
Postgres' shared buffer cache up to a large fraction of total RAM;
starving the kernel of disk buffers is just plain not a good idea.)
        regards, tom lane

Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]

From

Greg Copeland

Date:

08 October 2002, 10:55:26

Bruce,

Is there remarks along these lines in the performance turning section of
the docs?  Based on what's coming out of this it would seem that
stressing the importance of leaving a notable (rule of thumb here?)
amount for general OS/kernel needs is pretty important.

Greg

On Tue, 2002-10-08 at 09:50, Tom Lane wrote:
> (This is, BTW, one of the reasons for discouraging people from pushing
> Postgres' shared buffer cache up to a large fraction of total RAM;
> starving the kernel of disk buffers is just plain not a good idea.)