Thread: Analysis of ganged WAL writes

Analysis of ganged WAL writes

From
Tom Lane
Date:
I do not think the situation for ganging of multiple commit-record
writes is quite as dire as has been painted.  There is a simple error
in the current code that is easily corrected: in XLogFlush(), the
wait to acquire WALWriteLock should occur before, not after, we try
to acquire WALInsertLock and advance our local copy of the write
request pointer.  (To be exact, xlog.c lines 1255-1269 in CVS tip
ought to be moved down to before line 1275, inside the "if" that
tests whether we are going to call XLogWrite.)

Given that change, what will happen during heavy commit activity
is like this:

1. Transaction A is ready to commit.  It calls XLogInsert to insert
its commit record into the WAL buffers (thereby transiently acquiring
WALInsertLock) and then it calls XLogFlush to write and sync the
log through the commit record.  XLogFlush acquires WALWriteLock and
begins issuing the needed I/O request(s).

2. Transaction B is ready to commit.  It gets through XLogInsert
and then blocks on WALWriteLock inside XLogFlush.

3. Transactions C, D, E likewise insert their commit records
and then block on WALWriteLock.

4. Eventually, transaction A finishes its I/O, advances the "known
flushed" pointer past its own commit record, and releases the
WALWriteLock.

5. Transaction B now acquires WALWriteLock.  Given the code change I
recommend, it will choose to flush the WAL *through the last queued
commit record as of this instant*, not the WAL endpoint as of when it
started to wait.  Therefore, this WAL write will handle all of the
so-far-queued commits.

6. More transactions F, G, H, ... arrive to be committed.  They will
likewise insert their COMMIT records into the buffer and block on
WALWriteLock.

7. When B finishes its write and releases WALWriteLock, it will have
set the "known flushed" pointer past E's commit record.  Therefore,
transactions C, D, E will fall through their tests without calling
XLogWrite at all.  When F gets the lock, it will conclude that it
should write the data queued up to that time, and so it will handle
the commit records for G, H, etc.  (The fact that lwlock.c will release
waiters in order of arrival is important here --- we want C, D, E to
get out of the queue before F decides it needs to write.)


It seems to me that this behavior will provide fairly effective
ganging of COMMIT flushes under load.  And it's self-tuning; no need
to fiddle with weird parameters like commit_siblings.  We automatically
gang as many COMMITs as arrive during the time it takes to write and
flush the previous gang of COMMITs.

Comments?
        regards, tom lane


Re: Analysis of ganged WAL writes

From
Tom Lane
Date:
I said:
> There is a simple error
> in the current code that is easily corrected: in XLogFlush(), the
> wait to acquire WALWriteLock should occur before, not after, we try
> to acquire WALInsertLock and advance our local copy of the write
> request pointer.  (To be exact, xlog.c lines 1255-1269 in CVS tip
> ought to be moved down to before line 1275, inside the "if" that
> tests whether we are going to call XLogWrite.)

That patch was not quite right, as it didn't actually flush the
later-arriving data.   The correct patch is

*** src/backend/access/transam/xlog.c.orig    Thu Sep 26 18:58:33 2002
--- src/backend/access/transam/xlog.c    Sun Oct  6 18:45:57 2002
***************
*** 1252,1279 ****     /* done already? */     if (!XLByteLE(record, LogwrtResult.Flush))     {
-         /* if something was added to log cache then try to flush this too */
-         if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
-         {
-             XLogCtlInsert *Insert = &XLogCtl->Insert;
-             uint32        freespace = INSERT_FREESPACE(Insert);
- 
-             if (freespace < SizeOfXLogRecord)    /* buffer is full */
-                 WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
-             else
-             {
-                 WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
-                 WriteRqstPtr.xrecoff -= freespace;
-             }
-             LWLockRelease(WALInsertLock);
-         }         /* now wait for the write lock */         LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
LogwrtResult= XLogCtl->Write.LogwrtResult;         if (!XLByteLE(record, LogwrtResult.Flush))         {
 
!             WriteRqst.Write = WriteRqstPtr;
!             WriteRqst.Flush = record;             XLogWrite(WriteRqst);         }
LWLockRelease(WALWriteLock);
--- 1252,1284 ----     /* done already? */     if (!XLByteLE(record, LogwrtResult.Flush))     {         /* now wait for
thewrite lock */         LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);         LogwrtResult = XLogCtl->Write.LogwrtResult;
       if (!XLByteLE(record, LogwrtResult.Flush))         {
 
!             /* try to write/flush later additions to XLOG as well */
!             if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
!             {
!                 XLogCtlInsert *Insert = &XLogCtl->Insert;
!                 uint32        freespace = INSERT_FREESPACE(Insert);
! 
!                 if (freespace < SizeOfXLogRecord)    /* buffer is full */
!                     WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
!                 else
!                 {
!                     WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
!                     WriteRqstPtr.xrecoff -= freespace;
!                 }
!                 LWLockRelease(WALInsertLock);
!                 WriteRqst.Write = WriteRqstPtr;
!                 WriteRqst.Flush = WriteRqstPtr;
!             }
!             else
!             {
!                 WriteRqst.Write = WriteRqstPtr;
!                 WriteRqst.Flush = record;
!             }             XLogWrite(WriteRqst);         }         LWLockRelease(WALWriteLock);


To test this, I made a modified version of pgbench in which each
transaction consists of a simpleinsert into table_NNN values(0);
where each client thread has a separate insertion target table.
This is about the simplest transaction I could think of that would
generate a WAL record each time.

Running this modified pgbench with postmaster parameterspostmaster -i -N 120 -B 1000 --wal_buffers=250
and all other configuration settings at default, CVS tip code gives me
a pretty consistent 115-118 transactions per second for anywhere from
1 to 100 pgbench client threads.  This is exactly what I expected,
since the database (including WAL file) is on a 7200 RPM SCSI drive.
The theoretical maximum rate of sync'd writes to the WAL file is
therefore 120 per second (one per disk revolution), but we lose a little
because once in awhile the disk has to seek to a data file.

Inserting the above patch, and keeping all else the same, I get:

$ mybench -c 1 -t 10000 bench1
number of clients: 1
number of transactions per client: 10000
number of transactions actually processed: 10000/10000
tps = 116.694205 (including connections establishing)
tps = 116.722648 (excluding connections establishing)

$ mybench -c 5 -t 2000 -S -n bench1
number of clients: 5
number of transactions per client: 2000
number of transactions actually processed: 10000/10000
tps = 282.808341 (including connections establishing)
tps = 283.656898 (excluding connections establishing)

$ mybench -c 10 -t 1000 bench1
number of clients: 10
number of transactions per client: 1000
number of transactions actually processed: 10000/10000
tps = 443.131083 (including connections establishing)
tps = 447.406534 (excluding connections establishing)

$ mybench -c 50 -t 200 bench1
number of clients: 50
number of transactions per client: 200
number of transactions actually processed: 10000/10000
tps = 416.154173 (including connections establishing)
tps = 436.748642 (excluding connections establishing)

$ mybench -c 100 -t 100 bench1
number of clients: 100
number of transactions per client: 100
number of transactions actually processed: 10000/10000
tps = 336.449110 (including connections establishing)
tps = 405.174237 (excluding connections establishing)

CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients
to <10% idle at 10 or more.

So this does seem to be a nice win, and unless I hear objections
I will apply it ...
        regards, tom lane


Re: Analysis of ganged WAL writes

From
Greg Copeland
Date:
On Sun, 2002-10-06 at 18:07, Tom Lane wrote:
>
> CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients
> to <10% idle at 10 or more.
>
> So this does seem to be a nice win, and unless I hear objections
> I will apply it ...
>

Wow Tom!  That's wonderful!  On the other hand, maybe people needed the
extra idle CPU time that was provided by the unpatched code.  ;)

Greg


Re: Analysis of ganged WAL writes

From
Rod Taylor
Date:
On Sun, 2002-10-06 at 19:35, Greg Copeland wrote:
> On Sun, 2002-10-06 at 18:07, Tom Lane wrote:
> > 
> > CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients
> > to <10% idle at 10 or more.
> > 
> > So this does seem to be a nice win, and unless I hear objections
> > I will apply it ...
> > 
> 
> Wow Tom!  That's wonderful!  On the other hand, maybe people needed the
> extra idle CPU time that was provided by the unpatched code.  ;)

Naw.  Distributed.net finally got through RC5-64.  Lots of CPU to spare
now.

--  Rod Taylor



Re: Analysis of ganged WAL writes

From
Hannu Krosing
Date:
Tom Lane kirjutas E, 07.10.2002 kell 01:07:
> 
> To test this, I made a modified version of pgbench in which each
> transaction consists of a simple
>     insert into table_NNN values(0);
> where each client thread has a separate insertion target table.
> This is about the simplest transaction I could think of that would
> generate a WAL record each time.
> 
> Running this modified pgbench with postmaster parameters
>     postmaster -i -N 120 -B 1000 --wal_buffers=250
> and all other configuration settings at default, CVS tip code gives me
> a pretty consistent 115-118 transactions per second for anywhere from
> 1 to 100 pgbench client threads.  This is exactly what I expected,
> since the database (including WAL file) is on a 7200 RPM SCSI drive.
> The theoretical maximum rate of sync'd writes to the WAL file is
> therefore 120 per second (one per disk revolution), but we lose a little
> because once in awhile the disk has to seek to a data file.
> 
> Inserting the above patch, and keeping all else the same, I get:
> 
> $ mybench -c 1 -t 10000 bench1
> number of clients: 1
> number of transactions per client: 10000
> number of transactions actually processed: 10000/10000
> tps = 116.694205 (including connections establishing)
> tps = 116.722648 (excluding connections establishing)
> 
> $ mybench -c 5 -t 2000 -S -n bench1
> number of clients: 5
> number of transactions per client: 2000
> number of transactions actually processed: 10000/10000
> tps = 282.808341 (including connections establishing)
> tps = 283.656898 (excluding connections establishing)

in an ideal world this would be 5*120=600 tps. 

Have you any good any ideas what holds it back for the other 300 tps ? 

If it has CPU utilisation of only 50% then there must be still some
moderate lock contention. 

btw, what is the number for 1-5-10 clients with fsync off ? 

> $ mybench -c 10 -t 1000 bench1
> number of clients: 10
> number of transactions per client: 1000
> number of transactions actually processed: 10000/10000
> tps = 443.131083 (including connections establishing)
> tps = 447.406534 (excluding connections establishing)
> 
> CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients
> to <10% idle at 10 or more.
> 
> So this does seem to be a nice win, and unless I hear objections
> I will apply it ...

3x speedup is not just nice, it's great ;)

--------------
Hannu





Re: Analysis of ganged WAL writes

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
> in an ideal world this would be 5*120=600 tps. 
> Have you any good any ideas what holds it back for the other 300 tps ?

Well, recall that the CPU usage was about 20% in the single-client test.
(The reason I needed a variant version of pgbench is that this machine
is too slow to do more than 120 TPC-B transactions per second anyway.)

That says that the best possible throughput on this test scenario is 5
transactions per disk rotation --- the CPU is just not capable of doing
more.  I am actually getting about 4 xact/rotation for 10 or more
clients (in fact it seems to reach that plateau at 8 clients, and be
close to it at 7).  I'm inclined to think that the fact that it's 4 not
5 is just a matter of "not quite there" --- there's some additional CPU
overhead due to lock contention, etc, and any slowdown at all will cause
it to miss making 5.  The 20% CPU figure was approximate to begin with,
anyway.

The other interesting question is why we're not able to saturate the
machine with only 4 or 5 clients.  I think pgbench itself is probably
to blame for that: it can't keep all its backend threads constantly
busy ... especially not when several of them report back transaction
completion at essentially the same instant, as will happen under
ganged-commit conditions.  There will be intervals where multiple
backends are waiting for pgbench to send a new command.  That delay
in starting a new command cycle is probably enough for them to "miss the
bus" of getting included in the next commit write.

That's just a guess though; I don't have tools that would let me see
exactly what's happening.  Anyone else want to reproduce the test on
a different system and see what it does?

> If it has CPU utilisation of only 50% then there must be still some
> moderate lock contention. 

No, that's I/O wait I think, forced by the quantization of the number
of transactions that get committed per rotation.

> btw, what is the number for 1-5-10 clients with fsync off ? 

About 640 tps at 1 and 5, trailing off to 615 at 10, and down to 450
at 100 clients (now that must be lock contention...)
        regards, tom lane


Re: Analysis of ganged WAL writes

From
Tom Lane
Date:
I wrote:
> That says that the best possible throughput on this test scenario is 5
> transactions per disk rotation --- the CPU is just not capable of doing
> more.  I am actually getting about 4 xact/rotation for 10 or more
> clients (in fact it seems to reach that plateau at 8 clients, and be
> close to it at 7).

After further thought I understand why it takes 8 clients to reach full
throughput in this scenario.  Assume that we have enough CPU oomph so
that we can process four transactions, but not five, in the time needed
for one revolution of the WAL disk.  If we have five active clients
then the behavior will be like this:

1. Backend A becomes ready to commit.  It locks WALWriteLock and issues
a write/flush that will only cover its own commit record.  Assume that
it has to wait one full disk revolution for the write to complete (this
will be the steady-state situation).

2. While A is waiting, there is enough time for B, C, D, and E to run
their transactions and become ready to commit.  All eventually block on
WALWriteLock.

3. When A finishes its write and releases WALWriteLock, B will acquire
the lock and initiate a write that (with my patch) will cover C, D, and
E's commit records as well as its own.

4. While B is waiting for the disk to spin, A receives a new transaction
from its client, processes it, and becomes ready to commit.  It blocks
on WALWriteLock.

5. When B releases the lock, C, D, E acquire it and quickly fall
through, seeing that they need do no work.  Then A acquires the lock.
GOTO step 1.

So with five active threads, we alternate between committing one
transaction and four transactions on odd and even disk revolutions.

It's pretty easy to see that with six or seven active threads, we
will alternate between committing two or three transactions and
committing four.  Only when we get to eight threads do we have enough
backends to ensure that four transactions are available to commit on
every disk revolution.  This must be so because the backends that are
released at the end of any given disk revolution will not be able to
participate in the next group commit, if there is already at least
one backend ready to commit.

So this solution isn't perfect; it would still be nice to have a way to
delay initiation of the WAL write until "just before" the disk is ready
to accept it.  I dunno any good way to do that, though.

I went ahead and committed the patch for 7.3, since it's simple and does
offer some performance improvement.  But maybe we can think of something
better later on...
        regards, tom lane


Re: Analysis of ganged WAL writes

From
"Curtis Faith"
Date:
Tom, first of all, excellent job improving the current algorithm. I'm glad
you look at the WALCommitLock code.

> This must be so because the backends that are
> released at the end of any given disk revolution will not be able to
> participate in the next group commit, if there is already at least
> one backend ready to commit.

This is the major reason for my original suggestion about using aio_write.
The writes don't block each other and there is no need for a kernel level
exclusive locking call like fsync or fdatasync.

Even the theoretical limit you mention of one transaction per revolution
per committing process seem like a significant bottleneck.

Is committing 1 and 4 transactions on every revolution good? It's certainly
better than 1 per revolution.

However, what if we could have done 3 transactions per process in the time
it took for a single revolution?

Then we are looking at (1 + 4)/ 2 = 2.5 transactions per revolution versus
the theoretical maximum of (3 * 5) = 15 transactions per revolution if we
can figure out a way to do non-blocking writes that we can guarantee are on
the disk platter so we can return from commit.

Separating out whether or not aio is viable. Do you not agree that
eliminating the blocking would result in potentially a 6X improvement for
the 5 process case?

>
> So this solution isn't perfect; it would still be nice to have a way to
> delay initiation of the WAL write until "just before" the disk is ready
> to accept it.  I dunno any good way to do that, though.

I still think that it would be much faster to just keep writing the WAL log
blocks when they fill up and have a separate process wake the commiting
process when the write completes. This would eliminate WAL writing as a
bottleneck.

I have yet to hear anyone say that this can't be done, only that we might
not want to do it because the code might not be clean.

I'm generally only happy when I can finally remove a bottleneck completely,
but speeding one up by 3X like you have done is pretty damn cool for a day
or two's work.

- Curtis



Re: Analysis of ganged WAL writes

From
Tom Lane
Date:
"Curtis Faith" <curtis@galtair.com> writes:
> Even the theoretical limit you mention of one transaction per revolution
> per committing process seem like a significant bottleneck.

Well, too bad.  If you haven't gotten your commit record down to disk,
then *you have not committed*.  This is not negotiable.  (If you think
it is, then turn off fsync and quit worrying ;-))

An application that is willing to have multiple transactions in flight
at the same time can open up multiple backend connections to issue those
transactions, and thereby perhaps beat the theoretical limit.  But for
serial transactions, there is not anything we can do to beat that limit.
(At least not with the log structure we have now.  One could imagine
dropping a commit record into the nearest one of multiple buckets that
are carefully scattered around the disk.  But exploiting that would take
near-perfect knowledge about disk head positioning; it's even harder to
solve than the problem we're considering now.)

> I still think that it would be much faster to just keep writing the WAL log
> blocks when they fill up and have a separate process wake the commiting
> process when the write completes. This would eliminate WAL writing as a
> bottleneck.

You're failing to distinguish total throughput to the WAL drive from
response time seen by any one transaction.  Yes, a policy of writing
each WAL block once when it fills would maximize potential throughput,
but it would also mean a potentially very large delay for a transaction
waiting to commit.  The lower the system load, the worse the performance
on that scale.

The scheme we now have (with my recent patch) essentially says that the
commit delay seen by any one transaction is at most two disk rotations.
Unfortunately it's also at least one rotation :-(, except in the case
where there is no contention, ie, no already-scheduled WAL write when
the transaction reaches the commit stage.  It would be nice to be able
to say "at most one disk rotation" instead --- but I don't see how to
do that in the absence of detailed information about disk head position.

Something I was toying with this afternoon: assume we have a background
process responsible for all WAL writes --- not only filled buffers, but
the currently active buffer.  It periodically checks to see if there
are unwritten commit records in the active buffer, and if so schedules
a write for them.  If this could be done during each disk rotation,
"just before" the disk reaches the active WAL log block, we'd have an
ideal solution.  And it would not be too hard for such a process to
determine the right time: it could measure the drive rotational speed
by observing the completion times of successive writes to the same
sector, and it wouldn't take much logic to empirically find the latest
time at which a write can be issued and have a good probability of
hitting the disk on time.  (At least, this would work pretty well given
a dedicated WAL drive, else there'd be too much interference from other
I/O requests.)

However, this whole scheme falls down on the same problem we've run into
before: user processes can't schedule themselves with millisecond
accuracy.  The writer process might be able to determine the ideal time
to wake up and make the check, but it can't get the Unix kernel to
dispatch it then, at least not on most Unixen.  The typical scheduling
slop is one time slice, which is comparable to if not more than the
disk rotation time.

ISTM aio_write only improves the picture if there's some magic in-kernel
processing that makes this same kind of judgment as to when to issue the
"ganged" write for real, and is able to do it on time because it's in
the kernel.  I haven't heard anything to make me think that that feature
actually exists.  AFAIK the kernel isn't much more enlightened about
physical head positions than we are.
        regards, tom lane


Re: Analysis of ganged WAL writes

From
Hannu Krosing
Date:
On Tue, 2002-10-08 at 00:12, Curtis Faith wrote:
> Tom, first of all, excellent job improving the current algorithm. I'm glad
> you look at the WALCommitLock code.
> 
> > This must be so because the backends that are
> > released at the end of any given disk revolution will not be able to
> > participate in the next group commit, if there is already at least
> > one backend ready to commit.
> 
> This is the major reason for my original suggestion about using aio_write.
> The writes don't block each other and there is no need for a kernel level
> exclusive locking call like fsync or fdatasync.
> 
> Even the theoretical limit you mention of one transaction per revolution
> per committing process seem like a significant bottleneck.
> 
> Is committing 1 and 4 transactions on every revolution good? It's certainly
> better than 1 per revolution.

Of course committing all 5 at each rev would be better ;)

> However, what if we could have done 3 transactions per process in the time
> it took for a single revolution?

I may be missing something obvious, but I don't see a way to get more
than 1 trx/process/revolution, as each previous transaction in that
process must be written to disk before the next can start, and the only
way it can be written to the disk is when the disk heads are on the
right place and that happens exactly once per revolution. 

In theory we could devise some clever page interleave scheme that would
allow us to go like this: fill one page - write page to disk, commit
trx's - fill the page in next 1/3 of rev - write next page to disk ... ,
but this will work only for some limited set ao WAL page sizes.

It could be possible to get near 5/trx/rev for 5 backends if we do the
following (A-E are backends from Toms explanation):

1. write the page for A's trx to its proper pos P (wher P is page
number)

2. if after sync for A returns and we already have more transactions
waiting for write()+sync() of the same page, immediately write the
_same_ page to pos P+N (where N is a tunable parameter). If N is small
enough then P+N will be on the same cylinder for most cases and thus
will get transactions B-E also committed on the same rev.

3. make sure that the last version will also be written to its proper
place before the end of log will overwrite P+N. (This may be tricky.)

4. When restoring from WAL, always check for a page at EndPos+N for a
possible newer version of last page.

This scheme requires page numbers+page versions to be stored in each
page and could get us near 1 trx/backend/rev performance, but it's hard
to tell if it is really useful in real life.

This could also possibly be extended to more than one "end page" and
more than one "continuation end page copy" to get better than 1
trx/backend/rev.

-----------------
Hannu




Re: Analysis of ganged WAL writes

From
Hannu Krosing
Date:
On Tue, 2002-10-08 at 01:27, Tom Lane wrote:
>
> The scheme we now have (with my recent patch) essentially says that the
> commit delay seen by any one transaction is at most two disk rotations.
> Unfortunately it's also at least one rotation :-(, except in the case
> where there is no contention, ie, no already-scheduled WAL write when
> the transaction reaches the commit stage.  It would be nice to be able
> to say "at most one disk rotation" instead --- but I don't see how to
> do that in the absence of detailed information about disk head position.
> 
> Something I was toying with this afternoon: assume we have a background
> process responsible for all WAL writes --- not only filled buffers, but
> the currently active buffer.  It periodically checks to see if there
> are unwritten commit records in the active buffer, and if so schedules
> a write for them.  If this could be done during each disk rotation,
> "just before" the disk reaches the active WAL log block, we'd have an
> ideal solution.  And it would not be too hard for such a process to
> determine the right time: it could measure the drive rotational speed
> by observing the completion times of successive writes to the same
> sector, and it wouldn't take much logic to empirically find the latest
> time at which a write can be issued and have a good probability of
> hitting the disk on time.  (At least, this would work pretty well given
> a dedicated WAL drive, else there'd be too much interference from other
> I/O requests.)
> 
> However, this whole scheme falls down on the same problem we've run into
> before: user processes can't schedule themselves with millisecond
> accuracy.  The writer process might be able to determine the ideal time
> to wake up and make the check, but it can't get the Unix kernel to
> dispatch it then, at least not on most Unixen.  The typical scheduling
> slop is one time slice, which is comparable to if not more than the
> disk rotation time.

Standard for Linux has been 100Hz time slice, but it is configurable for
some time.

The latest RedHat (8.0) is built with 500Hz that makes about 4
slices/rev for 7200 rpm disks (2 for 15000rpm)
> ISTM aio_write only improves the picture if there's some magic in-kernel
> processing that makes this same kind of judgment as to when to issue the
> "ganged" write for real, and is able to do it on time because it's in
> the kernel.  I haven't heard anything to make me think that that feature
> actually exists.  AFAIK the kernel isn't much more enlightened about
> physical head positions than we are.

At least for open source kernels it could be possible to

1. write a patch to kernel 

or 

2. get the authors of kernel aio interested in doing it.

or

3. the third possibility would be using some real-time (RT) OS or mixed
RT/conventional OS where some threads can be scheduled for hard-RT .
In an RT os you are supposed to be able to do exactly what you describe.


I think that 2 and 3 could be "outsourced" (the respective developers
talked into supporting it) as both KAIO and RT Linuxen/BSDs are probably
also inetersted in high-profile applications so they could boast that
"using our stuff enabled PostgreSQL database run twice as fast".

Anyway, getting to near-harware speeds for database will need more
specific support from OS than web browsing or compiling.

---------------
Hannu









Re: Analysis of ganged WAL writes

From
"Curtis Faith"
Date:
> Well, too bad.  If you haven't gotten your commit record down to disk,
> then *you have not committed*.  This is not negotiable.  (If you think
> it is, then turn off fsync and quit worrying ;-))

I've never disputed this, so if I seem to be suggesting that, I've beee
unclear. I'm just assuming that the disk can get a confirmation back to the
INSERTing process in much less than one rotation. This would allow that
process to start working again, perhaps in time to complete another
transaction.

> An application that is willing to have multiple transactions in flight
> at the same time can open up multiple backend connections to issue those
> transactions, and thereby perhaps beat the theoretical limit.  But for
> serial transactions, there is not anything we can do to beat that limit.
> (At least not with the log structure we have now.  One could imagine
> dropping a commit record into the nearest one of multiple buckets that
> are carefully scattered around the disk.  But exploiting that would take
> near-perfect knowledge about disk head positioning; it's even harder to
> solve than the problem we're considering now.)

Consider the following scenario:

Time measured in disk rotations.

Time 1.00 - Process A Commits - Causing aio_write to log and wait
Time 1.03 - aio_completes for Process A write - wakes process A
Time 1.05 - Process A Starts another transaction.
Time 1.08 - Process A Commits
etc.

I agree that a process can't proceed from a commit until it receives
confirmation of the write, but if the write has hit the disk before a full
rotation then the process should be able to continue processing new
transactions

> You're failing to distinguish total throughput to the WAL drive from
> response time seen by any one transaction.  Yes, a policy of writing
> each WAL block once when it fills would maximize potential throughput,
> but it would also mean a potentially very large delay for a transaction
> waiting to commit.  The lower the system load, the worse the performance
> on that scale.

You are assuming fsync or fdatasync behavior, I am not. There would be no
delay under the scenario I describe. The transaction would exit commit as
soon as the confirmation of the write is received from the aio system. I
would hope that with a decent aio implementation this would generally be
much less than one rotation.

I think that the single transaction response time is very important and
that's one of the chief problems I sought to solve when I proposed
aio_writes for logging in my original email many moons ago.

> ISTM aio_write only improves the picture if there's some magic in-kernel
> processing that makes this same kind of judgment as to when to issue the
> "ganged" write for real, and is able to do it on time because it's in
> the kernel.  I haven't heard anything to make me think that that feature
> actually exists.  AFAIK the kernel isn't much more enlightened about
> physical head positions than we are.

All aio_write has to do is pass the write off to the device as soon as it
aio_write gets it bypassing the system buffers. The code on the disk's
hardware is very good at knowing when the disk head is coming. IMHO,
bypassing the kernel's less than enlightened writing system is the main
point of using aio_write.

- Curtis



Re: Analysis of ganged WAL writes

From
Greg Copeland
Date:
On Mon, 2002-10-07 at 16:06, Curtis Faith wrote:
> > Well, too bad.  If you haven't gotten your commit record down to disk,
> > then *you have not committed*.  This is not negotiable.  (If you think
> > it is, then turn off fsync and quit worrying ;-))
>

At this point, I think we've come full circle.  Can we all agree that
this concept is a *potential* source of improvement in a variety of
situations?  If we can agree on that, perhaps we should move to the next
stage in the process, validation?

How long do you think it would take to develop something worthy of
testing?  Do we have known test cases which will properly (in)validate
the approach that everyone will agree to?  If code is reasonably clean
so as to pass the smell test and shows a notable performance boost, will
it be seriously considered for inclusion?  If so, I assume it would
become a configure option (--with-aio)?


Regards,
Greg


Re: Analysis of ganged WAL writes

From
Justin Clift
Date:
Greg Copeland wrote:
<snip>
> If so, I assume it would become a configure option (--with-aio)?

Or maybe a GUC "use_aio" ?

:-)

Regards and best wishes,

Justin Clift

> 
> Regards,
> 
>         Greg
> 
>   ------------------------------------------------------------------------
>                        Name: signature.asc
>    signature.asc       Type: application/pgp-signature
>                 Description: This is a digitally signed message part

-- 
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."  - Indira Gandhi


Re: Analysis of ganged WAL writes

From
Tom Lane
Date:
"Curtis Faith" <curtis@galtair.com> writes:
>> Well, too bad.  If you haven't gotten your commit record down to disk,
>> then *you have not committed*.  This is not negotiable.  (If you think
>> it is, then turn off fsync and quit worrying ;-))

> I've never disputed this, so if I seem to be suggesting that, I've beee
> unclear. I'm just assuming that the disk can get a confirmation back to the
> INSERTing process in much less than one rotation.

You've spent way too much time working with lying IDE drives :-(

Do you really trust a confirm-before-write drive to make that write if
it loses power?  I sure don't.

If you do trust your drive to hold that data across a crash, then ISTM
the whole problem goes away anyway, as writes will "complete" quite
independently of disk rotation.  My Linux box has no problem claiming
that it's completing several thousand TPS with a single client ... and
yes, fsync is on, but it's using an IDE drive, and I don't know how to
disable confirm-before-write on that drive.  (That's why I did these
tests on my old slow HP hardware.)  Basically, the ganging of commit
writes happens inside the disk controller on a setup like that.  You
still don't need aio_write --- unless perhaps to reduce wastage of IDE
bus bandwidth by repeated writes, but that doesn't seem to be a scarce
resource in this context.
        regards, tom lane


Re: Analysis of ganged WAL writes

From
Greg Copeland
Date:
Well, I was thinking that aio may not be available on all platforms,
thus the conditional compile option.   On the other hand, wouldn't you
pretty much want it either on or off for all instances?  I can see that
it would be nice for testing though.  ;)

Greg

On Mon, 2002-10-07 at 16:23, Justin Clift wrote:
> Greg Copeland wrote:
> <snip>
> > If so, I assume it would become a configure option (--with-aio)?
>
> Or maybe a GUC "use_aio" ?
>
> :-)
>
> Regards and best wishes,
>
> Justin Clift
>
> >
> > Regards,
> >
> >         Greg
> >
> >   ------------------------------------------------------------------------
> >                        Name: signature.asc
> >    signature.asc       Type: application/pgp-signature
> >                 Description: This is a digitally signed message part
>
> --
> "My grandfather once told me that there are two kinds of people: those
> who work and those who take the credit. He told me to try to be in the
> first group; there was less competition there."
>    - Indira Gandhi


Re: Analysis of ganged WAL writes

From
"Curtis Faith"
Date:
> I may be missing something obvious, but I don't see a way to get more
> than 1 trx/process/revolution, as each previous transaction in that
> process must be written to disk before the next can start, and the only
> way it can be written to the disk is when the disk heads are on the
> right place and that happens exactly once per revolution.

Okay, consider the following scenario.

1) Process A commits when the platter is at 0 degrees.
2) There are enough XLog writes from other processes to fill 1/4 platter
rotation worth of log or 90 degrees. The SCSI drive writes the XLog commit
record and keeps writing other log entries as the head rotates.
3) Process A receives a confirmation of the write before the platter
rotates 60 degrees.
4) Process A continues and adds another commit before the platter rotates
to 90 degrees.

This should be very possible and more and more likely in the future as CPUs
get faster and faster relative to disks.

I'm not suggesting this would happen all the time, just that it's possible
and that an SMP machine with good CPUs and a fast I/O subsystem should be
able to keep the log writing at close to I/O bandwidth limits.

The case of bulk inserts is one where I would expect that for simple tables
we should be able to peg the disks given today's hardware and enough
inserting processes.

- Curtis



Re: Analysis of ganged WAL writes

From
Hannu Krosing
Date:
Curtis Faith kirjutas T, 08.10.2002 kell 01:04:
> > I may be missing something obvious, but I don't see a way to get more
> > than 1 trx/process/revolution, as each previous transaction in that
> > process must be written to disk before the next can start, and the only
> > way it can be written to the disk is when the disk heads are on the
> > right place and that happens exactly once per revolution.
> 
> Okay, consider the following scenario.
> 
> 1) Process A commits when the platter is at 0 degrees.
> 2) There are enough XLog writes from other processes to fill 1/4 platter
> rotation worth of log or 90 degrees. The SCSI drive writes the XLog commit
> record and keeps writing other log entries as the head rotates.
> 3) Process A receives a confirmation of the write before the platter
> rotates 60 degrees.
> 4) Process A continues and adds another commit before the platter rotates
> to 90 degrees.

for this scheme to work there are _very_ specific conditions to be met
(i.e. the number of writing processes and size of WAL records has to
meet very strict criteria)

I'd suspect that this will not work for 95% of cases.

> This should be very possible and more and more likely in the future as CPUs
> get faster and faster relative to disks.

You example of >1 trx/proc/rev will wok _only_ if no more and no less
than 1/4 of platter is filled by _other_ log writers.

> I'm not suggesting this would happen all the time, just that it's possible
> and that an SMP machine with good CPUs and a fast I/O subsystem should be
> able to keep the log writing at close to I/O bandwidth limits.

Keeping log writing parallel and close to I/O bandwidth limits is the
real key issue there. 

Toms test case of fast inserting of small records by relatively small
number of processes (forcing repeated writes of the same page) seems
also a border case.

> The case of bulk inserts is one where I would expect that for simple tables
> we should be able to peg the disks given today's hardware and enough
> inserting processes.

bulk inserts should probably be chunked at higher level by inserting
several records inside a single transaction.

-----------
Hannu



Re: Analysis of ganged WAL writes

From
"Zeugswetter Andreas SB SD"
Date:
> ISTM aio_write only improves the picture if there's some magic in-kernel
> processing that makes this same kind of judgment as to when to issue the
> "ganged" write for real, and is able to do it on time because it's in
> the kernel.  I haven't heard anything to make me think that that feature
> actually exists.  AFAIK the kernel isn't much more enlightened about
> physical head positions than we are.

Can the magic be, that kaio directly writes from user space memory to the
disk ? Since in your case all transactions A-E want the same buffer written,
the memory (not it's content) will also be the same. This would automatically
write the latest possible version of our WAL buffer to disk.

The problem I can see offhand is how the kaio system can tell which transaction
can be safely notified of the write, or whether the programmer is actually responsible
for not changing the buffer until notified of completion ?

Andreas


Re: Analysis of ganged WAL writes

From
Greg Copeland
Date:
On Tue, 2002-10-08 at 04:15, Zeugswetter Andreas SB SD wrote:
> Can the magic be, that kaio directly writes from user space memory to the
> disk ? Since in your case all transactions A-E want the same buffer written,
> the memory (not it's content) will also be the same. This would automatically
> write the latest possible version of our WAL buffer to disk.
>

*Some* implementations allow for zero-copy aio.  That is a savings.  On
heavily used systems, it can be a large savings.

> The problem I can see offhand is how the kaio system can tell which transaction
> can be safely notified of the write, or whether the programmer is actually responsible
> for not changing the buffer until notified of completion ?

That's correct.  The programmer can not change the buffer contents until
notification has completed for that outstanding aio operation.  To do
otherwise results in undefined behavior.  Since some systems do allow
for zero-copy aio operations, requiring the buffers not be modified,
once queued, make a lot of sense.  Of course, even on systems that don't
support zero-copy, changing the buffered data prior to write completion
just seems like a bad idea to me.

Here's a quote from SGI's aio_write man page:
If the buffer pointed to by aiocbp->aio_buf or the control block pointed
to by aiocbp changes or becomes an illegal address prior to asynchronous
I/O completion then the behavior is undefined.  Simultaneous synchronous
operations using the same aiocbp produce undefined results.

And on SunOS we have:    The aiocbp argument points to an  aiocb  structure.  If  the    buffer  pointed  to  by
aiocbp->aio_bufor the control block    pointed to by aiocbp becomes an  illegal  address  prior  to    asynchronous I/O
completion,then the behavior is undefined. 
and    For any system action that changes the process memory  space    while  an  asynchronous  I/O  is  outstanding to
theaddress    range being changed, the result of that action is undefined. 


Greg


Re: Analysis of ganged WAL writes

From
Tom Lane
Date:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
> Can the magic be, that kaio directly writes from user space memory to the 
> disk ?

This makes more assumptions about the disk drive's behavior than I think
are justified...

> Since in your case all transactions A-E want the same buffer written,
> the memory (not it's content) will also be the same.

But no, it won't: the successive writes will ask to write different
snapshots of the same buffer.

> The problem I can see offhand is how the kaio system can tell which
> transaction can be safely notified of the write,

Yup, exactly.  Whose snapshot made it down to (stable) disk storage?
        regards, tom lane


Re: Analysis of ganged WAL writes

From
"Curtis Faith"
Date:
> You example of >1 trx/proc/rev will wok _only_ if no more and no less
> than 1/4 of platter is filled by _other_ log writers.

Not really, if 1/2 the platter has been filled we'll still get in one more
commit in for a given rotation. If more than a rotation's worth of writing
has occurred that means we are bumping into the limit of disk I/O and that
it the limit that we can't do anything about without doing interleaved log
files.

> > The case of bulk inserts is one where I would expect that for
> simple tables
> > we should be able to peg the disks given today's hardware and enough
> > inserting processes.
>
> bulk inserts should probably be chunked at higher level by inserting
> several records inside a single transaction.

Agreed, that's much more efficient. There are plenty of situations where
the inserts and updates are ongoing rather than initial, Shridhar's
real-world test or TPC benchmarks, for example.

- Curtis



Re: Analysis of ganged WAL writes

From
"Curtis Faith"
Date:
> > Since in your case all transactions A-E want the same buffer written,
> > the memory (not it's content) will also be the same.
>
> But no, it won't: the successive writes will ask to write different
> snapshots of the same buffer.

Successive writes would write different NON-OVERLAPPING sections of the
same log buffer. It wouldn't make sense to send three separate copies of
the entire block. That could indeed cause problems.

If a separate log writing process was doing all the writing, it could
pre-gang the writes. However, I'm not sure this is necessary. I'll test the
simpler way first.

> > The problem I can see offhand is how the kaio system can tell which
> > transaction can be safely notified of the write,
>
> Yup, exactly.  Whose snapshot made it down to (stable) disk storage?

If you do as above, it can inform the transactions when the blocks get
written to disk since there are no inconsistent writes. If transactions A,
B and C had commits in block 1023, and the aio system writes that block to
the disk, it can notify all three that their transaction write is complete
when that block (or partial block) is written to disk.

It transaction C's write didn't make it into the buffer, I've got to assume
the aio system or the disk cache logic can handle realizing that it didn't
queue that write and therefore not inform transaction C of a completion.

- Curtis



Re: Analysis of ganged WAL writes

From
Tom Lane
Date:
"Curtis Faith" <curtis@galtair.com> writes:
> Successive writes would write different NON-OVERLAPPING sections of the
> same log buffer. It wouldn't make sense to send three separate copies of
> the entire block. That could indeed cause problems.

So you're going to undo the code's present property that all writes are
block-sized?  Aren't you worried about incurring page-in reads because
the kernel can't know that we don't care about data beyond what we've
written so far in the block?
        regards, tom lane


Re: Analysis of ganged WAL writes

From
"Curtis Faith"
Date:
> "Curtis Faith" <curtis@galtair.com> writes:
> > Successive writes would write different NON-OVERLAPPING sections of the
> > same log buffer. It wouldn't make sense to send three separate
> copies of
> > the entire block. That could indeed cause problems.
>
> So you're going to undo the code's present property that all writes are
> block-sized?  Aren't you worried about incurring page-in reads because
> the kernel can't know that we don't care about data beyond what we've
> written so far in the block?

Yes, I'll try undoing the current behavior.

I'm not really worried about doing page-in reads because the disks internal
buffers should contain most of the blocks surrounding the end of the log
file. If the successive partial writes exceed a block (which they will in
heavy use) then most of the time this won't be a problem anyway since the
disk will gang the full blocks before writing.

If the inserts are not coming fast enough to fill the log then the disks
cache should contain the data from the last time that block (or the block
before) was written. Disks have become pretty good at this sort of thing
since writing sequentially is a very common scenario.

It may not work, but one doesn't make significant progress without trying
things that might not work.

If it doesn't work, then I'll make certain that commit log records always
fill the buffer they are written too, with variable length commit records
and something to identify the size of the padding used to fill the rest of
the block.



Re: Analysis of ganged WAL writes

From
Tom Lane
Date:
"Curtis Faith" <curtis@galtair.com> writes:
> I'm not really worried about doing page-in reads because the disks internal
> buffers should contain most of the blocks surrounding the end of the log
> file. If the successive partial writes exceed a block (which they will in
> heavy use) then most of the time this won't be a problem anyway since the
> disk will gang the full blocks before writing.

You seem to be willing to make quite a large number of assumptions about
what the disk hardware will do or not do.  I trust you're going to test
your results on a wide range of hardware before claiming they have any
general validity ...
        regards, tom lane


Re: Analysis of ganged WAL writes

From
"Curtis Faith"
Date:
> "Curtis Faith" <curtis@galtair.com> writes:
> > I'm not really worried about doing page-in reads because the
> disks internal
> > buffers should contain most of the blocks surrounding the end
> of the log
> > file. If the successive partial writes exceed a block (which
> they will in
> > heavy use) then most of the time this won't be a problem
> anyway since the
> > disk will gang the full blocks before writing.
>
> You seem to be willing to make quite a large number of assumptions about
> what the disk hardware will do or not do.  I trust you're going to test
> your results on a wide range of hardware before claiming they have any
> general validity ...

Tom, I'm actually an empiricist. I trust nothing that I haven't tested or
read the code for myself. I've found too many instances of bugs or poor
implementations in the O/S to believe without testing.

On the other hand, one has to make some assumptions in order to devise
useful tests.

I'm not necessarily expecting that I'll come up with something that will
help everyone all the time. I'm just hoping that I can come up with
something that will help those using modern hardware, most of the time.

Even if it works, this will probably become one of those flags that need to
be tested as part of the performance analysis for any given system. Or
perhaps ideally, I'll come up with a LogDiskTester that simulates log
output and determines the best settings to use for a given disk and O/S,
optimized for size or speed, heavy inserts, etc.



Re: Analysis of ganged WAL writes

From
"Zeugswetter Andreas SB SD"
Date:
> > Can the magic be, that kaio directly writes from user space memory to the
> > disk ?
>
> This makes more assumptions about the disk drive's behavior than I think
> are justified...

No, no assumption about the drive, only the kaio implementation, namely, that
the kaio implementation reads the user memory at a latest (for the implementation)
possible time.

> > Since in your case all transactions A-E want the same buffer written,
> > the memory (not it's content) will also be the same.
>
> But no, it won't: the successive writes will ask to write different
> snapshots of the same buffer.

That is what I meant, yes. Should have better formulated memory location.

> > The problem I can see offhand is how the kaio system can tell which
> > transaction can be safely notified of the write,
>
> Yup, exactly.  Whose snapshot made it down to (stable) disk storage?

That is something the kaio implementation could prbbly know in our scenario,
where we iirc only append new records inside the page (don't modify anything
up front). Worst thing that can happen is that the kaio is too pessimistic
about what is already on disk (memory already written, but aio call not yet done).

Andreas


Re: Analysis of ganged WAL writes

From
Mats Lofkvist
Date:
tgl@sss.pgh.pa.us (Tom Lane) writes:

[snip]
> 
> So this does seem to be a nice win, and unless I hear objections
> I will apply it ...
> 

It does indeed look like a great improvement, so is the fix
going to be merged to the 7.3 branch or is it too late for that?
     _
Mats Lofkvist
mal@algonet.se


Re: Analysis of ganged WAL writes

From
Tom Lane
Date:
Mats Lofkvist <mal@algonet.se> writes:
> It does indeed look like a great improvement, so is the fix
> going to be merged to the 7.3 branch or is it too late for that?

Yes, been there done that ...
        regards, tom lane