Re: Proposed LogWriter Scheme, WAS: Potential Large Performance - Mailing list pgsql-hackers

From Curtis Faith
Subject Re: Proposed LogWriter Scheme, WAS: Potential Large Performance
Date
Msg-id DMEEJMCDOJAKPPFACMPMIEDGCEAA.curtis@galtair.com
Whole thread Raw
In response to Re: Proposed LogWriter Scheme, WAS: Potential Large Performance  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: Proposed LogWriter Scheme, WAS: Potential Large Performance  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: Proposed LogWriter Scheme, WAS: Potential Large  (Greg Copeland <greg@CopelandConsulting.Net>)
List pgsql-hackers
> So, you are saying that we may get back aio confirmation quicker than if
> we issued our own write/fsync because the OS was able to slip our flush
> to disk in as part of someone else's or a general fsync?
> 
> I don't buy that because it is possible our write() gets in as part of
> someone else's fsync and our fsync becomes a no-op, meaning there aren't
> any dirty buffers for that file.  Isn't that also possible?

Separate out the two concepts:

1) Writing of incomplete transactions at the block level by a
background LogWriter. 

I think it doesn't matter whether the write is aio_write or
write, writing blocks when we get them should provide the benefit
I outlined.

Waiting till fsync could miss the opporunity to write before the 
head passes the end of the last durable write because the drive
buffers might empty causing up to a full rotation's delay.

2) aio_write vs. normal write.

Since as you and others have pointed out aio_write and write are both
asynchronous, the issue becomes one of whether or not the copies to the
file system buffers happen synchronously or not.

This is not a big difference but it seems to me that the OS might be
able to avoid some context switches by grouping copying in the case
of aio_write. I've heard anecdotal reports that this is significantly
faster for some things but I don't know for certain.

> 
> Also, remember the kernel doesn't know where the platter rotation is
> either. Only the SCSI drive can reorder the requests to match this. The
> OS can group based on head location, but it doesn't know much about the
> platter location, and it doesn't even know where the head is.

The kernel doesn't need to know anything about platter rotation. It
just needs to keep the disk write buffers full enough not to cause
a rotational latency.

It's not so much a matter of reordering as it is of getting the data
into the SCSI drive before the head passes the last write's position.
If the SCSI drive's buffers are kept full it can continue writing at
its full throughput. If the writes stop and the buffers empty
it will need to wait up to a full rotation before it gets to the end 
of the log again

> Also, does aio return info when the data is in the kernel buffers or
> when it is actually on the disk?   
> 
> Simply, aio allows us to do the write and get notification when it is
> complete.  I don't see how that helps us, and I don't see any other
> advantages to aio.  To use aio, we need to find something that _can't_
> be solved with more traditional Unix API's, and I haven't seen that yet.
> 
> This aio thing is getting out of hand.  It's like we have a hammer, and
> everything looks like a nail, or a use for aio.

Yes, while I think its probably worth doing and faster, it won't help as
much as just keeping the drive buffers full even if that's by using write
calls.

I still don't understand the opposition to aio_write. Could we just have
the configuration setup determine whether one or the other is used? I 
don't see why we wouldn't use the faster calls if they were present and
reliable on a given system.

- Curtis


pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Anyone else having list server problems?
Next
From: Bruce Momjian
Date:
Subject: Re: Proposed LogWriter Scheme, WAS: Potential Large Performance