Re: Proposed LogWriter Scheme, WAS: Potential Large Performance - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Proposed LogWriter Scheme, WAS: Potential Large Performance
Date
Msg-id 200210052006.g95K6Tm25271@candle.pha.pa.us
Whole thread Raw
In response to Re: Proposed LogWriter Scheme, WAS: Potential Large Performance  ("Curtis Faith" <curtis@galtair.com>)
Responses Re: Proposed LogWriter Scheme, WAS: Potential Large Performance  ("Curtis Faith" <curtis@galtair.com>)
List pgsql-hackers
Curtis Faith wrote:
> > So, you are saying that we may get back aio confirmation quicker than if
> > we issued our own write/fsync because the OS was able to slip our flush
> > to disk in as part of someone else's or a general fsync?
> > 
> > I don't buy that because it is possible our write() gets in as part of
> > someone else's fsync and our fsync becomes a no-op, meaning there aren't
> > any dirty buffers for that file.  Isn't that also possible?
> 
> Separate out the two concepts:
> 
> 1) Writing of incomplete transactions at the block level by a
> background LogWriter. 
> 
> I think it doesn't matter whether the write is aio_write or
> write, writing blocks when we get them should provide the benefit
> I outlined.
> 
> Waiting till fsync could miss the opportunity to write before the 
> head passes the end of the last durable write because the drive
> buffers might empty causing up to a full rotation's delay.

No question about that!  The sooner we can get stuff to the WAL buffers,
the more likely we will get some other transaction to do our fsync work.
Any ideas on how we can do that?

> 2) aio_write vs. normal write.
> 
> Since as you and others have pointed out aio_write and write are both
> asynchronous, the issue becomes one of whether or not the copies to the
> file system buffers happen synchronously or not.
> 
> This is not a big difference but it seems to me that the OS might be
> able to avoid some context switches by grouping copying in the case
> of aio_write. I've heard anecdotal reports that this is significantly
> faster for some things but I don't know for certain.

I suppose it is possible, but because we spend so much time in fsync, we
want to focus on that.  People have recommended mmap of the WAL file,
and that seems like a much more direct way to handle it rather than aio.
However, we can't control when the stuff gets sent to disk with mmap'ed
WAL, or should I say we can't write to it and withhold writes to the
disk file with mmap, so we would need some intermediate step, and then
again, it just becomes more steps and extra steps slow things down too.


> > This aio thing is getting out of hand.  It's like we have a hammer, and
> > everything looks like a nail, or a use for aio.
> 
> Yes, while I think its probably worth doing and faster, it won't help as
> much as just keeping the drive buffers full even if that's by using write
> calls.

> I still don't understand the opposition to aio_write. Could we just have
> the configuration setup determine whether one or the other is used? I 
> don't see why we wouldn't use the faster calls if they were present and
> reliable on a given system.

We hesitate to add code relying on new features unless it is a
significant win, and in the aio case, we would have different WAL disk
write models for with/without aio, so it clearly could be two code
paths, and with two code paths, we can't as easily improve or optimize. 
If we get 2% boost out of some feature,  but it later discourages us
from adding a 5% optimization, it is a loss.  And, in most cases, the 2%
optimization is for a few platform, while the 5% optimization is for
all.  This code is +15 years old, so we are looking way down the road,
not just for today's hot feature.

For example, Tom just improved DISTINCT by 25% by optimizing some of the
sorting and function call handling.  If we had more complex threaded
sort code, that may not have been possible, or it may have been possible
for him to optimize only one of the code paths.

I can't tell you how many aio/mmap/fancy feature discussions we have
had, and we obviously discuss them, but in the end, they end up being of
questionable value for the risk/complexity;  but, we keep talking,
hoping we are wrong or some good ideas come out of it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


pgsql-hackers by date:

Previous
From: "Curtis Faith"
Date:
Subject: Re: Proposed LogWriter Scheme, WAS: Potential Large Performance
Next
From: Hannu Krosing
Date:
Subject: Re: Proposed LogWriter Scheme, WAS: Potential Large