Re: Background writer process - Mailing list pgsql-hackers

From Shridhar Daithankar
Subject Re: Background writer process
Date
Msg-id 3FB9C35B.3000201@myrealbox.com
Whole thread Raw
In response to Re: Background writer process  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
Bruce Momjian wrote:

> Shridhar Daithankar wrote:
> 
>>On Friday 14 November 2003 22:10, Jan Wieck wrote:
>>
>>>Shridhar Daithankar wrote:
>>>
>>>>On Friday 14 November 2003 03:05, Jan Wieck wrote:
>>>>
>>>>>For sure the sync() needs to be replaced by the discussed fsync() of
>>>>>recently written files. And I think the algorithm how much and how often
>>>>>to flush can be significantly improved. But after all, this does not
>>>>>change the real checkpointing at all, and the general framework having a
>>>>>separate process is what we probably want.
>>>>
>>>>Having fsync for regular data files and sync for WAL segment a
>>>>comfortable compromise?  Or this is going to use fsync for all of them.
>>>>
>>>>IMO, with fsync, we tell kernel that you can write this buffer. It may or
>>>>may not write it immediately, unless it is hard sync.
>>>
>>>I think it's more the other way around. On some systems sync() might
>>>return before all buffers are flushed to disk, while fsync() does not.
>>
>>Oops.. that's bad.
> 
> 
> Yes, one I idea I had was to do an fsync on a new file _after_ issuing
> sync, hoping that this will complete after all the sync buffers are
> done.
> 
> 
>>>>Since postgresql can afford lazy writes for data files, I think this
>>>>could work.
>>>
>>>The whole point of a checkpoint is to know for certain that a specific
>>>change is in the datafile, so that it is safe to throw away older WAL
>>>segments.
>>
>>I just made another posing on patches for a thread crossing win32-devel.
>>
>>Essentially I said
>>
>>1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if current code does 
>>it. The hackery in xlog.c is not exactly trivial.)
> 
> 
> We write WAL, then fsync, so if we write multiple blocks, we can write
> them and fsync once, rather than O_SYNC every write.
> 
> 
>>2. Open data files normally and fsync them only in background writer process.
>>
>>Now BGWriter process will flush everything at the time of checkpointing. It 
>>does not need to flush WAL because of O_SYNC(ideally but an additional fsync 
>>won't hurt). So it just flushes all the file descriptors touched since last 
>>checkpoint, which should not be much of a load because it is flushing those 
>>files intermittently anyways.
>>
>>It could also work nicely if only background writer fsync the data files. 
>>Backends can either wait or proceed to other business by the time disk is 
>>flushed. Backends needs to wait for certain while committing and it should be 
>>rather small delay of syncing to disk in current process as opposed to in  
>>background process. 
>>
>>In case of commit, BGWriter could get away with files touched in transaction
>>+WAL as opposed to all files touched since last checkpoint+WAL in case of 
>>checkpoint. I don't know how difficult that would be.
>>
>>What is different in current BGwriter implementation? Use of sync()?
> 
> 
> Well, basically we are still discussing how to do this.  Right now the
> backend writer patch uses sync(), but the final version will use fsync
> or O_SYNC, or maybe nothing.
> 
> The open items are whether a background process can keep the dirty
> buffers cleaned fast enough to keep up with the maximum number of
> backends.  We might need to use multiple processes or threads to do
> this.   We certainly will have a background writer in 7.5 --- the big
> question is whether _all_ write will go through it.   It certainly would
> be nice if it could, and Tom thinks it can, so we are still exploring
> this.

Given that fsync is blocking, the background writer has to scale up in terms of 
processes/threads and load w.r.t. disk flushing.

I would vote for threads for a simple reason that, in BGWriter, threads are 
needed only to flush the file. Get the fd, fsync it and get next one. No need to 
make entire process thread safe.

Furthermore BGWriter has to detect the disk limit. If adding threads does not 
improve fsyncing speed, it should stop adding them and wait. There is nothing to 
do when disk is saturated.

> If the background writer uses fsync, it can write and allow the buffer
> to be reused and fsync later, while if we use O_SYNC, we have to wait
> for the O_SYNC write to happen before reusing the buffer;  that will be
> slower.

Certainly. However an O_SYNC open file would not require fsync separately. I 
suggested it only for WAL. But for WAL block grouping as suggested in another 
post, all files with fsync might be a good idea.

Just a thought.
 Shridhar



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: [pgsql-advocacy] Not 7.5, but 8.0 ?
Next
From: Tom Lane
Date:
Subject: Re: start of transaction (was: Re: [PERFORM] Help with count(*))