Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint - Mailing list pgsql-hackers

From Kevin Brown
Subject Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint
Date
Msg-id 20040203103605.GC2608@filer
Whole thread Raw
In response to Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint  (Kevin Brown <kevin@sysexperts.com>)
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Bruce Momjian wrote:
> Here is my new idea.  (I will keep throwing out ideas until I hit on a
> good one.)  The bgwriter it going to have to check before every write to
> determine if the file is already recorded as needing fsync during
> checkpoint.  My idea is to have that checking happen during the bgwriter
> buffer scan, rather than at write time.  if we add a shared memory
> boolean for each buffer, backends needing to write buffers can writer
> buffers already recorded as safe to write by the bgwriter scanner.  I
> don't think the bgwriter is going to be able to keep up with I/O bound
> backends, but I do think it can scan and set those booleans fast enough
> for the backends to then perform the writes.  (We might need a separate
> bgwriter thread to do this or a separate process.)

That seems a bit excessive.

It seems to me that contention is only a problem if you keep a
centralized list of files that have been written by all the backends.
So don't do that.

Instead, have each backend maintain its own separate list in shared
memory.  The only readers of a given list would be the backend it belongs
to and the bgwriter, and the only time bgwriter attempts to read the
list is at checkpoint time.

At checkpoint time, for each backend list, the bgwriter grabs a write
lock on the list, copies it into its own memory space, truncates the
list, and then releases the read lock.  It then deletes the entries
out of its own list that have entries in the backend list it just read.
It then fsync()s the files that are left, under the assumption that the
backends will fsync() any file they write to directly.

The sum total size of all the lists shouldn't be that much larger than
it would be if you maintained it as a global list.  I'd conjecture that
backends that touch many of the same files are not likely to be touching a
large number of files per checkpoint, and those systems that touch a large
number of files probably do so through a lot of independent backends.


One other thing: I don't know exactly how checkpoints are orchestrated
between individual backends, but it seems clear to me that you want to do
a sync() *first*, then the fsync()s.  The reason is that sync() allows
the OS to order the writes across all the files in the most efficient
manner possible, whereas fsync() only takes care of the blocks belonging
to the file in question.  This won't be an option under Windows, but
on Unix systems it should make a difference.  On Linux it should make
quite a difference, since its sync() won't return until the buffers
have been flushed -- and then the following fsync()s will return almost
instantaneously since their data has already been written (so there
won't be any dirty blocks in those files).  I suppose it's possible that
on some OSes fsync()s could interfere with a running sync(), but for
those OSes we can just drop back do doing only fsync()s.


As usual, I could be completely full of it.  Take this for what it's
worth.  :-)


-- 
Kevin Brown                          kevin@sysexperts.com


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: [PATCHES] log session end - again
Next
From: Kevin Brown
Date:
Subject: Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint