Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint
Date
Msg-id 200402011333.i11DXmp29214@candle.pha.pa.us
Whole thread Raw
In response to Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint
List pgsql-hackers
Tom Lane wrote:
> What I've suggested before is that the bgwriter process can keep track
> of all files that it's written to since the last checkpoint, and fsync
> them during checkpoint (this would likely require giving the checkpoint
> task to the bgwriter instead of launching a separate process for it,
> but that doesn't seem unreasonable).  Obviously this requires only local
> storage in the bgwriter process, and hence no contention.
>
> That leaves us still needing to account for files that are written
> directly by a backend process and not by the bgwriter.  However, I claim
> that if the bgwriter is worth the cycles it's expending, cases in which
> a backend has to write out a page for itself will be infrequent enough
> that we don't need to optimize them.  Therefore it would be enough to
> have backends immmediately sync any write they have to do.  (They might
> as well use O_SYNC.)  Note that backends need not sync writes to temp
> files or temp tables, only genuine shared tables.
>
> If it turns out that it's not quite *that* infrequent, a compromise
> position would be to keep a small list of files-needing-fsync in shared
> memory.  Backends that have to evict pages from shared buffers add those
> files to the list; the bgwriter periodically removes entries from the
> list and fsyncs the files.  Only if there is no room in the list does a
> backend have to fsync for itself.  If the list is touched often enough
> that it becomes a source of contention, then the whole bgwriter concept
> is completely broken :-(
>
> Now this last plan does assume that an fsync applied by process X will
> write pages that were dirtied by process Y through a different file
> descriptor for the same file.  There's been some concern raised in the
> past about whether we can assume that.  If not, though, the simpler
> backends-must-sync-their-own-writes plan will still work.

I am concerned that the bgwriter will not be able to keep up with the
I/O generated by even a single backend restoring a database, let alone a
busy system.  To me, the write() performed by the bgwriter, because it
is I/O, will typically be the bottleneck on any system that is I/O bound
(especially as the kernel buffers fill) and will not be able to keep up
with active backends now freed from writes.

The idea to fallback when the bgwriter can not keep up is to have the
backends sync the data, which seems like it would just slow down an
I/O-bound system further.

I talked to Magnus about this, and we considered various ideas, but
could not come up with a clean way of having the backends communicate to
the bgwriter about their own non-sync writes.  We had the ideas of using
shared memory or a socket, but these seemed like choke-points.

Here is my new idea.  (I will keep throwing out ideas until I hit on a
good one.)  The bgwriter it going to have to check before every write to
determine if the file is already recorded as needing fsync during
checkpoint.  My idea is to have that checking happen during the bgwriter
buffer scan, rather than at write time.  if we add a shared memory
boolean for each buffer, backends needing to write buffers can writer
buffers already recorded as safe to write by the bgwriter scanner.  I
don't think the bgwriter is going to be able to keep up with I/O bound
backends, but I do think it can scan and set those booleans fast enough
for the backends to then perform the writes.  (We might need a separate
bgwriter thread to do this or a separate process.)

As I remember, our new queue system has a list of buffers that are most
likely to be replaced, so the bgwriter can scan those first and make
sure they have their booleans set.

There is an issue that these booleans are set without locking, so there
might need to be a double-check of them by backends, first before the
write, then after just before they replace the buffer.  The bgwriter
would clear the bits before the checkpoint starts.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

pgsql-hackers by date:

Previous
From: "Nicolai Tufar"
Date:
Subject: Re: 7.4.1 release status - Turkish Locale
Next
From: Tom Lane
Date:
Subject: Re: Idea about better configuration options for sort memory