Thread: Sync vs. fsync during checkpoint

Sync vs. fsync during checkpoint

From
Bruce Momjian
Date:
As some know, win32 doesn't have sync, and some are concerned that sync
isn't reliable enough during checkpoint anyway.

The trick is to somehow record all files modified since the last
checkpoint, and open/fsync/close each one.   My idea is to stat() each
file in each directory and compare the modify time to determine if the
file has been modified since the last checkpoint.  I can't think of an
easier way to efficiently collect all modified files.  In this case, we
let the file system keep track of it for us.

However, on XP, I just tested if files that are kept open have their
modification times modified, and it seems they don't.  If I do:

    while :
        echo test
        sleep 5
    done > x

I see the file size grow every 5 seconds, but I don't see the
modification time change.  Can someone confirm this?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> The trick is to somehow record all files modified since the last
> checkpoint, and open/fsync/close each one.   My idea is to stat() each
> file in each directory and compare the modify time to determine if the
> file has been modified since the last checkpoint.

This seems a complete non-starter, as stat() generally has at best
one-second resolution on mod times, even if you assume that the kernel
keeps mod time fully up-to-date at all times.  In any case, it's
difficult to believe that stat'ing everything in a database directory
will be faster than keeping track of it for ourselves.

            regards, tom lane

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > The trick is to somehow record all files modified since the last
> > checkpoint, and open/fsync/close each one.   My idea is to stat() each
> > file in each directory and compare the modify time to determine if the
> > file has been modified since the last checkpoint.
>
> This seems a complete non-starter, as stat() generally has at best
> one-second resolution on mod times, even if you assume that the kernel
> keeps mod time fully up-to-date at all times.  In any case, it's
> difficult to believe that stat'ing everything in a database directory
> will be faster than keeping track of it for ourselves.

Yes, we would have to have a slop factor and fsync anything more than
one second before the last checkpoint.  Any ideas on how to record the
modified files without generating tones of output or locking contention?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Any ideas on how to record the
> modified files without generating tones of output or locking contention?

What I've suggested before is that the bgwriter process can keep track
of all files that it's written to since the last checkpoint, and fsync
them during checkpoint (this would likely require giving the checkpoint
task to the bgwriter instead of launching a separate process for it,
but that doesn't seem unreasonable).  Obviously this requires only local
storage in the bgwriter process, and hence no contention.

That leaves us still needing to account for files that are written
directly by a backend process and not by the bgwriter.  However, I claim
that if the bgwriter is worth the cycles it's expending, cases in which
a backend has to write out a page for itself will be infrequent enough
that we don't need to optimize them.  Therefore it would be enough to
have backends immmediately sync any write they have to do.  (They might
as well use O_SYNC.)  Note that backends need not sync writes to temp
files or temp tables, only genuine shared tables.

If it turns out that it's not quite *that* infrequent, a compromise
position would be to keep a small list of files-needing-fsync in shared
memory.  Backends that have to evict pages from shared buffers add those
files to the list; the bgwriter periodically removes entries from the
list and fsyncs the files.  Only if there is no room in the list does a
backend have to fsync for itself.  If the list is touched often enough
that it becomes a source of contention, then the whole bgwriter concept
is completely broken :-(

Now this last plan does assume that an fsync applied by process X will
write pages that were dirtied by process Y through a different file
descriptor for the same file.  There's been some concern raised in the
past about whether we can assume that.  If not, though, the simpler
backends-must-sync-their-own-writes plan will still work.

            regards, tom lane

Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From
Bruce Momjian
Date:
Tom Lane wrote:
> What I've suggested before is that the bgwriter process can keep track
> of all files that it's written to since the last checkpoint, and fsync
> them during checkpoint (this would likely require giving the checkpoint
> task to the bgwriter instead of launching a separate process for it,
> but that doesn't seem unreasonable).  Obviously this requires only local
> storage in the bgwriter process, and hence no contention.
>
> That leaves us still needing to account for files that are written
> directly by a backend process and not by the bgwriter.  However, I claim
> that if the bgwriter is worth the cycles it's expending, cases in which
> a backend has to write out a page for itself will be infrequent enough
> that we don't need to optimize them.  Therefore it would be enough to
> have backends immmediately sync any write they have to do.  (They might
> as well use O_SYNC.)  Note that backends need not sync writes to temp
> files or temp tables, only genuine shared tables.
>
> If it turns out that it's not quite *that* infrequent, a compromise
> position would be to keep a small list of files-needing-fsync in shared
> memory.  Backends that have to evict pages from shared buffers add those
> files to the list; the bgwriter periodically removes entries from the
> list and fsyncs the files.  Only if there is no room in the list does a
> backend have to fsync for itself.  If the list is touched often enough
> that it becomes a source of contention, then the whole bgwriter concept
> is completely broken :-(
>
> Now this last plan does assume that an fsync applied by process X will
> write pages that were dirtied by process Y through a different file
> descriptor for the same file.  There's been some concern raised in the
> past about whether we can assume that.  If not, though, the simpler
> backends-must-sync-their-own-writes plan will still work.

I am concerned that the bgwriter will not be able to keep up with the
I/O generated by even a single backend restoring a database, let alone a
busy system.  To me, the write() performed by the bgwriter, because it
is I/O, will typically be the bottleneck on any system that is I/O bound
(especially as the kernel buffers fill) and will not be able to keep up
with active backends now freed from writes.

The idea to fallback when the bgwriter can not keep up is to have the
backends sync the data, which seems like it would just slow down an
I/O-bound system further.

I talked to Magnus about this, and we considered various ideas, but
could not come up with a clean way of having the backends communicate to
the bgwriter about their own non-sync writes.  We had the ideas of using
shared memory or a socket, but these seemed like choke-points.

Here is my new idea.  (I will keep throwing out ideas until I hit on a
good one.)  The bgwriter it going to have to check before every write to
determine if the file is already recorded as needing fsync during
checkpoint.  My idea is to have that checking happen during the bgwriter
buffer scan, rather than at write time.  if we add a shared memory
boolean for each buffer, backends needing to write buffers can writer
buffers already recorded as safe to write by the bgwriter scanner.  I
don't think the bgwriter is going to be able to keep up with I/O bound
backends, but I do think it can scan and set those booleans fast enough
for the backends to then perform the writes.  (We might need a separate
bgwriter thread to do this or a separate process.)

As I remember, our new queue system has a list of buffers that are most
likely to be replaced, so the bgwriter can scan those first and make
sure they have their booleans set.

There is an issue that these booleans are set without locking, so there
might need to be a double-check of them by backends, first before the
write, then after just before they replace the buffer.  The bgwriter
would clear the bits before the checkpoint starts.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Sync vs. fsync during checkpoint

From
Greg Stark
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:

> As some know, win32 doesn't have sync, and some are concerned that sync
> isn't reliable enough during checkpoint anyway.
> 
> The trick is to somehow record all files modified since the last
> checkpoint, and open/fsync/close each one.

Note that some people believe that if you do this it doesn't guarantee that
any data written to other file descriptors referring to the same files would
also get synced.

I am not one of those people however. Both Solaris and NetBSD kernel hackers
have told me those OS's would work in such a scheme and furthermore that they
cannot imagine any sane VFS that would fail.

I definitely think it's better than calling sync(2) which doesn't guarantee
the blocks are written by any particular time at all..

-- 
greg



Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From
Kevin Brown
Date:
Bruce Momjian wrote:
> Here is my new idea.  (I will keep throwing out ideas until I hit on a
> good one.)  The bgwriter it going to have to check before every write to
> determine if the file is already recorded as needing fsync during
> checkpoint.  My idea is to have that checking happen during the bgwriter
> buffer scan, rather than at write time.  if we add a shared memory
> boolean for each buffer, backends needing to write buffers can writer
> buffers already recorded as safe to write by the bgwriter scanner.  I
> don't think the bgwriter is going to be able to keep up with I/O bound
> backends, but I do think it can scan and set those booleans fast enough
> for the backends to then perform the writes.  (We might need a separate
> bgwriter thread to do this or a separate process.)

That seems a bit excessive.

It seems to me that contention is only a problem if you keep a
centralized list of files that have been written by all the backends.
So don't do that.

Instead, have each backend maintain its own separate list in shared
memory.  The only readers of a given list would be the backend it belongs
to and the bgwriter, and the only time bgwriter attempts to read the
list is at checkpoint time.

At checkpoint time, for each backend list, the bgwriter grabs a write
lock on the list, copies it into its own memory space, truncates the
list, and then releases the read lock.  It then deletes the entries
out of its own list that have entries in the backend list it just read.
It then fsync()s the files that are left, under the assumption that the
backends will fsync() any file they write to directly.

The sum total size of all the lists shouldn't be that much larger than
it would be if you maintained it as a global list.  I'd conjecture that
backends that touch many of the same files are not likely to be touching a
large number of files per checkpoint, and those systems that touch a large
number of files probably do so through a lot of independent backends.


One other thing: I don't know exactly how checkpoints are orchestrated
between individual backends, but it seems clear to me that you want to do
a sync() *first*, then the fsync()s.  The reason is that sync() allows
the OS to order the writes across all the files in the most efficient
manner possible, whereas fsync() only takes care of the blocks belonging
to the file in question.  This won't be an option under Windows, but
on Unix systems it should make a difference.  On Linux it should make
quite a difference, since its sync() won't return until the buffers
have been flushed -- and then the following fsync()s will return almost
instantaneously since their data has already been written (so there
won't be any dirty blocks in those files).  I suppose it's possible that
on some OSes fsync()s could interfere with a running sync(), but for
those OSes we can just drop back do doing only fsync()s.


As usual, I could be completely full of it.  Take this for what it's
worth.  :-)


-- 
Kevin Brown                          kevin@sysexperts.com


Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From
Kevin Brown
Date:
Some Moron at sysexperts.com wrote:
> At checkpoint time, for each backend list, the bgwriter grabs a write
> lock on the list, copies it into its own memory space, truncates the
> list, and then releases the read lock.

Sigh.  I meant to say that it then releases the *write* lock.


-- 
Kevin Brown                          kevin@sysexperts.com


Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> Instead, have each backend maintain its own separate list in shared
> memory.  The only readers of a given list would be the backend it belongs
> to and the bgwriter, and the only time bgwriter attempts to read the
> list is at checkpoint time.

> The sum total size of all the lists shouldn't be that much larger than
> it would be if you maintained it as a global list.

I fear that is just wishful thinking.  Consider the system catalogs as a
counterexample of files that are likely to be touched/modified by many
different backends.

The bigger problem though with this is that it makes the problem of
list overflow much worse.  The hard part about shared memory management
is not so much that the available space is small, as that the available
space is fixed --- we can't easily change it after postmaster start.
The more finely you slice your workspace, the more likely it becomes
that one particular part will run out of space.  So the inefficient case
where a backend isn't able to insert something into the appropriate list
will become considerably more of a factor.

> but it seems clear to me that you want to do
> a sync() *first*, then the fsync()s.

Hmm, that's an interesting thought.  On a machine that's doing a lot of
stuff besides running the database, a global sync would be
counterproductive --- but we could easily make it configurable as to
whether to issue the sync() or not.  It wouldn't affect correctness.
        regards, tom lane


Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From
Kevin Brown
Date:
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > Instead, have each backend maintain its own separate list in shared
> > memory.  The only readers of a given list would be the backend it belongs
> > to and the bgwriter, and the only time bgwriter attempts to read the
> > list is at checkpoint time.
> 
> > The sum total size of all the lists shouldn't be that much larger than
> > it would be if you maintained it as a global list.
> 
> I fear that is just wishful thinking.  Consider the system catalogs as a
> counterexample of files that are likely to be touched/modified by many
> different backends.

Oh, I'm not arguing that there won't be a set of files touched by a lot
of backends, just that the number of such files is likely to be relatively
small -- a few tens of files, perhaps.  But that admittedly can add up
fast.  But see below.


> The bigger problem though with this is that it makes the problem of
> list overflow much worse.  The hard part about shared memory management
> is not so much that the available space is small, as that the available
> space is fixed --- we can't easily change it after postmaster start.
> The more finely you slice your workspace, the more likely it becomes
> that one particular part will run out of space.  So the inefficient case
> where a backend isn't able to insert something into the appropriate list
> will become considerably more of a factor.

Well, running out of space in the list isn't that much of a problem.  If
the backends run out of list space (and the max size of the list could
be a configurable thing, either as a percentage of shared memory or as
an absolute size), then all that happens is that the background writer
might end up fsync()ing some files that have already been fsync()ed.
But that's not that big of a deal -- the fact they've already been
fsync()ed means that there shouldn't be any data in the kernel buffers
left to write to disk, so subsequent fsync()s should return quickly.
How quickly depends on the individual kernel's implementation of the
dirty buffer list as it relates to file descriptors.

Perhaps a better way to do it would be to store the list of all the
relfilenodes of everything in pg_class, with a flag for each indicating
whether or not an fsync() of the file needs to take place.  When anything
writes to a file without O_SYNC or a trailing fsync(), it sets the flag
for the relfilenode of what it's writing.  Then at checkpoint time, the
bgwriter can scan the list and fsync() everything that has been flagged.

The relfilenode list should be relatively small in size: at most 16
bytes per item (and that on a 64-bit machine).  A database that has 4096
file objects would have a 64K list at most.  Not bad.

Because each database backend can only see the class objects associated
with the database it's connected to or the global objects (if there's a
way to see all objects I'd like to know about it, but pg_class only
shows objects in the current database or objects which are visible to
all databases), the relfilenode list might have to be broken up into one
list per database, with perhaps a separate list for global objects.

The interesting question in that situation is how to handle object
creation and removal, which should be a relatively rare occurrance
(fortunately), so it supposedly doesn't have to be all that efficient.


-- 
Kevin Brown                          kevin@sysexperts.com


Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> Tom Lane wrote:
>> The more finely you slice your workspace, the more likely it becomes
>> that one particular part will run out of space.  So the inefficient case
>> where a backend isn't able to insert something into the appropriate list
>> will become considerably more of a factor.

> Well, running out of space in the list isn't that much of a problem.  If
> the backends run out of list space (and the max size of the list could
> be a configurable thing, either as a percentage of shared memory or as
> an absolute size), then all that happens is that the background writer
> might end up fsync()ing some files that have already been fsync()ed.
> But that's not that big of a deal -- the fact they've already been
> fsync()ed means that there shouldn't be any data in the kernel buffers
> left to write to disk, so subsequent fsync()s should return quickly.

Yes, it's a big deal.  You're arguing as though the bgwriter is the
thing that needs to be fast, when actually what we care about is the
backends being fast.  If the bgwriter isn't doing the vast bulk of the
writing (and especially the fsync waits) then we are wasting our time
having one at all.  So we need a scheme that makes it as unlikely as
possible that backends will have to do their own fsyncs.  Small
per-backend fsync lists aren't the way to do that.

> Perhaps a better way to do it would be to store the list of all the
> relfilenodes of everything in pg_class, with a flag for each indicating
> whether or not an fsync() of the file needs to take place.

You're forgetting that we have a fixed-size workspace to do this in ...
and no way to know at postmaster start how many relations there are in
any of our databases, let alone predict how many there might be later on.
        regards, tom lane


Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From
Kevin Brown
Date:
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > Well, running out of space in the list isn't that much of a problem.  If
> > the backends run out of list space (and the max size of the list could
> > be a configurable thing, either as a percentage of shared memory or as
> > an absolute size), then all that happens is that the background writer
> > might end up fsync()ing some files that have already been fsync()ed.
> > But that's not that big of a deal -- the fact they've already been
> > fsync()ed means that there shouldn't be any data in the kernel buffers
> > left to write to disk, so subsequent fsync()s should return quickly.
> 
> Yes, it's a big deal.  You're arguing as though the bgwriter is the
> thing that needs to be fast, when actually what we care about is the
> backends being fast.  If the bgwriter isn't doing the vast bulk of the
> writing (and especially the fsync waits) then we are wasting our time
> having one at all.  So we need a scheme that makes it as unlikely as
> possible that backends will have to do their own fsyncs.  Small
> per-backend fsync lists aren't the way to do that.

Ah, okay.  Pardon me, I was writing on low sleep at the time.

If we want to make the backends as fast as possible then they should
defer synchronous writes to someplace else.  But that someplace else
could easily be a process forked by the backend in question whose sole
purpose is to go through the list of files generated by its parent backend
and fsync() them.  The backend can then go about its business and upon
receipt of the SIGCHLD notify anyone that needs to be notified that the
fsync()s have completed.  This approach on any reasonable OS will have
minimal overhead because of copy-on-write page handling in the kernel
and the fact that the child process isn't going to exec() or write to
a bunch of memory.  The advantage is that each backend can maintain its
own list in per-process memory instead of using shared memory.  The
disadvantage is that a given file could have multiple simultaneous (or
close to simultaneous) fsync()s issued against it.  As noted previously,
that might not be such a big deal.

You could still build a list in shared memory of the files that backends
are accessing but it would then be a cache of sorts because it would
be fixed in size.  As soon as you run out of space in the shared list,
you'll have to expire some entries.  An expired entry simply means
that multiple fsync()s might be issued for the file being referred to.
But I suspect that such a list would have far too much contention,
and that it would be more efficient to simply risk issuing multiple
fsync()s against the same file by multiple backend children.

Another advantage to the child-of-backend-fsync()s approach is that it
would cause simultaneous fsync()s to happen, and on more advanced OSes
the OS itself should be able to coalesce the work to be done into a more
efficient pattern of writes to the disk.  That won't be possible if
fsync()s are serialized by PG.  It's not as good as a syscall that would
allow you to fsync() a bunch of file descriptors simultaneously, but it
might be close.

I have no idea whether or not this approach would work in Windows.

> > Perhaps a better way to do it would be to store the list of all the
> > relfilenodes of everything in pg_class, with a flag for each indicating
> > whether or not an fsync() of the file needs to take place.
> 
> You're forgetting that we have a fixed-size workspace to do this in ...
> and no way to know at postmaster start how many relations there are in
> any of our databases, let alone predict how many there might be later on.

Unfortunately, this is going to apply to most any approach.  The number
of blocks being dealt with is not fixed, because even though the cache
itself is fixed in size, the number of block writes it represents (and
thus the number of files involved) is not.  The list of files itself is
not fixed in size, either.

However, this *does* suggest another possible approach: you set up a
fixed size list and fsync() the batch when it fills up.


It sounds like we need to define the particular behavior we want first.
We're optimizing for some combination of throughput and responsiveness,
and those aren't necessarily the same thing.  I suppose this means that
the solution chosen has to have enough knobs to allow the DBA to pick
where on the throughput/responsiveness curve he wants to be.


-- 
Kevin Brown                          kevin@sysexperts.com


Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint

From
Kevin Brown
Date:
I wrote:
> But that someplace else
> could easily be a process forked by the backend in question whose sole
> purpose is to go through the list of files generated by its parent backend
> and fsync() them.  The backend can then go about its business and upon
> receipt of the SIGCHLD notify anyone that needs to be notified that the
> fsync()s have completed.

Duh, what am I thinking?  Of course, the right answer is to have the
child notify anyone that needs notification that fsync()s are done.  No
need for involvement of the parent (i.e., the backend in question)
unless the architecture of PG requires it somehow.



-- 
Kevin Brown                          kevin@sysexperts.com


Re: [pgsql-hackers-win32] Sync vs. fsync during

From
Sailesh Krishnamurthy
Date:
>>>>> "Kevin" == Kevin Brown <kevin@sysexperts.com> writes:
   >> The bigger problem though with this is that it makes the   >> problem of list overflow much worse.  The hard part
about  >> shared memory management is not so much that the available   >> space is small, as that the available space
isfixed --- we   >> can't easily change it after postmaster start.  The more finely
 

Again, I can suggest the shared memory MemoryContext we use in
TelegraphCQ that is based on the OSSP libmm memory manager. We use it
to grow and shrink shared memory at will.

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh