Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint - Mailing list pgsql-hackers

From Kevin Brown
Subject Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint
Date
Msg-id 20040207035541.GG2608@filer
Whole thread Raw
In response to Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint
List pgsql-hackers
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > Well, running out of space in the list isn't that much of a problem.  If
> > the backends run out of list space (and the max size of the list could
> > be a configurable thing, either as a percentage of shared memory or as
> > an absolute size), then all that happens is that the background writer
> > might end up fsync()ing some files that have already been fsync()ed.
> > But that's not that big of a deal -- the fact they've already been
> > fsync()ed means that there shouldn't be any data in the kernel buffers
> > left to write to disk, so subsequent fsync()s should return quickly.
> 
> Yes, it's a big deal.  You're arguing as though the bgwriter is the
> thing that needs to be fast, when actually what we care about is the
> backends being fast.  If the bgwriter isn't doing the vast bulk of the
> writing (and especially the fsync waits) then we are wasting our time
> having one at all.  So we need a scheme that makes it as unlikely as
> possible that backends will have to do their own fsyncs.  Small
> per-backend fsync lists aren't the way to do that.

Ah, okay.  Pardon me, I was writing on low sleep at the time.

If we want to make the backends as fast as possible then they should
defer synchronous writes to someplace else.  But that someplace else
could easily be a process forked by the backend in question whose sole
purpose is to go through the list of files generated by its parent backend
and fsync() them.  The backend can then go about its business and upon
receipt of the SIGCHLD notify anyone that needs to be notified that the
fsync()s have completed.  This approach on any reasonable OS will have
minimal overhead because of copy-on-write page handling in the kernel
and the fact that the child process isn't going to exec() or write to
a bunch of memory.  The advantage is that each backend can maintain its
own list in per-process memory instead of using shared memory.  The
disadvantage is that a given file could have multiple simultaneous (or
close to simultaneous) fsync()s issued against it.  As noted previously,
that might not be such a big deal.

You could still build a list in shared memory of the files that backends
are accessing but it would then be a cache of sorts because it would
be fixed in size.  As soon as you run out of space in the shared list,
you'll have to expire some entries.  An expired entry simply means
that multiple fsync()s might be issued for the file being referred to.
But I suspect that such a list would have far too much contention,
and that it would be more efficient to simply risk issuing multiple
fsync()s against the same file by multiple backend children.

Another advantage to the child-of-backend-fsync()s approach is that it
would cause simultaneous fsync()s to happen, and on more advanced OSes
the OS itself should be able to coalesce the work to be done into a more
efficient pattern of writes to the disk.  That won't be possible if
fsync()s are serialized by PG.  It's not as good as a syscall that would
allow you to fsync() a bunch of file descriptors simultaneously, but it
might be close.

I have no idea whether or not this approach would work in Windows.

> > Perhaps a better way to do it would be to store the list of all the
> > relfilenodes of everything in pg_class, with a flag for each indicating
> > whether or not an fsync() of the file needs to take place.
> 
> You're forgetting that we have a fixed-size workspace to do this in ...
> and no way to know at postmaster start how many relations there are in
> any of our databases, let alone predict how many there might be later on.

Unfortunately, this is going to apply to most any approach.  The number
of blocks being dealt with is not fixed, because even though the cache
itself is fixed in size, the number of block writes it represents (and
thus the number of files involved) is not.  The list of files itself is
not fixed in size, either.

However, this *does* suggest another possible approach: you set up a
fixed size list and fsync() the batch when it fills up.


It sounds like we need to define the particular behavior we want first.
We're optimizing for some combination of throughput and responsiveness,
and those aren't necessarily the same thing.  I suppose this means that
the solution chosen has to have enough knobs to allow the DBA to pick
where on the throughput/responsiveness curve he wants to be.


-- 
Kevin Brown                          kevin@sysexperts.com


pgsql-hackers by date:

Previous
From: "Magnus Hagander"
Date:
Subject: Re: Why has postmaster shutdown gotten so slow?
Next
From: Thomas Swan
Date:
Subject: Re: Preventing duplicate vacuums?