Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint - Mailing list pgsql-hackers

From Kevin Brown
Subject Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint
Date
Msg-id 20040204213634.GF2608@filer
Whole thread Raw
In response to Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint
Re: [pgsql-hackers-win32] Sync vs. fsync during
List pgsql-hackers
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > Instead, have each backend maintain its own separate list in shared
> > memory.  The only readers of a given list would be the backend it belongs
> > to and the bgwriter, and the only time bgwriter attempts to read the
> > list is at checkpoint time.
> 
> > The sum total size of all the lists shouldn't be that much larger than
> > it would be if you maintained it as a global list.
> 
> I fear that is just wishful thinking.  Consider the system catalogs as a
> counterexample of files that are likely to be touched/modified by many
> different backends.

Oh, I'm not arguing that there won't be a set of files touched by a lot
of backends, just that the number of such files is likely to be relatively
small -- a few tens of files, perhaps.  But that admittedly can add up
fast.  But see below.


> The bigger problem though with this is that it makes the problem of
> list overflow much worse.  The hard part about shared memory management
> is not so much that the available space is small, as that the available
> space is fixed --- we can't easily change it after postmaster start.
> The more finely you slice your workspace, the more likely it becomes
> that one particular part will run out of space.  So the inefficient case
> where a backend isn't able to insert something into the appropriate list
> will become considerably more of a factor.

Well, running out of space in the list isn't that much of a problem.  If
the backends run out of list space (and the max size of the list could
be a configurable thing, either as a percentage of shared memory or as
an absolute size), then all that happens is that the background writer
might end up fsync()ing some files that have already been fsync()ed.
But that's not that big of a deal -- the fact they've already been
fsync()ed means that there shouldn't be any data in the kernel buffers
left to write to disk, so subsequent fsync()s should return quickly.
How quickly depends on the individual kernel's implementation of the
dirty buffer list as it relates to file descriptors.

Perhaps a better way to do it would be to store the list of all the
relfilenodes of everything in pg_class, with a flag for each indicating
whether or not an fsync() of the file needs to take place.  When anything
writes to a file without O_SYNC or a trailing fsync(), it sets the flag
for the relfilenode of what it's writing.  Then at checkpoint time, the
bgwriter can scan the list and fsync() everything that has been flagged.

The relfilenode list should be relatively small in size: at most 16
bytes per item (and that on a 64-bit machine).  A database that has 4096
file objects would have a 64K list at most.  Not bad.

Because each database backend can only see the class objects associated
with the database it's connected to or the global objects (if there's a
way to see all objects I'd like to know about it, but pg_class only
shows objects in the current database or objects which are visible to
all databases), the relfilenode list might have to be broken up into one
list per database, with perhaps a separate list for global objects.

The interesting question in that situation is how to handle object
creation and removal, which should be a relatively rare occurrance
(fortunately), so it supposedly doesn't have to be all that efficient.


-- 
Kevin Brown                          kevin@sysexperts.com


pgsql-hackers by date:

Previous
From: "scott.marlowe"
Date:
Subject: Re: Question on database backup
Next
From: Alvaro Herrera
Date:
Subject: Beta freeze? (was Re: array surprising behavior)