Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint - Mailing list pgsql-hackers
From | Kevin Brown |
---|---|
Subject | Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint |
Date | |
Msg-id | 20040204213634.GF2608@filer Whole thread Raw |
In response to | Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint
Re: [pgsql-hackers-win32] Sync vs. fsync during |
List | pgsql-hackers |
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Instead, have each backend maintain its own separate list in shared > > memory. The only readers of a given list would be the backend it belongs > > to and the bgwriter, and the only time bgwriter attempts to read the > > list is at checkpoint time. > > > The sum total size of all the lists shouldn't be that much larger than > > it would be if you maintained it as a global list. > > I fear that is just wishful thinking. Consider the system catalogs as a > counterexample of files that are likely to be touched/modified by many > different backends. Oh, I'm not arguing that there won't be a set of files touched by a lot of backends, just that the number of such files is likely to be relatively small -- a few tens of files, perhaps. But that admittedly can add up fast. But see below. > The bigger problem though with this is that it makes the problem of > list overflow much worse. The hard part about shared memory management > is not so much that the available space is small, as that the available > space is fixed --- we can't easily change it after postmaster start. > The more finely you slice your workspace, the more likely it becomes > that one particular part will run out of space. So the inefficient case > where a backend isn't able to insert something into the appropriate list > will become considerably more of a factor. Well, running out of space in the list isn't that much of a problem. If the backends run out of list space (and the max size of the list could be a configurable thing, either as a percentage of shared memory or as an absolute size), then all that happens is that the background writer might end up fsync()ing some files that have already been fsync()ed. But that's not that big of a deal -- the fact they've already been fsync()ed means that there shouldn't be any data in the kernel buffers left to write to disk, so subsequent fsync()s should return quickly. How quickly depends on the individual kernel's implementation of the dirty buffer list as it relates to file descriptors. Perhaps a better way to do it would be to store the list of all the relfilenodes of everything in pg_class, with a flag for each indicating whether or not an fsync() of the file needs to take place. When anything writes to a file without O_SYNC or a trailing fsync(), it sets the flag for the relfilenode of what it's writing. Then at checkpoint time, the bgwriter can scan the list and fsync() everything that has been flagged. The relfilenode list should be relatively small in size: at most 16 bytes per item (and that on a 64-bit machine). A database that has 4096 file objects would have a 64K list at most. Not bad. Because each database backend can only see the class objects associated with the database it's connected to or the global objects (if there's a way to see all objects I'd like to know about it, but pg_class only shows objects in the current database or objects which are visible to all databases), the relfilenode list might have to be broken up into one list per database, with perhaps a separate list for global objects. The interesting question in that situation is how to handle object creation and removal, which should be a relatively rare occurrance (fortunately), so it supposedly doesn't have to be all that efficient. -- Kevin Brown kevin@sysexperts.com
pgsql-hackers by date: