Thread: Sync vs. fsync during checkpoint
As some know, win32 doesn't have sync, and some are concerned that sync isn't reliable enough during checkpoint anyway. The trick is to somehow record all files modified since the last checkpoint, and open/fsync/close each one. My idea is to stat() each file in each directory and compare the modify time to determine if the file has been modified since the last checkpoint. I can't think of an easier way to efficiently collect all modified files. In this case, we let the file system keep track of it for us. However, on XP, I just tested if files that are kept open have their modification times modified, and it seems they don't. If I do: while : echo test sleep 5 done > x I see the file size grow every 5 seconds, but I don't see the modification time change. Can someone confirm this? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > The trick is to somehow record all files modified since the last > checkpoint, and open/fsync/close each one. My idea is to stat() each > file in each directory and compare the modify time to determine if the > file has been modified since the last checkpoint. This seems a complete non-starter, as stat() generally has at best one-second resolution on mod times, even if you assume that the kernel keeps mod time fully up-to-date at all times. In any case, it's difficult to believe that stat'ing everything in a database directory will be faster than keeping track of it for ourselves. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > The trick is to somehow record all files modified since the last > > checkpoint, and open/fsync/close each one. My idea is to stat() each > > file in each directory and compare the modify time to determine if the > > file has been modified since the last checkpoint. > > This seems a complete non-starter, as stat() generally has at best > one-second resolution on mod times, even if you assume that the kernel > keeps mod time fully up-to-date at all times. In any case, it's > difficult to believe that stat'ing everything in a database directory > will be faster than keeping track of it for ourselves. Yes, we would have to have a slop factor and fsync anything more than one second before the last checkpoint. Any ideas on how to record the modified files without generating tones of output or locking contention? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Any ideas on how to record the > modified files without generating tones of output or locking contention? What I've suggested before is that the bgwriter process can keep track of all files that it's written to since the last checkpoint, and fsync them during checkpoint (this would likely require giving the checkpoint task to the bgwriter instead of launching a separate process for it, but that doesn't seem unreasonable). Obviously this requires only local storage in the bgwriter process, and hence no contention. That leaves us still needing to account for files that are written directly by a backend process and not by the bgwriter. However, I claim that if the bgwriter is worth the cycles it's expending, cases in which a backend has to write out a page for itself will be infrequent enough that we don't need to optimize them. Therefore it would be enough to have backends immmediately sync any write they have to do. (They might as well use O_SYNC.) Note that backends need not sync writes to temp files or temp tables, only genuine shared tables. If it turns out that it's not quite *that* infrequent, a compromise position would be to keep a small list of files-needing-fsync in shared memory. Backends that have to evict pages from shared buffers add those files to the list; the bgwriter periodically removes entries from the list and fsyncs the files. Only if there is no room in the list does a backend have to fsync for itself. If the list is touched often enough that it becomes a source of contention, then the whole bgwriter concept is completely broken :-( Now this last plan does assume that an fsync applied by process X will write pages that were dirtied by process Y through a different file descriptor for the same file. There's been some concern raised in the past about whether we can assume that. If not, though, the simpler backends-must-sync-their-own-writes plan will still work. regards, tom lane
Tom Lane wrote: > What I've suggested before is that the bgwriter process can keep track > of all files that it's written to since the last checkpoint, and fsync > them during checkpoint (this would likely require giving the checkpoint > task to the bgwriter instead of launching a separate process for it, > but that doesn't seem unreasonable). Obviously this requires only local > storage in the bgwriter process, and hence no contention. > > That leaves us still needing to account for files that are written > directly by a backend process and not by the bgwriter. However, I claim > that if the bgwriter is worth the cycles it's expending, cases in which > a backend has to write out a page for itself will be infrequent enough > that we don't need to optimize them. Therefore it would be enough to > have backends immmediately sync any write they have to do. (They might > as well use O_SYNC.) Note that backends need not sync writes to temp > files or temp tables, only genuine shared tables. > > If it turns out that it's not quite *that* infrequent, a compromise > position would be to keep a small list of files-needing-fsync in shared > memory. Backends that have to evict pages from shared buffers add those > files to the list; the bgwriter periodically removes entries from the > list and fsyncs the files. Only if there is no room in the list does a > backend have to fsync for itself. If the list is touched often enough > that it becomes a source of contention, then the whole bgwriter concept > is completely broken :-( > > Now this last plan does assume that an fsync applied by process X will > write pages that were dirtied by process Y through a different file > descriptor for the same file. There's been some concern raised in the > past about whether we can assume that. If not, though, the simpler > backends-must-sync-their-own-writes plan will still work. I am concerned that the bgwriter will not be able to keep up with the I/O generated by even a single backend restoring a database, let alone a busy system. To me, the write() performed by the bgwriter, because it is I/O, will typically be the bottleneck on any system that is I/O bound (especially as the kernel buffers fill) and will not be able to keep up with active backends now freed from writes. The idea to fallback when the bgwriter can not keep up is to have the backends sync the data, which seems like it would just slow down an I/O-bound system further. I talked to Magnus about this, and we considered various ideas, but could not come up with a clean way of having the backends communicate to the bgwriter about their own non-sync writes. We had the ideas of using shared memory or a socket, but these seemed like choke-points. Here is my new idea. (I will keep throwing out ideas until I hit on a good one.) The bgwriter it going to have to check before every write to determine if the file is already recorded as needing fsync during checkpoint. My idea is to have that checking happen during the bgwriter buffer scan, rather than at write time. if we add a shared memory boolean for each buffer, backends needing to write buffers can writer buffers already recorded as safe to write by the bgwriter scanner. I don't think the bgwriter is going to be able to keep up with I/O bound backends, but I do think it can scan and set those booleans fast enough for the backends to then perform the writes. (We might need a separate bgwriter thread to do this or a separate process.) As I remember, our new queue system has a list of buffers that are most likely to be replaced, so the bgwriter can scan those first and make sure they have their booleans set. There is an issue that these booleans are set without locking, so there might need to be a double-check of them by backends, first before the write, then after just before they replace the buffer. The bgwriter would clear the bits before the checkpoint starts. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > As some know, win32 doesn't have sync, and some are concerned that sync > isn't reliable enough during checkpoint anyway. > > The trick is to somehow record all files modified since the last > checkpoint, and open/fsync/close each one. Note that some people believe that if you do this it doesn't guarantee that any data written to other file descriptors referring to the same files would also get synced. I am not one of those people however. Both Solaris and NetBSD kernel hackers have told me those OS's would work in such a scheme and furthermore that they cannot imagine any sane VFS that would fail. I definitely think it's better than calling sync(2) which doesn't guarantee the blocks are written by any particular time at all.. -- greg
Bruce Momjian wrote: > Here is my new idea. (I will keep throwing out ideas until I hit on a > good one.) The bgwriter it going to have to check before every write to > determine if the file is already recorded as needing fsync during > checkpoint. My idea is to have that checking happen during the bgwriter > buffer scan, rather than at write time. if we add a shared memory > boolean for each buffer, backends needing to write buffers can writer > buffers already recorded as safe to write by the bgwriter scanner. I > don't think the bgwriter is going to be able to keep up with I/O bound > backends, but I do think it can scan and set those booleans fast enough > for the backends to then perform the writes. (We might need a separate > bgwriter thread to do this or a separate process.) That seems a bit excessive. It seems to me that contention is only a problem if you keep a centralized list of files that have been written by all the backends. So don't do that. Instead, have each backend maintain its own separate list in shared memory. The only readers of a given list would be the backend it belongs to and the bgwriter, and the only time bgwriter attempts to read the list is at checkpoint time. At checkpoint time, for each backend list, the bgwriter grabs a write lock on the list, copies it into its own memory space, truncates the list, and then releases the read lock. It then deletes the entries out of its own list that have entries in the backend list it just read. It then fsync()s the files that are left, under the assumption that the backends will fsync() any file they write to directly. The sum total size of all the lists shouldn't be that much larger than it would be if you maintained it as a global list. I'd conjecture that backends that touch many of the same files are not likely to be touching a large number of files per checkpoint, and those systems that touch a large number of files probably do so through a lot of independent backends. One other thing: I don't know exactly how checkpoints are orchestrated between individual backends, but it seems clear to me that you want to do a sync() *first*, then the fsync()s. The reason is that sync() allows the OS to order the writes across all the files in the most efficient manner possible, whereas fsync() only takes care of the blocks belonging to the file in question. This won't be an option under Windows, but on Unix systems it should make a difference. On Linux it should make quite a difference, since its sync() won't return until the buffers have been flushed -- and then the following fsync()s will return almost instantaneously since their data has already been written (so there won't be any dirty blocks in those files). I suppose it's possible that on some OSes fsync()s could interfere with a running sync(), but for those OSes we can just drop back do doing only fsync()s. As usual, I could be completely full of it. Take this for what it's worth. :-) -- Kevin Brown kevin@sysexperts.com
Some Moron at sysexperts.com wrote: > At checkpoint time, for each backend list, the bgwriter grabs a write > lock on the list, copies it into its own memory space, truncates the > list, and then releases the read lock. Sigh. I meant to say that it then releases the *write* lock. -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > Instead, have each backend maintain its own separate list in shared > memory. The only readers of a given list would be the backend it belongs > to and the bgwriter, and the only time bgwriter attempts to read the > list is at checkpoint time. > The sum total size of all the lists shouldn't be that much larger than > it would be if you maintained it as a global list. I fear that is just wishful thinking. Consider the system catalogs as a counterexample of files that are likely to be touched/modified by many different backends. The bigger problem though with this is that it makes the problem of list overflow much worse. The hard part about shared memory management is not so much that the available space is small, as that the available space is fixed --- we can't easily change it after postmaster start. The more finely you slice your workspace, the more likely it becomes that one particular part will run out of space. So the inefficient case where a backend isn't able to insert something into the appropriate list will become considerably more of a factor. > but it seems clear to me that you want to do > a sync() *first*, then the fsync()s. Hmm, that's an interesting thought. On a machine that's doing a lot of stuff besides running the database, a global sync would be counterproductive --- but we could easily make it configurable as to whether to issue the sync() or not. It wouldn't affect correctness. regards, tom lane
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Instead, have each backend maintain its own separate list in shared > > memory. The only readers of a given list would be the backend it belongs > > to and the bgwriter, and the only time bgwriter attempts to read the > > list is at checkpoint time. > > > The sum total size of all the lists shouldn't be that much larger than > > it would be if you maintained it as a global list. > > I fear that is just wishful thinking. Consider the system catalogs as a > counterexample of files that are likely to be touched/modified by many > different backends. Oh, I'm not arguing that there won't be a set of files touched by a lot of backends, just that the number of such files is likely to be relatively small -- a few tens of files, perhaps. But that admittedly can add up fast. But see below. > The bigger problem though with this is that it makes the problem of > list overflow much worse. The hard part about shared memory management > is not so much that the available space is small, as that the available > space is fixed --- we can't easily change it after postmaster start. > The more finely you slice your workspace, the more likely it becomes > that one particular part will run out of space. So the inefficient case > where a backend isn't able to insert something into the appropriate list > will become considerably more of a factor. Well, running out of space in the list isn't that much of a problem. If the backends run out of list space (and the max size of the list could be a configurable thing, either as a percentage of shared memory or as an absolute size), then all that happens is that the background writer might end up fsync()ing some files that have already been fsync()ed. But that's not that big of a deal -- the fact they've already been fsync()ed means that there shouldn't be any data in the kernel buffers left to write to disk, so subsequent fsync()s should return quickly. How quickly depends on the individual kernel's implementation of the dirty buffer list as it relates to file descriptors. Perhaps a better way to do it would be to store the list of all the relfilenodes of everything in pg_class, with a flag for each indicating whether or not an fsync() of the file needs to take place. When anything writes to a file without O_SYNC or a trailing fsync(), it sets the flag for the relfilenode of what it's writing. Then at checkpoint time, the bgwriter can scan the list and fsync() everything that has been flagged. The relfilenode list should be relatively small in size: at most 16 bytes per item (and that on a 64-bit machine). A database that has 4096 file objects would have a 64K list at most. Not bad. Because each database backend can only see the class objects associated with the database it's connected to or the global objects (if there's a way to see all objects I'd like to know about it, but pg_class only shows objects in the current database or objects which are visible to all databases), the relfilenode list might have to be broken up into one list per database, with perhaps a separate list for global objects. The interesting question in that situation is how to handle object creation and removal, which should be a relatively rare occurrance (fortunately), so it supposedly doesn't have to be all that efficient. -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > Tom Lane wrote: >> The more finely you slice your workspace, the more likely it becomes >> that one particular part will run out of space. So the inefficient case >> where a backend isn't able to insert something into the appropriate list >> will become considerably more of a factor. > Well, running out of space in the list isn't that much of a problem. If > the backends run out of list space (and the max size of the list could > be a configurable thing, either as a percentage of shared memory or as > an absolute size), then all that happens is that the background writer > might end up fsync()ing some files that have already been fsync()ed. > But that's not that big of a deal -- the fact they've already been > fsync()ed means that there shouldn't be any data in the kernel buffers > left to write to disk, so subsequent fsync()s should return quickly. Yes, it's a big deal. You're arguing as though the bgwriter is the thing that needs to be fast, when actually what we care about is the backends being fast. If the bgwriter isn't doing the vast bulk of the writing (and especially the fsync waits) then we are wasting our time having one at all. So we need a scheme that makes it as unlikely as possible that backends will have to do their own fsyncs. Small per-backend fsync lists aren't the way to do that. > Perhaps a better way to do it would be to store the list of all the > relfilenodes of everything in pg_class, with a flag for each indicating > whether or not an fsync() of the file needs to take place. You're forgetting that we have a fixed-size workspace to do this in ... and no way to know at postmaster start how many relations there are in any of our databases, let alone predict how many there might be later on. regards, tom lane
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Well, running out of space in the list isn't that much of a problem. If > > the backends run out of list space (and the max size of the list could > > be a configurable thing, either as a percentage of shared memory or as > > an absolute size), then all that happens is that the background writer > > might end up fsync()ing some files that have already been fsync()ed. > > But that's not that big of a deal -- the fact they've already been > > fsync()ed means that there shouldn't be any data in the kernel buffers > > left to write to disk, so subsequent fsync()s should return quickly. > > Yes, it's a big deal. You're arguing as though the bgwriter is the > thing that needs to be fast, when actually what we care about is the > backends being fast. If the bgwriter isn't doing the vast bulk of the > writing (and especially the fsync waits) then we are wasting our time > having one at all. So we need a scheme that makes it as unlikely as > possible that backends will have to do their own fsyncs. Small > per-backend fsync lists aren't the way to do that. Ah, okay. Pardon me, I was writing on low sleep at the time. If we want to make the backends as fast as possible then they should defer synchronous writes to someplace else. But that someplace else could easily be a process forked by the backend in question whose sole purpose is to go through the list of files generated by its parent backend and fsync() them. The backend can then go about its business and upon receipt of the SIGCHLD notify anyone that needs to be notified that the fsync()s have completed. This approach on any reasonable OS will have minimal overhead because of copy-on-write page handling in the kernel and the fact that the child process isn't going to exec() or write to a bunch of memory. The advantage is that each backend can maintain its own list in per-process memory instead of using shared memory. The disadvantage is that a given file could have multiple simultaneous (or close to simultaneous) fsync()s issued against it. As noted previously, that might not be such a big deal. You could still build a list in shared memory of the files that backends are accessing but it would then be a cache of sorts because it would be fixed in size. As soon as you run out of space in the shared list, you'll have to expire some entries. An expired entry simply means that multiple fsync()s might be issued for the file being referred to. But I suspect that such a list would have far too much contention, and that it would be more efficient to simply risk issuing multiple fsync()s against the same file by multiple backend children. Another advantage to the child-of-backend-fsync()s approach is that it would cause simultaneous fsync()s to happen, and on more advanced OSes the OS itself should be able to coalesce the work to be done into a more efficient pattern of writes to the disk. That won't be possible if fsync()s are serialized by PG. It's not as good as a syscall that would allow you to fsync() a bunch of file descriptors simultaneously, but it might be close. I have no idea whether or not this approach would work in Windows. > > Perhaps a better way to do it would be to store the list of all the > > relfilenodes of everything in pg_class, with a flag for each indicating > > whether or not an fsync() of the file needs to take place. > > You're forgetting that we have a fixed-size workspace to do this in ... > and no way to know at postmaster start how many relations there are in > any of our databases, let alone predict how many there might be later on. Unfortunately, this is going to apply to most any approach. The number of blocks being dealt with is not fixed, because even though the cache itself is fixed in size, the number of block writes it represents (and thus the number of files involved) is not. The list of files itself is not fixed in size, either. However, this *does* suggest another possible approach: you set up a fixed size list and fsync() the batch when it fills up. It sounds like we need to define the particular behavior we want first. We're optimizing for some combination of throughput and responsiveness, and those aren't necessarily the same thing. I suppose this means that the solution chosen has to have enough knobs to allow the DBA to pick where on the throughput/responsiveness curve he wants to be. -- Kevin Brown kevin@sysexperts.com
I wrote: > But that someplace else > could easily be a process forked by the backend in question whose sole > purpose is to go through the list of files generated by its parent backend > and fsync() them. The backend can then go about its business and upon > receipt of the SIGCHLD notify anyone that needs to be notified that the > fsync()s have completed. Duh, what am I thinking? Of course, the right answer is to have the child notify anyone that needs notification that fsync()s are done. No need for involvement of the parent (i.e., the backend in question) unless the architecture of PG requires it somehow. -- Kevin Brown kevin@sysexperts.com
>>>>> "Kevin" == Kevin Brown <kevin@sysexperts.com> writes: >> The bigger problem though with this is that it makes the >> problem of list overflow much worse. The hard part about >> shared memory management is not so much that the available >> space is small, as that the available space isfixed --- we >> can't easily change it after postmaster start. The more finely Again, I can suggest the shared memory MemoryContext we use in TelegraphCQ that is based on the OSSP libmm memory manager. We use it to grow and shrink shared memory at will. -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh