Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint |
Date | |
Msg-id | 200402011333.i11DXmp29214@candle.pha.pa.us Whole thread Raw |
In response to | Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint
|
List | pgsql-hackers |
Tom Lane wrote: > What I've suggested before is that the bgwriter process can keep track > of all files that it's written to since the last checkpoint, and fsync > them during checkpoint (this would likely require giving the checkpoint > task to the bgwriter instead of launching a separate process for it, > but that doesn't seem unreasonable). Obviously this requires only local > storage in the bgwriter process, and hence no contention. > > That leaves us still needing to account for files that are written > directly by a backend process and not by the bgwriter. However, I claim > that if the bgwriter is worth the cycles it's expending, cases in which > a backend has to write out a page for itself will be infrequent enough > that we don't need to optimize them. Therefore it would be enough to > have backends immmediately sync any write they have to do. (They might > as well use O_SYNC.) Note that backends need not sync writes to temp > files or temp tables, only genuine shared tables. > > If it turns out that it's not quite *that* infrequent, a compromise > position would be to keep a small list of files-needing-fsync in shared > memory. Backends that have to evict pages from shared buffers add those > files to the list; the bgwriter periodically removes entries from the > list and fsyncs the files. Only if there is no room in the list does a > backend have to fsync for itself. If the list is touched often enough > that it becomes a source of contention, then the whole bgwriter concept > is completely broken :-( > > Now this last plan does assume that an fsync applied by process X will > write pages that were dirtied by process Y through a different file > descriptor for the same file. There's been some concern raised in the > past about whether we can assume that. If not, though, the simpler > backends-must-sync-their-own-writes plan will still work. I am concerned that the bgwriter will not be able to keep up with the I/O generated by even a single backend restoring a database, let alone a busy system. To me, the write() performed by the bgwriter, because it is I/O, will typically be the bottleneck on any system that is I/O bound (especially as the kernel buffers fill) and will not be able to keep up with active backends now freed from writes. The idea to fallback when the bgwriter can not keep up is to have the backends sync the data, which seems like it would just slow down an I/O-bound system further. I talked to Magnus about this, and we considered various ideas, but could not come up with a clean way of having the backends communicate to the bgwriter about their own non-sync writes. We had the ideas of using shared memory or a socket, but these seemed like choke-points. Here is my new idea. (I will keep throwing out ideas until I hit on a good one.) The bgwriter it going to have to check before every write to determine if the file is already recorded as needing fsync during checkpoint. My idea is to have that checking happen during the bgwriter buffer scan, rather than at write time. if we add a shared memory boolean for each buffer, backends needing to write buffers can writer buffers already recorded as safe to write by the bgwriter scanner. I don't think the bgwriter is going to be able to keep up with I/O bound backends, but I do think it can scan and set those booleans fast enough for the backends to then perform the writes. (We might need a separate bgwriter thread to do this or a separate process.) As I remember, our new queue system has a list of buffers that are most likely to be replaced, so the bgwriter can scan those first and make sure they have their booleans set. There is an issue that these booleans are set without locking, so there might need to be a double-check of them by backends, first before the write, then after just before they replace the buffer. The bgwriter would clear the bits before the checkpoint starts. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
pgsql-hackers by date: