Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint - Mailing list pgsql-hackers
From | Kevin Brown |
---|---|
Subject | Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint |
Date | |
Msg-id | 20040203103605.GC2608@filer Whole thread Raw |
In response to | Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint (Bruce Momjian <pgman@candle.pha.pa.us>) |
Responses |
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint |
List | pgsql-hackers |
Bruce Momjian wrote: > Here is my new idea. (I will keep throwing out ideas until I hit on a > good one.) The bgwriter it going to have to check before every write to > determine if the file is already recorded as needing fsync during > checkpoint. My idea is to have that checking happen during the bgwriter > buffer scan, rather than at write time. if we add a shared memory > boolean for each buffer, backends needing to write buffers can writer > buffers already recorded as safe to write by the bgwriter scanner. I > don't think the bgwriter is going to be able to keep up with I/O bound > backends, but I do think it can scan and set those booleans fast enough > for the backends to then perform the writes. (We might need a separate > bgwriter thread to do this or a separate process.) That seems a bit excessive. It seems to me that contention is only a problem if you keep a centralized list of files that have been written by all the backends. So don't do that. Instead, have each backend maintain its own separate list in shared memory. The only readers of a given list would be the backend it belongs to and the bgwriter, and the only time bgwriter attempts to read the list is at checkpoint time. At checkpoint time, for each backend list, the bgwriter grabs a write lock on the list, copies it into its own memory space, truncates the list, and then releases the read lock. It then deletes the entries out of its own list that have entries in the backend list it just read. It then fsync()s the files that are left, under the assumption that the backends will fsync() any file they write to directly. The sum total size of all the lists shouldn't be that much larger than it would be if you maintained it as a global list. I'd conjecture that backends that touch many of the same files are not likely to be touching a large number of files per checkpoint, and those systems that touch a large number of files probably do so through a lot of independent backends. One other thing: I don't know exactly how checkpoints are orchestrated between individual backends, but it seems clear to me that you want to do a sync() *first*, then the fsync()s. The reason is that sync() allows the OS to order the writes across all the files in the most efficient manner possible, whereas fsync() only takes care of the blocks belonging to the file in question. This won't be an option under Windows, but on Unix systems it should make a difference. On Linux it should make quite a difference, since its sync() won't return until the buffers have been flushed -- and then the following fsync()s will return almost instantaneously since their data has already been written (so there won't be any dirty blocks in those files). I suppose it's possible that on some OSes fsync()s could interfere with a running sync(), but for those OSes we can just drop back do doing only fsync()s. As usual, I could be completely full of it. Take this for what it's worth. :-) -- Kevin Brown kevin@sysexperts.com
pgsql-hackers by date: