Thread: simplify register_dirty_segment()
The basic idea is to change register_dirty_segment() to register_opened_segment(). That is, we don't care if a segment is dirty or not, if someone opened it, then we will fsync it at checkpoint time. Currently, register_dirty_segment() is called in mdextend(), mdwrite() and mdtruncate(), this is costly since ForwardFsyncRequest() has to grab the BgWriterCommLock lock exclusively each time and mdwrite() is quite frequent. Benefits: + reduce BgWriterCommLock lock contention; + simplify code - we just need to register_opened_segment() when we open the segment; + reduce the BgWriterShmem->requests[] size; Costs: + have to fsync() a file even if we made no modification on it. The cost is just open/close file, so I think this is acceptable; Corner case: + what if we run out of shared memory for ForwardFsyncRequest()? In the original way, we just fsync() the file ourselves; Now we can't do this. Instead, we will issue and wait a checkpoint request to Bgwriter(let him absorb the requests) and try ForwardFsyncRequest() again. Comments? Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > That is, we don't care if a segment is dirty or not, if someone opened it, > then we will fsync it at checkpoint time. On platforms that I'm familiar with, an fsync call causes the kernel to spend a significant amount of time groveling through its buffers to see if any are dirty. We shouldn't incur that cost to buy marginal speedups at the application level. (In other words, "it's only an open/close" is wrong.) Also, it's not clear to me how this idea works at all, if a backend holds a relation open across more than one checkpoint. What will re-register the segment for the next cycle? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes > On platforms that I'm familiar with, an fsync call causes the kernel > to spend a significant amount of time groveling through its buffers > to see if any are dirty. We shouldn't incur that cost to buy marginal > speedups at the application level. (In other words, "it's only an > open/close" is wrong.) > I did some tests in SunOS, Linux and windows. Basically, I create 100 files, close them. Reopen them, write(dirty)/read(clean) 8192*100 bytes each, then fsync() them. I mesured the fsync() time. SunOS 5.8 + NFS + SCSI Fsync dirty files: duration: 2404.573 ms Fsync clean files: duration: 598.037 ms Linux 2.4 + Ext3 + IDE Fsync dirty files: duration: 6951.793 ms Fsync clean files: duration: 18.132 ms Window2000 + NTFS + IDE Fsync dirty files: duration: 3005.000 ms Fsync clean files: duration: 1101.000 ms I can't figure out why it tooks so long time in windows and SunOS for clean files - a possible reason is that they have to fsync some inode information like last access time even for clean files. Linux is quite smart in this sense. > Also, it's not clear to me how this idea works at all, if a backend holds > a relation open across more than one checkpoint. What will re-register > the segment for the next cycle? > You are right. A possible (but not clean) solution is like this: The bgwriter maintain a refcount for each file. When the file is open, refcount++, when the file is closing, refcount--. When the refcount goes to zero, Bgwriter could safely remove it from its PendingOpsTable after checkpoint. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > I can't figure out why it tooks so long time in windows and SunOS for clean > files - I told you why: they don't maintain bookkeeping information that allows them to quickly identify dirty buffers belonging to a particular file. Linux does ... but I'm not sure that makes it "smarter", since that bookkeeping has a distributed cost, and the cost might or might not be repaid in any particular system workload. It would be a reasonable bet for a kernel designer to assume that fsync() is generally going to have to wait for some I/O and so a bit of CPU overhead isn't really going to matter. > You are right. A possible (but not clean) solution is like this: The > bgwriter maintain a refcount for each file. When the file is open, > refcount++, when the file is closing, refcount--. When the refcount goes to > zero, Bgwriter could safely remove it from its PendingOpsTable after > checkpoint. Adjusting such a global refcount would require global locks, which is just what you were hoping to avoid :-( regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes > It would be a reasonable > bet for a kernel designer to assume that fsync() is generally going to > have to wait for some I/O and so a bit of CPU overhead isn't really > going to matter. Reasonable. > > Adjusting such a global refcount would require global locks, which is > just what you were hoping to avoid :-( I don't want to avoid the global locks but to alleviate it :-( Think the frequency of open()/close() will be much less than write(). Also the shmem space required. On further thought, I agree that this is unneccessary if for BgWriterCommLock reason - because currently BufMgrLock doesn't bother us too much, which is more intensively used, this lock is just nothing. Regards, Qingqing