Thread: simplify register_dirty_segment()

simplify register_dirty_segment()

From
"Qingqing Zhou"
Date:
The basic idea is to change register_dirty_segment() to
register_opened_segment().

That is, we don't care if a segment is dirty or not, if someone opened it,
then we will fsync it at checkpoint time. Currently,
register_dirty_segment() is called in mdextend(), mdwrite() and
mdtruncate(), this is costly since ForwardFsyncRequest() has to grab the
BgWriterCommLock lock exclusively each time and mdwrite() is quite frequent.

Benefits:
+ reduce BgWriterCommLock lock contention;
+ simplify code - we just need to register_opened_segment() when we open the
segment;
+ reduce the BgWriterShmem->requests[] size;

Costs:
+ have to fsync() a file even if we made no modification on it. The cost is
just open/close file, so I think this is acceptable;

Corner case:
+ what if we run out of shared memory for ForwardFsyncRequest()? In the
original way, we just fsync() the file ourselves; Now we can't do this.
Instead, we will issue and wait a checkpoint request to Bgwriter(let him
absorb the requests) and try ForwardFsyncRequest() again.


Comments?

Regards,
Qingqing




Re: simplify register_dirty_segment()

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> That is, we don't care if a segment is dirty or not, if someone opened it,
> then we will fsync it at checkpoint time.

On platforms that I'm familiar with, an fsync call causes the kernel
to spend a significant amount of time groveling through its buffers
to see if any are dirty.  We shouldn't incur that cost to buy marginal
speedups at the application level.  (In other words, "it's only an
open/close" is wrong.)

Also, it's not clear to me how this idea works at all, if a backend holds
a relation open across more than one checkpoint.  What will re-register
the segment for the next cycle?
        regards, tom lane


Re: simplify register_dirty_segment()

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes
> On platforms that I'm familiar with, an fsync call causes the kernel
> to spend a significant amount of time groveling through its buffers
> to see if any are dirty.  We shouldn't incur that cost to buy marginal
> speedups at the application level.  (In other words, "it's only an
> open/close" is wrong.)
>

I did some tests in SunOS, Linux and windows. Basically, I create 100 files,
close them. Reopen them, write(dirty)/read(clean) 8192*100 bytes each, then
fsync() them. I mesured the fsync() time.

SunOS 5.8 + NFS + SCSI   Fsync dirty files: duration: 2404.573 ms   Fsync clean files: duration: 598.037 ms

Linux 2.4 + Ext3 + IDE   Fsync dirty files: duration: 6951.793 ms   Fsync clean files: duration: 18.132 ms

Window2000 + NTFS + IDE   Fsync dirty files: duration: 3005.000 ms   Fsync clean files: duration: 1101.000 ms

I can't figure out why it tooks so long time in windows and SunOS for clean
files - a possible reason is that they have to fsync some inode information
like last access time even for clean files. Linux is quite smart in this
sense.

> Also, it's not clear to me how this idea works at all, if a backend holds
> a relation open across more than one checkpoint.  What will re-register
> the segment for the next cycle?
>

You are right. A possible (but not clean) solution is like this: The
bgwriter maintain a refcount for each file. When the file is open,
refcount++, when the file is closing, refcount--. When the refcount goes to
zero, Bgwriter could safely remove it from its PendingOpsTable after
checkpoint.

Regards,
Qingqing




Re: simplify register_dirty_segment()

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> I can't figure out why it tooks so long time in windows and SunOS for clean
> files -

I told you why: they don't maintain bookkeeping information that allows
them to quickly identify dirty buffers belonging to a particular file.
Linux does ... but I'm not sure that makes it "smarter", since that
bookkeeping has a distributed cost, and the cost might or might not
be repaid in any particular system workload.  It would be a reasonable
bet for a kernel designer to assume that fsync() is generally going to
have to wait for some I/O and so a bit of CPU overhead isn't really
going to matter.

> You are right. A possible (but not clean) solution is like this: The
> bgwriter maintain a refcount for each file. When the file is open,
> refcount++, when the file is closing, refcount--. When the refcount goes to
> zero, Bgwriter could safely remove it from its PendingOpsTable after
> checkpoint.

Adjusting such a global refcount would require global locks, which is
just what you were hoping to avoid :-(
        regards, tom lane


Re: simplify register_dirty_segment()

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes
> It would be a reasonable
> bet for a kernel designer to assume that fsync() is generally going to
> have to wait for some I/O and so a bit of CPU overhead isn't really
> going to matter.

Reasonable.

>
> Adjusting such a global refcount would require global locks, which is
> just what you were hoping to avoid :-(

I don't want to avoid the global locks but to alleviate it :-( Think the
frequency of open()/close() will be much less than write(). Also the shmem
space required. On further thought, I agree that this is unneccessary if for
BgWriterCommLock reason - because currently BufMgrLock doesn't bother us too
much, which is more intensively used, this lock is just nothing.

Regards,
Qingqing