Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint - Mailing list pgsql-hackers
From | Kevin Brown |
---|---|
Subject | Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint |
Date | |
Msg-id | 20040207035541.GG2608@filer Whole thread Raw |
In response to | Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [pgsql-hackers-win32] Sync vs. fsync during checkpoint
|
List | pgsql-hackers |
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Well, running out of space in the list isn't that much of a problem. If > > the backends run out of list space (and the max size of the list could > > be a configurable thing, either as a percentage of shared memory or as > > an absolute size), then all that happens is that the background writer > > might end up fsync()ing some files that have already been fsync()ed. > > But that's not that big of a deal -- the fact they've already been > > fsync()ed means that there shouldn't be any data in the kernel buffers > > left to write to disk, so subsequent fsync()s should return quickly. > > Yes, it's a big deal. You're arguing as though the bgwriter is the > thing that needs to be fast, when actually what we care about is the > backends being fast. If the bgwriter isn't doing the vast bulk of the > writing (and especially the fsync waits) then we are wasting our time > having one at all. So we need a scheme that makes it as unlikely as > possible that backends will have to do their own fsyncs. Small > per-backend fsync lists aren't the way to do that. Ah, okay. Pardon me, I was writing on low sleep at the time. If we want to make the backends as fast as possible then they should defer synchronous writes to someplace else. But that someplace else could easily be a process forked by the backend in question whose sole purpose is to go through the list of files generated by its parent backend and fsync() them. The backend can then go about its business and upon receipt of the SIGCHLD notify anyone that needs to be notified that the fsync()s have completed. This approach on any reasonable OS will have minimal overhead because of copy-on-write page handling in the kernel and the fact that the child process isn't going to exec() or write to a bunch of memory. The advantage is that each backend can maintain its own list in per-process memory instead of using shared memory. The disadvantage is that a given file could have multiple simultaneous (or close to simultaneous) fsync()s issued against it. As noted previously, that might not be such a big deal. You could still build a list in shared memory of the files that backends are accessing but it would then be a cache of sorts because it would be fixed in size. As soon as you run out of space in the shared list, you'll have to expire some entries. An expired entry simply means that multiple fsync()s might be issued for the file being referred to. But I suspect that such a list would have far too much contention, and that it would be more efficient to simply risk issuing multiple fsync()s against the same file by multiple backend children. Another advantage to the child-of-backend-fsync()s approach is that it would cause simultaneous fsync()s to happen, and on more advanced OSes the OS itself should be able to coalesce the work to be done into a more efficient pattern of writes to the disk. That won't be possible if fsync()s are serialized by PG. It's not as good as a syscall that would allow you to fsync() a bunch of file descriptors simultaneously, but it might be close. I have no idea whether or not this approach would work in Windows. > > Perhaps a better way to do it would be to store the list of all the > > relfilenodes of everything in pg_class, with a flag for each indicating > > whether or not an fsync() of the file needs to take place. > > You're forgetting that we have a fixed-size workspace to do this in ... > and no way to know at postmaster start how many relations there are in > any of our databases, let alone predict how many there might be later on. Unfortunately, this is going to apply to most any approach. The number of blocks being dealt with is not fixed, because even though the cache itself is fixed in size, the number of block writes it represents (and thus the number of files involved) is not. The list of files itself is not fixed in size, either. However, this *does* suggest another possible approach: you set up a fixed size list and fsync() the batch when it fills up. It sounds like we need to define the particular behavior we want first. We're optimizing for some combination of throughput and responsiveness, and those aren't necessarily the same thing. I suppose this means that the solution chosen has to have enough knobs to allow the DBA to pick where on the throughput/responsiveness curve he wants to be. -- Kevin Brown kevin@sysexperts.com
pgsql-hackers by date: