> > My goal there was to further improve group commit. When running pgbench > -j10 -c10, it was common to see fsyncs that alternated between flushing 1 > transaction, and 9 transactions. Because the first one to the gate would go > through it and slam it on all the others, and it would take one fsync cycle > for it reopen.
Hmm, yeah. I remember somewhat (Peter Geoghegan, I think) mentioning behavior like that before, but I had not made the connection to this issue at that time. This blog post is pretty depressing:
It suggests that an fsync in progress blocks out not only other fsyncs, but other writes to the same file, which for our purposes is just awful. More Googling around reveals that this is apparently well-known to Linux kernel developers and that they don't seem excited about fixing it. :-(
I think they already did. I don't see the effect in ext4, even on a rather old kernel like 2.6.32, using the code from the link above.
<crazy-idea>I wonder if we could write WAL to two different files in alternation, so that we could be writing to one file which fsync-ing the other.</crazy-idea>
I thought the most promising things, once there were timers and sleeps with resolution much better than centisecond, was to record the time at which each fsync finished, and then sleep until "then + commit_delay". That way you don't do any harm to the sleeper, as the write head is not positioned to process the fsync until then anyway, and give other workers the chance to get their commit records in.
But then I kind of lost interest, because anyone who cares very much about commit performance will probably get a nonvolatile write cache, and anything done would be too hardware/platform dependent.
Of course a BBU isn't magic, the kernel still has to spend time scrubbing the buffer pool and sending the dirty ones to the disk/controller when it gets an fsync, even if the confirmation does come back quickly. But it still seems too hardware/platform dependent to find a general purpose optimization.