Thread: WALWriteLock contention
WALWriteLock contention is measurable on some workloads. In studying the problem briefly, a couple of questions emerged: 1. Doesn't it suck to rewrite an entire 8kB block every time, instead of only the new bytes (and maybe a few bytes following that to spoil any old data that might be there)? I mean, the OS page size is 4kB on Linux. If we generate 2kB of WAL and then flush, we're likely to dirty two OS blocks instead of one. The OS isn't going to be smart enough to notice that one of those pages didn't really change, so we're potentially generating some extra disk I/O. My colleague Jan Wieck has some (inconclusive) benchmark results that suggest this might actually be hurting us significantly. More research is needed, but I thought I'd ask if we've ever considered NOT doing that, or if we should consider it. 2. I don't really understand why WALWriteLock is set up to prohibit two backends from flushing WAL at the same time. That seems unnecessary. Suppose we've got two backends that flush WAL one after the other. Assume (as is not unlikely) that the second one's flush position is ahead of the first one's flush position. So the first one grabs WALWriteLock and does the flush, and then the second one grabs WALWriteLock for its turn to flush and has to wait for an entire spin of the platter to complete before its fsync() can be satisfied. If we'd just let the second guy issue his fsync() right away, odds are good that the disk would have satisfied both in a single rotation. Now it's possible that the second request would've arrived too late for that to work out, but AFAICS in that case we're no worse off than we are now. And if it does work out we're better off. The only reasons I can see why we might NOT want to do this are (1) if we're trying to compensate for some OS-level bugginess, which is a horrifying thought, or (2) if we think the extra system calls will cost more than we save by piggybacking the flushes more efficiently. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 05/15/2015 09:06 AM, Robert Haas wrote: > 2. I don't really understand why WALWriteLock is set up to prohibit > two backends from flushing WAL at the same time. That seems > unnecessary. Suppose we've got two backends that flush WAL one after > the other. Assume (as is not unlikely) that the second one's flush > position is ahead of the first one's flush position. So the first one > grabs WALWriteLock and does the flush, and then the second one grabs > WALWriteLock for its turn to flush and has to wait for an entire spin > of the platter to complete before its fsync() can be satisfied. If > we'd just let the second guy issue his fsync() right away, odds are > good that the disk would have satisfied both in a single rotation. > Now it's possible that the second request would've arrived too late > for that to work out, but AFAICS in that case we're no worse off than > we are now. And if it does work out we're better off. The only This is a bit out of my depth but it sounds similar to (from a user perspective) the difference between synchronous and asynchronous commit. If we are willing to trust that PostgreSQL/OS will do what it is supposed to do, then it seems logical that what you describe above would definitely be a net win. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
Robert Haas <robertmhaas@gmail.com> writes: > WALWriteLock contention is measurable on some workloads. In studying > the problem briefly, a couple of questions emerged: > 1. Doesn't it suck to rewrite an entire 8kB block every time, instead > of only the new bytes (and maybe a few bytes following that to spoil > any old data that might be there)? It does, but it's not clear how to avoid torn-write conditions without that. > 2. I don't really understand why WALWriteLock is set up to prohibit > two backends from flushing WAL at the same time. That seems > unnecessary. Hm, perhaps so. regards, tom lane
On Fri, May 15, 2015 at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> WALWriteLock contention is measurable on some workloads. In studying >> the problem briefly, a couple of questions emerged: > >> 1. Doesn't it suck to rewrite an entire 8kB block every time, instead >> of only the new bytes (and maybe a few bytes following that to spoil >> any old data that might be there)? > > It does, but it's not clear how to avoid torn-write conditions without > that. Can you elaborate? I don't understand how repeatedly overwriting the same bytes with themselves accomplishes anything at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, May 15, 2015 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
WALWriteLock contention is measurable on some workloads. In studying
the problem briefly, a couple of questions emerged:
...
2. I don't really understand why WALWriteLock is set up to prohibit
two backends from flushing WAL at the same time. That seems
unnecessary. Suppose we've got two backends that flush WAL one after
the other. Assume (as is not unlikely) that the second one's flush
position is ahead of the first one's flush position. So the first one
grabs WALWriteLock and does the flush, and then the second one grabs
WALWriteLock for its turn to flush and has to wait for an entire spin
of the platter to complete before its fsync() can be satisfied. If
we'd just let the second guy issue his fsync() right away, odds are
good that the disk would have satisfied both in a single rotation.
Now it's possible that the second request would've arrived too late
for that to work out, but AFAICS in that case we're no worse off than
we are now. And if it does work out we're better off. The only
reasons I can see why we might NOT want to do this are (1) if we're
trying to compensate for some OS-level bugginess, which is a
horrifying thought, or (2) if we think the extra system calls will
cost more than we save by piggybacking the flushes more efficiently.
I implemented this 2-3 years ago, just dropping the WALWriteLock immediately before the fsync and then picking it up again immediately after, and was surprised that I saw absolutely no improvement. Of course it surely depends on the IO stack, but from what I saw it seemed that once a fsync landed in the kernel, any future ones on that file were blocked rather than consolidated. Alas I can't find the patch anymore, I can make more of an effort to dig it up if anyone cares. Although it would probably be easier to reimplement it than it would be to find it and rebase it.
I vaguely recall thinking that the post-fsync bookkeeping could be moved to a spin lock, with a fair bit of work, so that the WALWriteLock would not need to be picked up again, but the whole avenue didn't seem promising enough for me to worry about that part in detail.
My goal there was to further improve group commit. When running pgbench -j10 -c10, it was common to see fsyncs that alternated between flushing 1 transaction, and 9 transactions. Because the first one to the gate would go through it and slam it on all the others, and it would take one fsync cycle for it reopen.
Cheers,
Jeff
On Fri, May 15, 2015 at 9:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > I implemented this 2-3 years ago, just dropping the WALWriteLock immediately > before the fsync and then picking it up again immediately after, and was > surprised that I saw absolutely no improvement. Of course it surely depends > on the IO stack, but from what I saw it seemed that once a fsync landed in > the kernel, any future ones on that file were blocked rather than > consolidated. Interesting. > Alas I can't find the patch anymore, I can make more of an > effort to dig it up if anyone cares. Although it would probably be easier > to reimplement it than it would be to find it and rebase it. > > I vaguely recall thinking that the post-fsync bookkeeping could be moved to > a spin lock, with a fair bit of work, so that the WALWriteLock would not > need to be picked up again, but the whole avenue didn't seem promising > enough for me to worry about that part in detail. > > My goal there was to further improve group commit. When running pgbench > -j10 -c10, it was common to see fsyncs that alternated between flushing 1 > transaction, and 9 transactions. Because the first one to the gate would go > through it and slam it on all the others, and it would take one fsync cycle > for it reopen. Hmm, yeah. I remember somewhat (Peter Geoghegan, I think) mentioning behavior like that before, but I had not made the connection to this issue at that time. This blog post is pretty depressing: http://oldblog.antirez.com/post/fsync-different-thread-useless.html It suggests that an fsync in progress blocks out not only other fsyncs, but other writes to the same file, which for our purposes is just awful. More Googling around reveals that this is apparently well-known to Linux kernel developers and that they don't seem excited about fixing it. :-( <crazy-idea>I wonder if we could write WAL to two different files in alternation, so that we could be writing to one file which fsync-ing the other.</crazy-idea> -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, May 17, 2015 at 7:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> <crazy-idea>I wonder if we could write WAL to two different files in
> alternation, so that we could be writing to one file which fsync-ing
> the other.</crazy-idea>
>
Won't the order of transactions replay during recovery can cause
>
> <crazy-idea>I wonder if we could write WAL to two different files in
> alternation, so that we could be writing to one file which fsync-ing
> the other.</crazy-idea>
>
Won't the order of transactions replay during recovery can cause
problems if we do alternation while writing. I think this is one of
the reasons WAL is written sequentially. Another thing is that during
recovery, currently whenever we encounter mismatch in stored CRC
and actual record CRC, we call it end of recovery, but with writing
to 2 files simultaneously we might need to rethink that rule.
I think first point in your mail related to rewrite of 8K block each
time needs more thought and may be some experimentation to
check whether writing in lesser units based on OS page size or
sector size leads to any meaningful gains. Another thing is that
if there is high write activity, then group commits should help in
reducing IO for repeated writes and in the tests we can try by changing
commit_delay to see if that can help (if the tests are already tuned
with respect to commit_delay, then ignore this point).
On May 17, 2015, at 11:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sun, May 17, 2015 at 7:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> <crazy-idea>I wonder if we could write WAL to two different files in
> alternation, so that we could be writing to one file which fsync-ing
> the other.</crazy-idea>
Won't the order of transactions replay during recovery can causeproblems if we do alternation while writing. I think this is one ofthe reasons WAL is written sequentially. Another thing is that duringrecovery, currently whenever we encounter mismatch in stored CRCand actual record CRC, we call it end of recovery, but with writingto 2 files simultaneously we might need to rethink that rule.
Well, yeah. That's why I said it was a crazy idea.
I think first point in your mail related to rewrite of 8K block eachtime needs more thought and may be some experimentation tocheck whether writing in lesser units based on OS page size orsector size leads to any meaningful gains. Another thing is thatif there is high write activity, then group commits should help inreducing IO for repeated writes and in the tests we can try by changingcommit_delay to see if that can help (if the tests are already tunedwith respect to commit_delay, then ignore this point).
I am under the impression that using commit_delay usefully is pretty hard but, of course, I could be wrong.
...Robert
On Sun, May 17, 2015 at 2:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: > http://oldblog.antirez.com/post/fsync-different-thread-useless.html > > It suggests that an fsync in progress blocks out not only other > fsyncs, but other writes to the same file, which for our purposes is > just awful. More Googling around reveals that this is apparently > well-known to Linux kernel developers and that they don't seem excited > about fixing it. :-( He doesn't say, but I wonder if that is really Linux, or if it is the ext2, 3 and maybe 4 filesystems specifically. This blog post talks about the per-inode mutex that is held while writing with direct IO. Maybe fsyncing buffered IO is similarly constrained in those filesystems. https://www.facebook.com/notes/mysql-at-facebook/xfs-ext-and-per-inode-mutexes/10150210901610933 > <crazy-idea>I wonder if we could write WAL to two different files in > alternation, so that we could be writing to one file which fsync-ing > the other.</crazy-idea> If that is an ext3-specific problem, using multiple files might not help you anyway because ext3 famously fsyncs *all* files when you asked for one file to be fsynced, as discussed in Greg Smith's PostgreSQL 9.0 High Performance in chapter 4 (page 79). -- Thomas Munro http://www.enterprisedb.com
On May 17, 2015, at 5:57 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> On Sun, May 17, 2015 at 2:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> http://oldblog.antirez.com/post/fsync-different-thread-useless.html >> >> It suggests that an fsync in progress blocks out not only other >> fsyncs, but other writes to the same file, which for our purposes is >> just awful. More Googling around reveals that this is apparently >> well-known to Linux kernel developers and that they don't seem excited >> about fixing it. :-( > > He doesn't say, but I wonder if that is really Linux, or if it is the > ext2, 3 and maybe 4 filesystems specifically. This blog post talks > about the per-inode mutex that is held while writing with direct IO. Good point. We should probably test ext4 and xfs on a newish kernel. ...Robert
>
> My goal there was to further improve group commit. When running pgbench
> -j10 -c10, it was common to see fsyncs that alternated between flushing 1
> transaction, and 9 transactions. Because the first one to the gate would go
> through it and slam it on all the others, and it would take one fsync cycle
> for it reopen.
Hmm, yeah. I remember somewhat (Peter Geoghegan, I think) mentioning
behavior like that before, but I had not made the connection to this
issue at that time. This blog post is pretty depressing:
http://oldblog.antirez.com/post/fsync-different-thread-useless.html
It suggests that an fsync in progress blocks out not only other
fsyncs, but other writes to the same file, which for our purposes is
just awful. More Googling around reveals that this is apparently
well-known to Linux kernel developers and that they don't seem excited
about fixing it. :-(
I think they already did. I don't see the effect in ext4, even on a rather old kernel like 2.6.32, using the code from the link above.
<crazy-idea>I wonder if we could write WAL to two different files in
alternation, so that we could be writing to one file which fsync-ing
the other.</crazy-idea>
I thought the most promising things, once there were timers and sleeps with resolution much better than centisecond, was to record the time at which each fsync finished, and then sleep until "then + commit_delay". That way you don't do any harm to the sleeper, as the write head is not positioned to process the fsync until then anyway, and give other workers the chance to get their commit records in.
But then I kind of lost interest, because anyone who cares very much about commit performance will probably get a nonvolatile write cache, and anything done would be too hardware/platform dependent.
Of course a BBU isn't magic, the kernel still has to spend time scrubbing the buffer pool and sending the dirty ones to the disk/controller when it gets an fsync, even if the confirmation does come back quickly. But it still seems too hardware/platform dependent to find a general purpose optimization.
Cheers,
Jeff
On Mon, May 18, 2015 at 1:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On May 17, 2015, at 11:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, May 17, 2015 at 7:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > <crazy-idea>I wonder if we could write WAL to two different files in
> > alternation, so that we could be writing to one file which fsync-ing
> > the other.</crazy-idea>
>
> Won't the order of transactions replay during recovery can cause
> problems if we do alternation while writing. I think this is one of
> the reasons WAL is written sequentially. Another thing is that during
> recovery, currently whenever we encounter mismatch in stored CRC
> and actual record CRC, we call it end of recovery, but with writing
> to 2 files simultaneously we might need to rethink that rule.
>
>
> Well, yeah. That's why I said it was a crazy idea.
>
Another idea could be try to write as per disk sector size which I think
>
> On May 17, 2015, at 11:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, May 17, 2015 at 7:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > <crazy-idea>I wonder if we could write WAL to two different files in
> > alternation, so that we could be writing to one file which fsync-ing
> > the other.</crazy-idea>
>
> Won't the order of transactions replay during recovery can cause
> problems if we do alternation while writing. I think this is one of
> the reasons WAL is written sequentially. Another thing is that during
> recovery, currently whenever we encounter mismatch in stored CRC
> and actual record CRC, we call it end of recovery, but with writing
> to 2 files simultaneously we might need to rethink that rule.
>
>
> Well, yeah. That's why I said it was a crazy idea.
>
Another idea could be try to write as per disk sector size which I think
in most cases is 512 bytes (some latest disks do have larger size
sectors, so it should be configurable in some way). I think with this
ideally we don't need CRC for each WAL record, as that data will be
either written or not written. Even if we don't want to rely on the fact
that sector-sized writes are atomic, we can have a configurable CRC
per writeable-unit (which in this scheme would be 512 bytes).
It can have dual benefit. First it can help us in minimizing repeated
writes problem and second is that by eliminating the need to have CRC
for each record it can reduce the WAL volume and CPU load.