Thread: WALWriteLock contention

WALWriteLock contention

From
Robert Haas
Date:
WALWriteLock contention is measurable on some workloads.  In studying
the problem briefly, a couple of questions emerged:

1. Doesn't it suck to rewrite an entire 8kB block every time, instead
of only the new bytes (and maybe a few bytes following that to spoil
any old data that might be there)?  I mean, the OS page size is 4kB on
Linux.  If we generate 2kB of WAL and then flush, we're likely to
dirty two OS blocks instead of one.  The OS isn't going to be smart
enough to notice that one of those pages didn't really change, so
we're potentially generating some extra disk I/O.  My colleague Jan
Wieck has some (inconclusive) benchmark results that suggest this
might actually be hurting us significantly.  More research is needed,
but I thought I'd ask if we've ever considered NOT doing that, or if
we should consider it.

2. I don't really understand why WALWriteLock is set up to prohibit
two backends from flushing WAL at the same time.  That seems
unnecessary.  Suppose we've got two backends that flush WAL one after
the other.  Assume (as is not unlikely) that the second one's flush
position is ahead of the first one's flush position.  So the first one
grabs WALWriteLock and does the flush, and then the second one grabs
WALWriteLock for its turn to flush and has to wait for an entire spin
of the platter to complete before its fsync() can be satisfied.  If
we'd just let the second guy issue his fsync() right away, odds are
good that the disk would have satisfied both in a single rotation.
Now it's possible that the second request would've arrived too late
for that to work out, but AFAICS in that case we're no worse off than
we are now.  And if it does work out we're better off.  The only
reasons I can see why we might NOT want to do this are (1) if we're
trying to compensate for some OS-level bugginess, which is a
horrifying thought, or (2) if we think the extra system calls will
cost more than we save by piggybacking the flushes more efficiently.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WALWriteLock contention

From
"Joshua D. Drake"
Date:
On 05/15/2015 09:06 AM, Robert Haas wrote:

> 2. I don't really understand why WALWriteLock is set up to prohibit
> two backends from flushing WAL at the same time.  That seems
> unnecessary.  Suppose we've got two backends that flush WAL one after
> the other.  Assume (as is not unlikely) that the second one's flush
> position is ahead of the first one's flush position.  So the first one
> grabs WALWriteLock and does the flush, and then the second one grabs
> WALWriteLock for its turn to flush and has to wait for an entire spin
> of the platter to complete before its fsync() can be satisfied.  If
> we'd just let the second guy issue his fsync() right away, odds are
> good that the disk would have satisfied both in a single rotation.
> Now it's possible that the second request would've arrived too late
> for that to work out, but AFAICS in that case we're no worse off than
> we are now.  And if it does work out we're better off.  The only

This is a bit out of my depth but it sounds similar to (from a user 
perspective) the difference between synchronous and asynchronous commit. 
If we are willing to trust that PostgreSQL/OS will do what it is 
supposed to do, then it seems logical that what you describe above would 
definitely be a net win.

JD
-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.



Re: WALWriteLock contention

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> WALWriteLock contention is measurable on some workloads.  In studying
> the problem briefly, a couple of questions emerged:

> 1. Doesn't it suck to rewrite an entire 8kB block every time, instead
> of only the new bytes (and maybe a few bytes following that to spoil
> any old data that might be there)?

It does, but it's not clear how to avoid torn-write conditions without
that.

> 2. I don't really understand why WALWriteLock is set up to prohibit
> two backends from flushing WAL at the same time.  That seems
> unnecessary.

Hm, perhaps so.
        regards, tom lane



Re: WALWriteLock contention

From
Robert Haas
Date:
On Fri, May 15, 2015 at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> WALWriteLock contention is measurable on some workloads.  In studying
>> the problem briefly, a couple of questions emerged:
>
>> 1. Doesn't it suck to rewrite an entire 8kB block every time, instead
>> of only the new bytes (and maybe a few bytes following that to spoil
>> any old data that might be there)?
>
> It does, but it's not clear how to avoid torn-write conditions without
> that.

Can you elaborate?   I don't understand how repeatedly overwriting the
same bytes with themselves accomplishes anything at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WALWriteLock contention

From
Jeff Janes
Date:
On Fri, May 15, 2015 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
WALWriteLock contention is measurable on some workloads.  In studying
the problem briefly, a couple of questions emerged:

...
 

2. I don't really understand why WALWriteLock is set up to prohibit
two backends from flushing WAL at the same time.  That seems
unnecessary.  Suppose we've got two backends that flush WAL one after
the other.  Assume (as is not unlikely) that the second one's flush
position is ahead of the first one's flush position.  So the first one
grabs WALWriteLock and does the flush, and then the second one grabs
WALWriteLock for its turn to flush and has to wait for an entire spin
of the platter to complete before its fsync() can be satisfied.  If
we'd just let the second guy issue his fsync() right away, odds are
good that the disk would have satisfied both in a single rotation.
Now it's possible that the second request would've arrived too late
for that to work out, but AFAICS in that case we're no worse off than
we are now.  And if it does work out we're better off.  The only
reasons I can see why we might NOT want to do this are (1) if we're
trying to compensate for some OS-level bugginess, which is a
horrifying thought, or (2) if we think the extra system calls will
cost more than we save by piggybacking the flushes more efficiently.

I implemented this 2-3 years ago, just dropping the WALWriteLock immediately before the fsync and then picking it up again immediately after, and was surprised that I saw absolutely no improvement.  Of course it surely depends on the IO stack, but from what I saw it seemed that once a fsync landed in the kernel, any future ones on that file were blocked rather than consolidated.  Alas I can't find the patch anymore, I can make more of an effort to dig it up if anyone cares.  Although it would probably be easier to reimplement it than it would be to find it and rebase it.  

I vaguely recall thinking that the post-fsync bookkeeping could be moved to a spin lock, with a fair bit of work, so that the WALWriteLock would not need to be picked up again, but the whole avenue didn't seem promising enough for me to worry about that part in detail.  

My goal there was to further improve group commit.  When running pgbench -j10 -c10, it was common to see fsyncs that alternated between flushing 1 transaction, and 9 transactions. Because the first one to the gate would go through it and slam it on all the others, and it would take one fsync cycle for it reopen.

Cheers,

Jeff

Re: WALWriteLock contention

From
Robert Haas
Date:
On Fri, May 15, 2015 at 9:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> I implemented this 2-3 years ago, just dropping the WALWriteLock immediately
> before the fsync and then picking it up again immediately after, and was
> surprised that I saw absolutely no improvement.  Of course it surely depends
> on the IO stack, but from what I saw it seemed that once a fsync landed in
> the kernel, any future ones on that file were blocked rather than
> consolidated.

Interesting.

> Alas I can't find the patch anymore, I can make more of an
> effort to dig it up if anyone cares.  Although it would probably be easier
> to reimplement it than it would be to find it and rebase it.
>
> I vaguely recall thinking that the post-fsync bookkeeping could be moved to
> a spin lock, with a fair bit of work, so that the WALWriteLock would not
> need to be picked up again, but the whole avenue didn't seem promising
> enough for me to worry about that part in detail.
>
> My goal there was to further improve group commit.  When running pgbench
> -j10 -c10, it was common to see fsyncs that alternated between flushing 1
> transaction, and 9 transactions. Because the first one to the gate would go
> through it and slam it on all the others, and it would take one fsync cycle
> for it reopen.

Hmm, yeah.  I remember somewhat (Peter Geoghegan, I think) mentioning
behavior like that before, but I had not made the connection to this
issue at that time.  This blog post is pretty depressing:

http://oldblog.antirez.com/post/fsync-different-thread-useless.html

It suggests that an fsync in progress blocks out not only other
fsyncs, but other writes to the same file, which for our purposes is
just awful.  More Googling around reveals that this is apparently
well-known to Linux kernel developers and that they don't seem excited
about fixing it.  :-(

<crazy-idea>I wonder if we could write WAL to two different files in
alternation, so that we could be writing to one file which fsync-ing
the other.</crazy-idea>

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WALWriteLock contention

From
Amit Kapila
Date:
On Sun, May 17, 2015 at 7:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> <crazy-idea>I wonder if we could write WAL to two different files in
> alternation, so that we could be writing to one file which fsync-ing
> the other.</crazy-idea>
>

Won't the order of transactions replay during recovery can cause
problems if we do alternation while writing.  I think this is one of
the reasons WAL is written sequentially.  Another thing is that during
recovery, currently whenever we encounter mismatch in stored CRC
and actual record CRC, we call it end of recovery, but with writing
to 2 files simultaneously we might need to rethink that rule.

I think first point in your mail related to rewrite of 8K block each
time needs more thought and may be some experimentation to
check whether writing in lesser units based on OS page size or
sector size leads to any meaningful gains.  Another thing is that
if there is high write activity, then group commits should help in
reducing IO for repeated writes and in the tests we can try by changing
commit_delay to see if that can help (if the tests are already tuned
with respect to commit_delay, then ignore this point).


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WALWriteLock contention

From
Robert Haas
Date:
On May 17, 2015, at 11:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sun, May 17, 2015 at 7:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> <crazy-idea>I wonder if we could write WAL to two different files in
> alternation, so that we could be writing to one file which fsync-ing
> the other.</crazy-idea>

Won't the order of transactions replay during recovery can cause
problems if we do alternation while writing.  I think this is one of
the reasons WAL is written sequentially.  Another thing is that during
recovery, currently whenever we encounter mismatch in stored CRC
and actual record CRC, we call it end of recovery, but with writing
to 2 files simultaneously we might need to rethink that rule.

Well, yeah. That's why I said it was a crazy idea.

I think first point in your mail related to rewrite of 8K block each
time needs more thought and may be some experimentation to
check whether writing in lesser units based on OS page size or
sector size leads to any meaningful gains.  Another thing is that
if there is high write activity, then group commits should help in
reducing IO for repeated writes and in the tests we can try by changing
commit_delay to see if that can help (if the tests are already tuned
with respect to commit_delay, then ignore this point).

I am under the impression that using commit_delay usefully is pretty hard but, of course, I could be wrong.

...Robert

Re: WALWriteLock contention

From
Thomas Munro
Date:
On Sun, May 17, 2015 at 2:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> http://oldblog.antirez.com/post/fsync-different-thread-useless.html
>
> It suggests that an fsync in progress blocks out not only other
> fsyncs, but other writes to the same file, which for our purposes is
> just awful.  More Googling around reveals that this is apparently
> well-known to Linux kernel developers and that they don't seem excited
> about fixing it.  :-(

He doesn't say, but I wonder if that is really Linux, or if it is the
ext2, 3 and maybe 4 filesystems specifically.  This blog post talks
about the per-inode mutex that is held while writing with direct IO.
Maybe fsyncing buffered IO is similarly constrained in those
filesystems.

https://www.facebook.com/notes/mysql-at-facebook/xfs-ext-and-per-inode-mutexes/10150210901610933

> <crazy-idea>I wonder if we could write WAL to two different files in
> alternation, so that we could be writing to one file which fsync-ing
> the other.</crazy-idea>

If that is an ext3-specific problem, using multiple files might not
help you anyway because ext3 famously fsyncs *all* files when you
asked for one file to be fsynced, as discussed in Greg Smith's
PostgreSQL 9.0 High Performance in chapter 4 (page 79).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: WALWriteLock contention

From
Robert Haas
Date:
On May 17, 2015, at 5:57 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> On Sun, May 17, 2015 at 2:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> http://oldblog.antirez.com/post/fsync-different-thread-useless.html
>>
>> It suggests that an fsync in progress blocks out not only other
>> fsyncs, but other writes to the same file, which for our purposes is
>> just awful.  More Googling around reveals that this is apparently
>> well-known to Linux kernel developers and that they don't seem excited
>> about fixing it.  :-(
>
> He doesn't say, but I wonder if that is really Linux, or if it is the
> ext2, 3 and maybe 4 filesystems specifically.  This blog post talks
> about the per-inode mutex that is held while writing with direct IO.

Good point. We should probably test ext4 and xfs on a newish kernel.

...Robert


Re: WALWriteLock contention

From
Jeff Janes
Date:
>
> My goal there was to further improve group commit.  When running pgbench
> -j10 -c10, it was common to see fsyncs that alternated between flushing 1
> transaction, and 9 transactions. Because the first one to the gate would go
> through it and slam it on all the others, and it would take one fsync cycle
> for it reopen.

Hmm, yeah.  I remember somewhat (Peter Geoghegan, I think) mentioning
behavior like that before, but I had not made the connection to this
issue at that time.  This blog post is pretty depressing:

http://oldblog.antirez.com/post/fsync-different-thread-useless.html

It suggests that an fsync in progress blocks out not only other
fsyncs, but other writes to the same file, which for our purposes is
just awful.  More Googling around reveals that this is apparently
well-known to Linux kernel developers and that they don't seem excited
about fixing it.  :-(

I think they already did.  I don't see the effect in ext4, even on a rather old kernel like 2.6.32, using the code from the link above.
 

<crazy-idea>I wonder if we could write WAL to two different files in
alternation, so that we could be writing to one file which fsync-ing
the other.</crazy-idea>

I thought the most promising things, once there were timers and sleeps with resolution much better than centisecond, was to record the time at which each fsync finished, and then sleep until "then + commit_delay".  That way you don't do any harm to the sleeper, as the write head is not positioned to process the fsync until then anyway, and give other workers the chance to get their commit records in.   

But then I kind of lost interest, because anyone who cares very much about commit performance will probably get a nonvolatile write cache, and anything done would be too hardware/platform dependent. 

Of course a BBU isn't magic, the kernel still has to spend time scrubbing the buffer pool and sending the dirty ones to the disk/controller when it gets an fsync, even if the confirmation does come back quickly.  But it still seems too hardware/platform dependent to find a general purpose optimization.

Cheers,

Jeff

Re: WALWriteLock contention

From
Amit Kapila
Date:
On Mon, May 18, 2015 at 1:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On May 17, 2015, at 11:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, May 17, 2015 at 7:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > <crazy-idea>I wonder if we could write WAL to two different files in
> > alternation, so that we could be writing to one file which fsync-ing
> > the other.</crazy-idea>
>
> Won't the order of transactions replay during recovery can cause
> problems if we do alternation while writing.  I think this is one of
> the reasons WAL is written sequentially.  Another thing is that during
> recovery, currently whenever we encounter mismatch in stored CRC
> and actual record CRC, we call it end of recovery, but with writing
> to 2 files simultaneously we might need to rethink that rule.
>
>
> Well, yeah. That's why I said it was a crazy idea.
>

Another idea could be try to write as per disk sector size which I think
in most cases is 512 bytes (some latest disks do have larger size
sectors, so it should be configurable in some way).   I think with this
ideally we don't need CRC for each WAL record, as that data will be
either written or not written.  Even if we don't want to rely on the fact
that sector-sized writes are atomic, we can have a configurable CRC
per writeable-unit (which in this scheme would be 512 bytes). 

It can have dual benefit.  First it can help us in minimizing repeated
writes problem and second is that by eliminating the need to have CRC
for each record it can reduce the WAL volume and CPU load.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com