Thread: [HACKERS] subscription worker signalling wal writer too much

[HACKERS] subscription worker signalling wal writer too much

From
Jeff Janes
Date:
If I publish a pgbench workload and subscribe to it, the subscription worker is signalling the wal writer thousands of times a second, once for every async commit.  This has a noticeable performance cost.

I don't think it is ever necessary to signal the wal writer here, unless wal writer is taking the long sleep for power saving purposes.  If up to wal_writer_delay of "committed" transactions get lost on a crash, it doesn't really matter because they will be replayed again once replication resumes. However, might delaying the advance of the hot_standby_feedback by up to the amount of wal_writer_delay be a problem?  I would think any set-up which depends on the standby never being a few dozen milliseconds behind is already doomed.

If I want to suppress signalling, what would be the right way to communicate to XLogSetAsyncXactLSN that it is being called in a subscription worker?

Another approach would be to make XLogSetAsyncXactLSN signalling less aggressive for everyone, not just subscription workers.  There is no point in signalling it more than once for a given WriteRqstPtr page boundary.  So each backend could locally remember the last boundary for which it signalled wal writer, and not signal again for the same boundary.  This would be especially effective for a subscription worker, as it should be pretty common for almost all the async commits to be coming from the same process.  Or, the last boundary could be kept in shared memory (XLogCtl) so all backends can share it, at the expense of additional work needed to be done under a spin lock.

Other ideas?

Thanks,

Jeff

Re: [HACKERS] subscription worker signalling wal writer too much

From
Jeff Janes
Date:
On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
If I publish a pgbench workload and subscribe to it, the subscription worker is signalling the wal writer thousands of times a second, once for every async commit.  This has a noticeable performance cost.

I've used a local variable to avoid waking up the wal writer more than once for the same page boundary.  This reduces the number of wake-ups by about 7/8.

I'm testing it by doing 1e6 transactions over 8 clients while replication is in effect, then waiting for the logical replica to catch up.  This cycle takes 183.1 seconds in HEAD, and 162.4 seconds with the attached patch. N=14, p-value for difference of the means 6e-17.

If I suppress all wake-ups just to see what would happen, it further reduces the runtime to 153.7.

Cheers,

Jeff
Attachment

Re: [HACKERS] subscription worker signalling wal writer too much

From
Andres Freund
Date:
On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
> On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> 
> > If I publish a pgbench workload and subscribe to it, the subscription
> > worker is signalling the wal writer thousands of times a second, once for
> > every async commit.  This has a noticeable performance cost.
> >
> 
> I've used a local variable to avoid waking up the wal writer more than once
> for the same page boundary.  This reduces the number of wake-ups by about
> 7/8.

Maybe I'm missing something here, but isn't that going to reduce our
guarantees about when asynchronously committed xacts are flushed out?
You can easily fit a number of commits into the same page...   As this
isn't specific to logical-rep, I don't think that's ok.


Have you chased down why there's that many wakeups?  Normally I'd have
expected that a number of the SetLatch() calls get consolidated
together, but I guess walwriter is "too quick" in waking up and
resetting the latch?


Greetings,

Andres Freund



Re: [HACKERS] subscription worker signalling wal writer too much

From
Jeff Janes
Date:
On Wed, Jun 14, 2017 at 3:20 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
> On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> > If I publish a pgbench workload and subscribe to it, the subscription
> > worker is signalling the wal writer thousands of times a second, once for
> > every async commit.  This has a noticeable performance cost.
> >
>
> I've used a local variable to avoid waking up the wal writer more than once
> for the same page boundary.  This reduces the number of wake-ups by about
> 7/8.

Maybe I'm missing something here, but isn't that going to reduce our
guarantees about when asynchronously committed xacts are flushed out?
You can easily fit a number of commits into the same page...   As this
isn't specific to logical-rep, I don't think that's ok.

The guarantee is based on wal_writer_delay not on SIGUSR1, so I don't think this changes that. (Also, it isn't really a guarantee, the fsync can take many seconds to complete once we do initiate it, and there is absolutely nothing we can do about that, other than do the fsync synchronously in the first place).

The reason for kicking the wal writer at page boundaries is so that hint bits can get set earlier than they otherwise could. But I don't think kicking it multiple times per page boundary can help in that effort.
 

Have you chased down why there's that many wakeups?  Normally I'd have
expected that a number of the SetLatch() calls get consolidated
together, but I guess walwriter is "too quick" in waking up and
resetting the latch?

I'll have to dig into that some more.  The 7/8 reduction I cited was just in calls to SetLatch from that part of the code, I didn't measure whether the SetLatch actually called kill(owner_pid, SIGUSR1) or not when I determined that reduction, so it wasn't truly wake ups I measured.  Actual wake ups were measured only indirectly via the impact on performance.  I'll need to figure out how to instrument that without distorting the performance too much in the process..

Cheers,

Jeff

Re: [HACKERS] subscription worker signalling wal writer too much

From
Andres Freund
Date:
On 2017-06-14 16:24:27 -0700, Jeff Janes wrote:
> On Wed, Jun 14, 2017 at 3:20 PM, Andres Freund <andres@anarazel.de> wrote:
> 
> > On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
> > > On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com>
> > wrote:
> > >
> > > > If I publish a pgbench workload and subscribe to it, the subscription
> > > > worker is signalling the wal writer thousands of times a second, once
> > for
> > > > every async commit.  This has a noticeable performance cost.
> > > >
> > >
> > > I've used a local variable to avoid waking up the wal writer more than
> > once
> > > for the same page boundary.  This reduces the number of wake-ups by about
> > > 7/8.
> >
> > Maybe I'm missing something here, but isn't that going to reduce our
> > guarantees about when asynchronously committed xacts are flushed out?
> > You can easily fit a number of commits into the same page...   As this
> > isn't specific to logical-rep, I don't think that's ok.
> >
> 
> The guarantee is based on wal_writer_delay not on SIGUSR1, so I don't think
> this changes that. (Also, it isn't really a guarantee, the fsync can take
> many seconds to complete once we do initiate it, and there is absolutely
> nothing we can do about that, other than do the fsync synchronously in the
> first place).

Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
this afaics would allow wal writer to go into sleep mode with half a
page filled, and it'd not be woken up again until the page is filled.
No?


> > Have you chased down why there's that many wakeups?  Normally I'd have
> > expected that a number of the SetLatch() calls get consolidated
> > together, but I guess walwriter is "too quick" in waking up and
> > resetting the latch?

> I'll have to dig into that some more.  The 7/8 reduction I cited was just
> in calls to SetLatch from that part of the code, I didn't measure whether
> the SetLatch actually called kill(owner_pid, SIGUSR1) or not when I
> determined that reduction, so it wasn't truly wake ups I measured.  Actual
> wake ups were measured only indirectly via the impact on performance.  I'll
> need to figure out how to instrument that without distorting the
> performance too much in the process..

I'd suspect that just measuring the number of kill() calls should be
doable, if measured via perf or something like hta.t


Greetings,

Andres Freund



Re: [HACKERS] subscription worker signalling wal writer too much

From
Jeff Janes
Date:
On Wed, Jun 14, 2017 at 4:29 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-06-14 16:24:27 -0700, Jeff Janes wrote:
> On Wed, Jun 14, 2017 at 3:20 PM, Andres Freund <andres@anarazel.de> wrote:
>
> > On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
> > > On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com>
> > wrote:
> > >
> > > > If I publish a pgbench workload and subscribe to it, the subscription
> > > > worker is signalling the wal writer thousands of times a second, once
> > for
> > > > every async commit.  This has a noticeable performance cost.
> > > >
> > >
> > > I've used a local variable to avoid waking up the wal writer more than
> > once
> > > for the same page boundary.  This reduces the number of wake-ups by about
> > > 7/8.
> >
> > Maybe I'm missing something here, but isn't that going to reduce our
> > guarantees about when asynchronously committed xacts are flushed out?
> > You can easily fit a number of commits into the same page...   As this
> > isn't specific to logical-rep, I don't think that's ok.
> >
>
> The guarantee is based on wal_writer_delay not on SIGUSR1, so I don't think
> this changes that. (Also, it isn't really a guarantee, the fsync can take
> many seconds to complete once we do initiate it, and there is absolutely
> nothing we can do about that, other than do the fsync synchronously in the
> first place).

Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
this afaics would allow wal writer to go into sleep mode with half a
page filled, and it'd not be woken up again until the page is filled.
No?

If it is taking the big sleep, then we always wake it up, since acd4c7d58baf09f.  

I didn't change that part.  I only changed what we do if it not hibernating.


> > Have you chased down why there's that many wakeups?  Normally I'd have
> > expected that a number of the SetLatch() calls get consolidated
> > together, but I guess walwriter is "too quick" in waking up and
> > resetting the latch?

> I'll have to dig into that some more.  The 7/8 reduction I cited was just
> in calls to SetLatch from that part of the code, I didn't measure whether
> the SetLatch actually called kill(owner_pid, SIGUSR1) or not when I
> determined that reduction, so it wasn't truly wake ups I measured.  Actual
> wake ups were measured only indirectly via the impact on performance.  I'll
> need to figure out how to instrument that without distorting the
> performance too much in the process..

I'd suspect that just measuring the number of kill() calls should be
doable, if measured via perf or something like hta.t


It looks like only limited consolidation was happening, the number of kills invoked was more than half of the number of SetLatch.  I think the reason is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately suspends the calling process and gives the signalled process a chance to run in that time slice.  The Wal Writer gets woken up, sees that it only has a partial page to write and decides not to do anything, but resets the latch.  It does this fast enough that the subscription worker hasn't migrated CPUs yet.

But, if the WAL Writer sees it has already flushed up to the highest eligible page boundary, why doesn't the SetAsync code see the same thing?  Because it isn't flushed that high, but only written that high.  It looks like a story of three git commits:

commit 4de82f7d7c50a81ec8e70e2cb0ab413ab9134c0b
Author: Simon Riggs <simon@2ndQuadrant.com>
Date:   Sun Nov 13 09:00:57 2011 +0000

    Wakeup WALWriter as needed for asynchronous commit performance.


commit db76b1efbbab2441428a9ef21f7ac9ba43c52482
Author: Andres Freund <andres@anarazel.de>
Date:   Mon Feb 15 22:15:35 2016 +0100

    Allow SetHintBits() to succeed if the buffer's LSN is new enough.


commit 7975c5e0a992ae9a45e03d145e0d37e2b5a707f5
Author: Andres Freund <andres@anarazel.de>
Date:   Mon Feb 15 23:52:38 2016 +0100

    Allow the WAL writer to flush WAL at a reduced rate.


The first change made the WALWriter wake up and do a write and flush whenever an async commit occurred and there was a completed WAL page to be written.  This way the hint bits could be set while the heap page was still hot, because they couldn't get set until the WAL covering the hinted-at transaction commit is flushed.

The second change said we can set hint bits before the WAL covering the hinted-at transaction is flushed, as long the page LSN is newer than that (so the wal covering the hinted-at transaction commit must be flushed before the page containing the hint bit can be written).  

Then the third change makes the wal writer only write the full pages most of the times it is woken up, not flush them.  It only flushes them once every wal_writer_delay or wal_writer_flush_after, regardless of how often it is woken up.

It seems like the third change rendered the first one useless.  Wouldn't it better, and much simpler, just to have reverted the first change once the second one was done?  If that were done, would the third change still be needed?  (It did seem to add some other features as well, so I'm going to assume we still want those...).

But now the first change is even worse than useless, it is positively harmful.  The only thing to stop it from waking the WALWriter for every async commit is this line:

        /* if we have already flushed that far, we're done */
        if (WriteRqstPtr <= LogwrtResult.Flush)
            return;

Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't becomes false due to us waking walwriter, it only becomes false once the timeout expires (which it would have done anyway with no help from us), or once wal_writer_flush_after is exceeded.  So now every async commit is waking the walwriter.  Most of those wake up do nothing (there is not a completely new patch to write), some write one completed page but don't flush anything, and very few do a flush, and that one would have been done anyway.

So for the sake of wal_writer_flush_after, I guess the simplest thing to do would be to change repeat the flushbytes calculation and wake the walwriter only if it looks like that will be exceeded?  Attached patch does that.

Maybe the computation of flushbytes should be put into a macro, rather than repeated in the code?  Or even done on the fly, I don't see any reason it needs to be stored in a variable before testing.

This new patch is simpler than the previous one, and more effective at speeding up replication. I assume it would speed up pgbench with synchronous_commit turned off (or against unlogged tables) as well, but I don't think I have the hardware needed to test that.

Cheers,

Jeff

Attachment

Re: [HACKERS] subscription worker signalling wal writer too much

From
Jeff Janes
Date:
On Thu, Jun 15, 2017 at 3:06 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

This new patch is simpler than the previous one, and more effective at speeding up replication. I assume it would speed up pgbench with synchronous_commit turned off (or against unlogged tables) as well, but I don't think I have the hardware needed to test that.

If I use the 'tpcb-func' script embodied in the attached patch to pgbench, then I can see the performance difference against unlogged tables using 8 clients on a 8 CPU virtual machine. The normal tpcb-like script has too much communication overhead, bouncing from pgbench to the postgres backend 7 times per transaction, to see the difference. I also had to make autovacuum_vacuum_cost_delay=0, otherwise auto analyze holds a snapshot long enough to bloat the HOT chains which injects a great deal of variability into the timings.

Commit 7975c5e0a992ae9 in the 9.6 branch causes a regression of about 10%, and the my patch from the previous email redeems that regression.  It also gives the same improvement against 10dev HEAD.

Cheers,

Jeff
Attachment

Re: [HACKERS] subscription worker signalling wal writer too much

From
Andres Freund
Date:
Hi,

On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
> > Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
> > this afaics would allow wal writer to go into sleep mode with half a
> > page filled, and it'd not be woken up again until the page is filled.
> > No?
> >
> 
> If it is taking the big sleep, then we always wake it up, since
> acd4c7d58baf09f.
> 
> I didn't change that part.  I only changed what we do if it not hibernating.

Right, I hadn't read enough of the context.


> It looks like only limited consolidation was happening, the number of kills
> invoked was more than half of the number of SetLatch.  I think the reason
> is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
> suspends the calling process and gives the signalled process a chance to
> run in that time slice.  The Wal Writer gets woken up, sees that it only
> has a partial page to write and decides not to do anything, but resets the
> latch.  It does this fast enough that the subscription worker hasn't
> migrated CPUs yet.

I think part of the problem here is that latches signal the owning
process even if the owning process isn't actually sleeping - that's
obviously quite pointless.  In cases where walwriter is busy, that
actually causes a lot of superflous interrupted syscalls, because it
actually never ends up waiting on the latch. There's also a lot of
superflous context switches.  I think we can avoid doing so quite
easily, as e.g. with the attached patch.  Could you check how much that
helps your benchmark?


> The first change made the WALWriter wake up and do a write and flush
> whenever an async commit occurred and there was a completed WAL page to be
> written.  This way the hint bits could be set while the heap page was still
> hot, because they couldn't get set until the WAL covering the hinted-at
> transaction commit is flushed.
> 
> The second change said we can set hint bits before the WAL covering the
> hinted-at transaction is flushed, as long the page LSN is newer than that
> (so the wal covering the hinted-at transaction commit must be flushed
> before the page containing the hint bit can be written).
> 
> Then the third change makes the wal writer only write the full pages most
> of the times it is woken up, not flush them.  It only flushes them once
> every wal_writer_delay or wal_writer_flush_after, regardless of how often
> it is woken up.
> 
> It seems like the third change rendered the first one useless.

I don't think so. Isn't the walwriter writing out the contents of the
WAL is quite important because it makes room in wal_buffers for new WAL?

I suspect we actually should wake up wal-writer *more* aggressively, to
offload wal fsyncs from individual backends, even when they're not
committing.  Right now we fsync whenever a segment is finished - we
really don't want to do that in backends that could do other useful
work.  I suspect it'd be a good idea to remove that logic from
XLogWrite() and replace it with
    if (ProcGlobal->walwriterLatch)
        SetLatch(ProcGlobal->walwriterLatch);


> Wouldn't it
> better, and much simpler, just to have reverted the first change once the
> second one was done?

I think both can actually happen independently, no? It's pretty common
for the page lsn to be *older* than the corresponding commit.  In fact
you'll always hit that case unless there's also concurrent writes also
touching said page.


> If that were done, would the third change still be
> needed?  (It did seem to add some other features as well, so I'm going to
> assume we still want those...).

It's still useful, yes.  It avoids flushing the WAL too often
(page-by-page when there's a lot of writes), which can eat up a lot of
IOPS on fast storage.



> But now the first change is even worse than useless, it is positively
> harmful.  The only thing to stop it from waking the WALWriter for every
> async commit is this line:
> 
>         /* if we have already flushed that far, we're done */
>         if (WriteRqstPtr <= LogwrtResult.Flush)
>             return;
> 
> Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
> becomes false due to us waking walwriter, it only becomes false once the
> timeout expires (which it would have done anyway with no help from us), or
> once wal_writer_flush_after is exceeded.  So now every async commit is
> waking the walwriter.  Most of those wake up do nothing (there is not a
> completely new patch to write), some write one completed page but don't
> flush anything, and very few do a flush, and that one would have been done
> anyway.

s/completely new patch/completely new page/?

In my opinion we actually *do* want to write (but not flush!) out such
pages, so I'm not sure I agree with that logic.  Have to think about
this some more...


Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] subscription worker signalling wal writer too much

From
Jeff Janes
Date:
On Sat, Jun 24, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:


> It looks like only limited consolidation was happening, the number of kills
> invoked was more than half of the number of SetLatch.  I think the reason
> is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
> suspends the calling process and gives the signalled process a chance to
> run in that time slice.  The Wal Writer gets woken up, sees that it only
> has a partial page to write and decides not to do anything, but resets the
> latch.  It does this fast enough that the subscription worker hasn't
> migrated CPUs yet.

I think part of the problem here is that latches signal the owning
process even if the owning process isn't actually sleeping - that's
obviously quite pointless.  In cases where walwriter is busy, that
actually causes a lot of superflous interrupted syscalls, because it
actually never ends up waiting on the latch. There's also a lot of
superflous context switches.  I think we can avoid doing so quite
easily, as e.g. with the attached patch.  Could you check how much that
helps your benchmark?

That patch doesn't make any improvement to the pgbench --unlogged-tables benchmark.  I expected that, because this benchmark exposes the opposite problem.  The wal writer isn't busy, it is constantly being woken up but then immediately going back to sleep.



> The first change made the WALWriter wake up and do a write and flush
> whenever an async commit occurred and there was a completed WAL page to be
> written.  This way the hint bits could be set while the heap page was still
> hot, because they couldn't get set until the WAL covering the hinted-at
> transaction commit is flushed.
>
> The second change said we can set hint bits before the WAL covering the
> hinted-at transaction is flushed, as long the page LSN is newer than that
> (so the wal covering the hinted-at transaction commit must be flushed
> before the page containing the hint bit can be written).
>
> Then the third change makes the wal writer only write the full pages most
> of the times it is woken up, not flush them.  It only flushes them once
> every wal_writer_delay or wal_writer_flush_after, regardless of how often
> it is woken up.
>
> It seems like the third change rendered the first one useless.

I don't think so. Isn't the walwriter writing out the contents of the
WAL is quite important because it makes room in wal_buffers for new WAL?

If we wanted to do that, I think the asynchronous committer should just issue the write directly (unless wal_sync_method is open_datasync or open_sync).  Writes are usually extremely fast and it not worth trying to offload them to a background process, as your half of the signalling takes longer than just doing the write yourself.  When that is not true and writes start blocking due to extreme io congestion, them everything is going to start blocking anyway (wal_buffers head will run into the tail, and the tail can't advance because the writes are blocking) so while offloading writes to a background process is then good in theory, it isn't very effective.  If everyone is stuck, it doesn't matter which process is directly stuck and which ones are indirectly stuck.
 

I suspect we actually should wake up wal-writer *more* aggressively, to
offload wal fsyncs from individual backends, even when they're not
committing.  Right now we fsync whenever a segment is finished - we
really don't want to do that in backends that could do other useful
work.  I suspect it'd be a good idea to remove that logic from
XLogWrite() and replace it with
        if (ProcGlobal->walwriterLatch)
                SetLatch(ProcGlobal->walwriterLatch);


My understanding is that you can't start on a new file until the old file is completely synced, because the book keeping can currently only handle one file at a time.  So if you signal the wal writer to do the sync, you would just end up immediately blocking on it to finish.  Getting around that would require substantially more changes; we would need to implement a queue of full log files not yet synced so that we could move on without waiting.  If we had such a queue, then sure, this would be a good idea to just wake the wal writer. But currently, it only matters for large transactions (\copy, insert ... select, bulk updates).  With small transactions, you can't fill up log files quickly enough for it to make much of a difference.  So I think this is something to do, but it is independent from what I am currently going on about.

 

> Wouldn't it
> better, and much simpler, just to have reverted the first change once the
> second one was done?

I think both can actually happen independently, no? It's pretty common
for the page lsn to be *older* than the corresponding commit.  In fact
you'll always hit that case unless there's also concurrent writes also
touching said page.

That is true, but I don't see any residual regression when going from pre-db76b1efbbab2441428 to 7975c5e0a992ae9a4-with-my-patch applied.  So I think the hint bit issue only shows up when hitting the same page aggressively, in which case the LSN does get advanced quickly enough that db76b1efbba solves it.  Perhaps a different test case could show that issue.  I also didn't see any improvement from the original 4de82f7d7c50a81e, so maybe 8 CPUs just isn't enough to detect the problem with setting hint-buts with permanent tables with synchronous_commit=false.  



> If that were done, would the third change still be
> needed?  (It did seem to add some other features as well, so I'm going to
> assume we still want those...).

It's still useful, yes.  It avoids flushing the WAL too often
(page-by-page when there's a lot of writes), which can eat up a lot of
IOPS on fast storage.

But the only reason it would do that is because we keep waking it up.  If we stop waking it up, it would stop flushing each page, even if it still had the logic to flush for each page there.




> But now the first change is even worse than useless, it is positively
> harmful.  The only thing to stop it from waking the WALWriter for every
> async commit is this line:
>
>         /* if we have already flushed that far, we're done */
>         if (WriteRqstPtr <= LogwrtResult.Flush)
>             return;
>
> Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
> becomes false due to us waking walwriter, it only becomes false once the
> timeout expires (which it would have done anyway with no help from us), or
> once wal_writer_flush_after is exceeded.  So now every async commit is
> waking the walwriter.  Most of those wake up do nothing (there is not a
> completely new patch to write), some write one completed page but don't
> flush anything, and very few do a flush, and that one would have been done
> anyway.

s/completely new patch/completely new page/?

Yes.  My mental tab completion sometimes goes awry.
 

In my opinion we actually *do* want to write (but not flush!) out such
pages, so I'm not sure I agree with that logic.  Have to think about
this some more...

I think if we want to write each page, we should do it in the foreground.  Conceptually, it could just be another enum value in synchronous_commit, but synchronous_commit is getting rather unwieldy already.

Cheers,

Jeff

Re: [HACKERS] subscription worker signalling wal writer too much

From
Robert Haas
Date:
On Mon, Jul 3, 2017 at 11:26 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> My understanding is that you can't start on a new file until the old file is
> completely synced, because the book keeping can currently only handle one
> file at a time.  So if you signal the wal writer to do the sync, you would
> just end up immediately blocking on it to finish.  Getting around that would
> require substantially more changes; we would need to implement a queue of
> full log files not yet synced so that we could move on without waiting.

Yeah.  I bet that effort would be reasonably worthwhile.  I believe
that the forced end-of-segment syncs are costing us a noticeable
amount of performance.  Once or twice I took a very half-hearted run
at doing what you describe here, but gave up pretty quickly; it seems
like a somewhat tricky problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] subscription worker signalling wal writer too much

From
Jeff Janes
Date:
On Mon, Jul 3, 2017 at 8:26 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Sat, Jun 24, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:

 

> Wouldn't it
> better, and much simpler, just to have reverted the first change once the
> second one was done?

I think both can actually happen independently, no? It's pretty common
for the page lsn to be *older* than the corresponding commit.  In fact
you'll always hit that case unless there's also concurrent writes also
touching said page.

That is true, but I don't see any residual regression when going from pre-db76b1efbbab2441428 to 7975c5e0a992ae9a4-with-my-patch applied.  So I think the hint bit issue only shows up when hitting the same page aggressively, in which case the LSN does get advanced quickly enough that db76b1efbba solves it.  Perhaps a different test case could show that issue.  I also didn't see any improvement from the original 4de82f7d7c50a81e, so maybe 8 CPUs just isn't enough to detect the problem with setting hint-buts with permanent tables with synchronous_commit=false.

I've refined my benchmarking, using a smaller scaling factor (8, same as my CPU count) and running the tests for longer. The smaller size means there are more intensive updates on individual pgbench_accounts pages, and the longer run times allows be unset hint bits to build up for longer (usually I like short test runs with a large number of randomized repetitions, but that doesn't work well here).  Since the scale factor is the same as the number of CPUs, I've bumped the thread count so that it is more likely someone will choose a non-contended value of pgbench_branches.bid to update at any given moment. 

pgbench -c16 -j16 -T1200 -M prepared -b tpcb-func --startup='set synchronous_commit=false'

This testing did show the importance of waking up the wal writer so that hint bits can be set:

commit    TPS
cfafd8beadcd6f22200.48
cfafd8beadcd6f_nowake30766.83
median of 14 runs reported, p-val on difference of means is 2e-22.

Where nowake is a hackish disabling of the wake up introduced in 4de82f7d7c50a8, forward ported to 9.6 branch.  (I still wake it if is is doing the big sleep, I just took out the part that wake it up even from small sleeps)

So my revised benchmark is capable of detecting this effect, even with only 8 CPUs.

When I move to next commit where we set hint bits as long as the page was re-dirtied after, get these numbers:

db76b1efbbab2430742.69
db76b1efbbab24_nowake31072.97

The difference between these is not statistically different, nor is the difference between them and cfafd8beadcd6f.

So I think the logic you introduced in db76b1efbbab24 captures all the gain there is to be captured, and 4de82f7d7c50a8 can be reverted.  The only reason to wake up the wal writer is if it is taking the big sleep.

Of course, I can't prove a negative.  There could always be some test case which demonstrates that 4de82f7d7c50a8 still matters in spite of db76b1efbbab24.  So to be ultra-conservative, attached is a patch which should preserve all wake-ups other than the ones known to be useless.

7975c5e0a27810.66
7975c5e0a_patched30821.41

So 7975c5 causes a significant regression (10% regression, p-val 7e-16 on difference of the means).  It repeatedly wakes the wal-writer up to flush a page which the wal-writer simply refuses to flush.  This can't serve any purpose.  My patch still wakes it up to write the page if it has not been written yet, and also still wakes it to flush a range of pages of WalWriterFlushAfter is met.  But it just doesn't wake it up to flush a page which has already been written and will not get flushed.

I'd prefer the code-simplifying change of just not waking it up anymore except for when it is doing the big sleep, but I can't reliably detect a performance difference between that and the more conservative change posted here.

Cheers,

Jeff
Attachment

Re: [HACKERS] subscription worker signalling wal writer too much

From
Kyotaro HORIGUCHI
Date:
At Sat, 26 Aug 2017 14:45:20 -0700, Jeff Janes <jeff.janes@gmail.com> wrote in
<CAMkU=1zP931=v-ypFZMDxHAGGQwuVNUPONZE5_ggV+onkXEdvQ@mail.gmail.com>
> On Mon, Jul 3, 2017 at 8:26 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> 
> > On Sat, Jun 24, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:
> >
> >> On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
> >>
> >>
> >>
> >
> >> > Wouldn't it
> >> > better, and much simpler, just to have reverted the first change once
> >> the
> >> > second one was done?
> >>
> >> I think both can actually happen independently, no? It's pretty common
> >> for the page lsn to be *older* than the corresponding commit.  In fact
> >> you'll always hit that case unless there's also concurrent writes also
> >> touching said page.
> >>
> >
> > That is true, but I don't see any residual regression when going from
> > pre-db76b1efbbab2441428 to 7975c5e0a992ae9a4-with-my-patch applied.  So I
> > think the hint bit issue only shows up when hitting the same page
> > aggressively, in which case the LSN does get advanced quickly enough that
> > db76b1efbba solves it.  Perhaps a different test case could show that
> > issue.  I also didn't see any improvement from the original 4de82f7d7c50a81e,
> > so maybe 8 CPUs just isn't enough to detect the problem with setting
> > hint-buts with permanent tables with synchronous_commit=false.
> >
> 
> I've refined my benchmarking, using a smaller scaling factor (8, same as my
> CPU count) and running the tests for longer. The smaller size means there
> are more intensive updates on individual pgbench_accounts pages, and the
> longer run times allows be unset hint bits to build up for longer (usually
> I like short test runs with a large number of randomized repetitions, but
> that doesn't work well here).  Since the scale factor is the same as the
> number of CPUs, I've bumped the thread count so that it is more likely
> someone will choose a non-contended value of pgbench_branches.bid to update
> at any given moment.
> 
> pgbench -c16 -j16 -T1200 -M prepared -b tpcb-func --startup='set
> synchronous_commit=false'
> 
> This testing did show the importance of waking up the wal writer so that
> hint bits can be set:
> 
> commit     TPS
> cfafd8beadcd6f 22200.48
> cfafd8beadcd6f_nowake 30766.83
> median of 14 runs reported, p-val on difference of means is 2e-22.
> 
> Where nowake is a hackish disabling of the wake up introduced in
> 4de82f7d7c50a8,
> forward ported to 9.6 branch.  (I still wake it if is is doing the big
> sleep, I just took out the part that wake it up even from small sleeps)
> 
> So my revised benchmark is capable of detecting this effect, even with only
> 8 CPUs.
> 
> When I move to next commit where we set hint bits as long as the page was
> re-dirtied after, get these numbers:
> 
> db76b1efbbab24 30742.69
> db76b1efbbab24_nowake 31072.97
> 
> The difference between these is not statistically different, nor is the
> difference between them and cfafd8beadcd6f.
> 
> So I think the logic you introduced in db76b1efbbab24 captures all the gain
> there is to be captured, and 4de82f7d7c50a8 can be reverted.  The only
> reason to wake up the wal writer is if it is taking the big sleep.
> 
> Of course, I can't prove a negative.  There could always be some test case
> which demonstrates that 4de82f7d7c50a8 still matters in spite of
> db76b1efbbab24.
> So to be ultra-conservative, attached is a patch which should preserve all
> wake-ups other than the ones known to be useless.
> 
> 7975c5e0a 27810.66
> 7975c5e0a_patched 30821.41
> 
> So 7975c5 causes a significant regression (10% regression, p-val 7e-16 on
> difference of the means).  It repeatedly wakes the wal-writer up to flush a
> page which the wal-writer simply refuses to flush.  This can't serve any
> purpose.  My patch still wakes it up to write the page if it has not been
> written yet, and also still wakes it to flush a range of pages of
> WalWriterFlushAfter is met.  But it just doesn't wake it up to flush a page
> which has already been written and will not get flushed.
> 
> I'd prefer the code-simplifying change of just not waking it up anymore
> except for when it is doing the big sleep, but I can't reliably detect a
> performance difference between that and the more conservative change posted
> here.

FWIW I tried this patch with taking the following numbers using
pgbench --unlogged-tables/ -T 60 -j 8 -c 8. All numbers are of
the 2nd of two successive runs.

TA: # that XLogSetAsyncXactLSN called while wal writer is awake.
TB: # that XLogSetAsyncXactLSN skipped SetLatch when the TA case.
WA: # that WalWriterMain called XLogBackgroundFlush().
WB: # that XLogBackgroundFlush() skipped flushing in the WA case.


A: without this patch : 1745tps(wal_writer_flush_after = 1MB (default))

WB/WA =    9/141100 : 0.9% of all XLogBackgroundFlush() did nothing.                   (woke by SetLatch 141091 times)
TB/TA =27300/103000 : 26% of SetLatch is avoided. (12899 times called)


B: with this patch : 2041 tps (wal_writer_flush_after = 1MB (default))

WB/WA =    46/ 1600 : 2.8% of all XLogBackgroundFlush() did nothing.                   (woke by SetLatch 1553 times)
TB/TA =115822/118000: 97% of SetLatch is avoided. (4178 times called)


C: with this patch : 2033 tps (But wal_writer_flush_after = 0)

WB/WA =   117/1700  : 6.9% of all XLogBackgroundFlush() did nothing.                   (woke by SetLatch 1582 times)
TB/TA =110615/118000: 93% of SetLatch is avoided. (7785 times called)


The condition for skipping SetLatch of the latest patch (v2)
looks reasonable.  And the effect is obvious at least in the
test. The "needless" SetLatch seems very effectively reduced and
actually improving performance. Increased size of bytes flushed
at once seems the main cause of the performance improvement.

Though a bit uneasy to have similar code on both side
(XLogBackgroundFlush and XLogSetAsyncXactLSN) but +1 to this from
me.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] subscription worker signalling wal writer too much

From
Robert Haas
Date:
On Tue, Sep 19, 2017 at 8:48 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Though a bit uneasy to have similar code on both side
> (XLogBackgroundFlush and XLogSetAsyncXactLSN) but +1 to this from
> me.

This patch wasn't formatted very nicely; attached is a version that
pgindent likes better and doesn't bust past 80 columns.

I think this patch needs significantly better comments.  There's no
explanation of the rather complex formula the patch proposed to use.
Maybe it's just copied from someplace else that does have a comment,
but in that case, the comments here should mention that.  And the
comment in the other place should probably be updated too, like:

/* Make sure to keep this in sync with the similar logic in
XLogDoAwesomeStuff() */

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] subscription worker signalling wal writer too much

From
Michael Paquier
Date:
On Thu, Oct 5, 2017 at 10:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> This patch wasn't formatted very nicely; attached is a version that
> pgindent likes better and doesn't bust past 80 columns.
>
> I think this patch needs significantly better comments.  There's no
> explanation of the rather complex formula the patch proposed to use.
> Maybe it's just copied from someplace else that does have a comment,
> but in that case, the comments here should mention that.  And the
> comment in the other place should probably be updated too, like:
>
> /* Make sure to keep this in sync with the similar logic in
> XLogDoAwesomeStuff() */

I have just gone through this thread, and reducing the amount of times
the WAL writer is woken up is something worth doing. The last version
if v3 that Robert has sent in
https://www.postgresql.org/message-id/CA+TgmoZhw_nyrxYUqSuDh-XEJya-YzfwShki5NuDS=UfgvodrA@mail.gmail.com.
As far as I can see, the smarter formula presented here comes from
XLogBackgroundFlush(). Jeff, could you update the patch to reflect
that the new condition you are adding comes from there? I am switching
the patch as "waiting on author" for now.
--
Michael


Re: [HACKERS] subscription worker signalling wal writer too much

From
Michael Paquier
Date:
On Tue, Nov 21, 2017 at 3:44 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> I have just gone through this thread, and reducing the amount of times
> the WAL writer is woken up is something worth doing. The last version
> if v3 that Robert has sent in
> https://www.postgresql.org/message-id/CA+TgmoZhw_nyrxYUqSuDh-XEJya-YzfwShki5NuDS=UfgvodrA@mail.gmail.com.
> As far as I can see, the smarter formula presented here comes from
> XLogBackgroundFlush(). Jeff, could you update the patch to reflect
> that the new condition you are adding comes from there? I am switching
> the patch as "waiting on author" for now.

It has been one week since this update, in consequence I am marking
this patch as returned with feedback for now. Please make sure to
register a new CF entry if you can get a new version out.
-- 
Michael