Re: [HACKERS] subscription worker signalling wal writer too much - Mailing list pgsql-hackers
From | Jeff Janes |
---|---|
Subject | Re: [HACKERS] subscription worker signalling wal writer too much |
Date | |
Msg-id | CAMkU=1znvaWoxD6Rp6qFjmXdzCkD2_PWozJKr3_jENppB-3XRA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] subscription worker signalling wal writer too much (Andres Freund <andres@anarazel.de>) |
Responses |
Re: [HACKERS] subscription worker signalling wal writer too much
Re: [HACKERS] subscription worker signalling wal writer too much |
List | pgsql-hackers |
On Sat, Jun 24, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
> It looks like only limited consolidation was happening, the number of kills
> invoked was more than half of the number of SetLatch. I think the reason
> is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
> suspends the calling process and gives the signalled process a chance to
> run in that time slice. The Wal Writer gets woken up, sees that it only
> has a partial page to write and decides not to do anything, but resets the
> latch. It does this fast enough that the subscription worker hasn't
> migrated CPUs yet.
I think part of the problem here is that latches signal the owning
process even if the owning process isn't actually sleeping - that's
obviously quite pointless. In cases where walwriter is busy, that
actually causes a lot of superflous interrupted syscalls, because it
actually never ends up waiting on the latch. There's also a lot of
superflous context switches. I think we can avoid doing so quite
easily, as e.g. with the attached patch. Could you check how much that
helps your benchmark?
That patch doesn't make any improvement to the pgbench --unlogged-tables benchmark. I expected that, because this benchmark exposes the opposite problem. The wal writer isn't busy, it is constantly being woken up but then immediately going back to sleep.
> The first change made the WALWriter wake up and do a write and flush
> whenever an async commit occurred and there was a completed WAL page to be
> written. This way the hint bits could be set while the heap page was still
> hot, because they couldn't get set until the WAL covering the hinted-at
> transaction commit is flushed.
>
> The second change said we can set hint bits before the WAL covering the
> hinted-at transaction is flushed, as long the page LSN is newer than that
> (so the wal covering the hinted-at transaction commit must be flushed
> before the page containing the hint bit can be written).
>
> Then the third change makes the wal writer only write the full pages most
> of the times it is woken up, not flush them. It only flushes them once
> every wal_writer_delay or wal_writer_flush_after, regardless of how often
> it is woken up.
>
> It seems like the third change rendered the first one useless.
I don't think so. Isn't the walwriter writing out the contents of the
WAL is quite important because it makes room in wal_buffers for new WAL?
If we wanted to do that, I think the asynchronous committer should just issue the write directly (unless wal_sync_method is open_datasync or open_sync). Writes are usually extremely fast and it not worth trying to offload them to a background process, as your half of the signalling takes longer than just doing the write yourself. When that is not true and writes start blocking due to extreme io congestion, them everything is going to start blocking anyway (wal_buffers head will run into the tail, and the tail can't advance because the writes are blocking) so while offloading writes to a background process is then good in theory, it isn't very effective. If everyone is stuck, it doesn't matter which process is directly stuck and which ones are indirectly stuck.
I suspect we actually should wake up wal-writer *more* aggressively, to
offload wal fsyncs from individual backends, even when they're not
committing. Right now we fsync whenever a segment is finished - we
really don't want to do that in backends that could do other useful
work. I suspect it'd be a good idea to remove that logic from
XLogWrite() and replace it with
if (ProcGlobal->walwriterLatch)
SetLatch(ProcGlobal->walwriterLatch);
My understanding is that you can't start on a new file until the old file is completely synced, because the book keeping can currently only handle one file at a time. So if you signal the wal writer to do the sync, you would just end up immediately blocking on it to finish. Getting around that would require substantially more changes; we would need to implement a queue of full log files not yet synced so that we could move on without waiting. If we had such a queue, then sure, this would be a good idea to just wake the wal writer. But currently, it only matters for large transactions (\copy, insert ... select, bulk updates). With small transactions, you can't fill up log files quickly enough for it to make much of a difference. So I think this is something to do, but it is independent from what I am currently going on about.
> Wouldn't it
> better, and much simpler, just to have reverted the first change once the
> second one was done?
I think both can actually happen independently, no? It's pretty common
for the page lsn to be *older* than the corresponding commit. In fact
you'll always hit that case unless there's also concurrent writes also
touching said page.
That is true, but I don't see any residual regression when going from pre-db76b1efbbab2441428 to 7975c5e0a992ae9a4-with-my-p atch applied. So I think the hint bit issue only shows up when hitting the same page aggressively, in which case the LSN does get advanced quickly enough that db76b1efbba solves it. Perhaps a different test case could show that issue. I also didn't see any improvement from the original 4de82f7d7c50a81e, so maybe 8 CPUs just isn't enough to detect the problem with setting hint-buts with permanent tables with synchronous_commit=false.
> If that were done, would the third change still be
> needed? (It did seem to add some other features as well, so I'm going to
> assume we still want those...).
It's still useful, yes. It avoids flushing the WAL too often
(page-by-page when there's a lot of writes), which can eat up a lot of
IOPS on fast storage.
But the only reason it would do that is because we keep waking it up. If we stop waking it up, it would stop flushing each page, even if it still had the logic to flush for each page there.
> But now the first change is even worse than useless, it is positively
> harmful. The only thing to stop it from waking the WALWriter for every
> async commit is this line:
>
> /* if we have already flushed that far, we're done */
> if (WriteRqstPtr <= LogwrtResult.Flush)
> return;
>
> Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
> becomes false due to us waking walwriter, it only becomes false once the
> timeout expires (which it would have done anyway with no help from us), or
> once wal_writer_flush_after is exceeded. So now every async commit is
> waking the walwriter. Most of those wake up do nothing (there is not a
> completely new patch to write), some write one completed page but don't
> flush anything, and very few do a flush, and that one would have been done
> anyway.
s/completely new patch/completely new page/?
Yes. My mental tab completion sometimes goes awry.
In my opinion we actually *do* want to write (but not flush!) out such
pages, so I'm not sure I agree with that logic. Have to think about
this some more...
I think if we want to write each page, we should do it in the foreground. Conceptually, it could just be another enum value in synchronous_commit, but synchronous_commit is getting rather unwieldy already.
Cheers,
Jeff
pgsql-hackers by date: