Re: [PATCH] Improve performance of NOTIFY over many databases (issueblocking on AccessExclusiveLock on object 0 of class 1262 of database 0) - Mailing list pgsql-hackers

From Martijn van Oosterhout
Subject Re: [PATCH] Improve performance of NOTIFY over many databases (issueblocking on AccessExclusiveLock on object 0 of class 1262 of database 0)
Date
Msg-id CADWG95sxuQur+5FyM5iKYj=bXgdN4iJsVSL5oSMxPmNbVcCC+g@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)
List pgsql-hackers
Hoi Tom,

Thank you for the detailed response. Sorry the delay, I was on holiday.

You are absolutely correct when you point out that the queue pointers
were accessed without the lock and this is probably unsafe.  For the
first two patches this is can be remedied, though I find it makes the
logic a bit harder to follow.  The comments will need to be updated to
reflect the new logic. I hope to post something soon.

As for your point about the third patch, you are right that it's
probably not saving many cycles. However I do think it's worthwhile
actually optimising this loop, because the number of backends that are
listening is likely to be much smaller than the total number of
backends, so there's a lot of cycles being wasted here already. Fixing
the thundering herd issue (like in sinval as you point out) doesn't
actually reduce the amount of work being done, just spreads it out.
Since readers and writers block each other, blocking a writer means
blocking commits across the whole cluster.

There are a number of possible improvements here:

1. Do what sinval does and separate the reader and writer locks so
they can't block each other. This is the ultimate solution, but it's a
significant refactor and it's not clear that's actually worthwhile
here. This would almost be adopting the sinvaladt structure wholesale.

2. Add a field to AsyncQueueEntry which points to the next listening
backend. This would allow the loops over all listening backends to
complete much faster, especially in the normal case where there are
not many listeners relative to the number of backends. The downside is
this requires an exclusive lock to remove listeners, but that doesn't
seem a big problem.

3. The other idea from sinval where you only wake up one worker at a
time is a good one as you point out. This seems quite doable, however,
it seems wasteful to try and wake everyone up the moment we switch to
a new page. The longer you delay the lower the chance you need to wake
anyone at all because they've because they'll have caught up by
themselves. A single SLRU page can hold hundreds, or even thousands of
messages.

Do 2 & 3 seem like a good direction to go? I can probably work something up.

Thanks in advance,
Martijn


> Martijn van Oosterhout <kleptog@gmail.com> writes:
> > Please find attached updated versions of the patches, I've now tested
> > them. Also attached is a reproduction script to verify that they
> > actually work.
>
> I looked through these (a bit cursorily).
>
> I'm generally on board with the idea of 0001, but not with the patch
> details.  As coded, asyncQueueAdvanceTail is supposing that it can
> examine the shared QUEUE_HEAD and QUEUE_TAIL pointers without any
> lock whatsoever.  That's probably unsafe, and if it is safe for some
> reason, you haven't made the argument why.  Moreover, it seems
> unnecessary to make any such assumption.  Why not put back the
> advanceTail tests you removed, but adjust them so that advanceTail
> isn't set true unless QUEUE_HEAD and QUEUE_TAIL point to different
> pages?  (Note that in the existing coding, those tests are made
> while holding an appropriate lock, so it's safe to look at those
> pointers there.)
>
> It might be a good idea to make a macro encapsulating this new,
> more complicated rule for setting advanceTail, instead of relying
> on keeping the various call sites in sync.
>
> More attention to comments is also needed.  For instance, you've
> made a lie out of the documentation of the tail pointer:
>
>     QueuePosition tail;         /* the global tail is equivalent to the pos of
>                                  * the "slowest" backend */
>
> It needs to say something like "is <= the pos of the slowest backend",
> instead.  I think the explanation of why this algorithm is good could
> use more effort, too.
>
> Comments for 0002 are about the same: for no explained reason, and
> certainly no savings, you've put the notify_all test in an unsafe
> place rather than a safe one (viz, two lines down, *after* taking
> the relevant lock).  And 0002 needs more commentary about why
> its optimization is safe and useful, too.  In particular it's
> not obvious why QUEUE_HEAD being on a different page from QUEUE_TAIL
> has anything to do with whether we should wake up other backends.
>
> I'm not very persuaded by 0003, mainly because it seems likely to
> me that 0001 and 0002 will greatly reduce the possibility that
> the early-exit can happen.  So it seems like it's adding cycles
> (in a spot where we hold exclusive lock) without a good chance of
> saving any cycles.
>
> Taking a step back in hopes of seeing the bigger picture ...
> as you already noted, these changes don't really fix the "thundering
> herd of wakeups" problem, they just arrange for it to happen
> once per SLRU page rather than once per message.  I wonder if we
> could improve matters by stealing an idea from the sinval code:
> when we're trying to cause advance of the global QUEUE_TAIL, waken
> only the slowest backend, and have it waken the next-slowest after
> it's done.  In sinval there are some additional provisions to prevent
> a nonresponsive backend from delaying matters for other backends,
> but I think maybe we don't need that here.  async.c doesn't have
> anything equivalent to sinval reset, so there's no chance of
> overruling a slow backend's failure to advance its pos pointer,
> so there's not much reason not to just wait till it does do so.
>
> A related idea is to awaken only one backend at a time when we
> send a new message (i.e., advance QUEUE_HEAD) but I think that
> would likely be bad.  The hazard with the chained-wakeups method
> is that a slow backend blocks wakeup of everything else.  We don't
> care about that hugely for QUEUE_TAIL advance, because we're just
> hoping to free some SLRU space.  But for QUEUE_HEAD advance it'd
> mean failing to deliver notifies in a timely way, which we do care
> about.  (Also, if I remember correctly, the processing on that side
> only requires shared lock so it's less of a problem if many backends
> do it at once.)
>
>                         regards, tom lane



-- 
Martijn van Oosterhout <kleptog@gmail.com> http://svana.org/kleptog/



pgsql-hackers by date:

Previous
From: Amit Khandekar
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs
Next
From: Daniel Gustafsson
Date:
Subject: Re: pg_upgrade version checking questions