Thread: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Robert Haas
Date:
Kevin didn't send out an official gavel-banging announcement of the
end of CommitFest 2009-07 (possibly because I neglected until today to
give him privileges to actually change it in the web application), but
I'd just like to take a minute to thank him publicly for his efforts.
We started this CommitFest with something like 60 patches, which is
definitely on the larger side for a CommitFest, and Kevin did a great
job staying on top of what was going on with all of them and, I felt,
really helped keep us on track.  At the same time, I felt he did this
with a very light touch that made the whole thing go very smoothly.
So -- thanks, Kevin!

I also appreciate the efforts of all those who reviewed.  Good reviews
are really critical to keep the burden from building up on committers,
and I appreciate the efforts of everyone who contributed, in many
cases probably on their own time.  I'm particularly grateful to the
people who were vigilant about spelling, grammar, coding style,
whitespace, and other nitpicky little issues that are not much fun,
but which at least for me are a major time sink if they're still
lingering when it comes time to do the actual commit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> I'd just like to take a minute to thank him publicly for his
> efforts.  We started this CommitFest with something like 60
> patches, which is definitely on the larger side for a CommitFest,
> and Kevin did a great job staying on top of what was going on with
> all of them and, I felt, really helped keep us on track.  At the
> same time, I felt he did this with a very light touch that made
> the whole thing go very smoothly.  So -- thanks, Kevin!
You're welcome.  It was educational for me.  I don't think I want to
try to handle two in a row, and I think your style is better suited
than mine to the final CF for a release, but I might be able to take
on the 2010-11 CF if people want that.
My hand was not always so light behind the scenes, though -- I sent
or received about 100 off-list emails to try to keep things moving. 
Hopefully nobody was too offended by my nagging.  :-)
Oh, and thanks for putting together the CF web application.  Without
that, I couldn't have done half as well as I did.
> I also appreciate the efforts of all those who reviewed.
Yes, I'll second that!  I've always been impressed with the
PostgreSQL community, and managing this CF gave me new insights and
appreciation for the intelligence, professionalism, and community
spirit of its members -- authors, reviewers, and committers.
-Kevin


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Greg Smith
Date:
Kevin Grittner wrote:
> I don't think I want to try to handle two in a row, and I think your style is better suited
> than mine to the final CF for a release, but I might be able to take on the 2010-11 CF if people want that

Ha, you just put yourself right back on the hook with that comment, and 
Robert does seem like the right guy for CF-4 @ 2011-01.  Leaving the 
question of what's going to happen with CF-2 next month.

I think the crucial thing with the 2010-09 CF is that we have to get 
serious progress made sorting out all the sync rep ideas before/during 
that one.  The review Yeb did and subsequent discussion was really 
helpful, but the scope on that needs to actually get nailed down to 
*something* concrete if it's going to get built early enough in the 9.1 
release to be properly reviewed and tested for more than one round.  
Parts of the design and scope still feel like they're expanding to me, 
and I think having someone heavily involved in the next CF who is 
willing to push on nailing down that particular area is pretty 
important.  Will volunteer myself if I can stay on schedule to make it 
past the major time commitment sink I've had so far this year by then.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Robert Haas
Date:
On Wed, Aug 18, 2010 at 7:46 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Kevin Grittner wrote:
>>
>> I don't think I want to try to handle two in a row, and I think your style
>> is better suited
>> than mine to the final CF for a release, but I might be able to take on
>> the 2010-11 CF if people want that
>
> Ha, you just put yourself right back on the hook with that comment, and
> Robert does seem like the right guy for CF-4 @ 2011-01.  Leaving the
> question of what's going to happen with CF-2 next month.

My reputation precedes me, apparently.  Although I appreciate everyone
so far being willing to avoid mentioning exactly what that reputation
might be.  :-)

> I think the crucial thing with the 2010-09 CF is that we have to get serious
> progress made sorting out all the sync rep ideas before/during that one.
>  The review Yeb did and subsequent discussion was really helpful, but the
> scope on that needs to actually get nailed down to *something* concrete if
> it's going to get built early enough in the 9.1 release to be properly
> reviewed and tested for more than one round.  Parts of the design and scope
> still feel like they're expanding to me, and I think having someone heavily
> involved in the next CF who is willing to push on nailing down that
> particular area is pretty important.  Will volunteer myself if I can stay on
> schedule to make it past the major time commitment sink I've had so far this
> year by then.

Sitting on Sync Rep is a job and a half by itself, without adding all
the other CF work on top of it.  Maybe we should try to find two
vi^Holunteers: a CommitFest Manager (CFM) and a Major Feature
Babysitter (MBS).  At any rate, we should definitely NOT wait another
month to start thinking about Sync Rep again.  I haven't actually
looked at any of the Sync Rep code AT ALL but IIRC Heikki expressed
the view that the biggest thing standing in the way of a halfway
decent Sync Rep implementation was a number of polling loops that
needed to be replaced with something that wouldn't introduce
up-to-100ms delays.  And so far we haven't seen a patch for that.
Somebody write one.  And then let's get it reviewed and committed RSN.It may seem like we're early in the release cycle
yet,but for a 
feature of this magnitude we are not.  We committed way too much big
stuff at the very end of the last release cycle; Hot Standby was still
being cleaned up in May after commit in November.  We'll be lucky to
commit sync rep that early.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Heikki Linnakangas
Date:
On 19/08/10 04:46, Robert Haas wrote:
> At any rate, we should definitely NOT wait another
> month to start thinking about Sync Rep again.

Agreed. EnterpriseDB is interested in having that feature, so I'm on the 
hook to spend time on it regardless of commitfests.

> I haven't actually
> looked at any of the Sync Rep code AT ALL but IIRC Heikki expressed
> the view that the biggest thing standing in the way of a halfway
> decent Sync Rep implementation was a number of polling loops that
> needed to be replaced with something that wouldn't introduce
> up-to-100ms delays.

Well, that's the only uncontroversial thing about it that doesn't 
require any fighting over the UI or desired behavior. That's why I've 
focused on that first, and also because it's useful regardless of 
synchronous replication. But once that's done, we'll have to nail down 
how synchronous replication is supposed to behave, and how to configure it.

>  And so far we haven't seen a patch for that.
> Somebody write one.  And then let's get it reviewed and committed RSN.

Fujii is on vacation, but I've started working on it. The two issues 
with Fujii's latest patch are that it would not respond promptly on 
platforms where signals don't interrupt sleep, and it suffers the 
classic race condition that pselect() was invented for. I'm going to 
replace pg_usleep() with select(), and use the so called "self-pipe 
trick" to get over the race condition. I have that written up but I want 
to do some testing and cleanup before posting the patch.

>   It may seem like we're early in the release cycle yet, but for a
> feature of this magnitude we are not.  We committed way too much big
> stuff at the very end of the last release cycle; Hot Standby was still
> being cleaned up in May after commit in November.  We'll be lucky to
> commit sync rep that early.

Agreed. We need to decide the scope and minimum set of features real 
soon to get something concrete finished.

BTW, on what platforms signals don't interrupt sleep? Although that 
issue has been discussed many times before, I couldn't find any 
reference to a real platform in the archives.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> BTW, on what platforms signals don't interrupt sleep? Although that 
> issue has been discussed many times before, I couldn't find any 
> reference to a real platform in the archives.

I've got one in captivity (my old HPUX box).  Happy to test whatever you
come up with.

Considering that pg_usleep is implemented with select, I'm not following
what you mean by "replace pg_usleep() with select()"?
        regards, tom lane


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Alvaro Herrera
Date:
Excerpts from Heikki Linnakangas's message of jue ago 19 02:02:34 -0400 2010:
> On 19/08/10 04:46, Robert Haas wrote:

> >  And so far we haven't seen a patch for that.
> > Somebody write one.  And then let's get it reviewed and committed RSN.
> 
> Fujii is on vacation, but I've started working on it. The two issues 
> with Fujii's latest patch are that it would not respond promptly on 
> platforms where signals don't interrupt sleep, and it suffers the 
> classic race condition that pselect() was invented for. I'm going to 
> replace pg_usleep() with select(), and use the so called "self-pipe 
> trick" to get over the race condition. I have that written up but I want 
> to do some testing and cleanup before posting the patch.

Hmm, IIRC the self-pipe trick doesn't work on Windows, mainly because
select() doesn't handle pipes, only sockets.  You may need some extra
hack to make it work there.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Magnus Hagander
Date:
On Thu, Aug 19, 2010 at 17:08, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Heikki Linnakangas's message of jue ago 19 02:02:34 -0400 2010:
>> On 19/08/10 04:46, Robert Haas wrote:
>
>> >  And so far we haven't seen a patch for that.
>> > Somebody write one.  And then let's get it reviewed and committed RSN.
>>
>> Fujii is on vacation, but I've started working on it. The two issues
>> with Fujii's latest patch are that it would not respond promptly on
>> platforms where signals don't interrupt sleep, and it suffers the
>> classic race condition that pselect() was invented for. I'm going to
>> replace pg_usleep() with select(), and use the so called "self-pipe
>> trick" to get over the race condition. I have that written up but I want
>> to do some testing and cleanup before posting the patch.
>
> Hmm, IIRC the self-pipe trick doesn't work on Windows, mainly because
> select() doesn't handle pipes, only sockets.  You may need some extra
> hack to make it work there.

We have a pipe implementation using sockets that is used on Windows
for just this reason, IIRC.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Heikki Linnakangas
Date:
On 19/08/10 18:08, Alvaro Herrera wrote:
> Excerpts from Heikki Linnakangas's message of jue ago 19 02:02:34 -0400 2010:
>> On 19/08/10 04:46, Robert Haas wrote:
>
>>>   And so far we haven't seen a patch for that.
>>> Somebody write one.  And then let's get it reviewed and committed RSN.
>>
>> Fujii is on vacation, but I've started working on it. The two issues
>> with Fujii's latest patch are that it would not respond promptly on
>> platforms where signals don't interrupt sleep, and it suffers the
>> classic race condition that pselect() was invented for. I'm going to
>> replace pg_usleep() with select(), and use the so called "self-pipe
>> trick" to get over the race condition. I have that written up but I want
>> to do some testing and cleanup before posting the patch.
>
> Hmm, IIRC the self-pipe trick doesn't work on Windows, mainly because
> select() doesn't handle pipes, only sockets.  You may need some extra
> hack to make it work there.

That's fine, we can do the naive set-a-flag implementation on Windows 
because on Windows our "signals" are only delivered at 
CHECK_FOR_INTERRUPTS(), so we don't have the race condition to begin with.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Heikki Linnakangas
Date:
On 19/08/10 16:38, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> BTW, on what platforms signals don't interrupt sleep? Although that
>> issue has been discussed many times before, I couldn't find any
>> reference to a real platform in the archives.
>
> I've got one in captivity (my old HPUX box).  Happy to test whatever you
> come up with.
>
> Considering that pg_usleep is implemented with select, I'm not following
> what you mean by "replace pg_usleep() with select()"?

Instead of using pg_usleep(), call select() directly, waiting not only 
for the timeout, but also for data to arrive on the "self-pipe". The 
signal handler writes a byte to the self-pipe, waking up the select(). 
That way the select() is interupted by the signal arriving, even if 
signals per se don't interrupt it. And it closes the race condition 
involved with setting a flag in the signal handler and checking that in 
the main loop.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 19/08/10 16:38, Tom Lane wrote:
>> Considering that pg_usleep is implemented with select, I'm not following
>> what you mean by "replace pg_usleep() with select()"?

> Instead of using pg_usleep(), call select() directly, waiting not only 
> for the timeout, but also for data to arrive on the "self-pipe". The 
> signal handler writes a byte to the self-pipe, waking up the select(). 
> That way the select() is interupted by the signal arriving, even if 
> signals per se don't interrupt it. And it closes the race condition 
> involved with setting a flag in the signal handler and checking that in 
> the main loop.

Hmm, but couldn't you still do that inside pg_usleep?  Signal handlers
that do that couldn't know if they were interrupting a sleep per se,
so this would have to be a backend-wide convention.
        regards, tom lane


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Heikki Linnakangas
Date:
On 19/08/10 19:57, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> On 19/08/10 16:38, Tom Lane wrote:
>>> Considering that pg_usleep is implemented with select, I'm not following
>>> what you mean by "replace pg_usleep() with select()"?
>
>> Instead of using pg_usleep(), call select() directly, waiting not only
>> for the timeout, but also for data to arrive on the "self-pipe". The
>> signal handler writes a byte to the self-pipe, waking up the select().
>> That way the select() is interupted by the signal arriving, even if
>> signals per se don't interrupt it. And it closes the race condition
>> involved with setting a flag in the signal handler and checking that in
>> the main loop.
>
> Hmm, but couldn't you still do that inside pg_usleep?  Signal handlers
> that do that couldn't know if they were interrupting a sleep per se,
> so this would have to be a backend-wide convention.

I don't understand, do what inside pg_usleep()?

We only need to respond quickly to one signal, the one that tells 
walsender "there's some new WAL that you should send". We can rely on 
polling for all the other signals like SIGHUP for config reload or 
shutdown request, like we do today.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 19/08/10 19:57, Tom Lane wrote:
>> Hmm, but couldn't you still do that inside pg_usleep?  Signal handlers
>> that do that couldn't know if they were interrupting a sleep per se,
>> so this would have to be a backend-wide convention.

> I don't understand, do what inside pg_usleep()?

I was envisioning still using pg_usleep, and having this interaction
between signal handlers and the delay be private to pg_usleep, rather
than putting such ugly code into forty-nine different places.  There
are a *lot* of places where we have loops that break down delays
into at-most-one-second pg_usleep calls, and if we're going to have a
hack like this we should fix them all, not just two or three that SR
cares about.

But I'm still not seeing how this self-pipe hack avoids a race
condition.  If the signal handler is sending a byte whenever it
executes, then you could have bytes already stacked up in the pipe
from previous interrupts that didn't happen to come while inside
pg_usleep.  If you clear those before sleeping, you have a race
condition, and if you don't, then you fail to sleep the intended
amount of time even though there was no interrupt this time.
        regards, tom lane


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Heikki Linnakangas
Date:
On 19/08/10 20:18, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> On 19/08/10 19:57, Tom Lane wrote:
>>> Hmm, but couldn't you still do that inside pg_usleep?  Signal handlers
>>> that do that couldn't know if they were interrupting a sleep per se,
>>> so this would have to be a backend-wide convention.
>
>> I don't understand, do what inside pg_usleep()?
>
> I was envisioning still using pg_usleep, and having this interaction
> between signal handlers and the delay be private to pg_usleep, rather
> than putting such ugly code into forty-nine different places.  There
> are a *lot* of places where we have loops that break down delays
> into at-most-one-second pg_usleep calls, and if we're going to have a
> hack like this we should fix them all, not just two or three that SR
> cares about.

Hmm, will need to think about a suitable API for that. The nice thing 
would be that we could implement it using pselect() where available. 
(And reliable - the Linux select() man page says that glibc's pselect() 
is emulated using select(), and suffers from the very same race 
condition pselect() was invented to solve. How awful is that!?)

> But I'm still not seeing how this self-pipe hack avoids a race
> condition.  If the signal handler is sending a byte whenever it
> executes, then you could have bytes already stacked up in the pipe
> from previous interrupts that didn't happen to come while inside
> pg_usleep.  If you clear those before sleeping, you have a race
> condition, and if you don't, then you fail to sleep the intended
> amount of time even though there was no interrupt this time.

You clear the pipe after waking up. Before sending all the pending WAL 
and deciding that you're fully caught up again:

for(;;)
{  1. select()  2. clear pipe  3. send any pending WAL
}

There's more information on the self-pipe trick at e.g 
http://lwn.net/Articles/177897/

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 19/08/10 20:18, Tom Lane wrote:
>> But I'm still not seeing how this self-pipe hack avoids a race
>> condition.  If the signal handler is sending a byte whenever it
>> executes, then you could have bytes already stacked up in the pipe
>> from previous interrupts that didn't happen to come while inside
>> pg_usleep.  If you clear those before sleeping, you have a race
>> condition, and if you don't, then you fail to sleep the intended
>> amount of time even though there was no interrupt this time.

> You clear the pipe after waking up.

Hmm ... that doesn't answer my second objection about failing to sleep
the expected amount of time, but on reflection I guess that can't be
done: we have to be able to cope with interrupts occurring just before
the sleep actually begins, and there's no way to define "just before"
except by reference to the calling code having done whatever it might
need to do before/after sleeping.

> Hmm, will need to think about a suitable API for that.

Offhand I'd suggest something like

SetSleepInterrupt()    -- called by signal handlers, writes pipe
ClearSleepInterrupt()    -- called by sleep-and-do-something loops, clears pipe

pg_usleep() itself remains the same, but it is now guaranteed to return
immediately if SetSleepInterrupt is called, or has been called since the
last ClearSleepInterrupt.

> The nice thing 
> would be that we could implement it using pselect() where available. 
> (And reliable - the Linux select() man page says that glibc's pselect() 
> is emulated using select(), and suffers from the very same race 
> condition pselect() was invented to solve. How awful is that!?)

Ick.  So how would you tell if it's trustworthy?
        regards, tom lane


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, Aug 19, 2010 at 08:36:09PM +0300, Heikki Linnakangas wrote:

[...]

> Hmm, will need to think about a suitable API for that. The nice thing would 
> be that we could implement it using pselect() where available. (And 
> reliable - the Linux select() man page says that glibc's pselect() is 
> emulated using select(), and suffers from the very same race condition 
> pselect() was invented to solve. How awful is that!?)

It is indeed. It seems, though, that from Linux kernel 2.6.16 and glibc
2.4 on, things look better [1]. As a reference, Debian stable (not known
to adventure too far into the present ;-) is libc 2.7 on kernel 2.6.26.

Of course, "enterprise" GNU/Linux distros are said to be even more
conservative...

[1] <http://lwn.net/Articles/176911/>

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFMbiauBcgs9XrR2kYRAhiwAJ41f29jSIy409epTH0eJRXW17oByACeIkRo
CRg2BCw8tn3PkdnNR1i/MCY=
=GVMT
-----END PGP SIGNATURE-----


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Heikki Linnakangas
Date:
On 19/08/10 20:59, Tom Lane wrote:
> Offhand I'd suggest something like
>
> SetSleepInterrupt()    -- called by signal handlers, writes pipe
> ClearSleepInterrupt()    -- called by sleep-and-do-something loops, clears pipe
>
> pg_usleep() itself remains the same, but it is now guaranteed to return
> immediately if SetSleepInterrupt is called, or has been called since the
> last ClearSleepInterrupt.

Hmm, we have pg_usleep() calls in some fairly low-level functions, like 
mdunlink() and s_lock(). If someone has called SetSleepInterrupt(), we 
don't want those pg_usleep()s to return immediately. And pg_usleep() is 
used in some client code too. I think we need a separate sleep function 
for this.

Another idea is to not use unix signals at all, but ProcSendSignal() and 
ProcWaitForSignal(). We would not need the signal handler at all. 
Walsender would use ProcWaitForSignal() instead of pg_usleep() and 
backends that want to wake it up would use ProcSendSignal(). The problem 
is that there is currently no way to specify a timeout, but I presume 
the underlying semaphore operations have that capability, and we could 
expose it.

Actually ProcSendSignal()/ProcWaitForSignal() won't work as is, because 
walsender doesn't have a PGPROC entry, but you could easily build a 
similar mechanism, using PGSemaphoreLock/Unlock like 
ProcSendSignal()/WaitForSignal() does.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Hmm, we have pg_usleep() calls in some fairly low-level functions, like 
> mdunlink() and s_lock(). If someone has called SetSleepInterrupt(), we 
> don't want those pg_usleep()s to return immediately. And pg_usleep() is 
> used in some client code too. I think we need a separate sleep function 
> for this.

Well, we'd need some careful thought about which sleeps need what, but I
don't necessarily have an objection to a separate interruptable sleep
function.

> Another idea is to not use unix signals at all, but ProcSendSignal() and 
> ProcWaitForSignal(). We would not need the signal handler at all. 
> Walsender would use ProcWaitForSignal() instead of pg_usleep() and 
> backends that want to wake it up would use ProcSendSignal().

You keep on proposing solutions that only work for walsender :-(.
Most of the other places where we have pg_usleep actually do want
a timed sleep, I believe.  It's also unclear that we can always expect
ProcSendSignal to be usable --- for example, stuff like SIGHUP would
be sent by processes that might not be connected to shared memory.

> The problem 
> is that there is currently no way to specify a timeout, but I presume 
> the underlying semaphore operations have that capability, and we could 
> expose it.

They don't, or at least the semop-based implementation doesn't.
        regards, tom lane


Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!

From
Heikki Linnakangas
Date:
On 20/08/10 16:24, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> Hmm, we have pg_usleep() calls in some fairly low-level functions, like
>> mdunlink() and s_lock(). If someone has called SetSleepInterrupt(), we
>> don't want those pg_usleep()s to return immediately. And pg_usleep() is
>> used in some client code too. I think we need a separate sleep function
>> for this.
>
> Well, we'd need some careful thought about which sleeps need what, but I
> don't necessarily have an objection to a separate interruptable sleep
> function.

If we have to, we could also support multiple interrupts with multiple 
self-pipes, so that you can choose at pg_usleep() which ones to wake up on.

>> Another idea is to not use unix signals at all, but ProcSendSignal() and
>> ProcWaitForSignal(). We would not need the signal handler at all.
>> Walsender would use ProcWaitForSignal() instead of pg_usleep() and
>> backends that want to wake it up would use ProcSendSignal().
>
> You keep on proposing solutions that only work for walsender :-(.

Well yes, the other places where we use pg_usleep() are not really a 
problem as is. If as a side-effect we can make them respond more quickly 
to signals, with small changes, that's good, but walsender is the only 
one that's performance critical.

That said, a select() based solution is my current favorite.

> Most of the other places where we have pg_usleep actually do want
> a timed sleep, I believe.  It's also unclear that we can always expect
> ProcSendSignal to be usable --- for example, stuff like SIGHUP would
> be sent by processes that might not be connected to shared memory.
>
>> The problem
>> is that there is currently no way to specify a timeout, but I presume
>> the underlying semaphore operations have that capability, and we could
>> expose it.
>
> They don't, or at least the semop-based implementation doesn't.

There's semtimedop(). I don't know how portable it is, but it seems to 
exist at least on Linux, Solaris, HPUX and AIX. On what platforms do we 
use sysv semaphores?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


[ It's way past time to change the thread title ]

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 20/08/10 16:24, Tom Lane wrote:
>> You keep on proposing solutions that only work for walsender :-(.

> Well yes, the other places where we use pg_usleep() are not really a 
> problem as is.

Well, yes they are.  They cause unnecessary process wakeups and thereby
consume cycles even when the database is idle.  See for example a
longstanding complaint here:
https://bugzilla.redhat.com/show_bug.cgi?id=252129

If we're going to go to the trouble of having a mechanism like this,
I'd like it to fix that problem so I can close out that bug.

> There's semtimedop(). I don't know how portable it is, but it seems to 
> exist at least on Linux, Solaris, HPUX and AIX.

It's not on my HPUX, and I don't see it in the Single Unix Spec.

> On what platforms do we use sysv semaphores?

AFAIK, everything except Windows and extremely old versions of OS X.
        regards, tom lane


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 20/08/10 17:28, Tom Lane wrote:
> [ It's way past time to change the thread title ]
>
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> On 20/08/10 16:24, Tom Lane wrote:
>>> You keep on proposing solutions that only work for walsender :-(.
>
>> Well yes, the other places where we use pg_usleep() are not really a
>> problem as is.
>
> Well, yes they are.  They cause unnecessary process wakeups and thereby
> consume cycles even when the database is idle.  See for example a
> longstanding complaint here:
> https://bugzilla.redhat.com/show_bug.cgi?id=252129
>
> If we're going to go to the trouble of having a mechanism like this,
> I'd like it to fix that problem so I can close out that bug.

Hmm, if you want to put bgwriter and walwriter to deep sleep, then 
someone will need to wake them up when they have work to do. Currently 
they poll. Maybe they should just sleep longer, like 10 seconds, if 
there hasn't been any work to do in the last X wakeups.

We've been designing the new sleep facility so that the event that wakes 
up the sleep is sent from the signal handler in the same process, but it 
seems that all the potential users would actually want to be woken up 
from *another* process, so the signal handler seems like an unnecessary 
middleman. Particularly on Windows where signals are simulated with 
pipes and threads, while you could just send a Windows event directly 
from one process to another.

A common feature that all the users of this facility want is that once 
the event is sent, re-sending it is a fast no-op until re-enabled by the 
receiver. For example, if we need backends to wake up bgwriter after 
dirtying a buffer, you don't want to waste many cycles determining that 
bgwriter is already active and doesn't need to be woken up.

Let's call these "latches". I'm thinking of something like this:

/* Similar to LWLockId */
typedef enum
{  BgwriterLatch,  WalwriterLatch,  /* plus one for each walsender */
} LatchId;

/* * Wait for given latch to be set. Only one process can wait * for a given latch at a time. */
WaitLatch(LatchId latch, long timeout);

/* * Sets latch. Returns quickly if the latch is set already. */
SetLatch(LatchId latch);

/* * Clear the latch. Calling WaitLatch after this will sleep, unless * the latch is set again before the WaitLatch
call.*/
 
ResetLatch(LatchId latch);

There would be a boolean for each latch in shared memory, to indicate if 
the latch is "armed", allowing quick return from SetLatch if the latch 
is already set. Plus a signal to wake up the waiting process (maybe use 
procsignal.c), and the self-pipe trick within the receiving process to 
make it race condition free. On Windows, the signal and the self-pipe 
trick are replaced with Windows events.

I'll try out this approach tomorrow..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 20/08/10 17:28, Tom Lane wrote:
>> Well, yes they are.  They cause unnecessary process wakeups and thereby
>> consume cycles even when the database is idle.  See for example a
>> longstanding complaint here:
>> https://bugzilla.redhat.com/show_bug.cgi?id=252129
>> 
>> If we're going to go to the trouble of having a mechanism like this,
>> I'd like it to fix that problem so I can close out that bug.

> Hmm, if you want to put bgwriter and walwriter to deep sleep, then 
> someone will need to wake them up when they have work to do. Currently 
> they poll. Maybe they should just sleep longer, like 10 seconds, if 
> there hasn't been any work to do in the last X wakeups.

I should probably clarify that I don't expect this patch to solve that
problem all by itself.  But ISTM that a guaranteed-interruptible version
of pg_usleep is a necessary prerequisite before we can even think about
dialing down the database's CPU consumption at idle.

> We've been designing the new sleep facility so that the event that wakes 
> up the sleep is sent from the signal handler in the same process, but it 
> seems that all the potential users would actually want to be woken up 
> from *another* process, so the signal handler seems like an unnecessary 
> middleman.

No, because there are also lots of cases where the signal is arriving
from a non-Postgres process; SIGTERM arriving from init being the killer
case that you simply don't get to propose an alternative design for.
More generally, I'm uninterested in decoupling cases like SIGHUP for
postgresql.conf change from the existing signal mechanism, because it's
just too useful to be able to trigger that externally.

> Particularly on Windows where signals are simulated with 
> pipes and threads, while you could just send a Windows event directly 
> from one process to another.

Windows of course has different constraints, and in particular it lacks
the constraint that we must be able to respond to system-defined
signals.  So I wouldn't object to the underlying implementation being
different for Windows.

> [ "latch" proposal ]

This seems reasonably clean as far as signal conditions generated
internally to Postgres go, but I remain unclear on how it helps for
response to actual signals.
        regards, tom lane


Excerpts from Tom Lane's message of lun ago 23 19:44:02 -0400 2010:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

> > [ "latch" proposal ]
> 
> This seems reasonably clean as far as signal conditions generated
> internally to Postgres go, but I remain unclear on how it helps for
> response to actual signals.

This could probably replace the signalling between postmaster and
autovac launcher, as well.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 24/08/10 04:08, Alvaro Herrera wrote:
> Excerpts from Tom Lane's message of lun ago 23 19:44:02 -0400 2010:
>> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>
>>> [ "latch" proposal ]
>>
>> This seems reasonably clean as far as signal conditions generated
>> internally to Postgres go, but I remain unclear on how it helps for
>> response to actual signals.
>
> This could probably replace the signalling between postmaster and
> autovac launcher, as well.

Hmm, postmaster needs to stay out of shared memory..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 24/08/10 02:44, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> [ "latch" proposal ]
>
> This seems reasonably clean as far as signal conditions generated
> internally to Postgres go, but I remain unclear on how it helps for
> response to actual signals.

You can can set the latch in the signal handler.

Here's a first attempt at implementing that. To demonstrate how it
works, I modified walsender to use the new latch facility, also to
respond quickly to SIGHUP and SIGTERM.

There's two kinds of latches, local and global. Local latches can only
be set from the same process - allowing you to replace pg_usleep() with
something that is always interruptible by signals (by setting the latch
in the signal handler). The global latches work the same, and indeed the
implementation is the same, but the latch resides in shared memory, and
can be set by any process attached to shared memory. On Unix, when you
set a latch waited for by another process, the setter sends SIGUSR1 to
the waiting process, and the signal handler sends the byte to the
self-pipe to wake up the select().

On Windows, we use WaitEvent to wait on a latch, and SetEvent to wake
up. The difference between global and local latches is that for global
latches, the Event object needs to be created upfront at postmaster
startup so that its inherited to all child processes, and stored in
shared memory. A local Event object can be created only in the process
that needs it.

I put the code in src/backend/storage/ipc/latch.c now, but it probably
ought to go in src/backend/portability instead, with a separate
win32_latch.c file for Windows.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment
On Thu, Aug 26, 2010 at 7:40 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Here's a first attempt at implementing that. To demonstrate how it works, I
> modified walsender to use the new latch facility, also to respond quickly to
> SIGHUP and SIGTERM.

Great!

> There's two kinds of latches, local and global. Local latches can only be
> set from the same process - allowing you to replace pg_usleep() with
> something that is always interruptible by signals (by setting the latch in
> the signal handler). The global latches work the same, and indeed the
> implementation is the same, but the latch resides in shared memory, and can
> be set by any process attached to shared memory. On Unix, when you set a
> latch waited for by another process, the setter sends SIGUSR1 to the waiting
> process, and the signal handler sends the byte to the self-pipe to wake up
> the select().

According to this explanation, the latch which walsender uses seems to be
local. If it's true, walsender should call InitSharedLatch rather than
InitLatch?

> /*
>  * XXX: Should we invent an API to wait for data coming from the
>  * client connection too? It's not critical, but we could then
>  * eliminate the timeout altogether and go to sleep for good.
>  */

Yes, it would be very helpful when walsender waits for the ACK from
the standby in upcoming synchronous replication.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 27/08/10 10:39, Fujii Masao wrote:
> On Thu, Aug 26, 2010 at 7:40 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> There's two kinds of latches, local and global. Local latches can only be
>> set from the same process - allowing you to replace pg_usleep() with
>> something that is always interruptible by signals (by setting the latch in
>> the signal handler). The global latches work the same, and indeed the
>> implementation is the same, but the latch resides in shared memory, and can
>> be set by any process attached to shared memory. On Unix, when you set a
>> latch waited for by another process, the setter sends SIGUSR1 to the waiting
>> process, and the signal handler sends the byte to the self-pipe to wake up
>> the select().
>
> According to this explanation, the latch which walsender uses seems to be
> local. If it's true, walsender should call InitSharedLatch rather than
> InitLatch?

Yes, it should.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Tom Lane wrote:
> Well, yes they are.  They cause unnecessary process wakeups and thereby
> consume cycles even when the database is idle.  See for example a
> longstanding complaint here:
> https://bugzilla.redhat.com/show_bug.cgi?id=252129
>
> If we're going to go to the trouble of having a mechanism like this,
> I'd like it to fix that problem so I can close out that bug.
>   

The way the background writer wakes up periodically to absorb fsync 
requests is already way too infrequent on a busy system.  That component 
of the bug report reeks as "not a bug" to me.  Don't like the default?  
Increase bgwriter_delay; move on with your life.  We're not going to 
increase the default, to screw over regular users, just to instead 
prioritize people who prefer low power consumption.  There's a knob for 
it, you can tune the other way if you want.  And based on this pile of 
data I'm sorting through the last few weeks, my guess is that if the 
default is going anywhere in 9.1 it's to make the BGW run *more* often, 
not less.  What those asking for the default change don't realize is 
that if the BGW stops doing useful work, backends will start doing more 
disk writes with their own fsync calls, and now you've just messed with 
out how often the disks have to be powered up.  I could probably 
construct a test case that uses more power with the behavior they think 
they want than the current one does.  The only clear case where this is 
always a win is when the system it totally idle.

The complaint that there's no similar way to detune the logger for lower 
power use, something you can't really tweak on your own, is a much more 
reasonable demand.

I have a patch that adds a new column to pg_stat_bgwriter to count how 
often backend fsync calls happen directly, because the background writer 
couldn't be bothered to absorb them.  If this latching idea goes 
somewhere, that should be a reasonable way to measure if the new 
implementation is getting in the way of that particular processing 
requirement.  It's important to realize that the fsync absorb function 
of the background writer has become absolutely critical on modern 
systems that hit high transaction rates, so any refactoring of its basic 
design needs to stay responsive to those incoming backend requests.  I 
don't see any high-level issues with the latch design approach Heikki 
has proposed in regards to that.  I'll take a look at some of the other 
test cases I have here to see if they can help quantify its impact on 
this aspect of BGW behavior.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



On Fri, Aug 27, 2010 at 5:54 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Tom Lane wrote:
>>
>> Well, yes they are.  They cause unnecessary process wakeups and thereby
>> consume cycles even when the database is idle.  See for example a
>> longstanding complaint here:
>> https://bugzilla.redhat.com/show_bug.cgi?id=252129
>>
>> If we're going to go to the trouble of having a mechanism like this,
>> I'd like it to fix that problem so I can close out that bug.
>
> The way the background writer wakes up periodically to absorb fsync requests
> is already way too infrequent on a busy system.

Maybe instead of a fixed-duration sleep we could wake it up when it
needs to do something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 27, 2010 at 5:54 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> The way the background writer wakes up periodically to absorb fsync requests
>> is already way too infrequent on a busy system.

> Maybe instead of a fixed-duration sleep we could wake it up when it
> needs to do something.

*Any* fixed delay is going to be too long for some people and not long
enough for others; and the very same system might fall into both
categories at different times of day.  I don't think "make
bgwriter_delay customizable" is an adequate answer.  We've put up with
that so far because it wasn't possible to do better given the
infrastructure we had for waiting; but if we're going to try to improve
the infrastructure, we should have the ambition of getting rid of this
type of problem.
        regards, tom lane


Greg Smith <greg@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Well, yes they are.  They cause unnecessary process wakeups and thereby
>> consume cycles even when the database is idle.  See for example a
>> longstanding complaint here:
>> https://bugzilla.redhat.com/show_bug.cgi?id=252129

> ...  The only clear case where this is 
> always a win is when the system it totally idle.

If you'll climb down off that horse for a moment: yeah, the idle case is
*exactly* what they're complaining about.  In particular, the complaint
is that it's unreasonable to have Postgres running on a machine at all
unless it's actively being used, because it forces significant CPU power
drain anyway.  That gets in the way of our plan for world domination,
no?  If you can't have a PG sitting unobtrusively in the background,
waiting for you to have a need for it, it won't get installed in the
first place.  People will pick mysql, or something else with a smaller
footprint, to put on their laptops, and then we lose some more mindshare.
I see this as just another facet of the argument about whether it's okay
to have default settings that try to take over the entire machine.
        regards, tom lane


On Sat, Aug 28, 2010 at 2:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  That gets in the way of our plan for world domination,
> no?  If you can't have a PG sitting unobtrusively in the background,
> waiting for you to have a need for it, it won't get installed in the
> first place.  People will pick mysql, or something else with a smaller
> footprint, to put on their laptops, and then we lose some more mindshare.

People are always complaining about hosting companies not supporting
Postgres but put yourself in the shoes of a hosting copmany that wants
to have a few hundred hosting clients all on the same box, either as a
shared box or more likely these days as VMs. That's perfectly
reasonable if they're all idle nearly all the time but if they're all
waking up ever few milliseconds that's going to consume a substantial
amount of cpu before you even start handling requests.

--
greg


Tom Lane wrote:
> Greg Smith <greg@2ndquadrant.com> writes:
>   
>> ...  The only clear case where this is 
>> always a win is when the system it totally idle.
>>     
> If you'll climb down off that horse for a moment: yeah, the idle case is
> *exactly* what they're complaining about.

I wasn't on a horse here--I saw the specific case they must be most 
annoyed with.  If it's possible to turn this around into a "push" model 
instead of how it works now, without tanking so that performance (and 
maybe even power use!) suffers for the worst or even average case, that 
refactoring could end up outperforming the current model.  I'd like the 
background writer to never go to sleep at all really if enough works 
come in to keep it always busy, and I think that might fall out nicely 
as a result of aiming for the other end--making it sleeper deeper when 
it's not needed.

Just pointing out the very specific place where I suspect that will go 
wrong if it's not done carefully is the fsync absorption, because it's 
already broken enough that we're reworking it here.  PostgreSQL already 
ship a bad enough default configuration to decide yet another spot 
should be yielded to somebody else's priorities, unless that actually 
meets performance goals.  I think I can quantify what those should be 
with a test case for this part.

> I see this as just another facet of the argument about whether it's okay
> to have default settings that try to take over the entire machine.
>   

Mostly agreed here.  So long as the result is automatic enough to not 
introduce yet another GUC set to the wrong thing for busy systems by 
default, this could turn out great.  The list of things you must change 
to get the database to work right for serious work is already far too 
long, and I dread the thought of putting yet another on there just for 
the sake of lower power use.  Another part of the plan for world 
domination is that Oracle DBAs install the database for a test and say 
"wow, that wasn't nearly as complicated as I'm used to and it performs 
well, that was nice".  That those people matter too is all I'm saying.  
I could easily file a bug from their perspective saying "Background 
writer is a lazy SOB in default config" that would be no less valid than 
the one being discussed here.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
Here's a 2nd version of the "latch" patch. Now with a Windows
implementation. Comments welcome.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment
On Tue, Aug 31, 2010 at 4:06 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Here's a 2nd version of the "latch" patch. Now with a Windows
> implementation. Comments welcome.

Seems good.

Two minor comments:

> rc = WaitForSingleObject(latch->event, timeout / 1000);
> if (rc == WAIT_FAILED)
> {
>     ereport(ERROR,
>         (errcode_for_socket_access(),
>          errmsg("WaitForSingleObject() failed: error code %d", (int) GetLastError())));
> }
> if (rc == WAIT_TIMEOUT)
>     break;        /* timeout exceeded */

We should also check "rc == WAIT_OBJECT_0"?

> static volatile HANDLE waitingEvent = false;

s/false/NULL?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


On Fri, Aug 27, 2010 at 4:39 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> /*
>>  * XXX: Should we invent an API to wait for data coming from the
>>  * client connection too? It's not critical, but we could then
>>  * eliminate the timeout altogether and go to sleep for good.
>>  */
>
> Yes, it would be very helpful when walsender waits for the ACK from
> the standby in upcoming synchronous replication.

I propose to change WaitLatch() so that it accepts the socket
descriptor as an argument, to wait for data coming from the
client connection. WaitLatch() monitors not only the self-pipe
but also the given socket. If -1 is supplied, it checks only
the self-pipe.

The socket should be monitored by using poll() if the platform
has it, since poll() is usually more efficient.

So I'd like to change Unix implementation of WaitLatch() as
follows. Thought?

-------------------
define WaitLatch(latch, timeout) WaitLatchAndSocket(latch, -1, timeout)

void WaitLatchAndSocket(Latch *latch, int sock, long timeout);
{   ...
   FD_SET(selfpipe_readfd, &input_mask);   if (sock != -1)       FD_SET(sock, &input_mask);

#ifdef HAVE_POLL   poll(...)
#else   select(...)#endif   /* HAVE_POLL */
   ...
}
-------------------

Windows implementation of WaitLatchAndSocket() seems not to be
so simple. We would need to wait for both the latch event and
the packet from the socket by using WaitForMultipleObjectsEx().

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Tom Lane wrote:
> Greg Smith <greg@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> Well, yes they are.  They cause unnecessary process wakeups and thereby
> >> consume cycles even when the database is idle.  See for example a
> >> longstanding complaint here:
> >> https://bugzilla.redhat.com/show_bug.cgi?id=252129
> 
> > ...  The only clear case where this is 
> > always a win is when the system it totally idle.
> 
> If you'll climb down off that horse for a moment: yeah, the idle case is
> *exactly* what they're complaining about.  In particular, the complaint
> is that it's unreasonable to have Postgres running on a machine at all
> unless it's actively being used, because it forces significant CPU power
> drain anyway.  That gets in the way of our plan for world domination,
> no?  If you can't have a PG sitting unobtrusively in the background,
> waiting for you to have a need for it, it won't get installed in the
> first place.  People will pick mysql, or something else with a smaller
> footprint, to put on their laptops, and then we lose some more mindshare.
> I see this as just another facet of the argument about whether it's okay
> to have default settings that try to take over the entire machine.

FYI, last week I was running PG 8.4.X with default settings on a T43
laptop running XP with 1 Gig of RAM and it caused a racing game I was
playing to run jerkily.  When I shut down the Postgres server, the
problem was fixed.  I was kind of surprised that an idle PG server could
cause that.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 31/08/10 15:47, Fujii Masao wrote:
> On Fri, Aug 27, 2010 at 4:39 PM, Fujii Masao<masao.fujii@gmail.com>  wrote:
>>> /*
>>>   * XXX: Should we invent an API to wait for data coming from the
>>>   * client connection too? It's not critical, but we could then
>>>   * eliminate the timeout altogether and go to sleep for good.
>>>   */
>>
>> Yes, it would be very helpful when walsender waits for the ACK from
>> the standby in upcoming synchronous replication.
>
> I propose to change WaitLatch() so that it accepts the socket
> descriptor as an argument, to wait for data coming from the
> client connection.

Yeah, we probably should do that now.

>  WaitLatch() monitors not only the self-pipe
> but also the given socket. If -1 is supplied, it checks only
> the self-pipe.

The obvious next question is how to wait for multiple sockets and a 
latch at the same time? Perhaps we should have a select()-like interface 
where you can pass multiple file descriptors. Then again, looking at the 
current callers of select() in the backend, apart from postmaster they 
all wait for only one fd.

> The socket should be monitored by using poll() if the platform
> has it, since poll() is usually more efficient.

Yeah, I guess.

> So I'd like to change Unix implementation of WaitLatch() as
> follows. Thought?
>
> -------------------
> define WaitLatch(latch, timeout) WaitLatchAndSocket(latch, -1, timeout)
>
> void WaitLatchAndSocket(Latch *latch, int sock, long timeout);
> {
>      ...
>
>      FD_SET(selfpipe_readfd,&input_mask);
>      if (sock != -1)
>          FD_SET(sock,&input_mask);
>
> #ifdef HAVE_POLL
>      poll(...)
> #else
>      select(...)
>   #endif   /* HAVE_POLL */
>
>      ...
> }
> -------------------

Yep.

> Windows implementation of WaitLatchAndSocket() seems not to be
> so simple. We would need to wait for both the latch event and
> the packet from the socket by using WaitForMultipleObjectsEx().

Well, we already use WaitForMultipleObjectsEx() to implement select() on 
Windows, so it should be straightforward to copy that. I'll look into that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Wed, Sep 1, 2010 at 4:11 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> The obvious next question is how to wait for multiple sockets and a latch at
> the same time? Perhaps we should have a select()-like interface where you
> can pass multiple file descriptors. Then again, looking at the current
> callers of select() in the backend, apart from postmaster they all wait for
> only one fd.

Currently backends have not waited for multiple sockets, so I don't think that
interface is required for now. Similarly, we don't need to wait for the socket
to be ready to *write* because there is no use case for now.

>> Windows implementation of WaitLatchAndSocket() seems not to be
>> so simple. We would need to wait for both the latch event and
>> the packet from the socket by using WaitForMultipleObjectsEx().
>
> Well, we already use WaitForMultipleObjectsEx() to implement select() on
> Windows, so it should be straightforward to copy that. I'll look into that.

Agreed.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 02/09/10 06:46, Fujii Masao wrote:
> On Wed, Sep 1, 2010 at 4:11 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> The obvious next question is how to wait for multiple sockets and a latch at
>> the same time? Perhaps we should have a select()-like interface where you
>> can pass multiple file descriptors. Then again, looking at the current
>> callers of select() in the backend, apart from postmaster they all wait for
>> only one fd.
>
> Currently backends have not waited for multiple sockets, so I don't think that
> interface is required for now. Similarly, we don't need to wait for the socket
> to be ready to *write* because there is no use case for now.

Ok, here's an updated patch with WaitLatchOrSocket that let's you do that.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Ok, here's an updated patch with WaitLatchOrSocket that let's you do that.

Minor code review items:

s/There is three/There are three/ in unix_latch.c header comment

The header comment points out the correct usage of ResetLatch, but I
think it should also emphasize that the correct usage of SetLatch is
to set whatever flags indicate work-to-do before SetLatch.

I don't care for the Asserts in InitLatch and InitSharedLatch.
Initialization functions ought not assume that the struct they are
told to initialize contains anything but garbage.  And *especially*
not assume that without documentation.

s/inter-proess/inter-process/, at least 2 places

Does ReleaseLatch() have any actual use-case, and if so what would it be?
I think we could do without it.

The WaitLatch timeout API could use a bit of refinement.  I'd suggest
defining negative timeout as meaning wait forever, so that timeout = 0
can be used for "check but don't wait".  Also, it seems like the
function shouldn't just return void but should return a bool to show
whether it saw the latch set or timed out.  (Yeah, I realize the caller
could look into the latch to find that out, but callers really ought to
treat latches as opaque structs.)

I don't think you have the select-failed logic right in
WaitLatchOrSocket; on EINTR it will suppose that FD_ISSET is a valid
test to make, which I think ain't the case.  Just "continue" around
the loop.

Comment for unix_latch's latch_sigusr1_handler refers to SetEvent,
which is a Windows-ism.

+             * XXX: Is it safe to elog(ERROR) in a signal handler?

No, it isn't.

It seems like both implementations are #include'ing more than they
ought to --- why replication/walsender.h, in particular?
I don't think unix_latch needs spin.h either.

+typedef struct
+{
+    volatile sig_atomic_t    is_set;
+    volatile sig_atomic_t    owner_pid;
+} Latch;

I don't believe it is either sane or portable to declare struct fields
as volatile.  You need to attach the volatile qualifier to the function
arguments instead, eg
extern WaitLatch(volatile Latch *latch, ...)

Also, using sig_atomic_t for owner_pid is entirely not sane.
On many platforms sig_atomic_t is only a byte, and besides
which you have no need for that field to be settable by a
signal handler.
        regards, tom lane


On Fri, Sep 3, 2010 at 5:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Does ReleaseLatch() have any actual use-case, and if so what would it be?
> I think we could do without it.

In Unix, probably we can live without that. But in Windows, we need to
free SharedEventHandles slot for upcoming process using a latch when
ending.

> The WaitLatch timeout API could use a bit of refinement.  I'd suggest
> defining negative timeout as meaning wait forever, so that timeout = 0
> can be used for "check but don't wait".  Also, it seems like the
> function shouldn't just return void but should return a bool to show
> whether it saw the latch set or timed out.  (Yeah, I realize the caller
> could look into the latch to find that out, but callers really ought to
> treat latches as opaque structs.)

Agreed.

> I don't think you have the select-failed logic right in
> WaitLatchOrSocket; on EINTR it will suppose that FD_ISSET is a valid
> test to make, which I think ain't the case.  Just "continue" around
> the loop.

EINTR already makes us go back to the top of the loop since FD_ISSET(pipe)
is not checked. Then we would clear the pipe and break out of the loop
because of "latch->is_set == true".

> +                        * XXX: Is it safe to elog(ERROR) in a signal handler?
>
> No, it isn't.

We should use elog(FATAL) or check proc_exit_inprogress, instead?

+        if (errno != EAGAIN && errno != EWOULDBLOCK)
+        {
+            /*
+             * XXX: Is it safe to elog(ERROR) in a signal handler?
+             */
+            elog(ERROR, "write() on self-pipe failed: %m");
+        }
+        if (errno == EINTR)
+            goto retry;

"errno == EINTR)" seems to be never checked.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Fujii Masao <masao.fujii@gmail.com> writes:
>> +               * XXX: Is it safe to elog(ERROR) in a signal handler?
>> 
>> No, it isn't.

> We should use elog(FATAL) or check proc_exit_inprogress, instead?

elog(FATAL) is *certainly* not a better idea.  I think there's really
nothing that can be done, you just have to silently ignore the error.
BTW, if we retry, there had probably better be a limit on how many times
to retry ...

> +        if (errno != EAGAIN && errno != EWOULDBLOCK)
> +        {
> +            /*
> +             * XXX: Is it safe to elog(ERROR) in a signal handler?
> +             */
> +            elog(ERROR, "write() on self-pipe failed: %m");
> +        }
> +        if (errno == EINTR)
> +            goto retry;

> "errno == EINTR)" seems to be never checked.

Another issue with coding like that is that it supposes elog() won't
change errno.
        regards, tom lane


On Fri, Sep 3, 2010 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>>> +               * XXX: Is it safe to elog(ERROR) in a signal handler?
>>>
>>> No, it isn't.
>
>> We should use elog(FATAL) or check proc_exit_inprogress, instead?
>
> elog(FATAL) is *certainly* not a better idea.  I think there's really
> nothing that can be done, you just have to silently ignore the error.

Hmm.. some functions called by a signal handler use elog(FATAL), e.g.,
RecoveryConflictInterrupt() do that when unknown conflict mode is given
as an argument. Are these calls unsafe, too?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 02/09/10 23:13, Tom Lane wrote:
> The WaitLatch ...timeout API could use a bit of refinement.  I'd suggest
> defining negative timeout as meaning wait forever, so that timeout = 0
> can be used for "check but don't wait".  Also, it seems like the
> function shouldn't just return void but should return a bool to show
> whether it saw the latch set or timed out.

In case of WaitLatchOrSocket, the caller might want to know if a latch 
was set, the socket became readable, or it timed out. So we need three 
different return values.
> (Yeah, I realize the caller> could look into the latch to find that out, but callers really ought to> treat latches
asopaque structs.)
 

Hmm, maybe we need a TestLatch function to check if a latch is set.

> I don't think you have the select-failed logic right in
> WaitLatchOrSocket; on EINTR it will suppose that FD_ISSET is a valid
> test to make, which I think ain't the case.  Just "continue" around
> the loop.

Yep.

I also realized that the timeout handling is a bit surprising with 
interrupts. After EINTR we call select() again with the same timeout, so 
a signal effectively restarts the timer. We seem to have similar 
behavior in a couple of other places, in pgstat.c and auth.c. So maybe 
that's OK and just needs to be documented, but I thought I'd bring it up.

> It seems like both implementations are #include'ing more than they
> ought to --- why replication/walsender.h, in particular?

Windows implementation needs it for the max_wal_senders variable, to 
allocate enough shared Event objects in LatchShmemInit. In unix_latch.c 
it's not needed.

> Also, using sig_atomic_t for owner_pid is entirely not sane.
> On many platforms sig_atomic_t is only a byte, and besides
> which you have no need for that field to be settable by a
> signal handler.

Hmm, true, it doesn't need to be set from signal handler, but is there 
an atomicity problem if one process calls ReleaseLatch while another 
process is in SetLatch? ReleaseLatch sets owner_pid to 0, while SetLatch 
reads it and calls kill() on it. Can we assume that pid_t is atomic, or 
do we need a spinlock to protect it? (Windows implementation has a 
similar issue with HANDLE instead of pid_t)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Fujii Masao <masao.fujii@gmail.com> writes:
> On Fri, Sep 3, 2010 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> elog(FATAL) is *certainly* not a better idea. �I think there's really
>> nothing that can be done, you just have to silently ignore the error.

> Hmm.. some functions called by a signal handler use elog(FATAL), e.g.,
> RecoveryConflictInterrupt() do that when unknown conflict mode is given
> as an argument. Are these calls unsafe, too?

[ shrug... ]  I stated before that the Hot Standby patch is doing
utterly unsafe things in signal handlers.  Simon rejected that.
I am waiting for irrefutable evidence to emerge from the field
(and am very confident that it will be forthcoming...) before
I argue with him further.  Meanwhile, I'm not going to accept anything
unsafe in a core facility like this patch is going to be.
        regards, tom lane


Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 02/09/10 23:13, Tom Lane wrote:
>>> (Yeah, I realize the caller
>>> could look into the latch to find that out, but callers really ought to
>>> treat latches as opaque structs.)

> Hmm, maybe we need a TestLatch function to check if a latch is set.

+1.  It could be a macro for now.  I wish that we could declare Latch as
an opaque struct, but we probably need to let callers embed it in a
larger struct, so we'll just have to rely on callers to code cleanly.

> I also realized that the timeout handling is a bit surprising with 
> interrupts. After EINTR we call select() again with the same timeout, so 
> a signal effectively restarts the timer.

Actually it's platform-dependent.  On some machines (I think
BSD-derived) the value of the timeout struct gets reduced by the elapsed
time before returning, so that if you just loop around you don't get
more than the originally specified delay time in total.

> We seem to have similar 
> behavior in a couple of other places, in pgstat.c and auth.c. So maybe 
> that's OK and just needs to be documented, but I thought I'd bring it up.

Yeah.  I am hoping that this facility will greatly reduce the need for
callers to care about the exact timeout delay, so I think that what we
should do for now is just document that the timeout might be several
times as long as specified, and see how it goes from there.

We could fix the problem if we had to, by adding some gettimeofday()
calls and computing the remaining delay time each time through the
loop.  But let's avoid doing that until it's clear we need it.

>> Also, using sig_atomic_t for owner_pid is entirely not sane.

> Hmm, true, it doesn't need to be set from signal handler, but is there 
> an atomicity problem if one process calls ReleaseLatch while another 
> process is in SetLatch?

If there is *any* possibility of that happening then you have far worse
problems than whether the field is atomically readable or not: the
behavior will be unpredictable at just slightly larger timescales.
This is the reason why I think it'd be better if ReleaseLatch simply
didn't exist.  That'd discourage people from designing dynamic latch
structures, which are fundamentally going to be subject to race
conditions.
        regards, tom lane


On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> On Fri, Sep 3, 2010 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> elog(FATAL) is *certainly* not a better idea.  I think there's really
>>> nothing that can be done, you just have to silently ignore the error.
>
>> Hmm.. some functions called by a signal handler use elog(FATAL), e.g.,
>> RecoveryConflictInterrupt() do that when unknown conflict mode is given
>> as an argument. Are these calls unsafe, too?
>
> [ shrug... ]  I stated before that the Hot Standby patch is doing
> utterly unsafe things in signal handlers.  Simon rejected that.
> I am waiting for irrefutable evidence to emerge from the field
> (and am very confident that it will be forthcoming...) before
> I argue with him further.  Meanwhile, I'm not going to accept anything
> unsafe in a core facility like this patch is going to be.

Oh.  I thought you had ignored his objections and fixed it.  Why are
we releasing 9.0 with this problem again?  Surely this is nuts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> [ shrug... ] �I stated before that the Hot Standby patch is doing
>> utterly unsafe things in signal handlers. �Simon rejected that.
>> I am waiting for irrefutable evidence to emerge from the field
>> (and am very confident that it will be forthcoming...) before
>> I argue with him further. �Meanwhile, I'm not going to accept anything
>> unsafe in a core facility like this patch is going to be.

> Oh.  I thought you had ignored his objections and fixed it.  Why are
> we releasing 9.0 with this problem again?  Surely this is nuts.

My original review of hot standby found about half a dozen things
I thought were broken:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg00178.php
After a *very* long-drawn-out fight I fixed one of them
(max_standby_delay), largely still over Simon's objections.  I don't
have the energy to repeat that another half-dozen times, so I'm going
to wait for the suspected problems to be proven by field experience.
        regards, tom lane


On Fri, Sep 3, 2010 at 12:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> [ shrug... ]  I stated before that the Hot Standby patch is doing
>>> utterly unsafe things in signal handlers.  Simon rejected that.
>>> I am waiting for irrefutable evidence to emerge from the field
>>> (and am very confident that it will be forthcoming...) before
>>> I argue with him further.  Meanwhile, I'm not going to accept anything
>>> unsafe in a core facility like this patch is going to be.
>
>> Oh.  I thought you had ignored his objections and fixed it.  Why are
>> we releasing 9.0 with this problem again?  Surely this is nuts.
>
> My original review of hot standby found about half a dozen things
> I thought were broken:
> http://archives.postgresql.org/pgsql-hackers/2010-05/msg00178.php
> After a *very* long-drawn-out fight I fixed one of them
> (max_standby_delay), largely still over Simon's objections.  I don't
> have the energy to repeat that another half-dozen times, so I'm going
> to wait for the suspected problems to be proven by field experience.

Bummer.  Allow me to cast a vote for doing something about the fact
that handle_standby_sig_alarm() thinks it can safely acquire an LWLock
in a signal handler.  I think we should be making our decisions on
what to change in the code based on what is technically sound, rather
than based on how much the author complains about changing it.  Of
course there may be cases where there is a legitimate difference of
opinion concerning the best way forward, but I don't believe this is
one of them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 03/09/10 17:51, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> On 02/09/10 23:13, Tom Lane wrote:
>>> Also, using sig_atomic_t for owner_pid is entirely not sane.
>
>> Hmm, true, it doesn't need to be set from signal handler, but is there
>> an atomicity problem if one process calls ReleaseLatch while another
>> process is in SetLatch?
>
> If there is *any* possibility of that happening then you have far worse
> problems than whether the field is atomically readable or not: the
> behavior will be unpredictable at just slightly larger timescales.
> This is the reason why I think it'd be better if ReleaseLatch simply
> didn't exist.  That'd discourage people from designing dynamic latch
> structures, which are fundamentally going to be subject to race
> conditions.

Each Walsender needs a latch, and walsenders come and go. I first 
experimented with had no ReleaseLatch function; instead any process 
could call WaitLatch on any shared latch, as long as only one process 
waits on a given latch at a time. But it had exactly the same problem, 
WaitLatch had to set the pid on the Latch struct to allow other 
processes to send the signal. Another process could call SetLatch and 
read the pid field, while WaitLatch is just setting it. I think we'll 
have to put a spinlock there, if we can't assume that assignment of 
pid_t is atomic. It's not the end of the world..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 03/09/10 17:51, Tom Lane wrote:
>> If there is *any* possibility of that happening then you have far worse
>> problems than whether the field is atomically readable or not: the
>> behavior will be unpredictable at just slightly larger timescales.

> Each Walsender needs a latch, and walsenders come and go.

Well, then we need to think extremely hard about the circumstances in
which we need to send a cross-process latch signal to walsenders and
what the behavior needs to be in the race conditions.

> WaitLatch had to set the pid on the Latch struct to allow other 
> processes to send the signal. Another process could call SetLatch and 
> read the pid field, while WaitLatch is just setting it. I think we'll 
> have to put a spinlock there, if we can't assume that assignment of 
> pid_t is atomic. It's not the end of the world..

Yes it is.  Signal handlers can't take spinlocks (what if they interrupt
while the mainline is holding the lock?).

It's probably not too unreasonable to assume that pid_t assignment is
atomic.  But I'm still thinking that we have bigger problems than that
if there are really cases where SetLatch can execute at approximately
the same time as a latch owner is coming or going.
        regards, tom lane


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 03/09/10 21:16, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> WaitLatch had to set the pid on the Latch struct to allow other
>> processes to send the signal. Another process could call SetLatch and
>> read the pid field, while WaitLatch is just setting it. I think we'll
>> have to put a spinlock there, if we can't assume that assignment of
>> pid_t is atomic. It's not the end of the world..
>
> Yes it is.  Signal handlers can't take spinlocks (what if they interrupt
> while the mainline is holding the lock?).

Ok, I see.

> It's probably not too unreasonable to assume that pid_t assignment is
> atomic.  But I'm still thinking that we have bigger problems than that
> if there are really cases where SetLatch can execute at approximately
> the same time as a latch owner is coming or going.

I don't see how to avoid it. A walsender, or any process really, can 
exit at any time. It can make the latch inaccessible to others before it 
exits to minimize the window, but it's always going to be possible that 
another process is just about to call SetLatch when you exit.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 03/09/10 21:16, Tom Lane wrote:
>> It's probably not too unreasonable to assume that pid_t assignment is
>> atomic.  But I'm still thinking that we have bigger problems than that
>> if there are really cases where SetLatch can execute at approximately
>> the same time as a latch owner is coming or going.

> I don't see how to avoid it. A walsender, or any process really, can 
> exit at any time. It can make the latch inaccessible to others before it 
> exits to minimize the window, but it's always going to be possible that 
> another process is just about to call SetLatch when you exit.

Well, in that case what we need to do is presume that the latch object
has a continuing existence but the owner/receiver can come and go.
I would suggest that InitLatch needs to initialize the object into a
valid but unowned state; there is *no* deinitialize operation; and
there are AcquireLatch and ReleaseLatch operations to become owner
or stop being owner.  We also need to define the semantics of SetLatch
on an unowned latch --- does this set a signal condition that will be
available to the next owner?

This amount of complexity might be overkill for local latches, but I
think we need it for shared ones.
        regards, tom lane


Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> [ shrug... ]  I stated before that the Hot Standby patch is doing
>>> utterly unsafe things in signal handlers.  Simon rejected that.
>>> I am waiting for irrefutable evidence to emerge from the field
>>> (and am very confident that it will be forthcoming...) [...]
> 
>> [...]Why are
>> we releasing 9.0 with this problem again?  Surely this is nuts.

Will the docs give enough info so that release note readers
will know when they're giving well-informed consent to volunteer
to produce such field evidence?




On Fri, Sep 3, 2010 at 3:11 PM, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:
> Tom Lane wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> [ shrug... ]  I stated before that the Hot Standby patch is doing
>>>> utterly unsafe things in signal handlers.  Simon rejected that.
>>>> I am waiting for irrefutable evidence to emerge from the field
>>>> (and am very confident that it will be forthcoming...) [...]
>>
>>> [...]Why are
>>> we releasing 9.0 with this problem again?  Surely this is nuts.
>
> Will the docs give enough info so that release note readers
> will know when they're giving well-informed consent to volunteer
> to produce such field evidence?

Yeah, exactly.  Good news: you can now run queries on the standby.
Bad news: we've abandoned our policy of not releasing with known bugs.Have fun and enjoy using PostgreSQL, the world's
mostadvanced open 
source databSegmentation fault

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 03/09/10 19:38, Robert Haas wrote:
> On Fri, Sep 3, 2010 at 12:10 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> Robert Haas<robertmhaas@gmail.com>  writes:
>>> On Fri, Sep 3, 2010 at 10:07 AM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>>> [ shrug... ]  I stated before that the Hot Standby patch is doing
>>>> utterly unsafe things in signal handlers.  Simon rejected that.
>>>> I am waiting for irrefutable evidence to emerge from the field
>>>> (and am very confident that it will be forthcoming...) before
>>>> I argue with him further.  Meanwhile, I'm not going to accept anything
>>>> unsafe in a core facility like this patch is going to be.
>>
>>> Oh.  I thought you had ignored his objections and fixed it.  Why are
>>> we releasing 9.0 with this problem again?  Surely this is nuts.
>>
>> My original review of hot standby found about half a dozen things
>> I thought were broken:
>> http://archives.postgresql.org/pgsql-hackers/2010-05/msg00178.php
>> After a *very* long-drawn-out fight I fixed one of them
>> (max_standby_delay), largely still over Simon's objections.  I don't
>> have the energy to repeat that another half-dozen times, so I'm going
>> to wait for the suspected problems to be proven by field experience.
>
> Bummer.  Allow me to cast a vote for doing something about the fact
> that handle_standby_sig_alarm() thinks it can safely acquire an LWLock
> in a signal handler.  I think we should be making our decisions on
> what to change in the code based on what is technically sound, rather
> than based on how much the author complains about changing it.  Of
> course there may be cases where there is a legitimate difference of
> opinion concerning the best way forward, but I don't believe this is
> one of them.

Hmm, just to make the risk more concrete, here's one scenario that could 
happen:

1. Startup process tries to acquire cleanup lock on a page. It's pinned, 
so it has to wait, and calls ResolveRecoveryConflictWithBufferPin().
2. ResolveRecoveryConflictWithBufferPin enables the standby SIGALRM 
handler by calling enable_standby_sig_alarm(), and calls 
ProcWaitForSignal().

3. ProcWaitForSignal() calls semop() (assuming sysv semaphores here) to 
wait for the process semaphore

4. Max standby delay is reached and SIGALRM fired. CheckStandbyTimeout() 
is called in signal handler. CheckStandbyTimeout() calls 
SendRecoveryConflictWithBufferPin(), which calls CancelDBBackends()

5. CancelDBBackends() tries to acquire ProcArrayLock in exclusive mode. 
It's being held by another process, so we have to sleep

6. To sleep, LWLockAcquire calls PGSemaphoreLock, which calls semop() to 
wait on on the process semaphore.

So we now have the same process nested twice inside a semop() call. 
Looking at the Linux signal (7) man page, it is undefined what happens 
if semop() is re-entered in a signal handler while another semop() call 
is happening in main line of execution. Assuming it somehow works, which 
semop() call is going to return when the semaphore is incremented?

Maybe that's ok, if I'm reading the deadlock checker code correctly, it 
also calls semop() to increment the another process' semaphore, and the 
deadlock checker can be invoked from a signal handler while in semop() 
to wait on our process' semaphore. BTW, sem_post(), which is the Posix 
function for incrementing a semaphore, is listed as a safe function to 
call in a signal handler. But it's certainly fishy.

A safer approach would be to just PGSemaphoreUnlock() in the signal 
handler, and do all the other processing outside it. You'd still call 
semop() within semop(), but at least it would be closer to the semop() 
within semop() we already do in the deadlock checker. And there would be 
less randomness from timing and lock contention involved, making it 
easier to test the behavior on various platforms.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Fri, Sep 3, 2010 at 4:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Maybe that's ok, if I'm reading the deadlock checker code correctly, it also
> calls semop() to increment the another process' semaphore, and the deadlock
> checker can be invoked from a signal handler while in semop() to wait on our
> process' semaphore. BTW, sem_post(), which is the Posix function for
> incrementing a semaphore, is listed as a safe function to call in a signal
> handler. But it's certainly fishy.

Color me confused; I may need to backpedal rapidly here.  I had
thought that what Tom was complaining about was the fact that the
signal handler was taking LWLocks, which I would have thought to be
totally unsafe.  But it seems the deadlock detector does the same
thing, more or less because the signal handlers are set up so that
they don't do anything unless we're within a limited section of code
where nothing too interesting can happen.  I'm not too sure why we
think that it's safe to invoke the deadlock detector that way, but
it's also not too clear to me why this case is any worse.

It furthermore appears that Simon's reply to Tom's complaint about
this function was: "This was modelled very closely on
handle_sig_alarm() and was reviewed by other hackers. I'm not great on
that, as you know, so if you can explain what it is I can't do, and
how that differs from handle_sig_alarm running the deadlock detector
in the same way, then I'll work on it some more."

I guess I need the same explanation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> A safer approach would be to just PGSemaphoreUnlock() in the signal 
> handler, and do all the other processing outside it.

I don't see any particularly good reason to assume that
PGSemaphoreUnlock is safe either: you're still talking about nested
semop operations.

The pre-existing SIGALRM handler uses a self-signal (kill(MyProcPid,
SIGINT)) to kick the process off any wait it might be doing.  I'd rather
do something like that.

Or maybe the work you're doing on latches would help ...
        regards, tom lane


Robert Haas <robertmhaas@gmail.com> writes:
> Color me confused; I may need to backpedal rapidly here.  I had
> thought that what Tom was complaining about was the fact that the
> signal handler was taking LWLocks, which I would have thought to be
> totally unsafe.

Well, it's unsafe if the signal could interrupt mainline code that is
trying to take an LWLock or already holds the same LWLock --- or, if you
consider deadlock risks against other processes, already holding other
LWLocks might be problematic too.  And trying to use facilities like
elog or palloc is also pretty unsafe if the mainline is too.

The reason the deadlock checker is okay is that there is only a *very*
narrow range of mainline code that it could possibly be interrupting,
basically nothing but the PGSemaphoreLock() call in ProcSleep.
(There's a lot of other code inside the loop there, but notice that it's
protected by tests that ensure that it won't run unless the deadlock
checker already did.)

One salient thing about ProcSleep is that you shouldn't call it while
holding any LWLocks, and another is that if you're sleeping while
holding regular heavyweight locks, the deadlock checker is exactly what
will get you out of trouble if there's a deadlock.

Now the HS case likewise appears to be set up so that the signal can
only directly interrupt ProcWaitForSignal, so I think the core issue is
whether any deadlock situations are possible.  Given that this gets
called from a low-level place like LockBufferForCleanup, I don't feel
too comfortable about that.  I certainly haven't seen any analysis or
documentation of what locks can safely be held at that point.
The deadlock checker only tries to take the LockMgr LWLocks, so
extrapolating from whether it is safe to whether touching the
ProcArrayLock is safe seems entirely unfounded.

It might be worth pointing out here that LockBufferForCleanup is already
known to be a risk factor for undetected deadlocks, even without HS in
the picture, because of the possibility of deadlocks involving a chain
of both heavyweight locks and LWLocks.  Whether HS makes it materially
worse may be something that we need field experience to determine.
        regards, tom lane


On Fri, Sep 3, 2010 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Now the HS case likewise appears to be set up so that the signal can
> only directly interrupt ProcWaitForSignal, so I think the core issue is
> whether any deadlock situations are possible.  Given that this gets
> called from a low-level place like LockBufferForCleanup, I don't feel
> too comfortable about that.  I certainly haven't seen any analysis or
> documentation of what locks can safely be held at that point.
> The deadlock checker only tries to take the LockMgr LWLocks, so
> extrapolating from whether it is safe to whether touching the
> ProcArrayLock is safe seems entirely unfounded.
>
> It might be worth pointing out here that LockBufferForCleanup is already
> known to be a risk factor for undetected deadlocks, even without HS in
> the picture, because of the possibility of deadlocks involving a chain
> of both heavyweight locks and LWLocks.  Whether HS makes it materially
> worse may be something that we need field experience to determine.

OK, well, that does make it more clear what you're worried about, but
it seems that moving the work out of the signal handler isn't going to
solve this problem.  The core issue appears to be whether the caller
might be holding a lock which another process might attempt to take
while already holding ProcArrayLock, or (in the case of a three or
more way deadlock) whether the caller might be holding a lock that
some other process could try to acquire after seizing ProcArrayLock.
As far as I can tell from looking through the code, the only locks we
ever acquire while holding ProcArrayLock are XidGenLock and
known_assigned_xids_lck (which is a spin lock).  The latter is never
held more than VERY briefly.  XidGenLock is generally also held only
briefly, not acquiring any other locks in the meantime, but
GetNewTransactionId() is not obviously (to me) safe.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


On Fri, Sep 3, 2010 at 9:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> So we now have the same process nested twice inside a semop() call. Looking
> at the Linux signal (7) man page, it is undefined what happens if semop() is
> re-entered in a signal handler while another semop() call is happening in
> main line of execution. Assuming it somehow works, which semop() call is
> going to return when the semaphore is incremented?
>

Fwiw I wouldn't be too worried about semop specifically. It's a
syscall and will always return with EINTR. What makes functions
async-unsafe is when they might do extra processing manipulating data
structures in user space such as mallocing memory. POSIX seems to be
giving semop license to do that but realistically I can't imagine any
implementation doing so.

What I might wonder about is what happens if the signal is called just
after the semop completes or just before it starts.

And someone mentioned calling elog from within the signal handler --
doesn't elog call palloc() and sprintf? That can't be async-safe.

-- 
greg


On Fri, 2010-09-03 at 18:24 -0400, Tom Lane wrote:
> Now the HS case likewise appears to be set up so that the signal can
> only directly interrupt ProcWaitForSignal, so I think the core issue
> is
> whether any deadlock situations are possible.  Given that this gets
> called from a low-level place like LockBufferForCleanup, I don't feel
> too comfortable about that.

LockBufferForCleanup is only ever called during recovery by
heap_xlog_clean() or btree_xlog_vacuum().

The actions taken to replay a WAL record are independent of all other
WAL records from a locking perspective, so replay of every WAL record
starts with no LWlocks held by startup process. LockBufferForCleanup is
taken early on in replay a heap or btree cleanup record and so we can
easily check that no other LWlocks are held while it is called.

>   I certainly haven't seen any analysis or
> documentation of what locks can safely be held at that point.
> The deadlock checker only tries to take the LockMgr LWLocks, so
> extrapolating from whether it is safe to whether touching the
> ProcArrayLock is safe seems entirely unfounded.

So the startup process calls one LWlock, ProcArrayLock, and is not
holding any other LWlock when it does. The deadlock checker attempts to
get and hold all of the other lock partition locks. So deadlock checker
already does the thing you're saying might be dangerous and the startup
process doesn't.

The ProcArrayLock is only taken as a way of signaling other backends. If
that is particularly unsafe we could redesign that aspect.

> It might be worth pointing out here that LockBufferForCleanup is
> already
> known to be a risk factor for undetected deadlocks, even without HS in
> the picture, because of the possibility of deadlocks involving a chain
> of both heavyweight locks and LWLocks.  Whether HS makes it materially
> worse may be something that we need field experience to determine.

You may be right and that it will be a problem.

The deadlock risk we're protecting against is a deadlock involving both
normal locks and buffer pins. We're safer having it than not having this
code, IMHO.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services



Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 03/09/10 21:50, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> On 03/09/10 21:16, Tom Lane wrote:
>>> It's probably not too unreasonable to assume that pid_t assignment is
>>> atomic.  But I'm still thinking that we have bigger problems than that
>>> if there are really cases where SetLatch can execute at approximately
>>> the same time as a latch owner is coming or going.
>
>> I don't see how to avoid it. A walsender, or any process really, can
>> exit at any time. It can make the latch inaccessible to others before it
>> exits to minimize the window, but it's always going to be possible that
>> another process is just about to call SetLatch when you exit.
>
> Well, in that case what we need to do is presume that the latch object
> has a continuing existence but the owner/receiver can come and go.
> I would suggest that InitLatch needs to initialize the object into a
> valid but unowned state; there is *no* deinitialize operation; and
> there are AcquireLatch and ReleaseLatch operations to become owner
> or stop being owner.

I think we have just a terminology issue. What you're describing is 
exactly how it works now, if you just s/InitLatch/AcquireLatch. At the 
moment there's no need for an initialization function other than the 
InitLatch/AcquireLatch that associates the latch with the current 
process. I can add one for the sake of future-proofing, and to have 
better-defined behavior for setting a latch that has not been owned by 
anyone yet, but it's not strictly necessary.

>  We also need to define the semantics of SetLatch
> on an unowned latch --- does this set a signal condition that will be
> available to the next owner?

At the moment, no. Perhaps that would be useful, separating the Init and 
Acquire operations is needed to make that sane.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 03/09/10 21:50, Tom Lane wrote:
>> Well, in that case what we need to do is presume that the latch object
>> has a continuing existence but the owner/receiver can come and go.
>> I would suggest that InitLatch needs to initialize the object into a
>> valid but unowned state; there is *no* deinitialize operation; and
>> there are AcquireLatch and ReleaseLatch operations to become owner
>> or stop being owner.

> I think we have just a terminology issue. What you're describing is 
> exactly how it works now, if you just s/InitLatch/AcquireLatch.

No, it isn't.  What I'm suggesting requires breaking InitLatch into two
operations.

>> We also need to define the semantics of SetLatch
>> on an unowned latch --- does this set a signal condition that will be
>> available to the next owner?

> At the moment, no. Perhaps that would be useful, separating the Init and 
> Acquire operations is needed to make that sane.

Exactly.  I'm not totally sure either if it would be useful, but the
current design makes it impossible to allow that.

BTW, on reflection the AcquireLatch/ReleaseLatch terminology seems a bit
ill chosen: ReleaseLatch sounds way too much like something that would
just unlock or clear the latch.  Perhaps OwnLatch/DisownLatch, or
something along that line.
        regards, tom lane


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 06/09/10 17:18, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> I think we have just a terminology issue. What you're describing is
>> exactly how it works now, if you just s/InitLatch/AcquireLatch.
>
> No, it isn't.  What I'm suggesting requires breaking InitLatch into two
> operations.
>
>>> We also need to define the semantics of SetLatch
>>> on an unowned latch --- does this set a signal condition that will be
>>> available to the next owner?
>
>> At the moment, no. Perhaps that would be useful, separating the Init and
>> Acquire operations is needed to make that sane.
>
> Exactly.  I'm not totally sure either if it would be useful, but the
> current design makes it impossible to allow that.

Ok, I've split the Init and Acquire steps into two.

> BTW, on reflection the AcquireLatch/ReleaseLatch terminology seems a bit
> ill chosen: ReleaseLatch sounds way too much like something that would
> just unlock or clear the latch.  Perhaps OwnLatch/DisownLatch, or
> something along that line.

Yeah, I see what you mean. Although, maybe it's just me but Own/Disown
looks ugly. Anyone have a better suggestion?

Here's an updated patch, with all the issues reported this far fixed,
except for that naming issue, and Fujii's suggestion to use poll()
instead of select() where available. I've also polished it quite a bit,
improving comments etc. Magnus, can you take a look at the Windows
implementation to check that it's sane? At least it seems to work.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment
Hi,

On 09/06/2010 06:27 PM, Heikki Linnakangas wrote:
> Here's an updated patch, with all the issues reported this far fixed,
> except for that naming issue, and Fujii's suggestion to use poll()
> instead of select() where available. I've also polished it quite a bit,
> improving comments etc. Magnus, can you take a look at the Windows
> implementation to check that it's sane? At least it seems to work.

Is pselect() really as unportable as stated in the patch? What platforms 
have problems with pselect()?

Using the self-pipe trick, don't we risk running into the open file 
handles limitation? Or is it just two handles per process?

Do I understand correctly that the purpose of this patch is to work 
around the brokenness of select() on very few platforms? Or is there any 
additional feature that plain signals don't give us?

> + * It's important to reset the latch*before*  checking if there's work to
> + * do. Otherwise, if someone sets the latch between the check and the
> + * ResetLatch call, you will miss it and Wait will block.

Why doesn't WaitLatch() clear it? What's the use case for waiting for a 
latch and *not* wanting to reset it?

Regards

Markus


Markus Wanner <markus@bluegap.ch> writes:
> Is pselect() really as unportable as stated in the patch? What platforms 
> have problems with pselect()?

Well, it's not defined in the Single Unix Spec, which is our customary
reference for portability.  Also, it's alleged that some platforms have
it but in a form that's not actually any safer than select().  For
example, I read in the Darwin man page for it

IMPLEMENTATION NOTES    The pselect() function is implemented in the C library as a wrapper    around select().

and that man page appears to be borrowed verbatim from FreeBSD.

> Using the self-pipe trick, don't we risk running into the open file 
> handles limitation? Or is it just two handles per process?

It's just two handles per process.
        regards, tom lane


On 09/06/2010 08:46 PM, Tom Lane wrote:
> Well, it's not defined in the Single Unix Spec, which is our customary
> reference for portability.

FWIW, I bet the self-pipe trick isn't mentioned there, either... any 
good precedence that it actually works as expected on all of the target 
platforms? Existing users of the self-pipe trick?

(You are certainly aware that pselect() is defined in newer versions of 
POSIX).

> Also, it's alleged that some platforms have
> it but in a form that's not actually any safer than select().  For
> example, I read in the Darwin man page for it
>
> IMPLEMENTATION NOTES
>       The pselect() function is implemented in the C library as a wrapper
>       around select().

Ouch. Indeed, quick googling reveals the following source code for Darwin:

http://www.opensource.apple.com/source/Libc/Libc-391.5.22/gen/FreeBSD/pselect.c

Now that you are mentioning it, I seem to recall that even glibc had a 
user-space "implementation" of pselect. Fortunately, that's quite some 
years ago.

> and that man page appears to be borrowed verbatim from FreeBSD.

At least FreeBSD seems to have fixed this about 8 months ago:
http://svn.freebsd.org/viewvc/base?view=revision&revision=200725

Maybe Darwin catches up eventually?


AFAICT the custom select() implementation we are using for Windows could 
easily be changed to mimic pselect() instead. Thus most reasonably 
up-to-date Linux distributions plus Windows certainly provide a workable 
pselect() syscall. Would it be worth using pselect() for those (and 
maybe others that support pselect() appropriately)?

> It's just two handles per process.

Good. How about syscall overhead? One more write operation to the 
self-pipe per signal from within the signal handler and one read to 
actually clear the 'ready' state of the pipe from the waiter portion of 
the code, right?

Do we plan to replace all (or most) existing internal signals with these 
latches to circumvent the interruption problem? Or just the ones we need 
to wait for using pg_usleep()?

For Postgres-R, I'd probably have to extend it to call select() not only 
on the self-pipe, but on at least one other socket as well (to talk to 
the GCS). As long as that's possible, it looks like a more portable 
replacement for the pselect() variant that's currently in place there.

Regards

Markus


Markus Wanner <markus@bluegap.ch> writes:
> AFAICT the custom select() implementation we are using for Windows could 
> easily be changed to mimic pselect() instead. Thus most reasonably 
> up-to-date Linux distributions plus Windows certainly provide a workable 
> pselect() syscall. Would it be worth using pselect() for those (and 
> maybe others that support pselect() appropriately)?

I don't entirely see the point of opening ourselves up to the risk of
using a pselect that's not safe under the hood.  In any case, on most
modern platforms poll() is preferable to any variant of select().
        regards, tom lane


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 06/09/10 23:10, Markus Wanner wrote:
> Good. How about syscall overhead? One more write operation to the
> self-pipe per signal from within the signal handler and one read to
> actually clear the 'ready' state of the pipe from the waiter portion of
> the code, right?

Right.

> Do we plan to replace all (or most) existing internal signals with these
> latches to circumvent the interruption problem? Or just the ones we need
> to wait for using pg_usleep()?

At least the poll loops in bgwriter and walwriter need to be replaced if 
we want to fix the issue Tom mentioned earlier that the server currently 
wakes up periodically, waking up the CPU which wastes electricity. 
There's no hurry to replace other code.

> For Postgres-R, I'd probably have to extend it to call select() not only
> on the self-pipe, but on at least one other socket as well (to talk to
> the GCS). As long as that's possible, it looks like a more portable
> replacement for the pselect() variant that's currently in place there.

Yeah, that would be a straightforward extension.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 06/09/10 20:24, Markus Wanner wrote:
> On 09/06/2010 06:27 PM, Heikki Linnakangas wrote:
>> + * It's important to reset the latch*before* checking if there's work to
>> + * do. Otherwise, if someone sets the latch between the check and the
>> + * ResetLatch call, you will miss it and Wait will block.
>
> Why doesn't WaitLatch() clear it? What's the use case for waiting for a
> latch and *not* wanting to reset it?

Setting a latch that's already set is very fast, so you want to keep it 
set until the last moment. See the coding in walsender for example, it 
goes to some lengths to avoid clearing the latch until it's very sure 
there's no more work for it to do. That helps to keep the overhead in 
backends committing transactions low. (no-one has tried to measure that 
yet, though)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On 09/07/2010 09:06 AM, Heikki Linnakangas wrote:
> Setting a latch that's already set is very fast, so you want to keep it
> set until the last moment. See the coding in walsender for example, it
> goes to some lengths to avoid clearing the latch until it's very sure
> there's no more work for it to do. That helps to keep the overhead in
> backends committing transactions low. (no-one has tried to measure that
> yet, though)

Understood, thanks.

Markus


Hi,

On 09/06/2010 11:03 PM, Tom Lane wrote:
> I don't entirely see the point of opening ourselves up to the risk of
> using a pselect that's not safe under the hood.

It should be possible to reliably determine the platforms that provide 
an atomic pselect(). For those, I'm hesitant to use a "trick", where 
pselect() clearly provides a simpler and more "official" alternative. 
Especially considering that those platforms form the vast majority for 
running Postgres on.

What I'm most concerned about is the write() syscall within the signal 
handler. If that fails for another reason than those covered, we miss 
the signal. As Heikki points out in the comment, it's hard to deal with 
such a failure.

Regarding the exact implementation, the positioning of drainSelfPipe in 
Heikki's implementation seems strange to me. Most descriptions of the 
self-pipe trick [1] [2] [4] put the drainSelfPipe() just after the 
select(), where you can be sure there actually is something to read. 
(Except [3], which recommends putting it inside the signal handler, 
which I find even more frightening).

Maybe you can read more than one byte at a time in drainSelfPipe(), to 
save some syscalls?

Talking about the trick itself again: I found a lot of descriptions and 
mentioning of the self-pipe trick, but so far I only found an unknown 
window manager [5] and the custom inetd that's mentioned in the LWN 
article [4] which really use that trick. Digging deeper revealed that 
there's a sigsafe library [6] as well as the bglibs [7] which both seems 
to use the self-pipe trick as well (of which the later doesn't even care 
about the write()'s return value in the signal handler). None of these 
two libraries seems to be used in any project of relevance.

Overall I got the impression that people like to describe the trick, 
because it sounds so nifty and clever. However, I'd feel more 
comfortable if I knew there were some medium to large projects that 
actually use that trick. But AFAICT not even Bernstein's qmail does.

> In any case, on most
> modern platforms poll() is preferable to any variant of select().

Only Linux provides a ppoll() variant. And poll() itself doesn't replace 
pselect().


Overall, I'm glad this gets addressed. Note that this is a long standing 
issue for Postgres-R and it's covered with a lengthy comment in its TODO 
file [8].

Regards

Markus Wanner


[1] D. J. Bernstein, The self-pipe trick
http://cr.yp.to/docs/selfpipe.html

[2] Emile van Bergen, Avoiding races with Unix signals and select()
http://www.xs4all.nl/~evbergen/unix-signals.html

[3] Alex Pennace, Safe UNIX Signal Handling Tips
http://osiris.978.org/~alex/safesignalhandling.html

[4] LWN Article: The new pselect() system call, mentions the self-pipe 
trick in a comment
http://lwn.net/Articles/176911/

[5] Karmen: a window manager
http://freshmeat.net/projects/karmen

[6] sigsafe library
http://www.slamb.org/projects/sigsafe/

[7] Bruce Guenter, one stop library package
http://untroubled.org/bglibs/

[8] Postgres-R TOOD entry

http://git.postgres-r.org/?p=Postgres-R;a=blob;f=src/backend/replication/TODO;h=7bfc37ee9629943b9ff052d571b9d933ab38a0a8;hb=HEAD#l12



Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 08/09/10 20:36, Markus Wanner wrote:
> On 09/06/2010 11:03 PM, Tom Lane wrote:
>> I don't entirely see the point of opening ourselves up to the risk of
>> using a pselect that's not safe under the hood.
>
> It should be possible to reliably determine the platforms that provide
> an atomic pselect(). For those, I'm hesitant to use a "trick", where
> pselect() clearly provides a simpler and more "official" alternative.
> Especially considering that those platforms form the vast majority for
> running Postgres on.

Perhaps, but I'm equally concerned that having different implementations 
for different platforms means that all implementations get less testing 
than if we use only one. Because of that I'm actually reluctant to even 
use poll() where available instead of select(). At least in the first 
phase, until someone demonstrates that there's a measurable difference 
in performance. We only call poll/select when we're about to sleep, so 
it's not really that performance critical anyway.

> What I'm most concerned about is the write() syscall within the signal
> handler. If that fails for another reason than those covered, we miss
> the signal. As Heikki points out in the comment, it's hard to deal with
> such a failure.

Yeah, there isn't much you can do about it. Perhaps you could set a 
"mayday flag" (a global boolean variable) if it fails, and check that in 
the main loop, elogging a warning there instead. But I don't think we 
need to go to such lengths, realistically the write() will never fail or 
you have bigger problems.

> Maybe you can read more than one byte at a time in drainSelfPipe(), to
> save some syscalls?

Perhaps, although it should be very rare to have more than one byte in 
the pipe. SetLatch doesn't write another byte if the latch is already 
set, so you only get multiple bytes in the pipe if many processes set 
the latch at the same instant.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 08/09/10 20:36, Markus Wanner wrote:
>> It should be possible to reliably determine the platforms that provide
>> an atomic pselect(). For those, I'm hesitant to use a "trick", where
>> pselect() clearly provides a simpler and more "official" alternative.
>> Especially considering that those platforms form the vast majority for
>> running Postgres on.

> Perhaps, but I'm equally concerned that having different implementations 
> for different platforms means that all implementations get less testing 
> than if we use only one.

There's that, and there's also that Markus' premise is full of holes.
Exactly how will you determine that pselect is safe at compile time?
Even if you correctly determine that, how can you be sure that the
finished executable will only be run against a version of libc that has
a safe implementation?  Considering that we know that major platforms
such as FreeBSD have changed their implementations *very* recently,
it seems foolish to assume that an executable built on a machine with
corrected pselect could not be run on one with an older implementation.

> Because of that I'm actually reluctant to even 
> use poll() where available instead of select(). At least in the first 
> phase, until someone demonstrates that there's a measurable difference 
> in performance.

select() is demonstrably a loser whenever the process has a lot of open
files.  Also, we have plenty of experience with substituting poll() for
select(), so I'm not too worried about copy-and-pasting such code.
        regards, tom lane


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Martijn van Oosterhout
Date:
On Mon, Sep 06, 2010 at 07:24:59PM +0200, Markus Wanner wrote:
> Do I understand correctly that the purpose of this patch is to work
> around the brokenness of select() on very few platforms? Or is there any
> additional feature that plain signals don't give us?

If the issue is just that select() doesn't get interrupted and we don't
care about a couple of syscalls, would it not be better to simply use
sigaction to turn on SA_RESTART just prior to the select() and turn it
off just after. Or are these systems so broken that select() won't be
interrupted, even if the signal handler is explicitly configured to do
so?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Martijn van Oosterhout <kleptog@svana.org> writes:
> If the issue is just that select() doesn't get interrupted and we don't
> care about a couple of syscalls, would it not be better to simply use
> sigaction to turn on SA_RESTART just prior to the select() and turn it
> off just after. Or are these systems so broken that select() won't be
> interrupted, even if the signal handler is explicitly configured to do
> so?

I think you mean turn *off* SA_RESTART.  We'd have to do that for each
signal that we were concerned about allowing to interrupt the select(),
so it's more than just two added calls.  Another small problem is that
the latch code doesn't/shouldn't know what handlers are active, and
AFAICS you can't use sigaction() to flip that flag without setting the
handler address too.  So while maybe we could do it that way, it'd be
pretty dang messy.

In my mind the main value of the Latch code will be to have a clean
platform-independent API for waiting.  Why all the angst about whether
the implementation underneath is clean or not?  It's more important that
it *works* and we don't have to worry about whether it will break on
platform XYZ.
        regards, tom lane


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 08/09/10 23:07, Martijn van Oosterhout wrote:
> On Mon, Sep 06, 2010 at 07:24:59PM +0200, Markus Wanner wrote:
>> Do I understand correctly that the purpose of this patch is to work
>> around the brokenness of select() on very few platforms? Or is there any
>> additional feature that plain signals don't give us?
>
> If the issue is just that select() doesn't get interrupted and we don't
> care about a couple of syscalls, would it not be better to simply use
> sigaction to turn on SA_RESTART just prior to the select() and turn it
> off just after. Or are these systems so broken that select() won't be
> interrupted, even if the signal handler is explicitly configured to do
> so?

I don't know if SA_RESTART is portable. But in any case, that will do 
nothing about the race condition where the signal arrives just *before* 
the select() call.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On 09/08/2010 08:18 PM, Tom Lane wrote:
> Considering that we know that major platforms
> such as FreeBSD have changed their implementations *very* recently,
> it seems foolish to assume that an executable built on a machine with
> corrected pselect could not be run on one with an older implementation.

FWIW testing a recent development (i.e. 9.0-devel) version of FreeBSD 
still failed to properly support pselect().

> Also, we have plenty of experience with substituting poll() for
> select(), so I'm not too worried about copy-and-pasting such code.

It's certainly a simpler change than the self-pipe trick vs. pselect(), yes.

I'm happy to go with the self-pipe trick. A quick micro-benchmark didn't 
even show any significant difference compared to pselect(), so form that 
perspective, it's not a big deal.

And we'd then have a major project using the self-pipe trick. (I would 
still like to know what others exist).

Regards

Markus Wanner


On 09/08/2010 08:01 PM, Heikki Linnakangas wrote:
> Yeah, there isn't much you can do about it. Perhaps you could set a
> "mayday flag" (a global boolean variable) if it fails, and check that in
> the main loop, elogging a warning there instead. But I don't think we
> need to go to such lengths, realistically the write() will never fail or
> you have bigger problems.

Hm.. I think I'd like to see such a mayday flag. Just so we at least 
have a chance of finding out that something has gone wrong - whether or 
not there's a bigger problem.

> Perhaps, although it should be very rare to have more than one byte in
> the pipe. SetLatch doesn't write another byte if the latch is already
> set, so you only get multiple bytes in the pipe if many processes set
> the latch at the same instant.

Depending on what you use these latches for, it might not be that rare 
anymore. Trying to read more than one byte at a time doesn't cost anything.

Regards

Markus


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 06/09/10 19:27, Heikki Linnakangas wrote:
> On 06/09/10 17:18, Tom Lane wrote:
>> BTW, on reflection the AcquireLatch/ReleaseLatch terminology seems a bit
>> ill chosen: ReleaseLatch sounds way too much like something that would
>> just unlock or clear the latch. Perhaps OwnLatch/DisownLatch, or
>> something along that line.
>
> Yeah, I see what you mean. Although, maybe it's just me but Own/Disown
> looks ugly. Anyone have a better suggestion?

Magnus suggested AssociateLatch, given that the description of the 
function is that it associates the latch with the current process. I 
went with Own/Disown after all, it feels more precise, and having made 
the changes it doesn't look that ugly to me anymore.

> Here's an updated patch, with all the issues reported this far fixed,
> except for that naming issue, and Fujii's suggestion to use poll()
> instead of select() where available. I've also polished it quite a bit,
> improving comments etc. Magnus, can you take a look at the Windows
> implementation to check that it's sane? At least it seems to work.

We discussed the patch over IM, there's one point minor point I'd like 
to get into the archives:

> It seems that NumSharedLatches() is entirely wrongly placed if it's in
>   the win32 specific code! That needs to be somewhere shared, so people find it,

Yeah. There's a notice of that in OwnLatch(), but it would be nice if we 
could make it even more prominent. One idea is to put in latch.h as:

#define NumSharedLatches() (max_wal_senders /* + something else in the 
future */ )

When it's a #define, we don't need to put #include "walsender.h" in 
latch.h, it's enough to put it in win32_latch.c. It's a bit weird to 
have a #define in one header file that doesn't work unless you #include 
another header file in where you use it, but it would work. Any opinions 
on whether that's better than having NumSharedLatches() defined in the 
Win32-specific win32_latch.c file? I'm inclined to leave it as it is, in 
win32_latch.c, but I'm not sure.

Barring any last-minute objections, I'll commit this in the next few 
days. This patch doesn't affect walreceiver yet; I think the next step 
is to use the latches to eliminate the polling loop in walreceiver too, 
so that as soon as a piece of WAL is fsync'd to disk in the standby, 
it's applied.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 06/09/10 19:27, Heikki Linnakangas wrote:
>> It seems that NumSharedLatches() is entirely wrongly placed if it's in
>> the win32 specific code! That needs to be somewhere shared, so people find it,

> Yeah. There's a notice of that in OwnLatch(), but it would be nice if we 
> could make it even more prominent. One idea is to put in latch.h as:

> #define NumSharedLatches() (max_wal_senders /* + something else in the 
> future */ )

> When it's a #define, we don't need to put #include "walsender.h" in 
> latch.h, it's enough to put it in win32_latch.c. It's a bit weird to 
> have a #define in one header file that doesn't work unless you #include 
> another header file in where you use it, but it would work.

We have other precedents for that.  But having said that ...

> Any opinions 
> on whether that's better than having NumSharedLatches() defined in the 
> Win32-specific win32_latch.c file? I'm inclined to leave it as it is, in 
> win32_latch.c, but I'm not sure.

I'd leave it alone.  I see no very good reason to expose
NumSharedLatches globally.

> Barring any last-minute objections, I'll commit this in the next few 
> days. This patch doesn't affect walreceiver yet; I think the next step 
> is to use the latches to eliminate the polling loop in walreceiver too, 
> so that as soon as a piece of WAL is fsync'd to disk in the standby, 
> it's applied.

I will do some work as well once it's in.  Since I was the one
complaining about extra wakeups in the other processes, I'm willing
to go fix 'em.
        regards, tom lane


Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 11/09/10 18:02, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> Barring any last-minute objections, I'll commit this in the next few
>> days. This patch doesn't affect walreceiver yet; I think the next step
>> is to use the latches to eliminate the polling loop in walreceiver too,
>> so that as soon as a piece of WAL is fsync'd to disk in the standby,
>> it's applied.
>
> I will do some work as well once it's in.  Since I was the one
> complaining about extra wakeups in the other processes, I'm willing
> to go fix 'em.

Committed. I'll take a look at making walreceiver respond quickly when 
WAL arrives in the standby, using latches, but that shouldn't interfere 
with what you're doing.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Sat, 2010-09-11 at 19:15 +0300, Heikki Linnakangas wrote:
> Committed. I'll take a look at making walreceiver respond quickly when 
> WAL arrives in the standby, using latches, but that shouldn't interfere 
> with what you're doing.

I glanced at the code, and I see (in OwnLatch()):

+       if (latch->owner_pid != 0)
+               elog(ERROR, "latch already owned");
+       latch->owner_pid = MyProcPid;

But it looks like there may be a race there. I assume the callers are
supposed to have a lock at this point (and it looks like the current
caller is safe, but I didn't look in detail). Something still seems
strange to me though -- why throw an ERROR there if it can't happen (or
indicates an inconsistent state when it does happen)? And it doesn't
look safe to _not_ throw an error (due to a race) if it does happen.

It seems like OwnLatch is only supposed to be used when you_not_ to
throw an error in the event of a race (I don't think it is). already
know that you own it, because there's no error code returned or blocking
behavior, it just throws an ERROR. But if that's the case, what's the
point of OwnLatch()?

Perhaps add a few comments to describe to other users of the API how to
properly own a shared latch?

Regards,Jeff Davis



Jeff Davis <pgsql@j-davis.com> writes:
> I glanced at the code, and I see (in OwnLatch()):

> +       if (latch->owner_pid != 0)
> +               elog(ERROR, "latch already owned");
> +       latch->owner_pid = MyProcPid;

> But it looks like there may be a race there.

Yeah, that error check is only intended to catch gross logic errors,
not to guard against race conditions.  I don't think we really could
prevent a race there without adding a spinlock, which seems like
overkill.

> ... why throw an ERROR there if it can't happen (or
> indicates an inconsistent state when it does happen)?

Are you suggesting that an Assert would be sufficient?
        regards, tom lane


On Sun, 2010-09-12 at 12:29 -0400, Tom Lane wrote:
> > ... why throw an ERROR there if it can't happen (or
> > indicates an inconsistent state when it does happen)?
> 
> Are you suggesting that an Assert would be sufficient?
> 

I'm not too picky about whether it's Assert, ERROR, or PANIC (Asserts
aren't available in production systems, which I assume is why elog was
used); but we should be consistent and document that:(a) it shouldn't happen(b) that it's just a sanity check and we're
ignoringthe race
 

However, that also means that the whole concept of OwnLatch/DisownLatch
is entirely redundant, and only there for asserts because it doesn't do
anything else. That seems a little strange to me, as well, so (at
minimum) it should be documented that the functions really have no
effect on execution and are required only to support debugging.

Regards,Jeff Davis



On Sun, 2010-09-12 at 10:13 -0700, Jeff Davis wrote:
> I'm not too picky about whether it's Assert, ERROR, or PANIC (Asserts
> aren't available in production systems, which I assume is why elog was
> used); but we should be consistent and document that:
>  (a) it shouldn't happen
>  (b) that it's just a sanity check and we're ignoring the race

I should add that Assert seems to imply both of those things (at least
to me), so that would solve one of the confusing aspects of the API.

The other confusing part that I think still needs comments is that
OwnLatch/DisownLatch don't really do anything; they just carry
information for sanity checks later.

Regards,Jeff Davis



Jeff Davis <pgsql@j-davis.com> writes:
> However, that also means that the whole concept of OwnLatch/DisownLatch
> is entirely redundant, and only there for asserts because it doesn't do
> anything else. That seems a little strange to me, as well, so (at
> minimum) it should be documented that the functions really have no
> effect on execution and are required only to support debugging.

Uh, this is nonsense.  You have to have something like these functions
to support transferring ownership of a latch from one process to
another, which is required at least for the walreceiver usage.

It's correct that the latch code itself isn't trying very hard to avoid
a race condition in acquiring ownership, but that doesn't make the whole
thing useless, it just means that we're assuming that will be avoided
by logic elsewhere.  If there is a bug elsewhere that allows two
different processes to try to take ownership of the same latch, the
current coding will expose that bug soon enough.
        regards, tom lane


On Sun, 2010-09-12 at 14:12 -0400, Tom Lane wrote:
> Uh, this is nonsense.  You have to have something like these functions
> to support transferring ownership of a latch from one process to
> another, which is required at least for the walreceiver usage.

Oh, I see. It's needed to know where to send the signal, of course.

Regards,Jeff Davis



Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 12/09/10 20:13, Jeff Davis wrote:
> On Sun, 2010-09-12 at 12:29 -0400, Tom Lane wrote:
>>> ... why throw an ERROR there if it can't happen (or
>>> indicates an inconsistent state when it does happen)?
>>
>> Are you suggesting that an Assert would be sufficient?
>
> I'm not too picky about whether it's Assert, ERROR, or PANIC (Asserts
> aren't available in production systems, which I assume is why elog was
> used);

Right, OwnLatch is a not performance-critical, so it's better to elog() 
IMHO.

> but we should be consistent and document that:
>   (a) it shouldn't happen
>   (b) that it's just a sanity check and we're ignoring the race

Would this be sufficient?

--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -156,6 +156,7 @@ OwnLatch(volatile Latch *latch)     if (selfpipe_readfd == -1)         initSelfPipe();

+    /* sanity check */     if (latch->owner_pid != 0)         elog(ERROR, "latch already owned");     latch->owner_pid
=MyProcPid;
 

Or you want to suggest something better?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Mon, 2010-09-13 at 09:10 +0300, Heikki Linnakangas wrote:
> > but we should be consistent and document that:
> >   (a) it shouldn't happen
> >   (b) that it's just a sanity check and we're ignoring the race
> 
> Would this be sufficient?
> 
> --- a/src/backend/port/unix_latch.c
> +++ b/src/backend/port/unix_latch.c
> @@ -156,6 +156,7 @@ OwnLatch(volatile Latch *latch)
>       if (selfpipe_readfd == -1)
>           initSelfPipe();
> 
> +    /* sanity check */
>       if (latch->owner_pid != 0)
>           elog(ERROR, "latch already owned");
>       latch->owner_pid = MyProcPid;
> 
> Or you want to suggest something better?

Perfect. I was just slightly confused reading it the first time, and I
think that would have cleared it up.

Thanks,Jeff Davis




Re: Interruptible sleeps (was Re: CommitFest 2009-07: Yay, Kevin! Thanks, reviewers!)

From
Heikki Linnakangas
Date:
On 13/09/10 20:43, Jeff Davis wrote:
> On Mon, 2010-09-13 at 09:10 +0300, Heikki Linnakangas wrote:
>>> but we should be consistent and document that:
>>>    (a) it shouldn't happen
>>>    (b) that it's just a sanity check and we're ignoring the race
>>
>> Would this be sufficient?
>>
>> --- a/src/backend/port/unix_latch.c
>> +++ b/src/backend/port/unix_latch.c
>> @@ -156,6 +156,7 @@ OwnLatch(volatile Latch *latch)
>>        if (selfpipe_readfd == -1)
>>            initSelfPipe();
>>
>> +    /* sanity check */
>>        if (latch->owner_pid != 0)
>>            elog(ERROR, "latch already owned");
>>        latch->owner_pid = MyProcPid;
>>
>> Or you want to suggest something better?
>
> Perfect. I was just slightly confused reading it the first time, and I
> think that would have cleared it up.

Ok, added that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com