Thread: [RFC] Should smgrtruncate() avoid sending sinval message for temp relations

Hello,

I'm investigating a mysterious hang problem on PostgreSQL 9.2.8.  If many 
sessions use temporary tables whose rows are deleted on commit, the hang 
occurs.  I'd like to show you the stack trace, but I'm trying to figure out 
how to reproduce the problem.  IIRC, the stack trace was as follows.  The 
standby server was running normally.

...
SyncRepWaitForLSN
CommitTransaction
CommitTransactionCommand
ProcessCatchupEvent
HandleCatchupInterrupt
procsignal_sigusr1_handler
<SIGUSR1 received>
recv
...
ReadCommand
PostgresMain
...


Looking at smgrtruncate(), the sinval message is sent even when the 
truncated relation is a temporary relation.  However, I think the sinval 
message is not necessary for temp relations, because each session doesn't 
see the temp relations of other sessions.  So, the following code seems 
better.  This avoids sinval queue overflow which leads to SIGUSR1.  Is this 
correct?

if (SmgrIsTemp(reln))   /* Do his on behalf of sinval message handler */   smgrclosenode(reln->smgr_rnode);
else   CacheInvalidateSmgr(reln->smgr_rnode);


Regards
MauMau




"MauMau" <maumau307@gmail.com> writes:
> Looking at smgrtruncate(), the sinval message is sent even when the 
> truncated relation is a temporary relation.  However, I think the sinval 
> message is not necessary for temp relations, because each session doesn't 
> see the temp relations of other sessions.

This seems like a pretty unsafe suggestion, because the smgr level doesn't
know or care whether relations are temp; files are files.  In any case it
would only paper over one specific instance of whatever problem you're
worried about, because sinval messages definitely do need to be sent in
general.
        regards, tom lane



From: "Tom Lane" <tgl@sss.pgh.pa.us>
> This seems like a pretty unsafe suggestion, because the smgr level doesn't
> know or care whether relations are temp; files are files.  In any case it
> would only paper over one specific instance of whatever problem you're
> worried about, because sinval messages definitely do need to be sent in
> general.

I'm sorry I don't show the exact problem yet.  Apart from that, I understood 
that you insist it's not appropriate for smgr to be aware of whether the 
file is a temporary relation, in terms of module layering.  However, it 
doesn't seem so in the current implementation.  md.c, which is a layer under 
or part of smgr, uses SmgrIsTemp().  In addition, as the name suggests, 
SmgrIsTemp() is a function of smgr, which is defined in smgr.h.  So, it's 
not inappropriate for smgr to use it.

What I wanted to ask is whether and why sinval messages are really necessary 
for session-private objects like temp relations.  I thought shared inval is, 
as the name indicates, for objects "shared" among sessions.  If so, sinval 
for session-private objects doesn't seem to match the concept.

Regards
MauMau




On Wed, Jul 23, 2014 at 12:17 PM, MauMau <maumau307@gmail.com> wrote:
> From: "Tom Lane" <tgl@sss.pgh.pa.us>
>> This seems like a pretty unsafe suggestion, because the smgr level doesn't
>> know or care whether relations are temp; files are files.  In any case it
>> would only paper over one specific instance of whatever problem you're
>> worried about, because sinval messages definitely do need to be sent in
>> general.
>
> I'm sorry I don't show the exact problem yet.  Apart from that, I understood
> that you insist it's not appropriate for smgr to be aware of whether the
> file is a temporary relation, in terms of module layering.  However, it
> doesn't seem so in the current implementation.  md.c, which is a layer under
> or part of smgr, uses SmgrIsTemp().  In addition, as the name suggests,
> SmgrIsTemp() is a function of smgr, which is defined in smgr.h.  So, it's
> not inappropriate for smgr to use it.
>
> What I wanted to ask is whether and why sinval messages are really necessary
> for session-private objects like temp relations.  I thought shared inval is,
> as the name indicates, for objects "shared" among sessions.  If so, sinval
> for session-private objects doesn't seem to match the concept.

I think the problem here is that it actually is possible for one
session to access the temporary objects of another session:

rhaas=# create temp table fructivore (a int);
CREATE TABLE
rhaas=# select 'fructivore'::regclass::oid; oid
-------24578
(1 row)

Switch windows:

rhaas=# select 24578::regclass;      regclass
----------------------pg_temp_2.fructivore
(1 row)

rhaas=# alter table pg_temp_2.fructivore add column b int;
ALTER TABLE

Now, we could prohibit that specific thing.  But at the very least, it
has to be possible for one session to drop another session's temporary
objects, because autovacuum does it eventually, and superusers will
want to do it sooner to shut autovacuum up.  So it's hard to reason
about whether and to what extent it's safe to not send sinval messages
for temporary objects.

I think you might be approaching this problem from the wrong end,
though.  The question in my mind is: why does the
StartTransactionCommand() / CommitTransactionCommand() pair in
ProcessCatchupEvent() end up writing a commit record?  The obvious
possibility that occurs to me is that maybe rereading the invalidated
catalog entries causes a HOT prune, and maybe there ought to be some
way for a transaction that has only done HOT pruning to commit
asynchronously, just as we already do for transactions that only
modify temporary tables.  Or, failing that, maybe there's a way to
suppress synchronous commit for this particular transaction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 2014-07-24 11:17:15 -0400, Robert Haas wrote:
> I think you might be approaching this problem from the wrong end,
> though.

Yep.

>  The question in my mind is: why does the
> StartTransactionCommand() / CommitTransactionCommand() pair in
> ProcessCatchupEvent() end up writing a commit record?  The obvious
> possibility that occurs to me is that maybe rereading the invalidated
> catalog entries causes a HOT prune, and maybe there ought to be some
> way for a transaction that has only done HOT pruning to commit
> asynchronously, just as we already do for transactions that only
> modify temporary tables.  Or, failing that, maybe there's a way to
> suppress synchronous commit for this particular transaction.

I think we should do what the first paragraph in
http://archives.postgresql.org/message-id/20140707155113.GB1136%40alap3.anarazel.de
outlined. As Tom says somewhere downthread that requires some code
review, but other than that it should get rid of a fair amount of
problems.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



From: "Robert Haas" <robertmhaas@gmail.com>
> I think the problem here is that it actually is possible for one
> session to access the temporary objects of another session:
> Now, we could prohibit that specific thing.  But at the very least, it
> has to be possible for one session to drop another session's temporary
> objects, because autovacuum does it eventually, and superusers will
> want to do it sooner to shut autovacuum up.  So it's hard to reason
> about whether and to what extent it's safe to not send sinval messages
> for temporary objects.

I was a bit surprised to know that one session can access the data of 
another session's temporary tables.  That implenentation nay be complicating 
the situation -- extra sinval messages.


> I think you might be approaching this problem from the wrong end,
> though.  The question in my mind is: why does the
> StartTransactionCommand() / CommitTransactionCommand() pair in
> ProcessCatchupEvent() end up writing a commit record?  The obvious
> possibility that occurs to me is that maybe rereading the invalidated
> catalog entries causes a HOT prune, and maybe there ought to be some
> way for a transaction that has only done HOT pruning to commit
> asynchronously, just as we already do for transactions that only
> modify temporary tables.  Or, failing that, maybe there's a way to
> suppress synchronous commit for this particular transaction.

I could figure out what log record was output in the transaction started in 
ProcessCatchupEvent() by inserting elog() in XLogInsert().  The log record 
was (RM_STANDBY_ID, XLOG_STANDBY_LOCK).

The cause of the hang turned out clear.  It was caused as follows:

1. When a transaction commits which used a temporary table created with ON 
COMMIT DELETE ROWS, the sinval catchup signal (SIGUSR1) was issued from 
smgrtruncate().  This is because the temporary table is truncated at 
transaction end.

2. Another session, which is waiting for a client request, receives SIGUSR1. 
It calls ProcessCatchupEvent().

3. ProcessCatchupEvent() calls StartTransactionCommand(), emits the 
XLOG_STANDBY_LOCK WAL record, and then calls CommitTransactionCommand().

4. It then calls SyncRepWaitForLSN(), which in turn waits on the latch.

5. But the WaitLatch() never returns, because the session is already running 
inside the SIGUSR1 handler in step 2.  WaitLatch() needs SIGUSR1 to 
complete.

I think there is a problem with the latch and SIGUSR1 mechanism.  How can we 
fix this problem?

Regards
MauMau





From: "Andres Freund" <andres@2ndquadrant.com>
> I think we should do what the first paragraph in
> http://archives.postgresql.org/message-id/20140707155113.GB1136%40alap3.anarazel.de
> outlined. As Tom says somewhere downthread that requires some code
> review, but other than that it should get rid of a fair amount of
> problems.

As mentioned in the mail I've just sent,  there seems to be a problem around 
the latch and/or sinval catchup implementation.

Or, is it bad that many things are done in SIGUSR1 handler?  If some 
processing in SIGUSR1 handler requires waiting on a latch, it hangs at 
WaitLatch().  Currently, the only processing in the backend which requires a 
latch may be to wait for the sync standby.  However, in the future, the 
latch may be used for more tasks.

Another problem is, who knows WaitLatch() can return prematurely (before the 
actual waited-for event does SetLatch()) due to the SIGUSR1 issued for 
sinval catchup event?

How should we tackle these problem?

Regards
MauMau




"MauMau" <maumau307@gmail.com> writes:
> [ sinval catchup signal -> ProcessCatchupEvent -> WaitLatch -> deadlock ]

Ugh.

One line of thought is that it's pretty unsafe to be doing anything
as complicated as transaction start/commit in a signal handler, even one
that is sure it's not interrupting anything else.  The only excuse
ProcessCatchupEvent has for that is that it's "trying to ensure proper
cleanup from an error".  Perhaps with closer analysis we could convince
ourselves that errors thrown from AcceptInvalidationMessages() wouldn't
need transaction abort cleanup.

For that matter, it's not exactly clear that an error thrown out of
there would result in good things even with the transaction
infrastructure.  Presuming that we're waiting for client input, an
error longjmp out of ProcessCatchupEvent could mean taking control
away from OpenSSL, and I bet that won't end well.  Maybe we should
just be doing
PG_TRY    AcceptInvalidationMessages();PG_CATCH    elog(FATAL, "curl up and die");

ProcessIncomingNotify is also using
StartTransactionCommand()/CommitTransactionCommand(), so that would
need some thought too.

Or we could try to get rid of the need to do anything beyond setting
a flag in the interrupt handler itself.  But I'm afraid that's probably
unworkable, especially now that we use SA_RESTART on all signals.

Another line of thought is that we've been way too uncritical about
shoving different kinds of events into the SIGUSR1 multiplexor.
It might be a good idea to separate "high level" interrupts from
"low level" ones, using say SIGUSR1 for the former and SIGUSR2
for the latter.  However, that doesn't sound very back-patchable,
even assuming that we can come up with a clean division.
        regards, tom lane



On 2014-07-26 11:32:24 -0400, Tom Lane wrote:
> "MauMau" <maumau307@gmail.com> writes:
> > [ sinval catchup signal -> ProcessCatchupEvent -> WaitLatch -> deadlock ]
> 
> Ugh.
> 
> One line of thought is that it's pretty unsafe to be doing anything
> as complicated as transaction start/commit in a signal handler, even one
> that is sure it's not interrupting anything else.

Yea, that's really not nice.

> The only excuse
> ProcessCatchupEvent has for that is that it's "trying to ensure proper
> cleanup from an error".  Perhaps with closer analysis we could convince
> ourselves that errors thrown from AcceptInvalidationMessages() wouldn't
> need transaction abort cleanup.

Hm. I'm not convinced that's going to be easy.

Wouldn't it be better to move the catchup interrupt processing out of
the signal handler? For normal backends we only enable when reading from
the client and DoingCommandRead is set. How about setting a variable in
the signal handler and doing the actual catchup processing after the
recv() returned EINTR? That'd require either renegging on SA_RESTART or
using WaitLatchOrSocket() and nonblocking send/recv.  I think that'd be
a much safer model and after researching it a bit for the idle in
transaction timeout thing it doesn't look super hard. Even windows seems
to already support everything necessary.

> Or we could try to get rid of the need to do anything beyond setting
> a flag in the interrupt handler itself.  But I'm afraid that's probably
> unworkable, especially now that we use SA_RESTART on all signals.

Yea :(. Several platforms actually ignore SA_RESTART for
send/recv/select/... under some circumstances (notably linux), but it'd
probably be hard to get it right for all.

I don't think we can continue to use the blocking calls for much longer
unless we allow them to be interruptible. But I doubt that change would
be backpatchable.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> Wouldn't it be better to move the catchup interrupt processing out of
> the signal handler? For normal backends we only enable when reading from
> the client and DoingCommandRead is set. How about setting a variable in
> the signal handler and doing the actual catchup processing after the
> recv() returned EINTR?

Only it won't.  See SA_RESTART.  I think turning that off is a nonstarter,
as per previous discussions.

> That'd require either renegging on SA_RESTART or
> using WaitLatchOrSocket() and nonblocking send/recv.

Yeah, I was wondering about using WaitLatchOrSocket for client I/O too.
We already have a hook that lets us do the actual recv even when using
OpenSSL, and in principle that function could do interrupt-service-like
functions if it got kicked off the recv().

Anything in this line is going to be a bigger change than I'd want to
back-patch, though.  Are we OK with not fixing the problem in the back
branches?  Given the shortage of field complaints, that might be all
right.
        regards, tom lane



On 2014-07-26 13:58:38 -0400, Tom Lane wrote:

> Andres Freund <andres@2ndquadrant.com> writes:
> > That'd require either renegging on SA_RESTART or
> > using WaitLatchOrSocket() and nonblocking send/recv.
> 
> Yeah, I was wondering about using WaitLatchOrSocket for client I/O too.
> We already have a hook that lets us do the actual recv even when using
> OpenSSL, and in principle that function could do interrupt-service-like
> functions if it got kicked off the recv().

I've started playing with this. Looks clearly worthwile.

I think if we do it right we pretty much can get rid of the whole
prepare_for_client_read() machinery and handle everything via
ProcessInterrupts(). EnableCatchupInterrupt() et al don't really fill me
with joy.

I'm not yet entirely sure where the interrupt processing should happen,
but I guess that'll fall out of the work at some point. The important
bit imo is to *not* not do anything but return with BIO_set_retry_*()
from my_sock_read/write(). That then allows us to implement stuff like
the idle transaction timeout with much fewer problems.

I probably won't finish doing this before leaving on holidays, so nobody
should hesitate to look themselves if interested. If not, I plan to pick
this up again.  I think it's a prerequisite to getting rid of the FATAL
for recovery conflict interrupts which I really would like to do.

> Anything in this line is going to be a bigger change than I'd want to
> back-patch, though.

Agreed. I don't think it will, but it very well could have performance
implications. Besides the obvious risk of bugs...

> Are we OK with not fixing the problem in the back
> branches?  Given the shortage of field complaints, that might be all
> right.

I'm not really comfortable with that. How about simply flagging a couple
contexts to not do the SyncRepWaitForLsn() dance? Possibly just by doing
something ugly like
SetConfigOption('synchronous_commit', 'off', PGC_INTERNAL,
PGC_S_OVERRIDE, GUC_ACTION_LOCAL, true, ERROR)?
during startup, inval and similar transaction commands? Not pretty, but
it looks simple enough to be backpatchable.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



From: "Tom Lane" <tgl@sss.pgh.pa.us>
>> [ sinval catchup signal -> ProcessCatchupEvent -> WaitLatch -> deadlock ]

I must add one thing.  After some client processes closed the connection 
without any hang, their server processes were stuck with a stack trace like 
this (I'll look for and show the exact stack trace tomorrow):

WaitLatchOrSocket
SyncRepWaitForLSN
CommitTransaction
CommitTransactionCommand
ProcessCatchupEvent
...
shmem_exit
proc_exit_prepare
proc_exit
PostgresMain
...

The process appears to be hanging during session termination.  So, it's not 
the problem only during client request wait.


> Another line of thought is that we've been way too uncritical about
> shoving different kinds of events into the SIGUSR1 multiplexor.
> It might be a good idea to separate "high level" interrupts from
> "low level" ones, using say SIGUSR1 for the former and SIGUSR2
> for the latter.  However, that doesn't sound very back-patchable,
> even assuming that we can come up with a clean division.

This seems to be one step in the right direction.  There are two issues in 
the current implementation:

[Issue 1]
[ sinval catchup signal -> ProcessCatchupEvent -> WaitLatch -> deadlock ]
This is (partly) because the latch wakeup and other processing use the same 
SIGUSR1 in normal backend, autovacuum launcher/worker, and the background 
worker with database access.  On the other hand, other background daemon 
processes properly use SIGUSR1 only for latch wakeup, and SIGUSR2 for other 
tasks.

[Issue 2]
WaitLatch() returns prematurely due to the sinval catchup signal, even 
though the target event (e.g. reply from standby) hasn't occurred and called 
SetLatch() yet.  This is because procsignal_sigusr1_handler() always calls 
latch_sigusr1_handler().
This is probably not related to the cause of the hang.

So, as you suggest, I think it would be a good idea to separate 
latch_sigusr1_handler() call into a different function solely for it like 
other daemon processes,  and leave the rest in procsignal_sigusr1_handler() 
and rename function to procsignal_sigusr2_handler() or procsignal_handler(). 
Originally, it's not natural that the current procsignal_sigusr1_handler() 
contains latch_sigusr1_handler() call, because latch processing is not based 
on the procsignal mechanism (SetLatch() doesn't call SendProcSignal()).

I'll try the fix tomorrow if possible.  What kind of problems do you hink of 
for back-patching?

Regards
MauMau





From: "MauMau" <maumau307@gmail.com>
> I must add one thing.  After some client processes closed the connection 
> without any hang, their server processes were stuck with a stack trace 
> like this (I'll look for and show the exact stack trace tomorrow):

I found two kinds of stack traces:

#0  0x0000003199ec488f in poll () from /lib64/libc.so.6
#1  0x0000000000609f24 in WaitLatchOrSocket ()
#2  0x000000000063ad92 in SyncRepWaitForLSN ()
#3  0x00000000004ad474 in CommitTransaction ()
#4  0x00000000004aef53 in CommitTransactionCommand ()
#5  0x000000000064b547 in shmem_exit ()
#6  0x000000000064b625 in proc_exit_prepare ()
#7  0x000000000064b6a8 in proc_exit ()
#8  0x0000000000668a94 in PostgresMain ()
#9  0x0000000000617f2c in ServerLoop ()
#10 0x000000000061ae96 in PostmasterMain ()
#11 0x00000000005b2ccf in main ()

#0  0x0000003f4badf258 in poll () from /lib64/libc.so.6
#1  0x0000000000619b94 in WaitLatchOrSocket ()
#2  0x0000000000640c4c in SyncRepWaitForLSN ()
#3  0x0000000000491c18 in RecordTransactionCommit ()
#4  0x0000000000491d98 in CommitTransaction ()
#5  0x0000000000493135 in CommitTransactionCommand ()
#6  0x0000000000653fc5 in ProcessCatchupEvent ()
#7  0x00000000006540ed in HandleCatchupInterrupt ()
#8  0x00000000006533e3 in procsignal_sigusr1_handler ()
#9  <signal handler called>
#10 0x0000003f4bae96b0 in recv () from /lib64/libc.so.6
#11 0x00000000005b75f6 in secure_read ()
#12 0x00000000005c223b in pq_recvbuf ()
#13 0x00000000005c263b in pq_getbyte ()
#14 0x000000000066e081 in PostgresMain ()
#15 0x0000000000627d81 in PostmasterMain ()
#16 0x00000000005c4803 in main ()


> I'll try the fix tomorrow if possible.  What kind of problems do you hink 
> of for back-patching?

I could reproduce the problem with 9.2.8, but have not yet with 9.5dev. 
I'll try with 9.2.9, and create the fix.

Regards
MauMau





On 2014-07-26 20:20:05 +0200, Andres Freund wrote:
> On 2014-07-26 13:58:38 -0400, Tom Lane wrote:
> 
> > Andres Freund <andres@2ndquadrant.com> writes:
> > > That'd require either renegging on SA_RESTART or
> > > using WaitLatchOrSocket() and nonblocking send/recv.
> > 
> > Yeah, I was wondering about using WaitLatchOrSocket for client I/O too.
> > We already have a hook that lets us do the actual recv even when using
> > OpenSSL, and in principle that function could do interrupt-service-like
> > functions if it got kicked off the recv().
> 
> I've started playing with this. Looks clearly worthwile.
> 
> I think if we do it right we pretty much can get rid of the whole
> prepare_for_client_read() machinery and handle everything via
> ProcessInterrupts(). EnableCatchupInterrupt() et al don't really fill me
> with joy.

One thing I am wondering about around this is: Why are we only
processing catchup events when DoingCommandRead? There's other paths
where we can wait for data from the client for a long time. Obviously we
don't want to process async.c stuff from inside copy, but I don't see
why that's the case for sinval.c.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> One thing I am wondering about around this is: Why are we only
> processing catchup events when DoingCommandRead? There's other paths
> where we can wait for data from the client for a long time. Obviously we
> don't want to process async.c stuff from inside copy, but I don't see
> why that's the case for sinval.c.

It might be all right to do it during copy, but I'd just as soon treat
that as a separate issue.  If you merge it into the basic patch then it
might be hard to get rid of if there are problems.
        regards, tom lane



On 2014-07-28 15:29:57 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > One thing I am wondering about around this is: Why are we only
> > processing catchup events when DoingCommandRead? There's other paths
> > where we can wait for data from the client for a long time. Obviously we
> > don't want to process async.c stuff from inside copy, but I don't see
> > why that's the case for sinval.c.
> 
> It might be all right to do it during copy, but I'd just as soon treat
> that as a separate issue.  If you merge it into the basic patch then it
> might be hard to get rid of if there are problems.

Yea, not planning to merge it. Just wondering to make sure I understand
all the implications.

Another thing I'm wondering about - also not for the basic patch - is
accepting termination while writing to the client. It's rather annoying
that we currently don't allow to pg_terminate_backend() when writing to
the client.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On 2014-07-26 20:20:05 +0200, Andres Freund wrote:
> On 2014-07-26 13:58:38 -0400, Tom Lane wrote:
>
> > Andres Freund <andres@2ndquadrant.com> writes:
> > > That'd require either renegging on SA_RESTART or
> > > using WaitLatchOrSocket() and nonblocking send/recv.
> >
> > Yeah, I was wondering about using WaitLatchOrSocket for client I/O too.
> > We already have a hook that lets us do the actual recv even when using
> > OpenSSL, and in principle that function could do interrupt-service-like
> > functions if it got kicked off the recv().
>
> I've started playing with this. Looks clearly worthwile.
>
> I think if we do it right we pretty much can get rid of the whole
> prepare_for_client_read() machinery and handle everything via
> ProcessInterrupts(). EnableCatchupInterrupt() et al don't really fill me
> with joy.
>
> I'm not yet entirely sure where the interrupt processing should happen,
> but I guess that'll fall out of the work at some point. The important
> bit imo is to *not* not do anything but return with BIO_set_retry_*()
> from my_sock_read/write(). That then allows us to implement stuff like
> the idle transaction timeout with much fewer problems.
>
> I probably won't finish doing this before leaving on holidays, so nobody
> should hesitate to look themselves if interested. If not, I plan to pick
> this up again.  I think it's a prerequisite to getting rid of the FATAL
> for recovery conflict interrupts which I really would like to do.

I tried to get something reviewable before leaving, but alas, there's
far too many edgecases to consider. A good part of it has to do with my
decision to always operate the underlying socket in nonblocking mode and
do all the writing using latches. I think that's the right decision,
because it allows reliable interruptions everywhere and is easier than
duplicated codepaths. But I realize that's not guaranteed to be well
liked.

a) Latches aren't ready early enough. We need to read/write to the
socket long before MyProc is initialized. To have sane behaviour during
early startup we need to be canceleable there was well.

Right now I'm just busy looping in that case, but that's obviously not
acceptable. My best idea is to have another latch that we can use during
early startup. Say *MyProcLatch. Initially that points to a process
local latch and once initialized it points to MyProc->procLatch.

b) Latches don't support WL_WRITEABLE without WL_READABLE. I've simply
changed the code in both latch implementations to poll for errors in
both cases and set both readable/writeable if an error occurs. That
seems easy enough. Are there any bigger caveats for changing this?

c) There's a couple more or less undocumented
pgwin32_waitforsinglesocket() calls in be-secure.c. Afaics they're
likely partially already not required and definitely not required after
socket handling is baded on latches.

d) prepare_for_client_read(), client_read_ended() are now really quite a
misnomer. Because interrupts interrupt send/recv appropriately and
because we do *not* want to process them inside my_sock_read|write (so
we don't recursively enter openssl to send the FATAL to the client)
they're not used from within ssl anymore. But on a higher level, just
like:
    prepare_for_client_read();
    CHECK_FOR_INTERRUPTS();
    client_read_ended();
Not sure if that's the right abstraction for just a:
    EnableNotifyInterrupt();
    EnableCatchupInterrupt();
    CHECK_FOR_INTERRUPTS();
    DisableNotifyInterrupt();
    DisableCatchupInterrupt();
where the Enable* just set a variable so CHECK_FOR_INTERRUPTS can
process the events if occuring.

Doing it that way allows to throw away *large* chunks of code from
sinval.c and async.c because they don't have to care about doing
dangerous stuff from signal handlers anymore.

But the above needs a new name.


Even though it ends up being a bit more work than I anticipated, the
result still seems like a significant improvement over the current
code.

Will get back to this once I'm back (~10 days).  Unless somebody else
wants to pick this up. I've at attached my *heavily WIP* patch.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
From: "MauMau" <maumau307@gmail.com>
>> I'll try the fix tomorrow if possible.  What kind of problems do you hink
>> of for back-patching?
>
> I could reproduce the problem with 9.2.8, but have not yet with 9.5dev.
> I'll try with 9.2.9, and create the fix.

I could also reproduce the problem with 9.2.9, but I couldn't with 9.5devel.

However, I could confirm that the attached patch solves the problem.  The
patch is based on 9.2.9.  To adjust this patch for 9.3 and later, set the
background worker's SIGUSR2 handler to procsignal_sigusr2_handler like
normal backends.  Could you review and commit this?  We wish the fix for 9.1
and above.

To reproduce the problem, you can do as follows with the attached files.
Originally, test.sql had many columns the customer is actually using, but we
cannot show the real DDL as it's the customer's asset.

$ createdb test
$ pgbench -c 20 -j 20 -T 600 -s -n -f test.pgbench

Synchronous streaming replication and hot standby need to be used.

When I ran this on a 4-core RHEL5 box, the problem arose within a few
minutes.  pgbench continues to run emitting a lot of messages like below.
Over time, the number of normal backends will increases, and dozens of which
remain stuck with the stack trace I showed before.

CREATE TABLE
psql:test.sql:1: NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit
index "test_table_pkey" for table "test_table"

I have no idea why 9.5devel didn't show the problem.  One difference I
noticed is that pgbench didn't output the message of implicit index creation
for the primary key.

Regards
MauMau

Attachment
Please find attached the revised patch.  I failed to change SIGUSR1 to
SIGUSR2 when sending sinval catchup signal.

In addition, I changed wal sender to not receive SIGUSR1/2 from
SendProcSignal().  Without this, wal sender will excessively wake up or
terminate by the signals.  This is an existing problem.

Regards
MauMau

Attachment
I've tracked down the real root cause.  The fix is very simple.  Please
check the attached one-liner patch.

The cause is that the temporary relations are truncated unconditionally
regardless of whether they are accessed in the transaction or not.  That is,
the following sequence of steps result in the hang:

1. A session creates a temporary table with ON COMMIT DELETE ROWS.  It adds
the temporary table to the list of relations that should be truncated at
transaction commit.

2. The session receives a sinval catchup signal (SIGUSR1) from another
session.  It starts a transaction and processes sinval messages in the
SIGUSR1 signal handler.  No WAL is output while processing the sinval
messages.

3. When the transaction commits, the list of temporary relations are checked
to see if they need to be truncated.

4. The temporary table created in step 1 is truncated.  To truncate a
relation, Access Exclusive lock is acquired on it.  When hot standby is
used, acquiring an Access Exclusive lock generates a WAL record
(RM_STANDBY_ID, XLOG_STANDBY_LOCK).

5. The transaction waits on a latch for a reply from a synchronous standby,
because it wrote some WAL.  But the latch wait never returns, because the
latch needs to receive SIGUSR1 but the SIGUSR1 handler is already in
progress from step 2.


The correct behavior is for the transaction not to truncate the temporary
table in step 4, because the transaction didn't use the temporary table.

I confirmed that the fix is already in 9.3 and 9.5devel, so I just copied
the code fragment from 9.5devel to 9.2.9.  The attached patch is for 9.2.9.
I didn't check 9.4 and other versions.  Why wasn't the fix applied to 9.2?

Finally, I found a very easy way to reproduce the problem:

1. On terminal session 1, start psql and run:
  CREATE TEMPORARY TABLE t (c int) ON COMMIT DELETE ROWS;
Leave the psql session open.

2. On terminal session 2, run:
  pgbench -c8 -t500 -s1 -n -f test.sql dbname
[test.sql]
CREATE TEMPORARY TABLE t (c int) ON COMMIT DELETE ROWS;
DROP TABLE t;

3. On the psql session on terminal session 1, run any SQL statement.  It
doesn't reply.  The backend is stuck at SyncRepWaitForLSN().

Regards
MauMau

Attachment
On Fri, Aug 8, 2014 at 10:30 AM, MauMau <maumau307@gmail.com> wrote:
> I've tracked down the real root cause.  The fix is very simple.  Please
> check the attached one-liner patch.
>
> I confirmed that the fix is already in 9.3 and 9.5devel, so I just copied
> the code fragment from 9.5devel to 9.2.9.  The attached patch is for 9.2.9.
> I didn't check 9.4 and other versions.  Why wasn't the fix applied to 9.2?

It was considered a performance improvement - i.e. a feature - rather
than a bug fix, so it was only added to master.  That was after the
release of 9.2 and before the release of 9.3, so it's in newer
branches but not older ones.

Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Branch: master Release: REL9_3_BR [c9d7dbacd] 2013-01-29 10:43:33 +0200
   Skip truncating ON COMMIT DELETE ROWS temp tables, if the transaction hasn't   touched any temporary tables.
   We could try harder, and keep track of whether we've inserted to any temp   tables, rather than accessed them, and
whichtemp tables have been inserted   to. But this is dead simple, and already covers many interesting scenarios.
 

I'd support back-porting that commit to 9.1 and 9.2 as a fix for this
problem.  As the commit message says, it's dead simple.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 8, 2014 at 10:30 AM, MauMau <maumau307@gmail.com> wrote:
>> I've tracked down the real root cause.  The fix is very simple.  Please
>> check the attached one-liner patch.

> I'd support back-porting that commit to 9.1 and 9.2 as a fix for this
> problem.  As the commit message says, it's dead simple.

While I have no great objection to back-porting Heikki's patch, it seems
like a very large stretch to call this a root-cause fix.  At best it's
band-aiding one symptom in a rather fragile way.
        regards, tom lane



> Robert Haas <robertmhaas@gmail.com> writes:
>> I'd support back-porting that commit to 9.1 and 9.2 as a fix for this
>> problem.  As the commit message says, it's dead simple.

From: "Tom Lane" <tgl@sss.pgh.pa.us>
> While I have no great objection to back-porting Heikki's patch, it seems
> like a very large stretch to call this a root-cause fix.  At best it's
> band-aiding one symptom in a rather fragile way.

Thank you, Robert san.  I'll be waiting for it to be back-ported to the next 
9.1/9.2 release.

Yes, I think this failure is only one potential symptom caused by the 
implemnentation mistake -- handling both latch wakeup and other tasks that 
wait on a latch in the SIGUSR1 handler.  Although there may be no such tasks 
now, I'd like to correct and clean up the implementation as follows to avoid 
similar problems in the future.  I think it's enough to do this only for 
9.5.  Please correct me before I go deeper in the wrong direction.

* The SIGUSR1 handler only does latch wakeup.  Any other task is done in 
other signal handlers such as SIGUSR2.  Many daemon postgres processes 
follow this style, but the normal backend, autovacuum daemons, and 
background workers don't now.

* InitializeLatchSupport() in unix_latch.c calls pqsignal(SIGUSR1, 
latch_sigusr1_handler).  Change the argument of latch_sigusr1_handler() 
accordingly.

* Remove SIGUSR1 handler registration and process-specific SIGUSR1 handler 
functions from all processes.  We can eliminate many SIGUSR1 handler 
functions which have the same contents.

Regards
MauMau






On Mon, Aug 11, 2014 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Aug 8, 2014 at 10:30 AM, MauMau <maumau307@gmail.com> wrote:
>>> I've tracked down the real root cause.  The fix is very simple.  Please
>>> check the attached one-liner patch.
>
>> I'd support back-porting that commit to 9.1 and 9.2 as a fix for this
>> problem.  As the commit message says, it's dead simple.
>
> While I have no great objection to back-porting Heikki's patch, it seems
> like a very large stretch to call this a root-cause fix.  At best it's
> band-aiding one symptom in a rather fragile way.

Yeah, that's true, but I'm not clear that we have another
back-patchable fix, so doing something almost-certainly-harmless to
alleviate the immediate pain seems worthwhile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [RFC] Should smgrtruncate() avoid sending sinval message for temp relations

From
Andres Freund
Date:
On 2014-08-12 11:04:00 -0400, Robert Haas wrote:
> On Mon, Aug 11, 2014 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Fri, Aug 8, 2014 at 10:30 AM, MauMau <maumau307@gmail.com> wrote:
> >>> I've tracked down the real root cause.  The fix is very simple.  Please
> >>> check the attached one-liner patch.
> >
> >> I'd support back-porting that commit to 9.1 and 9.2 as a fix for this
> >> problem.  As the commit message says, it's dead simple.
> >
> > While I have no great objection to back-porting Heikki's patch, it seems
> > like a very large stretch to call this a root-cause fix.  At best it's
> > band-aiding one symptom in a rather fragile way.
> 
> Yeah, that's true, but I'm not clear that we have another
> back-patchable fix, so doing something almost-certainly-harmless to
> alleviate the immediate pain seems worthwhile.

Isn't that still leaving the very related issue of waits due to hot
pruning open?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On Tue, Aug 12, 2014 at 11:06 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-08-12 11:04:00 -0400, Robert Haas wrote:
>> On Mon, Aug 11, 2014 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Robert Haas <robertmhaas@gmail.com> writes:
>> >> On Fri, Aug 8, 2014 at 10:30 AM, MauMau <maumau307@gmail.com> wrote:
>> >>> I've tracked down the real root cause.  The fix is very simple.  Please
>> >>> check the attached one-liner patch.
>> >
>> >> I'd support back-porting that commit to 9.1 and 9.2 as a fix for this
>> >> problem.  As the commit message says, it's dead simple.
>> >
>> > While I have no great objection to back-porting Heikki's patch, it seems
>> > like a very large stretch to call this a root-cause fix.  At best it's
>> > band-aiding one symptom in a rather fragile way.
>>
>> Yeah, that's true, but I'm not clear that we have another
>> back-patchable fix, so doing something almost-certainly-harmless to
>> alleviate the immediate pain seems worthwhile.
>
> Isn't that still leaving the very related issue of waits due to hot
> pruning open?

Yes.  Do you have a back-patchable solution for that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [RFC] Should smgrtruncate() avoid sending sinval message for temp relations

From
Andres Freund
Date:
On 2014-08-12 11:56:41 -0400, Robert Haas wrote:
> On Tue, Aug 12, 2014 at 11:06 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-08-12 11:04:00 -0400, Robert Haas wrote:
> >> On Mon, Aug 11, 2014 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> > Robert Haas <robertmhaas@gmail.com> writes:
> >> >> On Fri, Aug 8, 2014 at 10:30 AM, MauMau <maumau307@gmail.com> wrote:
> >> >>> I've tracked down the real root cause.  The fix is very simple.  Please
> >> >>> check the attached one-liner patch.
> >> >
> >> >> I'd support back-porting that commit to 9.1 and 9.2 as a fix for this
> >> >> problem.  As the commit message says, it's dead simple.
> >> >
> >> > While I have no great objection to back-porting Heikki's patch, it seems
> >> > like a very large stretch to call this a root-cause fix.  At best it's
> >> > band-aiding one symptom in a rather fragile way.
> >>
> >> Yeah, that's true, but I'm not clear that we have another
> >> back-patchable fix, so doing something almost-certainly-harmless to
> >> alleviate the immediate pain seems worthwhile.
> >
> > Isn't that still leaving the very related issue of waits due to hot
> > pruning open?
> 
> Yes.  Do you have a back-patchable solution for that?

The easiest thing I can think of is sprinkling a few
SetConfigOption('synchronous_commit', 'off',               PGC_INTERNAL, PGC_S_OVERRIDE,
GUC_ACTION_LOCAL,true, ERROR);
 
in some strategic places. From a quick look:
* all of autovacuum
* locally in ProcessCompletedNotifies
* locally in ProcessIncomingNotify
* locally in ProcessCatchupEvent
* locally in InitPostgres

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-08-12 11:56:41 -0400, Robert Haas wrote:
>> Yes.  Do you have a back-patchable solution for that?

> The easiest thing I can think of is sprinkling a few
> SetConfigOption('synchronous_commit', 'off',
>                 PGC_INTERNAL, PGC_S_OVERRIDE,
>                 GUC_ACTION_LOCAL, true, ERROR);

This still seems to me to be applying a band-aid that covers over some
symptoms; it's not dealing with the root cause that we've overloaded
the signal handling mechanism too much.   What reason is there to think
that there are no other symptoms of that?
        regards, tom lane



Re: [RFC] Should smgrtruncate() avoid sending sinval message for temp relations

From
Andres Freund
Date:
On 2014-08-12 13:11:55 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-08-12 11:56:41 -0400, Robert Haas wrote:
> >> Yes.  Do you have a back-patchable solution for that?
> 
> > The easiest thing I can think of is sprinkling a few
> > SetConfigOption('synchronous_commit', 'off',
> >                 PGC_INTERNAL, PGC_S_OVERRIDE,
> >                 GUC_ACTION_LOCAL, true, ERROR);
> 
> This still seems to me to be applying a band-aid that covers over some
> symptoms; it's not dealing with the root cause that we've overloaded
> the signal handling mechanism too much.   What reason is there to think
> that there are no other symptoms of that?

I'm not arguing against fixing that. I think we need to do both,
although I am wary of fixing the signal handling in the back
branches. Fixing the signal handling won't get rid of the problem that
one e.g. might not be able to log in anymore if all synchronous standbys
are down and login caused hot pruning.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [RFC] Should smgrtruncate() avoid sending sinval message for temp relations

From
Andres Freund
Date:
Hi,

On 2014-07-26 18:16:01 +0200, Andres Freund wrote:
> On 2014-07-26 11:32:24 -0400, Tom Lane wrote:
> > "MauMau" <maumau307@gmail.com> writes:
> > > [ sinval catchup signal -> ProcessCatchupEvent -> WaitLatch -> deadlock ]
> > 
> > Ugh.
> > 
> > One line of thought is that it's pretty unsafe to be doing anything
> > as complicated as transaction start/commit in a signal handler, even one
> > that is sure it's not interrupting anything else.
> 
> Yea, that's really not nice.

MauMau, we don't do this anymore. Could you verify that the issue is
fixed for you?

I'd completely forgotten that this thread made me work on moving
everything complicated out of signal handlers...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services