Thread: plperlu stored procedure seems to freeze for a minute

plperlu stored procedure seems to freeze for a minute

From
"Peter J. Holzer"
Date:
A rather weird observation from the log files of our server (9.5 beta1):

2015-12-01 09:23:37 CET [26265]: [42-1] user=fragner,db=wds,pid=26265 WARNING:  Use of uninitialized value $success in
concatenation(.) or string at /usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36. 
2015-12-01 09:23:37 CET [26265]: [43-1] user=fragner,db=wds,pid=26265 CONTEXT:  PL/Perl function
"mb_timeseriesdata_zmq" 
[lots of other stuff from different connections]
2015-12-01 09:24:45 CET [26265]: [44-1] user=fragner,db=wds,pid=26265 ERROR:  impossible result '' (payload=) at
/usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pmline 36.  
                WDS::Macrobond::Utils::decode_result("") called at line 30
                main::__ANON__("gen_wqehur") called at -e line 0
                eval {...} called at -e line 0

Two messages from the same line of the same plperlu stored procedure, 68
seconds apart. So what is this line 36?

    confess "impossible result '$success' (payload=$payload)";

What? The first message clearly comes from interpolating $success
(which is undef at that point) into the argument. The second from
confess itself. What could cause a plperlu procedure to freeze for 68
seconds between the call to confess and its output?

Is it possible that only the writing of the log entry is delayed?

Another weird thing: $success is undef because a ØMQ rpc call[1] timed
out (after 60 seconds, which is also a mystery, because ØMQ doesn't seem
to have a default timeout of 60 seconds, and I don't set one). But at
09:24:45 (i.e. the time of the error message) the answer for that RPC
call arrived. So it kind of looks like confess waited for the message to
arrive (which makes no sense at all) or maybe that confess waited for
something which also blocked the sending of the request (because
according to the server logs, the RPC request only arrived there at
09:24:45 and was answered within 1 second), but that doesn't make any
sense either. (Just noticed that 60 + 68 == 128, which is also a round
number).

    hp


[1] ØMQ is an IPC framework: See http://zeromq.org/ We use it to make
    RPC calls from stored procedures to a server process.


--
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants
| |   | hjp@hjp.at         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

Attachment

Re: plperlu stored procedure seems to freeze for a minute

From
Adrian Klaver
Date:
On 12/01/2015 06:51 AM, Peter J. Holzer wrote:
> A rather weird observation from the log files of our server (9.5 beta1):
>
> 2015-12-01 09:23:37 CET [26265]: [42-1] user=fragner,db=wds,pid=26265 WARNING:  Use of uninitialized value $success
inconcatenation (.) or string at /usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36. 
> 2015-12-01 09:23:37 CET [26265]: [43-1] user=fragner,db=wds,pid=26265 CONTEXT:  PL/Perl function
"mb_timeseriesdata_zmq"
> [lots of other stuff from different connections]
> 2015-12-01 09:24:45 CET [26265]: [44-1] user=fragner,db=wds,pid=26265 ERROR:  impossible result '' (payload=) at
/usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pmline 36. 
>                  WDS::Macrobond::Utils::decode_result("") called at line 30
>                  main::__ANON__("gen_wqehur") called at -e line 0
>                  eval {...} called at -e line 0
>
> Two messages from the same line of the same plperlu stored procedure, 68
> seconds apart. So what is this line 36?
>
>      confess "impossible result '$success' (payload=$payload)";
>
> What? The first message clearly comes from interpolating $success
> (which is undef at that point) into the argument. The second from
> confess itself. What could cause a plperlu procedure to freeze for 68
> seconds between the call to confess and its output?
>
> Is it possible that only the writing of the log entry is delayed?
>
> Another weird thing: $success is undef because a ØMQ rpc call[1] timed

And the call is?

> out (after 60 seconds, which is also a mystery, because ØMQ doesn't seem
> to have a default timeout of 60 seconds, and I don't set one). But at

Network timeout?

> 09:24:45 (i.e. the time of the error message) the answer for that RPC
> call arrived. So it kind of looks like confess waited for the message to
> arrive (which makes no sense at all) or maybe that confess waited for
> something which also blocked the sending of the request (because
> according to the server logs, the RPC request only arrived there at
> 09:24:45 and was answered within 1 second), but that doesn't make any

So if the request timed out how did you get a reply, a second request?

> sense either. (Just noticed that 60 + 68 == 128, which is also a round
> number).


This looks a question for a ØMQ list.

Another thought, have you tried the code outside Postgres to see if works?

>
>     hp
>
>
> [1] ØMQ is an IPC framework: See http://zeromq.org/ We use it to make
>      RPC calls from stored procedures to a server process.
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: plperlu stored procedure seems to freeze for a minute

From
Alvaro Herrera
Date:
Peter J. Holzer wrote:

> Two messages from the same line of the same plperlu stored procedure, 68
> seconds apart. So what is this line 36?
>
>     confess "impossible result '$success' (payload=$payload)";
>
> What? The first message clearly comes from interpolating $success
> (which is undef at that point) into the argument. The second from
> confess itself. What could cause a plperlu procedure to freeze for 68
> seconds between the call to confess and its output?

Try running it under strace.  Maybe it's trying to resolve some name and
its resolver takes a minute to discover it can't, or something like that.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: plperlu stored procedure seems to freeze for a minute

From
Kevin Grittner
Date:
On Tue, Dec 1, 2015 at 8:51 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:

> What could cause a plperlu procedure to freeze for 68
> seconds between the call to confess and its output?

> Another weird thing: $success is undef because a ØMQ rpc call[1] timed
> out (after 60 seconds, which is also a mystery, because ØMQ doesn't seem
> to have a default timeout of 60 seconds, and I don't set one).

One thing I would check is the TCP keepalive settings.

If you don't sort it out, please read this and post with more
information; it's tough to guess at a cause without knowing the
version of PostgreSQL or the OS involved, etc.

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: plperlu stored procedure seems to freeze for a minute

From
"Peter J. Holzer"
Date:
On 2015-12-01 07:16:04 -0800, Adrian Klaver wrote:
> On 12/01/2015 06:51 AM, Peter J. Holzer wrote:
> >A rather weird observation from the log files of our server (9.5 beta1):
> >
> >2015-12-01 09:23:37 CET [26265]: [42-1] user=fragner,db=wds,pid=26265 WARNING:  Use of uninitialized value $success
inconcatenation (.) or string at /usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36. 
> >2015-12-01 09:23:37 CET [26265]: [43-1] user=fragner,db=wds,pid=26265 CONTEXT:  PL/Perl function
"mb_timeseriesdata_zmq"
> >[lots of other stuff from different connections]
> >2015-12-01 09:24:45 CET [26265]: [44-1] user=fragner,db=wds,pid=26265 ERROR:  impossible result '' (payload=) at
/usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pmline 36. 
> >                 WDS::Macrobond::Utils::decode_result("") called at line 30
> >                 main::__ANON__("gen_wqehur") called at -e line 0
> >                 eval {...} called at -e line 0
> >
> >Two messages from the same line of the same plperlu stored procedure, 68
> >seconds apart. So what is this line 36?
> >
> >     confess "impossible result '$success' (payload=$payload)";
> >
> >What? The first message clearly comes from interpolating $success
> >(which is undef at that point) into the argument. The second from
> >confess itself. What could cause a plperlu procedure to freeze for 68
> >seconds between the call to confess and its output?
> >
> >Is it possible that only the writing of the log entry is delayed?
> >
> >Another weird thing: $success is undef because a ØMQ rpc call[1] timed
>
> And the call is?

The sequence is:

my $req_sck = zmq_socket($context, ZMQ_REQ);
zmq_connect($req_sck, $url);
my $qry_msg = join(...);
zmq_send($req_sck, $qry_msg);
my $res_msg = zmq_msg_init();
my $rv = zmq_msg_recv($res_msg, $req_sck);
# check rv here.
my $data = zmq_msg_data($res_msg); # $data is "" here
my $result = WDS::Macrobond::Utils::decode_result($data); # the error messages are from this function

(Yeah, ØMQ is quite low-level. There is a higher level Perl Module, but
I'm not using it).

I omitted that because I don't think it's terribly relevant here.
Details of the usage of ØMQ are better discussed on the ØMQ mailing
list.

But there is something else which may be relevant: ØMQ uses threads
internally, and I don't actually know whether zmq_msg_recv returning
means that the read(2) call (or whatever) on the socket terminates.
It may actually continue in another thread. But I still don't see how
that could block the main thread (or wake it up again in a place which
has nothing to do with ØMQ (confess is a standard Perl function to print
a stack trace and die)). Or - just thinking aloud here - I fear I'm
abusing you guys as support teddy bears[1] - maybe it's the other way
round: confess dies, so maybe it frees some lock during cleanup which
allows the message which should have been sent by zmq_send to finally go
out on the wire. But that still doesn't explain the 68 seconds spent in
confess ...

Postgres worker processes are single-threaded, are they? Is there
something else which could interact badly with a moderately complex
multithreaded I/O library used from a stored procedure?

I suspect such an interaction because I cannot reproduce the problem
outside of a stored procedure. A standalone Perl script doing the same
requests doesn't get a timeout.

I guess Alvaro is right: I should strace the postgres worker process
while it executes the stored procedure. The problem of course is that
it happens often enough be annoying, but rarely enough that it's not
easily reproducible.

> >out (after 60 seconds, which is also a mystery, because ØMQ doesn't seem
> >to have a default timeout of 60 seconds, and I don't set one). But at
>
> Network timeout?

That was my first guess, but I don't see where it would come from. Or
why it only is there if I call the code from a stored procedure, not
from a standalone script.

> >09:24:45 (i.e. the time of the error message) the answer for that RPC
> >call arrived. So it kind of looks like confess waited for the message to
> >arrive (which makes no sense at all) or maybe that confess waited for
> >something which also blocked the sending of the request (because
> >according to the server logs, the RPC request only arrived there at
> >09:24:45 and was answered within 1 second), but that doesn't make any
>
> So if the request timed out how did you get a reply, a second request?

Nope. I don't really "get" the reply. I just see in the logs of the
other server that it sent a reply at that time.

The time line is like this

time    postgres process        mb_dal process
T       zmq_send()
        zmq_msg_recv()

T+60    zmq_msg_recv returns
        an empty message;
        decode() is called
        which notices that the
        message is empty and
        calls confess();

T+128   confess() prints a      receives the message sent at T;
        stacktrace;             sends a reply;

        hp

[1] For those who don't know the story: Someone at a help desk noticed
that people often started to explain their problem and somewhere in the
middle said "never mind, I think I know what the problem is" and went
away. So he put a big teddy bear on the desk and asked people to explain
their problems to the bear. So a support teddy bear is someone who you
talk to mostly to clear your own mind, not because you expect a
solution.

--
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants
| |   | hjp@hjp.at         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

Attachment

Re: plperlu stored procedure seems to freeze for a minute

From
Tom Lane
Date:
"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> But there is something else which may be relevant: �MQ uses threads
> internally, and I don't actually know whether zmq_msg_recv returning
> means that the read(2) call (or whatever) on the socket terminates.

... ugh ...

> Postgres worker processes are single-threaded, are they? Is there
> something else which could interact badly with a moderately complex
> multithreaded I/O library used from a stored procedure?

Yes, lots.  If you cause additional threads to appear inside a backend
process, things could break arbitrarily badly.  It's up to you to ensure
that none of those extra threads ever escape to execute any non-Perl
code.  I suspect this could easily explain the problems you're seeing.

            regards, tom lane


Re: plperlu stored procedure seems to freeze for a minute

From
Adrian Klaver
Date:
On 12/01/2015 09:58 AM, Peter J. Holzer wrote:
> On 2015-12-01 07:16:04 -0800, Adrian Klaver wrote:
>> On 12/01/2015 06:51 AM, Peter J. Holzer wrote:
>>> A rather weird observation from the log files of our server (9.5 beta1):
>>>
>>> 2015-12-01 09:23:37 CET [26265]: [42-1] user=fragner,db=wds,pid=26265 WARNING:  Use of uninitialized value $success
inconcatenation (.) or string at /usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36. 
>>> 2015-12-01 09:23:37 CET [26265]: [43-1] user=fragner,db=wds,pid=26265 CONTEXT:  PL/Perl function
"mb_timeseriesdata_zmq"
>>> [lots of other stuff from different connections]
>>> 2015-12-01 09:24:45 CET [26265]: [44-1] user=fragner,db=wds,pid=26265 ERROR:  impossible result '' (payload=) at
/usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pmline 36. 
>>>                  WDS::Macrobond::Utils::decode_result("") called at line 30
>>>                  main::__ANON__("gen_wqehur") called at -e line 0
>>>                  eval {...} called at -e line 0
>>>
>>> Two messages from the same line of the same plperlu stored procedure, 68
>>> seconds apart. So what is this line 36?
>>>
>>>      confess "impossible result '$success' (payload=$payload)";
>>>
>>> What? The first message clearly comes from interpolating $success
>>> (which is undef at that point) into the argument. The second from
>>> confess itself. What could cause a plperlu procedure to freeze for 68
>>> seconds between the call to confess and its output?
>>>
>>> Is it possible that only the writing of the log entry is delayed?
>>>
>>> Another weird thing: $success is undef because a ØMQ rpc call[1] timed
>>
>> And the call is?
>
> The sequence is:
>
> my $req_sck = zmq_socket($context, ZMQ_REQ);
> zmq_connect($req_sck, $url);
> my $qry_msg = join(...);
> zmq_send($req_sck, $qry_msg);
> my $res_msg = zmq_msg_init();
> my $rv = zmq_msg_recv($res_msg, $req_sck);
> # check rv here.
> my $data = zmq_msg_data($res_msg); # $data is "" here
> my $result = WDS::Macrobond::Utils::decode_result($data); # the error messages are from this function
>
> (Yeah, ØMQ is quite low-level. There is a higher level Perl Module, but
> I'm not using it).
>
> I omitted that because I don't think it's terribly relevant here.
> Details of the usage of ØMQ are better discussed on the ØMQ mailing
> list.
>
> But there is something else which may be relevant: ØMQ uses threads
> internally, and I don't actually know whether zmq_msg_recv returning

Except I see this here:

http://api.zeromq.org/4-0:zmq-socket

Thread safety

ØMQ sockets are not thread safe. Applications MUST NOT use a socket from
multiple threads except after migrating a socket from one thread to
another with a "full fence" memory barrier.

> means that the read(2) call (or whatever) on the socket terminates.
> It may actually continue in another thread. But I still don't see how
> that could block the main thread (or wake it up again in a place which
> has nothing to do with ØMQ (confess is a standard Perl function to print
> a stack trace and die)). Or - just thinking aloud here - I fear I'm
> abusing you guys as support teddy bears[1] - maybe it's the other way
> round: confess dies, so maybe it frees some lock during cleanup which
> allows the message which should have been sent by zmq_send to finally go
> out on the wire. But that still doesn't explain the 68 seconds spent in
> confess ...
>
> Postgres worker processes are single-threaded, are they? Is there
> something else which could interact badly with a moderately complex
> multithreaded I/O library used from a stored procedure?
>
> I suspect such an interaction because I cannot reproduce the problem
> outside of a stored procedure. A standalone Perl script doing the same
> requests doesn't get a timeout.
>
> I guess Alvaro is right: I should strace the postgres worker process
> while it executes the stored procedure. The problem of course is that
> it happens often enough be annoying, but rarely enough that it's not
> easily reproducible.

 From here:

http://api.zeromq.org/4-0:zmq-connect

It seems something like(I am not a Perl programmer, so approach carefully):

my $rc = zmq_connect($req_sck, $url);

Then you will have an error code to examine.

Have you looked at the Notes at the bottom of this page:

http://www.postgresql.org/docs/9.4/interactive/plperl-trusted.html

>
>>> out (after 60 seconds, which is also a mystery, because ØMQ doesn't seem
>>> to have a default timeout of 60 seconds, and I don't set one). But at
>>
>> Network timeout?
>
> That was my first guess, but I don't see where it would come from. Or
> why it only is there if I call the code from a stored procedure, not
> from a standalone script.
>
>>> 09:24:45 (i.e. the time of the error message) the answer for that RPC
>>> call arrived. So it kind of looks like confess waited for the message to
>>> arrive (which makes no sense at all) or maybe that confess waited for
>>> something which also blocked the sending of the request (because
>>> according to the server logs, the RPC request only arrived there at
>>> 09:24:45 and was answered within 1 second), but that doesn't make any
>>
>> So if the request timed out how did you get a reply, a second request?
>
> Nope. I don't really "get" the reply. I just see in the logs of the
> other server that it sent a reply at that time.
>
> The time line is like this
>
> time    postgres process        mb_dal process
> T       zmq_send()
>          zmq_msg_recv()
>
> T+60    zmq_msg_recv returns
>          an empty message;
>          decode() is called
>          which notices that the
>          message is empty and
>          calls confess();
>
> T+128   confess() prints a      receives the message sent at T;
>          stacktrace;             sends a reply;
>
>          hp
>
> [1] For those who don't know the story: Someone at a help desk noticed
> that people often started to explain their problem and somewhere in the
> middle said "never mind, I think I know what the problem is" and went
> away. So he put a big teddy bear on the desk and asked people to explain
> their problems to the bear. So a support teddy bear is someone who you
> talk to mostly to clear your own mind, not because you expect a
> solution.
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: plperlu stored procedure seems to freeze for a minute

From
"Peter J. Holzer"
Date:
On 2015-12-01 10:20:09 -0800, Adrian Klaver wrote:
> On 12/01/2015 09:58 AM, Peter J. Holzer wrote:
> >On 2015-12-01 07:16:04 -0800, Adrian Klaver wrote:
> >>On 12/01/2015 06:51 AM, Peter J. Holzer wrote:
> >>>A rather weird observation from the log files of our server (9.5 beta1):
> >>>
> >>>2015-12-01 09:23:37 CET [26265]: [42-1] user=fragner,db=wds,pid=26265 WARNING:  Use of uninitialized value
$successin concatenation (.) or string at /usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pm line 36. 
> >>>2015-12-01 09:23:37 CET [26265]: [43-1] user=fragner,db=wds,pid=26265 CONTEXT:  PL/Perl function
"mb_timeseriesdata_zmq"
> >>>[lots of other stuff from different connections]
> >>>2015-12-01 09:24:45 CET [26265]: [44-1] user=fragner,db=wds,pid=26265 ERROR:  impossible result '' (payload=) at
/usr/local/share/perl/5.20.2/WDS/Macrobond/Utils.pmline 36. 
> >>>                 WDS::Macrobond::Utils::decode_result("") called at line 30
> >>>                 main::__ANON__("gen_wqehur") called at -e line 0
> >>>                 eval {...} called at -e line 0
> >>>
> >>>Two messages from the same line of the same plperlu stored procedure, 68
> >>>seconds apart. So what is this line 36?
> >>>
> >>>     confess "impossible result '$success' (payload=$payload)";
> >>>
> >>>What? The first message clearly comes from interpolating $success
> >>>(which is undef at that point) into the argument. The second from
> >>>confess itself. What could cause a plperlu procedure to freeze for 68
> >>>seconds between the call to confess and its output?
> >>>
> >>>Is it possible that only the writing of the log entry is delayed?
> >>>
> >>>Another weird thing: $success is undef because a ØMQ rpc call[1] timed
> >>
> >>And the call is?
> >
> >The sequence is:
> >
> >my $req_sck = zmq_socket($context, ZMQ_REQ);
> >zmq_connect($req_sck, $url);
> >my $qry_msg = join(...);
> >zmq_send($req_sck, $qry_msg);
> >my $res_msg = zmq_msg_init();
> >my $rv = zmq_msg_recv($res_msg, $req_sck);
> ># check rv here.
> >my $data = zmq_msg_data($res_msg); # $data is "" here
> >my $result = WDS::Macrobond::Utils::decode_result($data); # the error messages are from this function
> >
> >(Yeah, ØMQ is quite low-level. There is a higher level Perl Module, but
> >I'm not using it).
> >
> >I omitted that because I don't think it's terribly relevant here.
> >Details of the usage of ØMQ are better discussed on the ØMQ mailing
> >list.
> >
> >But there is something else which may be relevant: ØMQ uses threads
> >internally, and I don't actually know whether zmq_msg_recv returning
>
> Except I see this here:
>
> http://api.zeromq.org/4-0:zmq-socket
>
> Thread safety
>
> ØMQ sockets are not thread safe. Applications MUST NOT use a socket from
> multiple threads except after migrating a socket from one thread to another
> with a "full fence" memory barrier.

Well yes, but I don't use a ØMQ socket in multiple threads, It is
created in the stored procedure and destroyed at the end (just checked
the strace output: Yes it is. For a moment I wasn't sure whether lexical
variables in plperlu procedures go out of scope.). It's the ØMQ library
itself which creates extra threads (And it should terminate them
properly and afaics from strace it does).


[...]
> From here:
>
> http://api.zeromq.org/4-0:zmq-connect
>
> It seems something like(I am not a Perl programmer, so approach carefully):
>
> my $rc = zmq_connect($req_sck, $url);
>
> Then you will have an error code to examine.

Yes. I have been a bit sloppy with error checking. I check only the
return value of the zmq_msg_recv() call which returns the empty message.
It is possible that the problem actually occurs earlier and I should
check those calls as well. Mea culpa.

However, in at least one case the failed call was indeed zmq_msg_recv()
not one of the earlier ones (see my next mail).

> Have you looked at the Notes at the bottom of this page:
>
> http://www.postgresql.org/docs/9.4/interactive/plperl-trusted.html

I have. I don't think that's a problem here: Debian perl is built with
with both multiplicity and ithreads, and I would assume that the .deb
packages from postgresql.org use the shared library provided by the
system.

But even if that wasn't the case it should not be a problem as only
plperlu stored procedures are called.

    hp

--
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants
| |   | hjp@hjp.at         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

Attachment

Re: plperlu stored procedure seems to freeze for a minute

From
"Peter J. Holzer"
Date:
On 2015-12-01 18:58:31 +0100, Peter J. Holzer wrote:
> I suspect such an interaction because I cannot reproduce the problem
> outside of a stored procedure. A standalone Perl script doing the same
> requests doesn't get a timeout.
>
> I guess Alvaro is right: I should strace the postgres worker process
> while it executes the stored procedure. The problem of course is that
> it happens often enough be annoying, but rarely enough that it's not
> easily reproducible.

I did manage to catch a timeout once with strace in the mean time,
although that one was much more straightforward and less mysterious than
the original case: postgres process sends message, about 10 seconds
later it receives a SIGALRM which interrupts an epoll, reply hasn't yet
arrived, error message to client and log file. No waits in functions
which shouldn't wait or messages which arrive much later than they were
(presumably) sent.

The strace doesn't show a reason for the SIGALRM, though. No alarm(2) or
setitimer(2) system call (I connected strace to a running postgres
process just after I got the prompt from "psql" and before I typed
"select * from mb_search('export');" (I used a different (but very
similar) stored procedure for those tests because it is much easier to
find a search which is slow enough to trigger a timeout at least
sometimes than a data request (which normally finishes in
milliseconds)).

So I guess my next task will be to find out where that SIGALRM comes
from and/or whether I can just restart the zmq_msg_recv if it happens.

    hp

--
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants
| |   | hjp@hjp.at         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

Attachment

Re: plperlu stored procedure seems to freeze for a minute

From
"Peter J. Holzer"
Date:
On 2015-12-01 13:13:27 -0500, Tom Lane wrote:
> "Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> > Postgres worker processes are single-threaded, are they? Is there
> > something else which could interact badly with a moderately complex
> > multithreaded I/O library used from a stored procedure?
>
> Yes, lots.  If you cause additional threads to appear inside a backend
> process, things could break arbitrarily badly.  It's up to you to ensure
> that none of those extra threads ever escape to execute any non-Perl
> code.

Actually, non-�MQ code. Perl doesn't like to be unexpectedly
multithreaded either. Yes, those threads should only ever execute code
from the �MQ library. In fact they are automatically created and
destroyed by the library and there is no way to control them from Perl
code (there may be a way to do that from the C API, but I don't remember
seeing that in the manual).

> I suspect this could easily explain the problems you're seeing.

Quite.

    hp

--
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants
| |   | hjp@hjp.at         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

Attachment

Re: plperlu stored procedure seems to freeze for a minute

From
"Peter J. Holzer"
Date:
On 2015-12-01 20:55:02 +0100, Peter J. Holzer wrote:
> On 2015-12-01 18:58:31 +0100, Peter J. Holzer wrote:
> > I suspect such an interaction because I cannot reproduce the problem
> > outside of a stored procedure. A standalone Perl script doing the same
> > requests doesn't get a timeout.
[...]
> The strace doesn't show a reason for the SIGALRM, though. No alarm(2) or
> setitimer(2) system call (I connected strace to a running postgres
> process just after I got the prompt from "psql" and before I typed
> "select * from mb_search('export');" (I used a different (but very
> similar) stored procedure for those tests because it is much easier to
> find a search which is slow enough to trigger a timeout at least
> sometimes than a data request (which normally finishes in
> milliseconds)).
>
> So I guess my next task will be to find out where that SIGALRM comes
> from and/or whether I can just restart the zmq_msg_recv if it happens.

Ok, I think I know where that SIGALRM comes from: It's the
AuthenticationTimeout. What I'm seeing in strace (if I attach it early
enough) is that during authentication the postgres worker process calls
setitimer with a 60 second timeout twice. This matches the comment in
backend/postmaster/postmaster.c:

         * Note: AuthenticationTimeout is applied here while waiting for the
         * startup packet, and then again in InitPostgres for the duration of any
         * authentication operations.  So a hostile client could tie up the
         * process for nearly twice AuthenticationTimeout before we kick him off.

As explained in backend/utils/misc/timeout.c, the timers are never
cancelled: If a timeout is cancelled, postgres just sees that it has
nothing to do and resumes whatever it is doing.

This is also what I'm seeing: 60 seconds after start, the process
receives a SIGALRM.

If the process is idle or in a "normal" SQL statement at the time, thats
not a problem. But if it is in one of my stored procedures which is
currently calling a ØMQ function which is waiting for some I/O
(zmq_msg_recv(), most likely), that gets interrupted and returns an
error which my code doesn't know how to handle (yet). So the error gets
back to the user.

A strange interaction between postgres and ØMQ indeed. But now that I
know what's causing it I can handle that. Thanks for your patience.

    hp


--
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants
| |   | hjp@hjp.at         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

Attachment

Re: plperlu stored procedure seems to freeze for a minute

From
Jim Nasby
Date:
On 12/2/15 9:26 AM, Peter J. Holzer wrote:
> As explained in backend/utils/misc/timeout.c, the timers are never
> cancelled: If a timeout is cancelled, postgres just sees that it has
> nothing to do and resumes whatever it is doing.

Hrm, if those timers are really just for auth purposes then perhaps they
should be cancelled. But aside from that, there's certainly other things
that can signal a backend (including fairly normal things, like DDL).
Offhand I don't think functions run in a CRITICAL block (and I don't
think it'd be a good idea for them to). So really, functions have to be
handle being interrupted.

> This is also what I'm seeing: 60 seconds after start, the process
> receives a SIGALRM.
>
> If the process is idle or in a "normal" SQL statement at the time, thats
> not a problem. But if it is in one of my stored procedures which is
> currently calling a ØMQ function which is waiting for some I/O
> (zmq_msg_recv(), most likely), that gets interrupted and returns an
> error which my code doesn't know how to handle (yet). So the error gets
> back to the user.

Yeah, it'd be nice to detect that this had happened. Or maybe it's worth
it to ignore SIGALARM while a UDF is running. (We certainly wouldn't
want to do a HOLD_INTERRUPTS() though...)
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: plperlu stored procedure seems to freeze for a minute

From
"Peter J. Holzer"
Date:
On 2015-12-02 19:07:55 -0600, Jim Nasby wrote:
> On 12/2/15 9:26 AM, Peter J. Holzer wrote:
> >As explained in backend/utils/misc/timeout.c, the timers are never
> >cancelled: If a timeout is cancelled, postgres just sees that it has
> >nothing to do and resumes whatever it is doing.
>
> Hrm, if those timers are really just for auth purposes then perhaps they
> should be cancelled. But aside from that, there's certainly other things
> that can signal a backend (including fairly normal things, like DDL).

Yep. I noticed that, too. In one of my test runs I got two signals
instead of the one I expected. Checking the logs I found that it seemed be
caused by another user dropping a table.

> Offhand I don't think functions run in a CRITICAL block (and I don't think
> it'd be a good idea for them to). So really, functions have to be handle
> being interrupted.

Right. I think that should be mentioned somewhere in the manual.
Something like this:

    Note: PostgreSQL uses signals for various purposes. These signals
    may be delivered while a user-defined function is executed.
    Therefore user-defined functions must be able to handle being
    interrupted, in particular they must expect system calls to fail
    with errno=EINTR and handle that case appropriately.

I'm not sure wether that's an issue with all procedural languages. If it
is, it should probable go into "Chapter 39. Procedural Languages". If it
is specific to plperl(u), I would put it in "42.8.2. Limitations and
Missing Features".

> Yeah, it'd be nice to detect that this had happened. Or maybe it's worth it
> to ignore SIGALARM while a UDF is running.

Can those signals be safely ignored? Just blocking them (so that they
are delivered after the UDF finishes) might be safer. But even that may
be a problem: If the UDF then executes some SQL, could that rely on
signals being delivered? I have no idea.

    hp

--
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants
| |   | hjp@hjp.at         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

Attachment

Re: plperlu stored procedure seems to freeze for a minute

From
Tom Lane
Date:
"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> Can those signals be safely ignored? Just blocking them (so that they
> are delivered after the UDF finishes) might be safer. But even that may
> be a problem: If the UDF then executes some SQL, could that rely on
> signals being delivered? I have no idea.

The minute you start fooling with a backend's signal behavior, we're
going to politely refuse to support whatever breakage you run into.
We aren't sending those signals just for amusement's sake.

            regards, tom lane


Re: plperlu stored procedure seems to freeze for a minute

From
"Peter J. Holzer"
Date:
On 2015-12-03 10:02:18 -0500, Tom Lane wrote:
> "Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> > Can those signals be safely ignored? Just blocking them (so that they
> > are delivered after the UDF finishes) might be safer. But even that may
> > be a problem: If the UDF then executes some SQL, could that rely on
> > signals being delivered? I have no idea.
>
> The minute you start fooling with a backend's signal behavior, we're
> going to politely refuse to support whatever breakage you run into.

As I understood Jim he was talking about possible changes to postgresql
to shield UDFs from those signals, not something the author of a UDF
should do.


> We aren't sending those signals just for amusement's sake.

Right. That's why I was sceptical whether those signals could be
ignored. I wouldn't have thought so, but Jim clearly knows a lot more
about the inner workings of postgresql than I do (which is easy - I know
almost nothing) and maybe he knows of a way (something like "we can
ignore signals while executing the UDF and just assume that we missed at
least one signal and call the magic synchronize state function
afterwards")

    hp

--
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants
| |   | hjp@hjp.at         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

Attachment

Re: plperlu stored procedure seems to freeze for a minute

From
Jim Nasby
Date:
On 12/4/15 4:15 AM, Peter J. Holzer wrote:
> On 2015-12-03 10:02:18 -0500, Tom Lane wrote:
>> >"Peter J. Holzer"<hjp-pgsql@hjp.at>  writes:
>>> > >Can those signals be safely ignored? Just blocking them (so that they
>>> > >are delivered after the UDF finishes) might be safer. But even that may
>>> > >be a problem: If the UDF then executes some SQL, could that rely on
>>> > >signals being delivered? I have no idea.
>> >
>> >The minute you start fooling with a backend's signal behavior, we're
>> >going to politely refuse to support whatever breakage you run into.
> As I understood Jim he was talking about possible changes to postgresql
> to shield UDFs from those signals, not something the author of a UDF
> should do.

No, I was suggesting that:

1) It might possible for Postgres error handling to detect that a signal
had been received since entering a UDF if the UDF reported an error, and
make that information available as part of the standard error report.

2) If there are alarms being set specifically for authentication, they
should probably be cancelled after authentication is successful.

I certainly do NOT think it would be wise to disable all signal handling
during UDF execution. Just for starters, that means that if a UDF went
into an infinite loop your only way to recover would be to PANIC the
entire database. It would probably create a bunch of other problems as well.

In other words, UDFs *must* be capable of handling an interrupt.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com