Thread: PQConsumeinput stuck on recv

PQConsumeinput stuck on recv

From
Andre Oliveira Freitas
Date:
Hi, I've been experiencing an issue. We use an open-source VoIP software whose backend is PostgreSQL. Initially we had a twin-server setup (one server running the VoIP software, another one running the pg instance). Due to company growth we were running into performance issues, so we rolled out a new architecture using multiple VoIP servers connected to the single pg instance. Since then, the VoIP software started misbehaving - it randomly stops responding, and only a restart gets it back up running. It is random throughout the servers, time-of-day, day-of-week... we haven't found a correlation between it and any other metric like CPU usage, network traffic and such.

Since it's been happening for a few weeks now, every time it freezes we take a gcore dump and check it in gdb... and after a lot of hair pulling and learning about the innards of the VoIP software we see that most often the software is stuck in this call trace:

#0 in __libc_recv (fd=409, buf=0x7f2c4802e6c0, n=16384, flags=1898970523) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
#1 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
#2 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
#3 in PQconsumeInput () from /usr/lib/x86_64-linux-gnu/libpq.so.5

The software shares a database connection between threads, and controls its access through a mutex, so once one thread that acquires the mutex gets stuck in the location above, all other threads starts pilling up behind the mutex, and that's apparently the reason the software stops responding for most of its functions (while other functions that do not depend on the database works normally).

And it stays stuck on it forever... at one situation we took two gcore dumps spaced 10 minutes apart, and they look almost identical, with the same thread stuck on recv and all the others waiting for the lock to be released.

I wonder if anyone has any tip on what to look for next... Besides the implementation of the VoIP software itself, we are looking into network issues (we are seeing a bunch of TCP retransmissions between some servers and the db), but otherwise no other app running on those servers has presented any weird behavior like this VoIP software. We don't understand what would cause recv to get stuck like this.

BTW we're running debian 9, pg 9.6.3, and the VoIP sofware (along most of the other apps) uses libpq of a slightly older version (9.4.15).

Thanks in advance.

Andre

Re: PQConsumeinput stuck on recv

From
Andres Freund
Date:
Hi,

On 2018-02-23 13:33:18 -0300, Andre Oliveira Freitas wrote:
> Since it's been happening for a few weeks now, every time it freezes we
> take a gcore dump and check it in gdb... and after a lot of hair pulling
> and learning about the innards of the VoIP software we see that most often
> the software is stuck in this call trace:
> 
> #0 in __libc_recv (fd=409, buf=0x7f2c4802e6c0, n=16384, flags=1898970523)
> at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
> #1 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
> #2 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
> #3 in PQconsumeInput () from /usr/lib/x86_64-linux-gnu/libpq.so.5

So it's just receiving data from the network. Have you verified whether
the connection is actually stable? Any chance it's just waiting for the
network to time out.  Might be worth configuring tcp timeouts, to make
sure its unrelated to that.

What is the server showing as activity while the client is waiting?
Could you show the corresponding pg_stat_activity row?

> The software shares a database connection between threads, and controls its
> access through a mutex, so once one thread that acquires the mutex gets
> stuck in the location above, all other threads starts pilling up behind the
> mutex, and that's apparently the reason the software stops responding for
> most of its functions (while other functions that do not depend on the
> database works normally).

Hm, have you compiled libpq with threading support? Or use a
distribution that compiles it with that? While I don't see an obvious
connection to that stacktrace it seems worthwhile to verify.

A mutex protecting individual connection, while also necessary if
connections are shared, doesn't achieve the same.

> I wonder if anyone has any tip on what to look for next...

Any chance you're occasionally forking and then interacting with the
connection in the forked process?

Greetings,

Andres Freund


Re: PQConsumeinput stuck on recv

From
Andre Oliveira Freitas
Date:
2018-02-23 14:20 GMT-03:00 Andres Freund <andres@anarazel.de>:
>
> Hi,
>
> On 2018-02-23 13:33:18 -0300, Andre Oliveira Freitas wrote:
> > Since it's been happening for a few weeks now, every time it freezes we
> > take a gcore dump and check it in gdb... and after a lot of hair pulling
> > and learning about the innards of the VoIP software we see that most often
> > the software is stuck in this call trace:
> >
> > #0 in __libc_recv (fd=409, buf=0x7f2c4802e6c0, n=16384, flags=1898970523)
> > at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
> > #1 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
> > #2 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
> > #3 in PQconsumeInput () from /usr/lib/x86_64-linux-gnu/libpq.so.5
>
> So it's just receiving data from the network. Have you verified whether
> the connection is actually stable? Any chance it's just waiting for the
> network to time out.  Might be worth configuring tcp timeouts, to make
> sure its unrelated to that.
>
> What is the server showing as activity while the client is waiting?
> Could you show the corresponding pg_stat_activity row?

Sure,


datid;datname;pid;usesysid;usename;application_name;client_addr;client_hostname;client_port;backend_start;xact_start;query_start;state_change;wait_event_type;wait_event;state;backend_xid;backend_xmin;query
16,4;freeswitch;13,322;16,386;freeswitch;freeswitch;192.168.0.101;;43,276;2018-02-20
14:11:56;;2018-02-20 14:24:15;2018-02-20 14:24:15;;;idle;;;COMMIT
16,4;freeswitch;13,323;16,386;freeswitch;freeswitch;192.168.0.101;;43,328;2018-02-20
14:11:57;;2018-02-20 14:24:57;2018-02-20 14:24:57;;;idle;;;COMMIT
16,4;freeswitch;13,416;16,386;freeswitch;freeswitch;192.168.0.101;;51,681;2018-02-20
14:16:02;;2018-02-20 14:16:31;2018-02-20 14:16:31;;;idle;;;insert into
sip_authentication (nonce,expires,profile_name,hostname, last_nc)
values('363d02f6-cb9a-4791-9e05-d18473a18812', 1519147649, 'internal',
'srv-1', 0)
16,4;freeswitch;13,472;16,386;freeswitch;freeswitch;192.168.0.101;;33,481;2018-02-20
14:22:13;;2018-02-20 14:25:09;2018-02-20 14:25:09;;;idle;;;select
command from aliases where alias='show status'

The problematic connection is the third one; by checking the timings
on query_start it appears to be the last query executed by the stuck
connection before it became stuck. There are no outstanding locks in
any of the tables the VoIP software normally uses.

>
>
> > The software shares a database connection between threads, and controls its
> > access through a mutex, so once one thread that acquires the mutex gets
> > stuck in the location above, all other threads starts pilling up behind the
> > mutex, and that's apparently the reason the software stops responding for
> > most of its functions (while other functions that do not depend on the
> > database works normally).
>
> Hm, have you compiled libpq with threading support? Or use a
> distribution that compiles it with that? While I don't see an obvious
> connection to that stacktrace it seems worthwhile to verify.
>
> A mutex protecting individual connection, while also necessary if
> connections are shared, doesn't achieve the same.

I'm using the libpq that comes with debian, however I can install the
library from the official repository to be sure, I assume the one from
the official repo has it enabled.

>
> > I wonder if anyone has any tip on what to look for next...
>
> Any chance you're occasionally forking and then interacting with the
> connection in the forked process?

As far as I know, no. The software forks at the beginning but from
then on, its just threads.

>
> Greetings,
>
> Andres Freund

If it is of any help, here is the link to the implementation that
consumes PQconsumeInput:

https://freeswitch.org/stash/projects/FS/repos/freeswitch/browse/src/switch_pgsql.c#128

It is a function to check if the database connection is up before
running a query. I'm not a mantainer nor an expert in pg, but we
reviewed the implementation and seems OK.

One thing that is bothering me by looking at the gdb backtraces is
that recv always seems to be receiving a non-zero value in flags, even
though libpq seems to pass zero. I don't know if its of any relevance.

-- 

Thanks in advance,

Andre


Re: PQConsumeinput stuck on recv

From
Andres Freund
Date:
Hi,

On 2018-02-23 15:59:18 -0300, Andre Oliveira Freitas wrote:
> I'm using the libpq that comes with debian, however I can install the
> library from the official repository to be sure, I assume the one from
> the official repo has it enabled.

That should be fine. Could you get the backtrace again, after installing
debugging symbols? It'd certainly be helpful to see the exact path to
the blocking syscall.

Greetings,

Andres Freund


Re: PQConsumeinput stuck on recv

From
Andre Oliveira Freitas
Date:
Hi,

I was able to capture the backtrace again, now with libpq debugging symbols.

Thread 15 (Thread 0x7f8cec068700 (LWP 68)):
#0  0x00007f8d252a1d9b in __libc_recv (fd=150, buf=0x7f8cf0034410,
n=16384, flags=623517083, flags@entry=0) at
../sysdeps/unix/sysv/linux/x86_64/recv.c:33
#1  0x00007f8d26689783 in recv (__flags=0, __n=<optimized out>,
__buf=<optimized out>, __fd=<optimized out>) at
/usr/include/x86_64-linux-gnu/bits/socket2.h:44
#2  pqsecure_raw_read (conn=conn@entry=0x7f8cf001e390, ptr=<optimized
out>, len=<optimized out>) at
/build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-secure.c:230
#3  0x00007f8d26689863 in pqsecure_read
(conn=conn@entry=0x7f8cf001e390, ptr=<optimized out>, len=<optimized
out>) at /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-secure.c:217
#4  0x00007f8d266810ea in pqReadData (conn=conn@entry=0x7f8cf001e390)
at /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-misc.c:683
#5  0x00007f8d2667e6f2 in PQconsumeInput (conn=0x7f8cf001e390) at
/build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-exec.c:1704

In this case, I also checked the pg_stat_activity and this particular
connection on pg server side was idle for 15 minutes. I killed it
using pg_terminate_backend, and then somehow the application resumed
execution.

As you can see, recv has received a non-zero value in flags, which has
the MSG_WAITALL bit is set, however in this case the application
expects the read to be non-blocking.

Thanks in advance,

André Freitas

2018-02-23 16:10 GMT-03:00 Andres Freund <andres@anarazel.de>:
> Hi,
>
> On 2018-02-23 15:59:18 -0300, Andre Oliveira Freitas wrote:
>> I'm using the libpq that comes with debian, however I can install the
>> library from the official repository to be sure, I assume the one from
>> the official repo has it enabled.
>
> That should be fine. Could you get the backtrace again, after installing
> debugging symbols? It'd certainly be helpful to see the exact path to
> the blocking syscall.
>
> Greetings,
>
> Andres Freund



--

André Luis O. Freitas
System Architect

Rua do Rócio, 220 - Cj. 72
São Paulo - SP - 04552-000
55 11 4063 4222

afreitas@callix.com.br
www.callix.com.br


Re: PQConsumeinput stuck on recv

From
Andres Freund
Date:
Hi,


On 2018-03-01 19:20:04 -0300, Andre Oliveira Freitas wrote:
> I was able to capture the backtrace again, now with libpq debugging symbols.
> 
> Thread 15 (Thread 0x7f8cec068700 (LWP 68)):
> #0  0x00007f8d252a1d9b in __libc_recv (fd=150, buf=0x7f8cf0034410,
> n=16384, flags=623517083, flags@entry=0) at
> ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
> #1  0x00007f8d26689783 in recv (__flags=0, __n=<optimized out>,
> __buf=<optimized out>, __fd=<optimized out>) at
> /usr/include/x86_64-linux-gnu/bits/socket2.h:44
> #2  pqsecure_raw_read (conn=conn@entry=0x7f8cf001e390, ptr=<optimized
> out>, len=<optimized out>) at
> /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-secure.c:230
> #3  0x00007f8d26689863 in pqsecure_read
> (conn=conn@entry=0x7f8cf001e390, ptr=<optimized out>, len=<optimized
> out>) at /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-secure.c:217
> #4  0x00007f8d266810ea in pqReadData (conn=conn@entry=0x7f8cf001e390)
> at /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-misc.c:683
> #5  0x00007f8d2667e6f2 in PQconsumeInput (conn=0x7f8cf001e390) at
> /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-exec.c:1704
> 
> In this case, I also checked the pg_stat_activity and this particular
> connection on pg server side was idle for 15 minutes. I killed it
> using pg_terminate_backend, and then somehow the application resumed
> execution.

There's something decidedly weird going on. Libpq always keeps the
connection in nonblocking mode internally. Blocking is implemented by
using select on the socket. So this should never block.

Is there any chance parts of your application changes the sockets
block-y-ness?  Is see your code is passing the socket around, so perhaps
that's happening somewhere outside of the file?


> As you can see, recv has received a non-zero value in flags

Well, no, not really. recv() has a 0 flags, it's just libc's internal
implementation that appears to be showing up weird afaict.


Greetings,

Andres Freund


Re: PQConsumeinput stuck on recv

From
Andre Oliveira Freitas
Date:
Ok, on one hand glad to hear that is an artifact, on the other hand
that means the issue is still out there...

I'm not a maintainer of Freeswitch, I am an user of it. However I am
trying to supply the maintainers with any relevant information
regarding this issue.

As you mentioned, I see the socket being modified and passed around in
the source code. Do you think the implementation is OK? I do not know
if that's standard practice when consuming libpq.

2018-03-02 17:36 GMT-03:00 Andres Freund <andres@anarazel.de>:
> Hi,
>
>
> On 2018-03-01 19:20:04 -0300, Andre Oliveira Freitas wrote:
>> I was able to capture the backtrace again, now with libpq debugging symbols.
>>
>> Thread 15 (Thread 0x7f8cec068700 (LWP 68)):
>> #0  0x00007f8d252a1d9b in __libc_recv (fd=150, buf=0x7f8cf0034410,
>> n=16384, flags=623517083, flags@entry=0) at
>> ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
>> #1  0x00007f8d26689783 in recv (__flags=0, __n=<optimized out>,
>> __buf=<optimized out>, __fd=<optimized out>) at
>> /usr/include/x86_64-linux-gnu/bits/socket2.h:44
>> #2  pqsecure_raw_read (conn=conn@entry=0x7f8cf001e390, ptr=<optimized
>> out>, len=<optimized out>) at
>> /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-secure.c:230
>> #3  0x00007f8d26689863 in pqsecure_read
>> (conn=conn@entry=0x7f8cf001e390, ptr=<optimized out>, len=<optimized
>> out>) at /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-secure.c:217
>> #4  0x00007f8d266810ea in pqReadData (conn=conn@entry=0x7f8cf001e390)
>> at /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-misc.c:683
>> #5  0x00007f8d2667e6f2 in PQconsumeInput (conn=0x7f8cf001e390) at
>> /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-exec.c:1704
>>
>> In this case, I also checked the pg_stat_activity and this particular
>> connection on pg server side was idle for 15 minutes. I killed it
>> using pg_terminate_backend, and then somehow the application resumed
>> execution.
>
> There's something decidedly weird going on. Libpq always keeps the
> connection in nonblocking mode internally. Blocking is implemented by
> using select on the socket. So this should never block.
>
> Is there any chance parts of your application changes the sockets
> block-y-ness?  Is see your code is passing the socket around, so perhaps
> that's happening somewhere outside of the file?
>
>
>> As you can see, recv has received a non-zero value in flags
>
> Well, no, not really. recv() has a 0 flags, it's just libc's internal
> implementation that appears to be showing up weird afaict.
>
>
> Greetings,
>
> Andres Freund



--

André Luis O. Freitas
System Architect

Rua do Rócio, 220 - Cj. 72
São Paulo - SP - 04552-000
55 11 4063 4222

afreitas@callix.com.br
www.callix.com.br


Re: PQConsumeinput stuck on recv

From
Andres Freund
Date:
Hi,

On 2018-03-05 17:57:51 -0300, Andre Oliveira Freitas wrote:
> Ok, on one hand glad to hear that is an artifact, on the other hand
> that means the issue is still out there...
> 
> I'm not a maintainer of Freeswitch, I am an user of it. However I am
> trying to supply the maintainers with any relevant information
> regarding this issue.
> 
> As you mentioned, I see the socket being modified and passed around in
> the source code. Do you think the implementation is OK? I do not know
> if that's standard practice when consuming libpq.

It's ok to pass the socket around, it'd not be ok to change whether the
socket blocks or not.  I don't have the time to review the freeswitch
code to see whether it does so.

- Andres