PQConsumeinput stuck on recv - Mailing list pgsql-general

From Andre Oliveira Freitas
Subject PQConsumeinput stuck on recv
Date
Msg-id CAN6ijTDiFnzeWyDkbcL9dYj6-G9nQAsW1gL6OK+bQMrUBgn8iQ@mail.gmail.com
Whole thread Raw
Responses Re: PQConsumeinput stuck on recv  (Andres Freund <andres@anarazel.de>)
List pgsql-general
Hi, I've been experiencing an issue. We use an open-source VoIP software whose backend is PostgreSQL. Initially we had a twin-server setup (one server running the VoIP software, another one running the pg instance). Due to company growth we were running into performance issues, so we rolled out a new architecture using multiple VoIP servers connected to the single pg instance. Since then, the VoIP software started misbehaving - it randomly stops responding, and only a restart gets it back up running. It is random throughout the servers, time-of-day, day-of-week... we haven't found a correlation between it and any other metric like CPU usage, network traffic and such.

Since it's been happening for a few weeks now, every time it freezes we take a gcore dump and check it in gdb... and after a lot of hair pulling and learning about the innards of the VoIP software we see that most often the software is stuck in this call trace:

#0 in __libc_recv (fd=409, buf=0x7f2c4802e6c0, n=16384, flags=1898970523) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
#1 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
#2 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
#3 in PQconsumeInput () from /usr/lib/x86_64-linux-gnu/libpq.so.5

The software shares a database connection between threads, and controls its access through a mutex, so once one thread that acquires the mutex gets stuck in the location above, all other threads starts pilling up behind the mutex, and that's apparently the reason the software stops responding for most of its functions (while other functions that do not depend on the database works normally).

And it stays stuck on it forever... at one situation we took two gcore dumps spaced 10 minutes apart, and they look almost identical, with the same thread stuck on recv and all the others waiting for the lock to be released.

I wonder if anyone has any tip on what to look for next... Besides the implementation of the VoIP software itself, we are looking into network issues (we are seeing a bunch of TCP retransmissions between some servers and the db), but otherwise no other app running on those servers has presented any weird behavior like this VoIP software. We don't understand what would cause recv to get stuck like this.

BTW we're running debian 9, pg 9.6.3, and the VoIP sofware (along most of the other apps) uses libpq of a slightly older version (9.4.15).

Thanks in advance.

Andre

pgsql-general by date:

Previous
From: Viktor Fougstedt
Date:
Subject: Re: Given a set of daterange, finding the continuous range thatincludes a particular date
Next
From: Andres Freund
Date:
Subject: Re: PQConsumeinput stuck on recv