Thread: Postgres gets stuck

Postgres gets stuck

From
"Craig A. James"
Date:
I'm having a rare but deadly problem.  On our web servers, a process occasionally gets stuck, and can't be unstuck.
Onceit's stuck, all Postgres activities cease.  "kill -9" is required to kill it -- signals 2 and 15 don't work, and
"/etc/init.d/postgresqlstop" fails. 

Here's what the process table looks like:

$ ps -ef | grep postgres
postgres 30713     1  0 Apr24 ?        00:02:43 /usr/local/pgsql/bin/postmaster -p 5432 -D /disk3/postgres/data
postgres 25423 30713  0 May08 ?        00:03:34 postgres: writer process
postgres 25424 30713  0 May08 ?        00:00:02 postgres: stats buffer process
postgres 25425 25424  0 May08 ?        00:00:02 postgres: stats collector process
postgres 11918 30713 21 07:37 ?        02:00:27 postgres: production webuser 127.0.0.1(21772) SELECT
postgres 31624 30713  0 16:11 ?        00:00:00 postgres: production webuser [local] idle
postgres 31771 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12422) idle
postgres 31772 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12421) idle
postgres 31773 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12424) idle
postgres 31774 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12425) idle
postgres 31775 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12426) idle
postgres 31776 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12427) idle
postgres 31777 30713  0 16:12 ?        00:00:00 postgres: production webuser 127.0.0.1(12428) idle

The SELECT process is the one that's stuck.  top(1) and other indicators show that nothing is going on at all (no CPU
usage,normal memory usage); the process seems to be blocked waiting for something.  (The "idle" processes are attached
toa FastCGI program.) 

This has happened on *two different machines*, both doing completely different tasks.  The first one is essentially a
read-onlywarehouse that serves lots of queries, and the second one is the server we use to load the warehouse.  In both
cases,Postgres has been running for a long time, and is issuing SELECT statements that it's issued millions of times
beforewith no problems.  No other processes are accessing Postgres, just the web services. 

This is a deadly bug, because our web site goes dead when this happens, and it requires an administrator to log in and
killthe stuck postgres process then restart Postgres.  We've installed failover system so that the web site is diverted
toa backup server, but since this has happened twice in one week, we're worried. 

Any ideas?

Details:

    Postgres 8.0.3
    Linux 2.6.12-1.1381_FC3smp i686 i386

    Dell 2-CPU Xeon system (hyperthreading is enabled)
    4 GB memory
    2 120 GB disks (SATA on machine 1, IDE on machine 2)

Thanks,
Craig

Re: Postgres gets stuck

From
Chris
Date:
> This is a deadly bug, because our web site goes dead when this happens,
> and it requires an administrator to log in and kill the stuck postgres
> process then restart Postgres.  We've installed failover system so that
> the web site is diverted to a backup server, but since this has happened
> twice in one week, we're worried.
>
> Any ideas?

Sounds like a deadlock issue.

Do you have query logging turned on?

Also, edit your postgresql.conf file and add (or uncomment):

stats_command_string = true

and restart postgresql.

then you'll be able to:

select * from pg_stat_activity;

to see what queries postgres is running and that might give you some clues.

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: Postgres gets stuck

From
"Qingqing Zhou"
Date:
""Craig A. James"" <cjames@modgraph-usa.com> wrote
> I'm having a rare but deadly problem.  On our web servers, a process
> occasionally gets stuck, and can't be unstuck.  Once it's stuck, all
> Postgres activities cease.  "kill -9" is required to kill it --
> signals 2 and 15 don't work, and "/etc/init.d/postgresql stop" fails.
>
> Details:
>
>    Postgres 8.0.3
>

[Scanning 8.0.4 ~ 8.0.7 ...] Didn't find related bug fix in the upgrade
release. Can you attach to the problematic process and "bt" it (so we
could see where it stucks)?

Regards,
Qingqing



Re: Postgres gets stuck

From
"Craig A. James"
Date:
Chris wrote:
>
>> This is a deadly bug, because our web site goes dead when this
>> happens, ...
>
> Sounds like a deadlock issue.
> ...
> stats_command_string = true
> and restart postgresql.
> then you'll be able to:
> select * from pg_stat_activity;
> to see what queries postgres is running and that might give you some clues.

Thanks, good advice.  You're absolutely right, it's stuck on a mutex.  After doing what you suggest, I discovered that
thequery in progress is a user-written function (mine).  When I log in as root, and use "gdb -p <pid>" to attach to the
process,here's what I find.  Notice the second function in the stack, a mutex lock: 

(gdb) bt
#0  0x0087f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x0096cbfe in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
#2  0x008ff67b in _L_mutex_lock_3220 () from /lib/tls/libc.so.6
#3  0x4f5fc1b4 in ?? ()
#4  0x00dc5e64 in std::string::_Rep::_S_empty_rep_storage () from /usr/local/pgsql/lib/libchmoogle.so
#5  0x009ffcf0 in ?? () from /usr/lib/libz.so.1
#6  0xbfe71c04 in ?? ()
#7  0xbfe71e50 in ?? ()
#8  0xbfe71b78 in ?? ()
#9  0x009f7019 in zcfree () from /usr/lib/libz.so.1
#10 0x009f7019 in zcfree () from /usr/lib/libz.so.1
#11 0x009f8b7c in inflateEnd () from /usr/lib/libz.so.1
#12 0x00c670a2 in ~basic_unzip_streambuf (this=0xbfe71be0) at zipstreamimpl.h:332
#13 0x00c60b61 in OpenBabel::OBConversion::Read (this=0x1, pOb=0xbfd923b8, pin=0xffffffea) at istream:115
#14 0x00c60fd8 in OpenBabel::OBConversion::ReadString (this=0x8672b50, pOb=0xbfd923b8) at obconversion.cpp:780
#15 0x00c19d69 in chmoogle_ichem_mol_alloc () at stl_construct.h:120
#16 0x00c1a203 in chmoogle_ichem_normalize_parent () at stl_construct.h:120
#17 0x00c1b172 in chmoogle_normalize_parent_sdf () at vector.tcc:243
#18 0x0810ae4d in ExecMakeFunctionResult ()
#19 0x0810de2e in ExecProject ()
#20 0x08115972 in ExecResult ()
#21 0x08109e01 in ExecProcNode ()
#22 0x00000020 in ?? ()
#23 0xbed4b340 in ?? ()
#24 0xbf92d9a0 in ?? ()
#25 0xbed4b0c0 in ?? ()
#26 0x00000000 in ?? ()

It looks to me like my code is trying to read the input parameter (a fairly long string, maybe 2K) from a buffer that
wasgzip'ed by Postgres for the trip between the client and server.  My suspicion is that it's an incompatibility
betweenmalloc() libraries.  libz (gzip compression) is calling something called zcfree, which then appears to be
interceptedby something that's (probably statically) linked into my library.  And somewhere along the way, a mutex gets
set,and then ... it's stuck forever. 

ps(1) shows that this thread had been running for about 7 hours, and the job status showed that this function had been
successfullycalled about 1 million times, before this mutex lock occurred. 

Any ideas?

Thanks,
Craig

Re: Postgres gets stuck

From
Tom Lane
Date:
"Craig A. James" <cjames@modgraph-usa.com> writes:
> My suspicion is that it's an incompatibility between malloc()
> libraries.

On Linux there's only supposed to be one malloc, ie, glibc's version.
On other platforms I'd be worried about threaded vs non-threaded libc
(because the backend is not threaded), but not Linux.

There may be a more basic threading problem here, though, rooted in the
precise fact that the backend isn't threaded.  If you're trying to use
any libraries that assume they can have multiple threads, I wouldn't be
at all surprised to see things go boom.  C++ exception handling could be
problematic too.

Or it could be a garden variety glibc bug.  How up-to-date is your
platform?

            regards, tom lane

Re: Postgres gets stuck

From
"Craig A. James"
Date:
Tom Lane wrote:
> >My suspicion is that it's an incompatibility between malloc()
> >libraries.
>
> On Linux there's only supposed to be one malloc, ie, glibc's version.
> On other platforms I'd be worried about threaded vs non-threaded libc
> (because the backend is not threaded), but not Linux.

I guess I misinterpreted the Postgress manual, which says (in 31.9, "C Language Functions"),

    "When allocating memory, use the PostgreSQL functions palloc and pfree
    instead of the corresponding C library functions malloc and free."

I imagined that perhaps palloc/pfree used mutexes for something.  But if I understand you, palloc() and pfree() are
justwrappers around malloc() and free(), and don't (for example) make their own separate calls to brk(2), sbrk(2), or
theirkin.  If that's the case, then you answered my question - it's all ordinary malloc/free calls in the end, and
that'snot the source of the problem. 

> There may be a more basic threading problem here, though, rooted in the
> precise fact that the backend isn't threaded.  If you're trying to use
> any libraries that assume they can have multiple threads, I wouldn't be
> at all surprised to see things go boom.

No threading anywhere.  None of the libraries use threads or mutexes.  It's just plain old vanilla C/C++ scientific
algorithms.

>  C++ exception handling could be problematic too.

No C++ exceptions are thrown anywhere in the code, 'tho I suppose one of the I/O libraries could throw an exception,
e.g.when reading from a file.  But there's no evidence of this after millions of identical operations succeeded.  In
addition,the stack trace shows it to be stuck in a memory operation, not an I/O operation. 

> Or it could be a garden variety glibc bug.  How up-to-date is your
> platform?

I guess this is the next place to look.  From the few answers I've gotten, it sounds like this isn't a known Postgres
issue,and my stack trace doesn't seem to be familiar to anyone on this forum.  Oh well... thanks for your help. 

Craig

Re: Postgres gets stuck

From
Tom Lane
Date:
"Craig A. James" <cjames@modgraph-usa.com> writes:
> I guess I misinterpreted the Postgress manual, which says (in 31.9, "C Language Functions"),

>     "When allocating memory, use the PostgreSQL functions palloc and pfree
>     instead of the corresponding C library functions malloc and free."

> I imagined that perhaps palloc/pfree used mutexes for something.  But if I understand you, palloc() and pfree() are
justwrappers around malloc() and free(), and don't (for example) make their own separate calls to brk(2), sbrk(2), or
theirkin. 

Correct.  palloc/pfree are all about managing the lifetime of memory
allocations, so that (for example) a function can return a palloc'd data
structure without worrying about whether that creates a long-term memory
leak.  But ultimately they just use malloc/free, and there's certainly
not any threading or mutex considerations in there.

> No threading anywhere.  None of the libraries use threads or mutexes.  It's just plain old vanilla C/C++ scientific
algorithms.

Darn, my best theory down the drain.

>> Or it could be a garden variety glibc bug.  How up-to-date is your
>> platform?

> I guess this is the next place to look.

Let us know how it goes...

            regards, tom lane