Thread: syslog enabled causes random hangs?
I'm encountering strange hangs in postgresql backends at random moments. They seem to be associated with attempts to issue log entries via syslog. I have run backtraces on the hung backends a few times, and they routinely trace into system libraries where it looks like a stuck syslog call. So far, I haven't had this problem with any other apps, so I'm thinking it's a condition being aggravated by Postgres. The machines I'm seeing this on are RedHat 8 and Gentoo Linux machines (x86). We're running compiled-from-source 7.3.3 on both machines (by hand on the RH8 machine, and built by Gentoo's package manager, because it's a build-from-source distribution). On my Gentoo test boxes, it happens occasionally using syslog-ng 1.6.0_rc3 and glibc 2.3.2-r1. On the RedHat 8 machine, it's sysklogd-1.4.1-10 and "glibc-devel-2.3.2-4.80.6". I switched to not using syslog on the RH8 machine because it was hanging backends quite regularly, and that's rather bad in production. :) Has anyone else had problems using the syslog option? I need to rebuild PostgreSQL on my test machine so I'll have unstripped binaries, but once that's done, I'll bring on the stack traces when I get another hang.
"Arthur Ward" <award@dominionsciences.com> writes: > I'm encountering strange hangs in postgresql backends at random moments. > They seem to be associated with attempts to issue log entries via syslog. > I have run backtraces on the hung backends a few times, and they routinely > trace into system libraries where it looks like a stuck syslog call. So > far, I haven't had this problem with any other apps, so I'm thinking it's > a condition being aggravated by Postgres. How verbose are your Postgres logging settings? On most platforms I've dealt with, syslog drops messages on the floor if it starts to get saturated. It may be that the Linux implementation has worse behavior than that under heavy load :-(. In any case I'd suggest filing a bug against syslog. There's very little we can do about it if the syslog library call hangs up. Personally I find it more reliable to pipe the postmaster's stderr to some sort of log-rotation program than to depend on syslog. It seems that the Apache folks have found the same, since they include a suitable log-rotation filter in their distribution ... regards, tom lane
""Arthur Ward"" <award@dominionsciences.com> writes: > I'm encountering strange hangs in postgresql backends at random moments. > They seem to be associated with attempts to issue log entries via syslog. > I have run backtraces on the hung backends a few times, and they routinely > trace into system libraries where it looks like a stuck syslog call. So > far, I haven't had this problem with any other apps, so I'm thinking it's > a condition being aggravated by Postgres. > > The machines I'm seeing this on are RedHat 8 and Gentoo Linux machines > (x86). We're running compiled-from-source 7.3.3 on both machines (by hand > on the RH8 machine, and built by Gentoo's package manager, because it's a > build-from-source distribution). On my Gentoo test boxes, it happens > occasionally using syslog-ng 1.6.0_rc3 and glibc 2.3.2-r1. On the RedHat 8 > machine, it's sysklogd-1.4.1-10 and "glibc-devel-2.3.2-4.80.6". I switched > to not using syslog on the RH8 machine because it was hanging backends > quite regularly, and that's rather bad in production. :) > > Has anyone else had problems using the syslog option? > > I need to rebuild PostgreSQL on my test machine so I'll have unstripped > binaries, but once that's done, I'll bring on the stack traces when I get > another hang. Try to put a "-" before the name of your log file, this say to syslogd to not flush each message on the disk, so your line should appear somethink like: LOCAL0.* -/var/log/postgresql.log Regards Gaetano Mendola
> "Arthur Ward" <award@dominionsciences.com> writes: >> I'm encountering strange hangs in postgresql backends at random >> moments. They seem to be associated with attempts to issue log entries >> via syslog. I have run backtraces on the hung backends a few times, >> and they routinely trace into system libraries where it looks like a >> stuck syslog call. So far, I haven't had this problem with any other >> apps, so I'm thinking it's a condition being aggravated by Postgres. > > How verbose are your Postgres logging settings? Defaults. (single-user test machine -- not much activity) On what I was doing today, it seemed to happen most often with the occurrance of a checkpoint, which would normally log a message about recycling the transaction log files. > On most platforms I've dealt with, syslog drops messages on the floor if > it starts to get saturated. It may be that the Linux implementation has > worse behavior than that under heavy load :-(. In any case I'd suggest > filing a bug against syslog. There's very little we can do about it if > the syslog library call hangs up. After rebuilding to get a non-stripped copy of Postgresql, I did a little searching on my backtraces. Some of the first hits are pointing at a deadlock condition in glibc, although I'm not sure how or why this is affecting PostgreSQL running independently of my app and its signal handler. Maybe some more digging will enlighten me... > Personally I find it more reliable to pipe the postmaster's stderr to > some sort of log-rotation program than to depend on syslog. It seems > that the Apache folks have found the same, since they include a suitable > log-rotation filter in their distribution ... Considering deadlocks in the system libraries sound a bit scary to me, I'm definitely convinced to change my development machine now... As I originally wrote, I'd already changed the production machine. FWIW, since I already went to the trouble (and for the sake of people searching the archives in the future), here's what I was seeing in the backtraces after rebuilding this afternoon: This process: 19763 pts/2 S 0:00 postgres: checkpoint subprocess Was stuck here: (gdb) bt #0 0x402cf077 in semop () from /lib/libc.so.6 (gdb) And the other: 19720 pts/2 S 0:04 postgres: award Trucking [local] UPDATE (gdb) bt #0 0x4021cec6 in sigsuspend () from /lib/libc.so.6 #1 0x424b6218 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x424b79a0 in __pthread_alt_lock () from /lib/libpthread.so.0 #3 0x424b4c17 in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0x402ca21c in vsyslog () from /lib/libc.so.6 #5 0x402c9d8f in syslog () from /lib/libc.so.6 #6 0x08150a57 in write_syslog () (gdb)
"Arthur Ward" <award@dominionsciences.com> writes: > FWIW, since I already went to the trouble (and for the sake of people > searching the archives in the future), here's what I was seeing in the > backtraces after rebuilding this afternoon: > This process: > 19763 pts/2 S 0:00 postgres: checkpoint subprocess > Was stuck here: > (gdb) bt > #0 0x402cf077 in semop () from /lib/libc.so.6 > (gdb) Hmm. It's odd that gdb isn't showing a backtrace any deeper than that --- are you sure you rebuilt with debug symbols enabled? But anyway, I'd say this process is just waiting for someone else to release a lock. Likely the lock holder is this other process: > And the other: > 19720 pts/2 S 0:04 postgres: award Trucking [local] UPDATE > (gdb) bt > #0 0x4021cec6 in sigsuspend () from /lib/libc.so.6 > #1 0x424b6218 in __pthread_wait_for_restart_signal () > from /lib/libpthread.so.0 > #2 0x424b79a0 in __pthread_alt_lock () from /lib/libpthread.so.0 > #3 0x424b4c17 in pthread_mutex_lock () from /lib/libpthread.so.0 > #4 0x402ca21c in vsyslog () from /lib/libc.so.6 > #5 0x402c9d8f in syslog () from /lib/libc.so.6 > #6 0x08150a57 in write_syslog () > (gdb) This seems worthy of an inquiry to the glibc maintainers, or at least some research to find out what syslog() could block on. regards, tom lane
>> And the other: >> 19720 pts/2 S 0:04 postgres: award Trucking [local] UPDATE >> (gdb) bt >> #0 0x4021cec6 in sigsuspend () from /lib/libc.so.6 >> #1 0x424b6218 in __pthread_wait_for_restart_signal () >> from /lib/libpthread.so.0 >> #2 0x424b79a0 in __pthread_alt_lock () from /lib/libpthread.so.0 >> #3 0x424b4c17 in pthread_mutex_lock () from /lib/libpthread.so.0 >> #4 0x402ca21c in vsyslog () from /lib/libc.so.6 >> #5 0x402c9d8f in syslog () from /lib/libc.so.6 >> #6 0x08150a57 in write_syslog () >> (gdb) > > This seems worthy of an inquiry to the glibc maintainers, or at least > some research to find out what syslog() could block on. I've been digging, and the best information I've been able to come up with is that there may be an unresolved glibc bug involving syslog and threads dating back to at least 2001, or that glibc's syslog() is not re-entrant and deadlocks when called from inside a signal handler when already executing. Neither of those make a whole lot of sense to me, though: 1. The Postgresql backend shouldn't be getting any signals (at least none external to Postgresql). 2. Postgresql doesn't use threads, right? That occurred to me this morning, and I'm a little disturbed by the appearance of libpthread in the backtrace. The disgusting thing is that all the mail threads I've followed either dead-end, or mention creating a signal-safe and/or thread-safe wrapper around syslog inside to the calling application. Yuck! The only information I've found which was particularly enlightening was a mention that POSIX doesn't specify that syslog() should be reentrant, and (according to the poster) glibc is not and does not specify (which I confirmed in the on-line manual). Blah. I'm ready to chalk this one up to glibc inadequacies and put it behind me now, especially when the simple adjustment of pipe-to-log-rotator does just fine.
award@dominionsciences.com writes: > Neither of those make a whole lot of sense to me, though: > 1. The Postgresql backend shouldn't be getting any signals (at least none > external to Postgresql). Hm. Postgres does sometimes issue elog messages from inside a signal handler. I doubt that it would do so while interrupting a non-interrupt elog issuance, but wouldn't swear to it. > 2. Postgresql doesn't use threads, right? That occurred to me this > morning, and I'm a little disturbed by the appearance of libpthread in the > backtrace. Postgres doesn't, but that doesn't mean that your glibc wouldn't internally assume that it might be in a threaded app. See the recent flamewar about threaded vs unthreaded libc in pgsql-hackers. It's really too bad that your gdb backtrace didn't show anything past the write_syslog level (which is an elog subroutine). If we could see whether the elog had been issued from a signal handler, and if so what it had interrupted, we'd have an idea whether this is a known syslog deficiency or not. Do you want to rebuild with debug symbols and try the backtrace again? > The disgusting thing is that all the mail threads I've followed either > dead-end, or mention creating a signal-safe and/or thread-safe wrapper > around syslog inside to the calling application. Yuck! Agreed, but we might be looking at needing to do that in Postgres, if this turns out to be something that the glibc boys know about and won't fix. regards, tom lane
> It's really too bad that your gdb backtrace didn't show anything past > the write_syslog level (which is an elog subroutine). If we could see > whether the elog had been issued from a signal handler, and if so what > it had interrupted, we'd have an idea whether this is a known syslog > deficiency or not. Do you want to rebuild with debug symbols and try > the backtrace again? Since you think it might be necessary to add a workaround in Postgresql, I'll try it again when I install 7.3.4 on my test machine. That will probably be in a few days. I need to get back to real work again. :-) I'm sure I rebuilt 7.3.3 with the symbols left in place. Off the top of your head do you know an easy way to tell if a binary still has symbols?
I'm back with more on the funky glibc-syslog-Postgres deadlocking behavior: > It's really too bad that your gdb backtrace didn't show anything past > the write_syslog level (which is an elog subroutine). If we could see > whether the elog had been issued from a signal handler, and if so what > it had interrupted, we'd have an idea whether this is a known syslog > deficiency or not. Do you want to rebuild with debug symbols and try > the backtrace again? A minor ah-ha here: Whoever wrote the original Gentoo build for Postgres left out --enable-debug, even though Gentoo has a system-wide flag for "no, I don't want symbols" (which I did not set). I was testing a different piece of my app at home, and triggered the deadlock by accident. Being frustrated, I took the time to rebuild PG with symbols, and also rebuilt Python (for other reasons), which accounts for the RExec exception in the traceback here. The exception normally comes back to the client with no problems. I had forgotten to comment out the poison line in Python 2.2.3's RExec module so my plpython code would continue working. This is Postgresql 7.3.4 running against glibc 2.3.2. It just occured to me that I didn't trace the postmaster or the other always-on processes, just the two dead backends. In a couple of attempts to recreate the problem again, I had no failures. :-( The original problem was on a different machine running a different part of my project, and seemed more reproducible, or I could have just had a bad day, so I can go back to that at some point if tracing the other processes is a good idea. postgres 27259 3.2 6.4 47488 20600 pts/0 S 03:30 0:05 postgres: postgres Trucking [local] VACUUM (gdb) bt #0 0x4dbf81a7 in semop () from /lib/libc.so.6 #1 0x080f2dd7 in PGSemaphoreLock (sema=0x1, interruptOK=1 '\001') at pg_sema.c:434 #2 0x08106dd3 in ProcWaitForSignal () at proc.c:895 #3 0x08101878 in LockBufferForCleanup (buffer=1021) at bufmgr.c:1946 #4 0x080c380b in lazy_vacuum_heap (onerel=0x420a20f8, vacrelstats=0x8288f88) at vacuumlazy.c:474 #5 0x080c3476 in lazy_scan_heap (onerel=0x420a20f8, vacrelstats=0x8288f88, Irel=0x8297818, nindexes=3) at vacuumlazy.c:426 #6 0x080c31e2 in lazy_vacuum_rel (onerel=0x21, vacstmt=0x1) at vacuumlazy.c:158 #7 0x080c02ee in vacuum_rel (relid=1021, vacstmt=0x8287138, expected_relkind=114 'r') at vacuum.c:829 #8 0x080bfd32 in vacuum (vacstmt=0x8287138) at vacuum.c:290 #9 0x0810a3a1 in pg_exec_query_string (query_string=0x8287138, dest=Remote, parse_context=0x827bb30) at postgres.c:789 #10 0x0810afdf in PostgresMain (argc=4, argv=0xbfff7cc8, username=0x822e1d9 "postgres") at postgres.c:2013 #11 0x080f5175 in DoBackend (port=0x822e0a8) at postmaster.c:2310 #12 0x080f4d2f in BackendStartup (port=0x822e0a8) at postmaster.c:1932 #13 0x080f3f95 in ServerLoop () at postmaster.c:1009 #14 0x080f3989 in PostmasterMain (argc=1, argv=0x82162e0) at postmaster.c:788 #15 0x080d684c in main (argc=1, argv=0xbfff85c4) at main.c:210 #16 0x4db327a7 in __libc_start_main () from /lib/libc.so.6 postgres 27235 5.7 3.0 41900 9784 pts/0 S 03:29 0:15 postgres: award Trucking [local] INSERT (gdb) bt #0 0x4db45ed6 in sigsuspend () from /lib/libc.so.6 #1 0x4de40218 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0x4de419a0 in __pthread_alt_lock () from /lib/libpthread.so.0 #3 0x4de3ec17 in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0x4dbf334c in vsyslog () from /lib/libc.so.6 #5 0x4dbf2ebf in syslog () from /lib/libc.so.6 #6 0x0814a49b in write_syslog (level=4, line=0xbfff70fc "ERROR: plpython: Unable to create rexec.RExec instance\nexceptions.RuntimeError: This code is not secure in Python 2.2 and 2.3") at elog.c:728 #7 0x08149bf9 in elog (lev=-1073779172, fmt=0x421a44fa "plpython: %s\n%s") at elog.c:383 #8 0x4214b496 in PLy_elog (elevel=20, fmt=0x421a4a80 "Unable to create rexec.RExec instance") at plpython.c:2811 #9 0x421490ff in PLy_procedure_compile (proc=0x83abdb8, src=0x8359610 "\n# TD[\"args\"][0] - audit_table (with schema) \n# TD[\"relid\"] = OID of the table being triggered on\n\nif not SD.has_key(\"plans\"):\n\tSD[\"plans\"] = {}\n\nif SD[\"plans\"].has_key(TD[\"relid\"]):\n\tcachedPlans = SD"...) at plpython.c:1203 #10 0x42148f32 in PLy_procedure_create (fcinfo=0xbfff78c0, is_trigger=1 '\001', procTup=0x420b1910, key=0xbfff7730 "5210458_trigger") at plpython.c:1177 #11 0x42148ba8 in PLy_procedure_get (fcinfo=0xbfff78c0, is_trigger=1 '\001') at plpython.c:1043 #12 0x4214758e in plpython_call_handler (fcinfo=0xbfff78c0) at plpython.c:423 #13 0x0814b4cc in fmgr_security_definer (fcinfo=0xbfff78c0) at fmgr.c:681 #14 0x080bbb8f in ExecCallTriggerFunc (trigdata=0xbfff79e8, finfo=0x82aff98, per_tuple_context=0x827c9d0) at trigger.c:1121 #15 0x080bc39a in DeferredTriggerExecute (event=0x82c2ba8, itemno=-4, rel=0x4f815b, trigdesc=0x828bff8, finfo=0x82afcf0, per_tuple_context=0x827c9d0) at trigger.c:1656 #16 0x080bc685 in deferredTriggerInvokeEvents (immediate_only=1 '\001') at trigger.c:1804 #17 0x080bc7fb in DeferredTriggerEndQuery () at trigger.c:1964 #18 0x0810a4c7 in finish_xact_command (forceCommit=0 '\0') at postgres.c:970 #19 0x0810a287 in pg_exec_query_string (query_string=0xbfff7aac, dest=Remote, parse_context=0x827bb30) at postgres.c:897 #20 0x0810afdf in PostgresMain (argc=4, argv=0xbfff7cc8, username=0x822e1d9 "award") at postgres.c:2013 #21 0x080f5175 in DoBackend (port=0x822e0a8) at postmaster.c:2310 #22 0x080f4d2f in BackendStartup (port=0x822e0a8) at postmaster.c:1932 #23 0x080f3f95 in ServerLoop () at postmaster.c:1009 #24 0x080f3989 in PostmasterMain (argc=1, argv=0x82162e0) at postmaster.c:788 #25 0x080d684c in main (argc=1, argv=0xbfff85c4) at main.c:210 #26 0x4db327a7 in __libc_start_main () from /lib/libc.so.6
"Arthur Ward" <award@dominionsciences.com> writes: > I'm back with more on the funky glibc-syslog-Postgres deadlocking behavior: It looks to me like the guy doing VACUUM is simply waiting for the other guy to release a page-level lock. The other guy is running a deferred trigger and so I'd expect him to be holding one or two page-level locks, on the page or pages containing the tuple or tuples passed to the trigger. Nothing evidently wrong there. The real question is why does vsyslog() have anything to block on, when it's running in an unthreaded process? Seeing that you are using plpython, I wonder if Python is confusing matters somehow. regards, tom lane
> It looks to me like the guy doing VACUUM is simply waiting for the other > guy to release a page-level lock. The other guy is running a deferred > trigger and so I'd expect him to be holding one or two page-level locks, > on the page or pages containing the tuple or tuples passed to the > trigger. Nothing evidently wrong there. If I remember what I was working on the other day when this whole thing started, I think it was a single backend and a checkpoint that collided. I'll trace that combination, assuming it happens again. > The real question is why does vsyslog() have anything to block on, when > it's running in an unthreaded process? Seeing that you are using > plpython, I wonder if Python is confusing matters somehow. Oof. I'm using plpython all over the place; I don't think this problem has happened in any location that can work without it easily. :-/
"Arthur Ward" <award@dominionsciences.com> writes: >> It looks to me like the guy doing VACUUM is simply waiting for the other >> guy to release a page-level lock. The other guy is running a deferred >> trigger and so I'd expect him to be holding one or two page-level locks, >> on the page or pages containing the tuple or tuples passed to the >> trigger. Nothing evidently wrong there. > If I remember what I was working on the other day when this whole thing > started, I think it was a single backend and a checkpoint that collided. > I'll trace that combination, assuming it happens again. A checkpoint would also have reason to wait for a page-level lock, if the stuck backend was holding one. I am wondering though if the stuck condition consistently happens while trying to fire a trigger? That would be very interesting ... not sure what it'd mean though ... >> The real question is why does vsyslog() have anything to block on, when >> it's running in an unthreaded process? Seeing that you are using >> plpython, I wonder if Python is confusing matters somehow. > Oof. I'm using plpython all over the place; I don't think this problem has > happened in any location that can work without it easily. :-/ It looks to me like your plpython code is all dead in the water, seeing that your Python installation is refusing creation of rexec. (AFAIK the only workaround is to downgrade Python to a version that allows rexec.) If you're using it all over the place, how come you haven't noticed that?? regards, tom lane
> A checkpoint would also have reason to wait for a page-level lock, if > the stuck backend was holding one. I am wondering though if the stuck > condition consistently happens while trying to fire a trigger? That > would be very interesting ... not sure what it'd mean though ... Hmm. I'm really at a loss as to how I could test this theory. I don't have a truly reproducible test case for any of these syslog problems, just one set of code with really bad luck one afternoon. Race conditions (or things that smell a lot like them) stink. > It looks to me like your plpython code is all dead in the water, seeing > that your Python installation is refusing creation of rexec. (AFAIK the > only workaround is to downgrade Python to a version that allows rexec.) > If you're using it all over the place, how come you haven't noticed > that?? I did notice, and it was an oversight. I had just rebuilt Python 2.2.3 (for unrelated reasons) and forgot to comment out the poision line in rexec.py where they raise the "stop using RExec" exception. It behaves properly once that line is commented out (properly being it works same as with earlier versions of Python; it appears all that changed was the insertion of the exception at the beginning of the RExec constructor). I tried to get the deadlock behavior again after fixing rexec.py, but my luck wasn't bad enough for another three runs, so I posted the case I had traces for. An idea just popped into my head, though. Perhaps I can create procedures in plpgsql and plpython which do nothing but spew notices (which would in turn be sent to syslog), and run one or two copies to see if they'll die on their own given sufficient time. That seems worthwhile, especially if I can get a deadlock in plpgsql, since it will take the blame away from both triggers and plpython. Does this sound like a reasonable experiment? I may try setting this up on my home machine tonight to run for a few days.
"Arthur Ward" <award@dominionsciences.com> writes: > An idea just popped into my head, though. Perhaps I can create procedures > in plpgsql and plpython which do nothing but spew notices (which would in > turn be sent to syslog), and run one or two copies to see if they'll die > on their own given sufficient time. That seems worthwhile, especially if I > can get a deadlock in plpgsql, since it will take the blame away from both > triggers and plpython. Does this sound like a reasonable experiment? Sure. Note it seems possible that the failure occurs only during a particular backend's first or second attempt to send something to syslog (since the first attempt would do openlog). So you should probably run a test that involves sending only a few messages per session, rather than vast numbers of messages from one session. regards, tom lane