Thread: BUG: Infinite syslog() loop
I woke up this morning to a filled root partition on one of my servers. I found that both debug and syslog were filled with over 50MB each of postgres errors: [what appears to be a standard cleanup cron job] . . . . Apr 21 04:00:03 alpha postgres[3371]: [85] DEBUG: --Relation pg_toast_16408-- Apr 21 04:00:03 alpha postgres[3371]: [86-1] DEBUG: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. Apr 21 04:00:03 alpha postgres[3371]: [86-2] ^ITotal CPU 0.00s/0.00u sec elapsed 0.00 sec. Apr 22 04:00:01 alpha postgres[17926]: [1] DEBUG: connection: host=[local] user=postgres database=template1 Apr 22 04:00:02 alpha postgres[17941]: [1] DEBUG: connection: host=[local] user=postgres database=template1 Apr 22 04:00:02 alpha postgres[362]: [1] DEBUG: statistics collector process (pid 364) exited with exit code 1 Apr 22 04:00:02 alpha postgres[362]: [2] DEBUG: statistics collector process (pid 17945) exited with exit code 1 . . . . Apr 22 04:16:32 alpha postgres[362]: [445836] DEBUG: statistics collector process (pid 1629) exited with exit code 1 The 'statistics collector exited' error message goes on to repeat for almost half a million times (as shown), before the files filled the root fs and truncated. It took about 16.5 minutes to spew this much, to both logs, before it wedged the machine and finally fully gave up. As far as I know the machine did not have zero reported space available (after minfree) at the time this started, but even if it did, an infinite loop of syslogs is only going to exacerbate the problem, hence this bug report. I've shut down postgres on the machine (since we're not using it at this instant afaik) and have kept the logs in question, so if someone can tell me what to look at, I'll do what I can to gather it. Email me directly, as I'm not on any pgsql lists. The machine is a stock Debian Woody box, up to date, with no local changes to the postgres configuration at all. It's running 7.2.1-2woody2. Erik Walthinsen <omega@temple-baptist.com> - System Administrator __ / \ GStreamer - The only way to stream! | | M E G A ***** http://gstreamer.net/ ***** _\ /_
On Tue, 2003-04-22 at 20:41, Erik Walthinsen wrote: > I woke up this morning to a filled root partition on one of my servers. > I found that both debug and syslog were filled with over 50MB each of > postgres errors: > > [what appears to be a standard cleanup cron job] > . . . . > Apr 21 04:00:03 alpha postgres[3371]: [85] DEBUG: --Relation ... > The machine is a stock Debian Woody box, up to date, with no local changes > to the postgres configuration at all. It's running 7.2.1-2woody2. Change your syslog configuration to throw away debug messages from postgres (user0) -- Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight, UK http://www.lfix.co.uk/oliver GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C ======================================== "Nay, in all these things we are more than conquerors through him that loved us." Romans 8:37
Erik Walthinsen <omega@temple-baptist.com> writes: > Apr 22 04:00:02 alpha postgres[362]: [1] DEBUG: statistics collector > process (pid 364) exited with exit code 1 > Apr 22 04:00:02 alpha postgres[362]: [2] DEBUG: statistics collector > process (pid 17945) exited with exit code 1 > . . . . > Apr 22 04:16:32 alpha postgres[362]: [445836] DEBUG: statistics collector > process (pid 1629) exited with exit code 1 > The machine is a stock Debian Woody box, up to date, with no local changes > to the postgres configuration at all. It's running 7.2.1-2woody2. Hmm. Evidently the pgstat subprocess is dying immediately on launch. Checking the archives, I see that back then pgstat was in the habit of reporting its woes on stderr no matter what the elog destination was supposed to be :-(. So the actually useful message went to postmaster stderr, which more than likely is routed to /dev/null by your postmaster startup script. If you want to try to figure out what went wrong, you could alter the start script to enable collection of stderr output and then try to reproduce the problem. regards, tom lane
Erik Walthinsen <omega@temple-baptist.com> writes: > The problem though is still that whatever code is spawning pgstat is doing > so forever. Are you suggesting that because the error output isn't going > to the right place that postmaster is not detecting that it died? I'd > think that postmaster would instead check the return value (else how does > it determine atm that it failed?) and not loop once it's failed. Actually, the postmaster is deliberately designed to restart pgstat. On the whole I think that's a net plus for reliability, not a minus. There might be some value in limiting the rate at which we try to spawn it, though. regards, tom lane
I said: > Actually, the postmaster is deliberately designed to restart pgstat. > On the whole I think that's a net plus for reliability, not a minus. > There might be some value in limiting the rate at which we try to spawn > it, though. As of CVS tip, the postmaster will not try to launch pgstat oftener than once a minute. regards, tom lane
On Fri, 2003-04-25 at 23:27, Erik Walthinsen wrote: > On Fri, 25 Apr 2003, Oliver Elphick wrote: > > > > I woke up this morning to a filled root partition on one of my servers. > > > I found that both debug and syslog were filled with over 50MB each of > > > postgres errors: > > Change your syslog configuration to throw away debug messages from > > postgres (user0) > > Um, NO. > > I will not configure my system to throw away logging data that's required > to keep track of normal system operations, because there's a bug causing > infinite looping. Oops. I hadn't read your message properly. I've frequently seen complaints of debug messages in syslog, because people don't realise they need to configure syslog rahter than postgresql to control them. -- Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight, UK http://www.lfix.co.uk/oliver GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C ======================================== "Submit yourselves therefore to God. Resist the devil, and he will flee from you." James 4:7
On Fri, 25 Apr 2003, Tom Lane wrote: > As of CVS tip, the postmaster will not try to launch pgstat oftener than > once a minute. Cool! Now I just have to convince Debian to bring that change into woody <g> Thanx, Omega Erik Walthinsen <omega@temple-baptist.com> - System Administrator __ / \ GStreamer - The only way to stream! | | M E G A ***** http://gstreamer.net/ ***** _\ /_
On Fri, 25 Apr 2003, Oliver Elphick wrote: > > I woke up this morning to a filled root partition on one of my servers. > > I found that both debug and syslog were filled with over 50MB each of > > postgres errors: > Change your syslog configuration to throw away debug messages from > postgres (user0) Um, NO. I will not configure my system to throw away logging data that's required to keep track of normal system operations, because there's a bug causing infinite looping. This 'solution' in no way solves the problem, which in actuality has very little to do with syslog and everything to do with the fact that postgresql managed to get into an infinite loop. Infinite loops typically cause the process to eat up all available processing time, hence the ability to write nearly half a million log messages in just over 16 minutes. My server has better things to do. Throwing away the log messages is equivalent to sticking your fingers in your ears to hide from a nuclear air-burst. I still welcome any advice that will help me track down the cause of this bug and get it fixed. Erik Walthinsen <omega@temple-baptist.com> - System Administrator __ / \ GStreamer - The only way to stream! | | M E G A ***** http://gstreamer.net/ ***** _\ /_
On Fri, 25 Apr 2003, Tom Lane wrote: > Hmm. Evidently the pgstat subprocess is dying immediately on launch. > Checking the archives, I see that back then pgstat was in the habit of > reporting its woes on stderr no matter what the elog destination was > supposed to be :-(. So the actually useful message went to postmaster > stderr, which more than likely is routed to /dev/null by your postmaster > startup script. > > If you want to try to figure out what went wrong, you could alter the > start script to enable collection of stderr output and then try to > reproduce the problem. It is possible that the disk was beyond minfree at that point, which would explain why pgstat died. I'll have to test that scenario on a UML box to avoid squishing my production server. The problem though is still that whatever code is spawning pgstat is doing so forever. Are you suggesting that because the error output isn't going to the right place that postmaster is not detecting that it died? I'd think that postmaster would instead check the return value (else how does it determine atm that it failed?) and not loop once it's failed. Erik Walthinsen <omega@temple-baptist.com> - System Administrator __ / \ GStreamer - The only way to stream! | | M E G A ***** http://gstreamer.net/ ***** _\ /_